Data Description¶
As mentioned, we collected Wikipedia articles corresponding to the five major social science disciplines: Economics, Political Science, Anthropology, Sociology and Psychology. To understand how we did this one needs to know the structure of how information is stored into categories on Wikipedia (very exciting!). Each category consists of several subcategories with even more sub-subcategories and on to infinity. Besides the subcategories each level of the depthness also contains the corresponding pages belonging to that level’s subcategories. This is visualised in the image below.
# Create illustration of wikipedia structure
import pygraphviz as pgv
G=pgv.AGraph(directed=True)
G.add_node("ROOT", label="Category: Social sciences", fontsize=20)
G.add_node("ROOT_i", label="Depth 0", shape = "plaintext", fontsize=20)
disciplines = ["Anthropology",
"Economics",
"Sociology",
"Political Science",
"Psychology"]
for i,k in enumerate(disciplines):
G.add_node("Child_%i" % i, label=f"Subcategory: {k}")
G.add_edge("ROOT", "Child_%i" % i)
G.add_node("Grandchild_%i" % i, label = f"List of {k} sub-subcategories")
G.add_edge("Child_%i" % i, "Grandchild_%i" % i)
G.add_node("Greatgrandchild_%i" % i, label = f"... n list of {k} sub-subcategories")
G.add_edge("Grandchild_%i" % i, "Greatgrandchild_%i" % i)
G.add_node("Child_%ix" % i, label="Depth 1", shape = "plaintext", fontsize=20)
G.add_node("Grandchild_%ix" % i, label="Depth 2", shape = "plaintext", fontsize=20)
G.add_node("Greatgrandchild_%ix" % i, label="Depth n", shape = "plaintext", fontsize=20)
G.add_edge("ROOT_i", "Child_%ix" % i)
G.add_edge("Child_%ix" % i, "Grandchild_%ix" % i)
G.add_edge("Grandchild_%ix" % i, "Greatgrandchild_%ix" % i)
G.layout(prog='dot')
G.draw('wikipedia_struture.png')
Perhaps a specific depth level could be the key to an representative sampling strategy, we thought and so we used the tool PetScan to access the articles at a predefined depth. This enabled us to find all the pages for each discipline depending on the depth of query which in our case were set to 2. A nice bonus to this approach is that Petscan can be accessed programmatically through Python and thus provide us with a relevant list of pages to collect from Wikipedia. A sort of structured way to sample from an otherwise chaotic encyclopedia. But now that we have the data, let’s get serious.
The table below shows the five first observations of our data-set, which includes the following variables:
name
: The name of the Wikipedia article.parent
: The discipline to which the article belongs.edges
: Contains all links to another Wikipedia page.text
: The raw text of the article.cleaned_text
: Punctuation removed, lower-cased.lemmatized
: The cleaned text in lemmatized form, stop words removed.tokens
: The lemmatized tokenized into a list of words.gcc
: Dummy for if the article is part of the giant component in the network.
Once again, the data can be downloaded from the following (if you really want to see for yourself) link.
import pandas as pd
import numpy as np
from ast import literal_eval
from collections import defaultdict
df = pd.read_pickle('Final_df.pickle')
df = df[['title', 'parent', 'edges', 'text', 'cleaned_text', "lemmatized", "tokens", "gcc"]]
df["title"] = df["title"].apply(lambda x: " ".join(x.split("_")))
df["parent"] = df["parent"].apply(lambda x: " ".join(x.split("_")))
display(df)
title | parent | edges | text | cleaned_text | lemmatized | tokens | gcc | |
---|---|---|---|---|---|---|---|---|
0 | political science | political science | [comparative_politics, public_administration, ... | \n Political science is the scientific study ... | political science is the scientific study of ... | political science scientific study politics so... | [political, science, scientific, study, politi... | 1 |
1 | world values survey | political science | [religion, left-wing_politics, joseph_schumpet... | The World Values Survey (WVS) is a global res... | the world values survey wvs is a global resea... | world value survey wv global research project ... | [world, value, survey, wv, global, research, p... | 1 |
2 | voter turnout | political science | [tactical_voting, political_science, political... | In political science, voter turnout is the pe... | in political science voter turnout is the per... | political science voter turnout percentage eli... | [political, science, voter, turnout, percentag... | 1 |
3 | mierscheid law | political science | [] | The Mierscheid law is a satirical forecast[ci... | the mierscheid law is a satirical forecast ci... | mierscheid law satirical forecast citation nee... | [mierscheid, law, satirical, forecast, citatio... | 0 |
4 | political groups of the european parliament | political science | [civil_service, foreign_policy, london_school_... | \n \n \n \n The political groups of the Europ... | the political groups of the european parliame... | political group european parliament parliament... | [political, group, european, parliament, parli... | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
4297 | sascha altman dubrul | psychology | [anti-psychiatry, medicalization, psychoanalyt... | Sascha Altman DuBrul, a.k.a. Sascha DuBrul or... | sascha altman dubrul a k a sascha dubrul or s... | sascha altman dubrul k sascha dubrul sascha sc... | [sascha, altman, dubrul, k, sascha, dubrul, sa... | 1 |
4298 | grotesque body | psychology | [social_system, burlesque] | The grotesque body is a concept, or literary ... | the grotesque body is a concept or literary t... | grotesque body concept literary trope put forw... | [grotesque, body, concept, literary, trope, pu... | 1 |
4299 | imaginary audience | psychology | [] | The imaginary audience refers to a psychologi... | the imaginary audience refers to a psychologi... | imaginary audience refers psychological state ... | [imaginary, audience, refers, psychological, s... | 1 |
4300 | lady wonder | psychology | [astrology, mediumship, communal_reinforcement... | Reportedly haunted locations:\n Lady Wonder (... | reportedly haunted locations lady wonder born... | reportedly haunted location lady wonder born l... | [reportedly, haunted, location, lady, wonder, ... | 1 |
4301 | ego integrity | psychology | [psychometrics, religion, object_relations_the... | Ego integrity was the term given by Erik Erik... | ego integrity was the term given by erik erik... | ego integrity term given erik erikson last eig... | [ego, integrity, term, given, erik, erikson, l... | 1 |
4302 rows × 8 columns
In the table below we display summary statistics including the average number of articles for each discipline, number of edges and word count. As expected, the distribution is rather skewed with Political Science for example having more than double the amount of articles compared to sociology for instance. Remember this detail - Political Science is dominating our dataset and therefore most likely our analysis…
#Create descriptives table
tab = defaultdict(list)
for discipline in df['parent'].unique():
avg_edges = []
avg_pagelen = []
for row in df.loc[df['parent']==discipline].iterrows():
avg_edges.append(len(row[1]['edges']))
avg_pagelen.append(len(row[1]['tokens']))
tab['Discipline'].append(discipline)
tab['Number of articles'].append(df.loc[df['parent']==discipline].shape[0])
tab['Avg. edges'].append(np.mean(avg_edges))
tab['Avg. word count'].append(np.mean(avg_pagelen))
tab = pd.DataFrame(tab)
tab.set_index('Discipline').round(2)
Number of articles | Avg. edges | Avg. word count | |
---|---|---|---|
Discipline | |||
political science | 1614 | 3.97 | 714.27 |
economics | 1020 | 3.30 | 433.05 |
anthropology | 510 | 4.86 | 815.08 |
psychology | 543 | 4.68 | 857.45 |
sociology | 615 | 3.28 | 575.52 |