Data Collection¶
This notebook outlines our data collection strategy which consists of the following steps:
Finding the relevant Wikipedia pages for each discipline through PetScan.
Scraping each page to parse out hyperlinks to other Wikipedia pages and the text.
Creating a smaller and manageable subgraph from the Network.
Lastly we will also describe the steps that we have taken to preprocess our data for the purpose of forthcoming natural language processing.
# imports
import re
import json
import string
import random
import requests
import pickle
import warnings
import powerlaw
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from typing import List
from itertools import chain
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
from dataclasses import dataclass
import networkx as nx
import littleballoffur
import nltk
from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
Finding Relevant Articles¶
To collect the relevant Wikipedia pages for our project, we specify the dataclass WikiPage
. This is based on the use of the open-source software PetScan that based on a list of Wikipedia-categories yields the corresponding page-names. We furthermore specify the depth of our PetScan-query, which is a measure of how deeply nested we want our categories to be. As the list of pages grows exponentially we limit the levels of depth we set the parameter to 0, 1 and 2. The reason for not choosing one specific depth is that the group and subgroup structure of the disciplines differs which means that we get a widely different amount of pages.
@dataclass(frozen=False)
class WikiPage:
"""
Data obj that stores an article and
its relevant attributes
"""
title:str
parent:str
depth:int
text:str = np.nan
edges:List = np.nan
def collect_pages(parents:list,
depth:int=0)->List[WikiPage]:
"""
Finds relevant articles from petscan based on some initial query.
See https://petscan.wmflabs.org/ for api reference.
"""
pages = list()
errors = 0
#setup API call
base_url = 'https://petscan.wmflabs.org/?ns%5B0%5D=1&'
params = {'project':'wikipedia',
'language':'en',
'format':'json',
'interface_language':'en',
'depth':str(depth),
'doit':''}
#Loop over parents and get corresponding page names
for cat in parents:
params['categories'] = cat
resp = requests.get(url=base_url, params=params).json()
try:
for page in resp['*'][0]['a']['*']:
#Append nodes
pages.append(WikiPage(title=page['title'],
parent=cat,
depth=depth))
except KeyError:
errors+=1
print(f'Petscan failed to retrieve {errors} pages in depth {depth}...')
return pages
Below we call the function collect_pages()
, create a page list for depth 0, 1 and 2 and display the resulting counts. As can be seen Anthropology is a clear outlier because of a different group structure on Wikipedia.
#Define initial query groups
query = ['political_science', 'economics', 'sociology', 'anthropology', 'psychology']
depths = [0,1,2]
pages = []
for d in tqdm(depths):
pages += collect_pages(query, d)
#Show marginal distribution of pages
pd.DataFrame(pages).groupby('parent').count()['title']
Petscan failed to retrieve 0 pages in depth 0...
Petscan failed to retrieve 0 pages in depth 1...
Petscan failed to retrieve 0 pages in depth 2...
parent
anthropology 17621
economics 6023
political_science 7011
psychology 8757
sociology 5895
Name: title, dtype: int64
Collect Page Text and Edges¶
In the function collect_attributes()
we use BeautifulSoup
to scrape the html content from the wikipedia pages we’ve found. The key html node is the div
with attributes {'id':'mw-content-text'}
from which we can parse out all paragraphs and hyperlinks, disregarding section headings, tables and other irrelevant content and page attributes.
def collect_attributes(articles:list[WikiPage])->list[WikiPage]:
"""
Parses the wikipedia article text and urls pointing to another wiki page.
"""
base_url = 'https://en.wikipedia.org/wiki/'
error_log = dict()
for page in tqdm(pages):
try:
try:
resp = requests.get(base_url+page.title, timeout=10)
except requests.exceptions.Timeout as e:
error_log[page.title] = e
soup = BeautifulSoup(resp.content, 'html.parser')
content = soup.find('div', {'id':'mw-content-text'})
text = ''
for paragraph in content.find_all('p'):
text += ' ' + paragraph.text
page.text = text
page.edges = [ref.text for ref in content.find_all('a', href=True)
if 'wiki' in ref.get('href')]
except Exception as e:
#Log potential errors in collection
error_log[page.title] = str(e)
return pages, error_log
pages, error_log = collect_attributes(pages)
pages_df.to_pickle("full_data_pickle")
print(f'Amount of pages that failed to be collected: {len(error_log.keys())}')
Amount of pages that failed to be collected: 135
As only 135 out of 45307 pages are missing we do not consider this a problem of substantiel value, why we keep our dataset as is.
Subsetting a Smaller Network¶
Because of the large size of the network, we deem it necessary to create a smaller subgraph that is more manageable. To do this we initially perform three operations:
(1) Restrict anthropology pages to depth = 1: Due to the inherent structure of Wikipedia we cannot expect all of the categories to have an equal amount of pages. Some categories might be defined more loosely and therefore span broader, whereas other categories might be larger as more people participate in activities related to the given discipline and therefore more prone to contribute to the related Wikipedia pages. No matter the reason, anthropology pages are widely overrepresented in our dataset, which is why we decide to restrict the anthropology pages to
depth = 2
.(2) Remove Duplicates: Some pages occur in two categories e.g. the page elitism occurs both in the political science and the anthropology category. As it becomes ambiguous to which discipline the page belongs, we decide to remove them. In cases where some pages occur several times within the same category (as they are collected at different depths) we collapse them to one observation.
(3) Remove edges to pages that did not collect: Lastly we remove all edges from the pages we have collected to pages not in our sample.
pages_df = pd.read_pickle("full_data.pickle")
def remove_anthro(df: pd.DataFrame) -> pd.DataFrame:
"""
Remove pages in the anthropology category of depth 2
"""
df = df.loc[~((df["depth"] == 2) & (df["parent"] == "anthropology"))]
return df
def remove_duplicates(df: pd.DataFrame) -> pd.DataFrame:
"""
Remove inter-category page-duplicates and collapse intra-category duplicates
"""
nodes_to_remove = [node for node in tqdm(set(df[df.duplicated("title")]["title"])) if
len(set(df[df["title"] == node]["parent"])) > 1]
df = df[~df['title'].isin(nodes_to_remove)]
df = df.drop_duplicates(subset="title", keep="first")
return df
def uniform_page_and_edge_names(df: pd.DataFrame) -> pd.DataFrame:
"""
Uniform the spelling and format of string in the title and edges column
"""
df['title'] = df['title'].str.lower()
df['edges'] = df['edges'].apply(lambda x: [re.sub(' ', '_', l.strip().lower()) for l in x])
return df
def remove_edges_not_in_nodelist(df: pd.DataFrame) -> pd.DataFrame:
"""
Remove all edges not in the title column from the lists of edges
"""
tqdm.pandas()
nodes = df["title"].tolist()
df['edges'] = df['edges'].progress_apply(lambda x: [e for e in x if e in nodes])
return df
# Calling the functions
pd.options.mode.chained_assignment = None # Hide Pandas Warnings
df = remove_duplicates(pages_df)
df = remove_anthro(df)
df = uniform_page_and_edge_names(df)
df = remove_edges_not_in_nodelist(df)
df = df.reset_index()
To further subset our data, we remove self-loops and model an undirected network from which we extract the giant connected component. We then try to sample a representative subgraph containing \(\frac{1}{4}\) of the giant connected component’s nodes using a Metropolis algorithm as stated in [Hübler et al., 2008] and implemented in the function .MetropolisHastingsRandomWalkSampler()
from the module littleballoffur
. The intuition behind the method is to initialise a random subgraph \(S\) of our full network \(G\) from which we iteratively remove and add nodes in order to mimic some topological properties of \(G\), in this case the degree-distribution. This is of course not without consequences as we partly remove the complexity in our network and thereby lose information, however we deem this necessary to reduce our network to a more manageable size.
# We create a dictionary with the index each pagename
index_dict = {i:k for k, i in enumerate(df['title'])}
# We create an edgelist
edge_list = []
for node, edges in zip(df['title'].tolist(), df['edges'].tolist()):
for edge in edges:
edge_list.append((index_dict[node], index_dict[edge]))
edge_list = [e for e in edge_list if e[0] != e[1]] # ... and remove self-loops
# We model the acutal graph and extract the gcc
G = nx.Graph()
G.add_edges_from(edge_list)
gcc = max(nx.connected_components(G), key=len)
G = G.subgraph(gcc)
# littleballoffur requires all names to be translated into integers
G = nx.relabel.convert_node_labels_to_integers(G)
# We make a sample with 0.25 of the gcc's observations
number_of_nodes = int(0.25 * G.number_of_nodes())
sampler = MetropolisHastingsRandomWalkSampler(number_of_nodes = number_of_nodes)
sampled_supgraph = sampler.sample(G)
To assess the bias induced by our sampling, we compare the networks in terms of their degree exponent and transitivity. This way we can make sure that the network’s degree distribution and level of clusterness is somewhat comparable.
%%capture
# Calculating the exponent
G_degree_dist = [i[1] for i in G.degree]
sampled_supgraph_degree_dist = [i[1] for i in sampled_supgraph.degree]
G_exponent = powerlaw.Fit(G_degree_dist)
sampled_supgraph_exponent = powerlaw.Fit(sampled_supgraph_degree_dist)
# Calculating the transitivity
G_transitivity = nx.transitivity(G)
sampled_supgraph_transitivity = nx.transitivity(sampled_supgraph)
print(f"Degree distribution exponent full network G: {G_exponent.power_law.alpha}")
print(f"Degree distribution exponent supgraph S: {sampled_supgraph_exponent.power_law.alpha}")
print("-"*70)
print('Transitivity full network G: {:.4f}'.format(G_transitivity))
print('Transitivity supgraph S: {:.4f}'.format(sampled_supgraph_transitivity))
Degree distribution exponent full network G: 3.2974527129230946
Degree distribution exponent supgraph S: 3.8342121796821265
----------------------------------------------------------------------
Transitivity full network G: 0.1796
Transitivity supgraph S: 0.2822
Reassuringly, the exponent of the two networks are fairly similar suggesting that the degree distributions somewhat follow the same structure. A bit more worrying is the difference in transitivity that shows that we have overestimated the transitivity in our supgraph. This results in our network being more dense with a higher number of “triads”, than what really is the case. Nonetheless we choose to use the subgraph for our further analysis. We can now model our final network and corresponding DataFrame which we will use for our analysis based on the nodes from the subgraph. Finally we restrict this network to the weakly connected component as we create a directed network.
# We define a list of nodes to keep and remove edges that refers to nodes that we have removed
nodes_to_keep = [list(index_dict.keys())[i] for i in list(sampled_supgraph.nodes())]
df = df[df['title'].isin(nodes_to_keep)].reset_index()[["title", "parent", "depth", "text", "edges"]]
df = remove_edges_not_in_nodelist(df)
# Set attributes
node_attr = df[["title", "parent", "depth"]].set_index("title").to_dict(orient='index')
# Create an edgelist
edge_list = []
for node, edges in zip(df['title'].tolist(), df['edges'].tolist()):
for edge in edges:
edge_list.append((node, edge))
# Model the network
G = nx.DiGraph()
G.add_edges_from(edge_list)
nx.set_node_attributes(G, node_attr)
# Extract the weakly connected component
gcc = max(nx.weakly_connected_components(G), key=len)
G = G.subgraph(gcc)
# Add a gcc column to our final DataFrame
df["gcc"] = df["title"].apply(lambda x: 1 if x in G.nodes() else 0)
# Saving the edgelist as the a directed network can not be pickled
with open("Final_edge_list.pickle", 'wb') as f:
pickle.dump(edge_list, f)
# Saving the node attributes
with open("Final_node_attr.pickle", 'wb') as f:
pickle.dump(node_attr, f)
Add Preprocessing¶
In this section we conduct light preprocessing since the Wikipedia documents are fairly clean. We do the following preprocessing steps:
1 Getting the clean text from the Wikipedia pages: The Wikipedia pages are structured in a page-specific text section and a section with links called “See Also”. We are only interested in the first part. We clean the text further by removing all non-alphanumeric characters.
2 Remove stopwords: To reduce the size of the vocabulary we remove all stopwords as they do not carry a lot of information relative to their size.
3 Lemmatization of words: Lemmatization refers to the reduction of words to its syntactical root to align words despite grammatical modifications which is useful for our analysis of text. We therefore add a column to our final DataFrame with lemmatized words.
4 Tokenisation of words: We split up all the lemmatized pages into a list of words (or tokens) and add them to a new column in our final DataFrame.
def clean_text(text: str) -> str:
"""
Extracts clean text from the Wikipedia-pages by splitting
on the "See also" section and removing non-alphanumerical characters
"""
text = text.split('See also')[0]
text = re.sub('\W+', ' ', text)
return text.lower()
def remove_stopwords(text: str) -> str:
"""
Removes nltk stopwords bounded by whitespace and replace it
with whitespace.
"""
patterns = set(stopwords.words('english'))
for pattern in patterns:
if re.search(' '+pattern+' ', text):
text = re.sub(' '+pattern+' ', ' ', text)
return text
def lemmatize(text: str) -> str:
"""
Lemmatize all text.
"""
lemmatizer = WordNetLemmatizer()
text = word_tokenize(text)
sent_lemmatized = [lemmatizer.lemmatize(word) for word in text]
return ' '.join(sent_lemmatized)
def word_tokenize(text: str) -> str:
"""
Tokenize all text.
"""
text = WordPunctTokenizer().tokenize(text)
return text
# we apply all functions to the text
warnings.filterwarnings('ignore') # Ignore DepreciationError
tqdm.pandas()
df['cleaned_text'] = df['text'].astype(str).progress_apply(lambda x: clean_text(x))
df['lemmatized'] = df['cleaned_text'].astype(str).progress_apply(lambda x: remove_stopwords(x))
df['lemmatized'] = df['lemmatized'].astype(str).progress_apply(lambda x: lemmatize(x))
df['tokens'] = df['lemmatized'].astype(str).progress_apply(lambda x: word_tokenize(x))
# Saving and displaying the discipline distribution of the final DataFrame
df.to_pickle("Final_df.pickle")
df.groupby("parent").count()["title"]
parent
anthropology 510
economics 1020
political_science 1614
psychology 543
sociology 615
Name: title, dtype: int64