The rankings.ipynb notebook analyzes and compares different centrality metrics for a citation network. It loads node and edge data from JSON files, calculates various centrality metrics (like degree, betweenness, closeness, etc.), and compares them against ground truth measures such as importance and document type.
To set up the project, follow these steps:
-
Clone the Repository: Clone the repository to your local machine using the following command:
git clone https://github.com/davidwickerhf/rankings.git
-
Navigate to the Project Directory: Change into the project directory:
cd rankings -
Create a Virtual Environment (optional but recommended): You can create a virtual environment to manage dependencies:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies: Install the required packages using pip:
pip install -r requirements.txt
-
Launch Jupyter Notebook: Start Jupyter Notebook with the following command:
jupyter notebook
-
Open the Rankings Notebook: In the Jupyter interface, open
rankings.ipynbto begin your analysis.
The notebook begins by loading node and edge data from JSON files. Ensure that your data files are in the correct format and located in the specified directory.
You can run the load.py script to load the data into the data/ECHR directory.
python load/load.py --input_path ECHR_metadata.csv --save_path EHCR --metadataYou can also run the split.py script to split the data into smaller networks.
python load/split.py --input_path data/ECHR --output_path networks --min_cases 50Preprocessing steps are applied to clean and prepare the data for analysis. This includes:
- Converting document types to numeric values.
- Filtering out rows with uncomputed metric values.
- Filtering out rows with Nan doctypebranch
The notebook calculates various centrality measures using the NetworkX library. Key centrality metrics include:
- Degree Centrality
- Betweenness Centrality
- Closeness Centrality
- Eigenvector Centrality
- PageRank
- Disruption
The notebook creates composite rankings based on the best-performing centrality measures for predicting high and low relevance scores. It includes:
- Error bar plots for centrality measures against ground truth scores.
- Functions to find the best centrality measures and create composite rankings.
The notebook calculates correlations between individual centrality measures and ground truth scores, as well as between composite rankings and ground truths. It visualizes these correlations using plots.
The analyze_network() function performs comprehensive network analysis using various centrality measures and composite rankings. It returns an AnalysisResults dictionary containing:
- Basic network statistics (nodes, edges, density, etc.)
- Correlation coefficients between rankings and ground truths
- Best performing centrality measures for each ground truth
- Composite ranking results
- The final processed DataFrame with all measures included
The compare_networks() function allows for the comparison of results across different networks. It analyzes:
- Correlation comparisons between centrality measures and ground truth metrics across networks.
- Ranking comparisons to see how centrality measures rank relative to each other in different networks.
By following the steps outlined above, you can effectively utilize the rankings.ipynb notebook to analyze and visualize centrality metrics in citation networks. Feel free to modify the notebook to suit your specific analysis needs.