Parkinson's Insight Engine (PIE)

Overview

The Parkinson's Insight Engine (PIE) is a comprehensive pipeline designed for researchers working with the Michael J. Fox Foundation's Parkinson's Progression Markers Initiative (PPMI) dataset. PIE automates the entire machine learning workflow, from loading and consolidating raw multi-modal data to training models and generating insightful reports. It provides a reproducible, configurable, and transparent framework to accelerate research.

The primary way to use PIE is through its main pipeline script, which orchestrates all the steps required to go from raw data to a full classification analysis with a single command.

Key Features

End-to-End Automation: A single command runs the full data processing and machine learning pipeline.
Modular Pipeline: Each step (Data Reduction, Feature Engineering, Feature Selection, Classification) generates its own detailed HTML report and intermediate data files.
Intelligent Data Reduction: Analyzes and removes low-value features before merging, drastically reducing memory usage and feature space complexity.
Robust Feature Engineering: Applies one-hot encoding, numeric scaling, and other transformations to prepare data for modeling.
Advanced Model Training: Leverages pycaret to compare a suite of models, tune the best performer, and evaluate its performance on a held-out test set.
Leakage Prevention: Employs a configurable list of features to exclude, preventing data leakage and ensuring more realistic model evaluation.
Comprehensive Reporting: Generates a main HTML report that links to detailed reports for each stage of the pipeline, providing full transparency.

The PIE Workflow

PIE processes data in a sequential, multi-stage workflow. Each stage produces outputs that feed into the next.

[Raw PPMI Data]
       |
       v
[1. Data Loading]
   - Loads all raw data modalities into memory.
   - (This step is integrated into the start of the Data Reduction stage).
       |
       v
[2. Data Reduction]
   - Analyzes loaded data tables.
   - Drops low-value columns (e.g., high missingness, zero variance).
   - Merges and consolidates all tables into a single CSV.
   - (Report: data_reduction_report.html)
       |
       v
[3. Feature Engineering]
   - Applies one-hot encoding, scaling, etc. to create model-ready features.
   - (Report: feature_engineering_report.html)
       |
       v
[4. Feature Selection]
   - Splits data into training and testing sets.
   - Selects the most relevant features from the training data.
   - (Report: feature_selection_report.html)
       |
       v
[5. Classification]
   - Compares multiple ML models on the final feature set.
   - Tunes and evaluates the best model.
   - (Report: classification_report.html)
       |
       v
[Final Pipeline Report]
- (pipeline_report.html)

Getting Started

Prerequisites

Python 3.8 or later.
Required dependencies can be installed from requirements.txt:
```
pip install -r requirements.txt
```

Installation

Clone the repository and install the PIE package. For development, use the editable "-e" flag.

git clone https://github.com/MJFF-ResearchCommunity/PIE.git
cd PIE
pip install -e .

Data Setup

Download PPMI Data: You must apply for access to the PPMI data.

Organize Data: Create a directory named PPMI at the root of the cloned PIE repository. Download the individual study data folders from LONI and place them inside the PPMI directory. The structure should look like this:

PIE/
├── PPMI/
│   ├── _Subject_Characteristics/
│   ├── Biospecimen/
│   ├── Motor___MDS-UPDRS/
│   ├── Non-motor_Assessments/   
│   ├── Medical_History/
│   └── ... (other data folders)
├── pie/
└── ... (other project files)

How to Use PIE: The Main Pipeline

The most effective way to use PIE is by running the main pipeline script from your terminal. This script executes the entire workflow and provides configurable parameters.

A Standard End-to-End Run

This example demonstrates a typical use case: predicting the COHORT of a subject.

1. Configure Leakage Features Before running, it is critical to configure the data leakage prevention. Open config/leakage_features.txt. This file should contain a list of column names (one per line) that should be removed from the data because they would "leak" information about the target variable.

For example, if you are predicting COHORT, you should exclude features like subject_characteristics_APPRDX (the clinician's diagnosis), as this is nearly identical to the target. The default file provides a starting point, but you must review and customize it for your specific research question.

2. Execute the Pipeline Run the following command from the root PIE/ directory:

python3 pie/pipeline.py \
    --data-dir ./PPMI \
    --output-dir ./output/my_first_run \
    --target-column COHORT \
    --leakage-features-path config/leakage_features.txt \
    --fs-method fdr \
    --fs-param 0.05 \
    --n-models 5 \
    --tune \
    --budget 60.0

Understanding the Command-Line Arguments

--data-dir: Path to your raw PPMI data.
--output-dir: Where all results, reports, and data files will be saved.
--target-column: The variable you want your models to predict.
--leakage-features-path: Path to your leakage prevention file.
--fs-method: The feature selection algorithm to use (fdr or k_best).
--fs-param: The parameter for the feature selection method (e.g., 0.05 for FDR's alpha).
--n-models: The number of models to compare.
--tune: A flag to enable hyperparameter tuning for the best model.
--budget: A time limit in minutes for the model comparison step.

Pipeline Output

After the run completes, the specified output directory (./output/my_first_run) will contain:

Intermediate Data: The CSV file output from each major step.
HTML Reports: A separate, detailed HTML report for each step.
pipeline_report.html: A top-level summary report that links to all the individual step reports. The script will attempt to open this file in your browser automatically upon completion.

Example Visualizations

The PIE pipeline generates detailed HTML reports at each stage. Here is a preview of some of the visualizations from the final classification report:

You can see an example of a full classification report in the image below. Click the image to view it in full size.

Running Tests

To verify your setup and ensure all components are working correctly, you can run the integration test. This test executes a complete, expedited run of the pipeline.

pytest tests/test_pipeline.py

This test will create its own output in output/test_pipeline_run and check that all expected files are generated and that data leakage prevention is working.

Deeper Dive: Understanding the Modules

While the main pipeline is the recommended entry point, PIE is composed of modular components. You can learn more about each one in the detailed documentation:

Please also check out the notebooks directory for some great examples of how to use PIE in a more modular fashion.

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a new branch for your feature: git checkout -b feature-name.
Make your changes.
Add or update tests for your changes.
Ensure the full test suite passes: pytest tests/.
Commit your changes and create a pull request.

Contributors

Cameron Hamilton
Victoria Catterson

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

If you have any questions or suggestions, please don't hesitate to contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
Imaging		Imaging
assets		assets
catboost_info		catboost_info
config		config
documentation		documentation
notebooks		notebooks
pie		pie
tests		tests
.gitignore		.gitignore
AUC.png		AUC.png
PPMI_Data_User_Guide_20240918.md		PPMI_Data_User_Guide_20240918.md
Prediction Error.png		Prediction Error.png
README.md		README.md
classification_report_example.html		classification_report_example.html
fs_pipeline.prof		fs_pipeline.prof
pipeline.prof		pipeline.prof
requirements.txt		requirements.txt
setup.py		setup.py
test_data_loader.py		test_data_loader.py
test_reducer.prof		test_reducer.prof

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Parkinson's Insight Engine (PIE)

Overview

Key Features

The PIE Workflow

Getting Started

Prerequisites

Installation

Data Setup

How to Use PIE: The Main Pipeline

A Standard End-to-End Run

Understanding the Command-Line Arguments

Pipeline Output

Example Visualizations

Running Tests

Deeper Dive: Understanding the Modules

Contributing

Contributors

License

Contact

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

cowlet/PIE

Folders and files

Latest commit

History

Repository files navigation

Parkinson's Insight Engine (PIE)

Overview

Key Features

The PIE Workflow

Getting Started

Prerequisites

Installation

Data Setup

How to Use PIE: The Main Pipeline

A Standard End-to-End Run

Understanding the Command-Line Arguments

Pipeline Output

Example Visualizations

Running Tests

Deeper Dive: Understanding the Modules

Contributing

Contributors

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages