This project automates the retrieval and compilation of species-specific biological trait data by integrating biodiversity APIs with large language models. It is designed to scale from focused case studies to generalized, cross-species analyses.
The first phase of this project demonstrates a deep dive into amphibians (frogs) as a proof of concept.
- Uses multiple APIs to collect ecological and biological information:
- AmphibiaWeb: morphology and reproductive traits (snout–vent length, clutch size, egg diameter)
- IUCN Red List: elevation ranges and habitat categories
- World Bank CCKP: temperature and rainfall statistics
- Automates retrieval of structured (API) and semi-structured (XML parsed via GPT-4o) data
- Compiles outputs into a clean CSV/Excel dataset with traits like morphology, reproduction, climate, and altitude
The system then expands into a general-purpose trait extraction pipeline.
- Uses Europe PMC / PubMed Central (PMC) to query scientific literature
- Retrieves PDFs, parses them, and applies LLM-based extraction prompts to pull out traits such as diet, size, habitat, or environmental associations
- Works for any list of species and any set of traits, driven by an Excel file and trait description mapping
- Provides a UI for easy use, supporting batch processing across taxa
-
Run the GUI script:
python3 Data-Compilation-Model/02_generic_data_compilation/scripts/gui.py
-
In the popup window:
- Upload your Excel file: first column = species, remaining columns = traits.
- Upload your trait descriptions text file: UTF-16 encoded; each line in the format trait: description.
-
Start extraction:
- Click Start Data Extraction.
- The system will query APIs, fetch papers, and extract trait data.
- Results will be saved to: Data-Compilation-Model/02_generic_data_compilation/results/