A Generic pipeline to integrate large trait matrices with large phylogenies

This is a basic read me file for the pipeline. You can use the tutorial for step by step details on how to use the pipeline and visualize the character evolution in a phylogenetic tree.

This pipeline transforms a character matrix obtained from Phenoscape Knowledge base (KB) into a version that can be efficiently integrated with an Open Tree phylogeny.

Inputs

There two user inputs required for the pipeline

1.A character matrix downloaded from Phenoscape KB
2.Open tree phylogeny downloaded from Open Tree of Life

Optional: you can request for a meta-data file for the character your are interested in; this enables you to distinguish between inferred and asserted states (step 4 of the pipeline). Without the meta-data file you can still continue the pipeline, but you cannot differentially visualize the inferred vs asserted states. Refer to the tutorial for more details.

Please move the input files to the folder named ‘inputs’

Requirements

To run the pipeline you need to use Python version 2 (2.7 or newer). This will not work on python 3.

You need to install following external libraries as well

1. networkx
2. dendropy

Implementation

If all the requirements are met, execute the main.py python script. The pipeline will be implemented and the output matrices can be found on the outputs folder.

tree_integration_pipeline folder contains the scripts for each step of the pipeline. You can look into the script and refer to the tutorial for more details.

1.‘preprocessed_tree.tre’: this is the pre-processed version of your input phylogeny from Open Tree of Life. This version must be used for the ancestral state reconstruction, because it does not have ott ids at the end of each species name. If you use the input version the ancestral state reconstruction will not run efficiently.

2.finalfullopentree_matrix.txt: This matrix is ready to be merged with the Open Tree phylogeny. This matrix contains all the taxa in Open Tree file, which means there are taxa that are not in the propagated matrix based on VTO taxonomy. For instance, our example had 11,786 taxa after the propagation. However, the Open Tree phylogeny had 38,830 total taxa. This matrix also contains 38,830 taxa but only 11,786 taxa have data for pectoral fin. This matrix can be used for further analysis to investigate phylogenetic clades with missing data.

3.finalopentree_matrix_onlydata.txt: This matrix is also ready to be merged with the Open Tree phylogeny. The difference between the previous matrix is: this matrix only contains taxa that are in the propagated matrix. In our example, this means it has 11,786 taxa, not 38,830. This matrix does not have taxa with missing data as before.

Statistics folder

This folder contains files that has valuable statistics regarding the procedure of the pipeline

originaldatamatrix_taxalist.txt: contains a list of taxa in the original input data matrix

originaldatamatrix_taxalist_separated.txt: The same original taxa list separated to different taxonomic levels (families, genera, etc.)

missingtaxa.txt: List of taxa with missing data that were removed during step 2 of the pipeline

conflict_counts.txt: This file contains a list of all taxa that has ‘0&1’, separated into different taxonomic levels with their literature sources.

inferredstats.txt: lists of taxa separated based on asserted vs inferred presence and absence. This file will only be generated by step 4 if you input the meta-data file

propagationstatistics.txt: statistics about the propagation process conducted during step 5 of the pipeline.

finalVTOspecieslist.txt: list of all the species in the propagated data matrix

opentree_specieslist.txt: list of all the species in the open tree phylogeny

finalmatchedvtlist.txt: list of species in the propagated matrix that matched with the open tree species during step 6 of the pipeline.

finalmismatchedlist_andstats.txt: list of species in the propagated matrix that did not match with the open tree species. This list is separated based on the reason for being mismatched (improper naming syntax, taxa being extinct, etc.)

intermediate_matrices folder

Each step of the pipeline generates an intermediate version of the data matrix downloaded from KB. Only the final output matrices are included in the outputs folder. All the intermediate matrices are included in this intermediate_matrices folder.

tabdelemited_charactermatrix.txt: this matrix is resulted during the step 1 of the pipeline after converting the input data matrix downloaded by KB to tab-delimited format. This matrix should be tab-delimited and contains two columns: taxa_name and the character you are interested in. The meaning of the character states for the character column is given below.

0: absence
1: presence
0&1: presence and absence: conflicts or polymorphisms

preprocessed_matrix.txt: This matrix is generated after the step 2 of the pipeline. The taxa with missing data is removed in this matrix.

conflicts_removed_datamatrix.txt: this matrix is generated in step 3 after removal of ‘0&1’ states from higher-level taxa.

modified_inferredadded_matrix.txt: this matrix is only generated in step 4 if you input the meta-data file. The matrix contains an additional column (character_name_inferred) for inferred character states. The meaning of the different states in this column is given below

0: asserted presence
1: asserted presence
2: inferred presence
3: inferred absence

finalVTOmatrix.txt: this matrix is generated after propagation in step 5. The matrix contains a new column (char_name_propagated) to represent the propagated status for each taxon. The propagated state can be

0: not propagated
1: propagated

Important: after running the pipeline for a desired matrix, please remove the contents of the outputs, intermediate_matrices, and statistics folders to another location or make a backup. If you re-implement the pipeline for a new matrix (another character), these output files will be replaced. Make sure to keep above three folders intact; do not delete them move only their contents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

A Generic pipeline to integrate large trait matrices with large phylogenies

Inputs

Requirements

Implementation

Statistics folder

intermediate_matrices folder

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
inputs		inputs
intermediate_matrices		intermediate_matrices
outputs		outputs
statistics		statistics
tree_integration_pipeline		tree_integration_pipeline
README.md		README.md
The pipeline GitHub version tutorial.docx		The pipeline GitHub version tutorial.docx
main.py		main.py

Uh oh!

Uh oh!

pasanfernando/generic_pipeline_for_trait_integration

Folders and files

Latest commit

History

Repository files navigation

A Generic pipeline to integrate large trait matrices with large phylogenies

Inputs

Requirements

Implementation

Statistics folder

intermediate_matrices folder

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages