This is a basic read me file for the pipeline. You can use the tutorial for step by step details on how to use the pipeline and visualize the character evolution in a phylogenetic tree.
This pipeline transforms a character matrix obtained from Phenoscape Knowledge base (KB) into a version that can be efficiently integrated with an Open Tree phylogeny.
There two user inputs required for the pipeline
1.A character matrix downloaded from Phenoscape KB
2.Open tree phylogeny downloaded from Open Tree of Life
Optional: you can request for a meta-data file for the character your are interested in; this enables you to distinguish between inferred and asserted states (step 4 of the pipeline). Without the meta-data file you can still continue the pipeline, but you cannot differentially visualize the inferred vs asserted states. Refer to the tutorial for more details.
Please move the input files to the folder named ‘inputs’
To run the pipeline you need to use Python version 2 (2.7 or newer). This will not work on python 3.
You need to install following external libraries as well
1. networkx
2. dendropy
If all the requirements are met, execute the main.py python script. The pipeline will be implemented and the output matrices can be found on the outputs folder.
tree_integration_pipeline folder contains the scripts for each step of the pipeline. You can look into the script and refer to the tutorial for more details.
1.‘preprocessed_tree.tre’: this is the pre-processed version of your input phylogeny from Open Tree of Life. This version must be used for the ancestral state reconstruction, because it does not have ott ids at the end of each species name. If you use the input version the ancestral state reconstruction will not run efficiently.
2.finalfullopentree_matrix.txt: This matrix is ready to be merged with the Open Tree phylogeny. This matrix contains all the taxa in Open Tree file, which means there are taxa that are not in the propagated matrix based on VTO taxonomy. For instance, our example had 11,786 taxa after the propagation. However, the Open Tree phylogeny had 38,830 total taxa. This matrix also contains 38,830 taxa but only 11,786 taxa have data for pectoral fin. This matrix can be used for further analysis to investigate phylogenetic clades with missing data.
3.finalopentree_matrix_onlydata.txt: This matrix is also ready to be merged with the Open Tree phylogeny. The difference between the previous matrix is: this matrix only contains taxa that are in the propagated matrix. In our example, this means it has 11,786 taxa, not 38,830. This matrix does not have taxa with missing data as before.
This folder contains files that has valuable statistics regarding the procedure of the pipeline
originaldatamatrix_taxalist.txt: contains a list of taxa in the original input data matrix
originaldatamatrix_taxalist_separated.txt: The same original taxa list separated to different taxonomic levels (families, genera, etc.)
missingtaxa.txt: List of taxa with missing data that were removed during step 2 of the pipeline
conflict_counts.txt: This file contains a list of all taxa that has ‘0&1’, separated into different taxonomic levels with their literature sources.
inferredstats.txt: lists of taxa separated based on asserted vs inferred presence and absence. This file will only be generated by step 4 if you input the meta-data file
propagationstatistics.txt: statistics about the propagation process conducted during step 5 of the pipeline.
finalVTOspecieslist.txt: list of all the species in the propagated data matrix
opentree_specieslist.txt: list of all the species in the open tree phylogeny
finalmatchedvtlist.txt: list of species in the propagated matrix that matched with the open tree species during step 6 of the pipeline.
finalmismatchedlist_andstats.txt: list of species in the propagated matrix that did not match with the open tree species. This list is separated based on the reason for being mismatched (improper naming syntax, taxa being extinct, etc.)
Each step of the pipeline generates an intermediate version of the data matrix downloaded from KB. Only the final output matrices are included in the outputs folder. All the intermediate matrices are included in this intermediate_matrices folder.
tabdelemited_charactermatrix.txt: this matrix is resulted during the step 1 of the pipeline after converting the input data matrix downloaded by KB to tab-delimited format. This matrix should be tab-delimited and contains two columns: taxa_name and the character you are interested in. The meaning of the character states for the character column is given below.
0: absence
1: presence
0&1: presence and absence: conflicts or polymorphisms
preprocessed_matrix.txt: This matrix is generated after the step 2 of the pipeline. The taxa with missing data is removed in this matrix.
conflicts_removed_datamatrix.txt: this matrix is generated in step 3 after removal of ‘0&1’ states from higher-level taxa.
modified_inferredadded_matrix.txt: this matrix is only generated in step 4 if you input the meta-data file. The matrix contains an additional column (character_name_inferred) for inferred character states. The meaning of the different states in this column is given below
0: asserted presence
1: asserted presence
2: inferred presence
3: inferred absence
finalVTOmatrix.txt: this matrix is generated after propagation in step 5. The matrix contains a new column (char_name_propagated) to represent the propagated status for each taxon. The propagated state can be
0: not propagated
1: propagated
Important: after running the pipeline for a desired matrix, please remove the contents of the outputs, intermediate_matrices, and statistics folders to another location or make a backup. If you re-implement the pipeline for a new matrix (another character), these output files will be replaced. Make sure to keep above three folders intact; do not delete them move only their contents.