-
Notifications
You must be signed in to change notification settings - Fork 25
Installation
GToTree runs in a Unix-like command-line environment. It is tested most extensively in the bash shell, but is written to be as POSIX-compliant as possible in hopes to run smoothly on as many unix variants as possible.
If you don't already have the glorious package manager conda, I highly recommend you get it. This really isn't the venue to go into why it's so helpful, but it really is, I promise 🙂
To get conda up and running (which is very quick), you can follow the instructions to install miniconda (a light-weight version) for your appropriate system starting from here. You will want a python 3.X version, and more than likely a 64-bit version. And if you'd like to learn more about conda sometime, I have an introduction page here 🙂
The following line will create a gtotree conda environment and install GToTree:
conda create -y -n gtotree -c conda-forge -c bioconda -c defaults -c astrobiomike gtotree
NOTE: Very kindly, @JoeVineis has noted that he had trouble installing GToTree with just the above command. But things did workout when breaking apart the environment build and the installation. So if you run into any snags with the one-line install above, maybe try this way :)
conda create -y -n gtotree python=3.7
conda activate gtotree
conda install -c conda-forge -c bioconda -c defaults -c astrobiomike gtotree
Now you should be able to enter and exit the environment with conda activate gtotree
and conda deactivate gtotree
. If you enter the environment and run the following:
gtt-hmms
It will print out where the GToTree default HMMs directory is located, and list the available HMMs there. And if you enter GToTree
with no arguments, you should see the help menu. You can run a test that takes about 2 minutes like so:
gtt-test.sh
For which the end of the standard output should look like this:
#################################################################################
#### Done!! ####
#################################################################################
Overall, 12 genomes of the input 14 were retained (see notes below).
Tree written to:
GToTree_test/GToTree_test.tre
Alignment written to:
GToTree_test/Aligned_SCGs_mod_names.faa
Main genomes summary table with comp./redund. estimates written to:
GToTree_test/Genomes_summary_info.tsv
Summary table with hits per target gene per genome written to:
GToTree_test/SCG_hit_counts.tsv
Files for additional PFam searches written to:
GToTree_test/additional_pfam_search_results/
Partitions (for downstream use with mixed-model treeing) written to:
GToTree_test/run_files/Partitions.txt
_______________________________________________________________________________
Notes:
1 accession(s) not successfully found at NCBI.
1 genome(s) removed due to having too few hits to the targeted SCGs.
Reported along with additional informative run files in:
GToTree_test/run_files/
_______________________________________________________________________________
Log file written to:
GToTree_test/gtotree-runlog.txt
Programs used and their citations have been written to:
GToTree_test/citations.txt
Total process runtime: 0 hours and 1 minutes.
And if you took that output tree file "GToTree_test.tre" and threw it into a tree viewer, such as uploading it to the Interactive Tree of Life site, rooting it at the included archaeal sequence, and dragging and dropping in the "additional_pfam_search_results/PF05400.13-iToL.txt" file, it would look something like this (though with some different labels now, as I haven't updated this image):

Where the blue branches go to those genomes in which the FliT protein involved in flagellar biosynthesis was detected (searched for by it's PFam, PF05400.13, being specified in the "pfam_targets.txt" input file).
You can clean out the results from the test run by running:
gtt-clean-after-test.sh
GToTree comes packaged with full example datafiles and outputs as outline on the example-usage page in directories where conda installed it on your system. You can access these while in the "gtotree" conda environment under the $EXAMPLE_DATA_DIR
variable. E.g. echo $EXAMPLE_DATA_DIR
or cd $EXAMPLE_DATA_DIR
.
Again, the conda installation is highly recommended as it is more robust across different systems. But to try installing without conda, download and unpack/decompress GToTree wherever you'd like it to live on your system (be sure to change the versions below to the latest found here:
curl -L https://github.com/AstrobioMike/GToTree/archive/v1.5.22.tar.gz -o GToTree-v1.5.22.tar.gz
tar -xzvf GToTree-v1.5.22.tar.gz
Now we need to add the "bin" directory to our PATH (see here if you are unfamiliar with what the PATH is and you'd like to know more).
One way we can do this is change directories into the bin, and use pwd
inside an echo
command to put the full path into our PATH:
cd GToTree-1.5.22/bin # make sure you are in this bin directory
echo "export PATH=\"$(pwd):\$PATH\"" >> ~/.bash_profile
If you'd like to more easily be able to use the included single-copy gene HMM profiles, you can also add a variable to your bash profile so that you don't need to provide the full path to them whenever you use them. If you change directories into the "hmm_sets" directory, this can be done in a similar way as above:
cd ../hmm_sets/ # from where we were above
echo "export GToTree_HMM_dir=\"$(pwd)/\"" >> ~/.bash_profile
Last thing to do is source
the ~/.bash_profile we just modified so those changes take effect in our current session:
source ~/.bash_profile
You can run gtt-hmms
with no arguments to make sure the default HMM directory is set, and see what taxa the currently available HMM files can more specifically target.
And now if you type GToTree
with no arguments, you should see the help menu (but note that you still need to take care of the dependencies presented below before you're ready to rock):
GToTree v1.5.22
(github.com/AstrobioMike/GToTree)
---------------------------------- HELP INFO ----------------------------------
This program takes input genomes from various sources and ultimately produces
a phylogenomic tree. You can find detailed usage information at:
github.com/AstrobioMike/GToTree/wiki
------------------------------- REQUIRED INPUTS -------------------------------
1) Input genomes in one or any combination of the following formats:
- [-a <file>] single-column file of NCBI assembly accessions
- [-g <file>] single-column file with the paths to each GenBank file
- [-f <file>] single-column file with the paths to each fasta file
- [-A <file>] single-column file with the paths to each amino acid file,
each file should hold the coding sequences for just one genome
2) [-H <file>] location of the uncompressed HMM file being used, or just the
HMM name if you've set the environment variable 'GToTree_HMM_dir'
to the appropriate location or installed via conda (run 'gtt-hmms'
by itself to view the available gene-sets)
------------------------------- OPTIONAL INPUTS -------------------------------
Output directory specification:
- [-o <str>] default: GToTree_output
Specify the desired output directory.
User-specified modification of genome labels:
- [-m <file>] specify desired genome labels
A two- or three-column tab-delimited file where column 1 holds either
the file name or NCBI accession of the genome to name (depending
on the input source), column 2 holds the desired new genome label,
and column 3 holds something to be appended to either initial or
modified labels (e.g. useful for "tagging" genomes in the tree based
on some characteristic). Columns 2 or 3 can be empty, and the file does
not need to include all input genomes.
Options for adding taxonomy information:
- [-t ] default: false
Provide this flag with no arguments if you'd like to add NCBI taxonomy
info to the sequence headers for any genomes with NCBI taxids. This will
will largely be effective for input genomes provided as NCBI accessions
(provided to the `-a` argument), but any input GenBank files will also
be searched for an NCBI taxid. See `-L` argument for specifying desired
ranks.
- [-D ] default: false
Provide this flag with no arguments if you'd like to add taxonomy from the
Genome Taxonomy Database (GTDB; gtdb.ecogenomic.org). This will only be
effective for input genomes provided as NCBI accessions (provided to the
`-a` argument). This can be used in combination with the `-t` flag, in
which case any input accessions not represented in the GTDB will have NCBI
taxonomic infomation added (with '_NCBI' appended). See `-L` argument for
specifying desired ranks, and see helper script `gtt-get-accessions-from-GTDB`
for help getting input accessions based on GTDB taxonomy searches.
- [-L <str>] default: Domain,Phylum,Class,Species,Strain
A comma-separated list of the taxonomic ranks you'd like added to
the labels if adding taxonomic information. E.g., all would be
"-L Domain,Phylum,Class,Order,Family,Genus,Species". Note that
strain-level information is available through NCBI, but not GTDB.
Filtering settings:
- [-c <float>] default: 0.2
A float between 0-1 specifying the range about the median of
sequences to be retained. For example, if the median length of a
set of sequences is 100 AAs, those seqs longer than 120 or shorter
than 80 will be filtered out before alignment of that gene set
with the default 0.2 setting.
- [-G <float>] default: 0.5
A float between 0-1 specifying the minimum fraction of hits a
genome must have of the SCG-set. For example, if there are 100
target genes in the HMM profile, and Genome X only has hits to 49
of them, it will be removed from analysis with default value 0.5.
- [-B ] default: false
Provide this flag with no arguments if you'd like to run GToTree
in "best-hit" mode. By default, if a SCG has more than one hit
in a given genome, GToTree won't include a sequence for that target
from that genome in the final alignment. With this flag provided,
GToTree will use the best hit. See here for more discussion:
github.com/AstrobioMike/GToTree/wiki/things-to-consider
Additional PFam searching:
- [-p <file>] single-column file of additional PFam targets to search for.
Table of hit counts, fasta of hit sequences, and files compatible
with the iToL web-based tree-viewer will be generated for each
target. See visualization of gene presence/absence example at
github.com/AstrobioMike/GToTree/wiki/example-usage for example.
General run settings:
- [-N ] default: false
No tree. Generate alignment only.
- [-T <str>] default: FastTree
Which program to use for tree generation. Currently supported are
"FastTree" and "IQ-TREE". As of now, these run with default settings
only (and IQ-TREE includes "-mset WAG,LG". To run either with more
specific options (and there is a lot of room for variation here), you
can use the output alignment file from GToTree (and partitions file if
wanted for mixed-model specification) as input into a dedicated treeing
program.
- [-n <int> ] default: 2
The number of cpus you'd like to use during the HMM search. (Given
these are individual small searches on single genomes, 2 is probably
always sufficient.)
- [-j <int> ] default: 1
The number of jobs you'd like to run in parallel during steps
that are parallelizable. This includes things like downloading input
accession genomes and running parallel alignments, but not the tree step.
Note that I've occassionally noticed NCBI not being happy with over ~50
downloads being attempted concurrently. So if using a `-j` setting around
there or higher, and GToTree is saying a lot of input accessions were not
successfully downloaded, consider trying with fewer.
- [-P ] default: false
Provide this flag with no arguments if your system can't use ftp,
and you'd like to try using http.
- [-d ] default: false
Provide this flag with no arguments if you'd like to keep the
temporary directory. (Mostly useful for debugging.)
-------------------------------- EXAMPLE USAGE --------------------------------
GToTree -a ncbi_accessions.txt -f fasta_files.txt -H Bacteria -D -j 4
By far, the easiest way to get all the dependencies up and running is with conda as done above. But if you don't want to use conda, here are links to installing all the dependencies (be sure to install Easel along with HMMER3 as well if you are doing things the non-conda way).
If you use GToTree, please be sure to cite these folks – a citations.txt
file including used programs is produced with each run to help 🙂
- Biopython - citation
- HMMER3 v3.2.1 - citation: they note in the user manual to cite the website, but there is also this paper (be sure to install Easel along with HMMER3 as well, see more at the HMMER3 install page here)
- Muscle v3.8.1551 - citation
- Trimal v1.4.1 - citation
- FastTree v2.1.10 - citation
If you use GToTree in a manner that uses these tools, please cite these folks – a citations.txt
file including used programs is produced with each run to help 🙂
-
Prodigal v2.6.3 - citation
- if providing input genomes in fasta format, or GenBank format with no CDS annotations, or NCBI accessions to genomes with no gene calls
- if providing input genomes as NCBI assembly accessions
-
TaxonKit v0.6.0 - citation
- if adding NCBI taxonomy information to input genomes
-
Genome Taxonomy Database Release R05-RS95 - citation
- if adding GTDB taxonomy information to input genomes
-
GNU Parallel v20161122 - citation info
- if running things in parallel (specifically set with the
-j
argument)
- if running things in parallel (specifically set with the
- IQ-TREE v1.6.9 - citation
NOTE: If doing a non-conda installation, you may need to also temporarily change your terminal's localization settings if you're not in the United States or Australia, as GToTree expect things to be encoded a certain way. If you run
locale
in the terminal, you will get a list of these. If any do not say "en_US.UTF-8", then you can run these two commands to temporarily change them (for the current terminal session):export LC_ALL="en_US.UTF-8"
andexport LANG="en_US.UTF-8"
. Now in this terminal window, GToTree will run appropriately. When you open a new terminal, your settings will be back to the way they were.
Home -- What is GToTree? -- Installation -- Example Usage -- User Guide -- SCG-sets -- Things to Consider
- Home
- What is GToTree?
- Installation
- Example usage
- User Guide
- SCG-sets
- Things to consider