Skip to content

Installation

Mike Lee edited this page Aug 25, 2020 · 138 revisions

GToTree runs in a Unix-like command-line environment. It is tested most extensively in the bash shell, but is written to be as POSIX-compliant as possible in hopes to run smoothly on as many unix variants as possible.

Conda quickstart!

If you don't already have the glorious package manager conda, I highly recommend you get it. This really isn't the venue to go into why it's so helpful, but it really is, I promise 🙂

To get conda up and running (which is very quick), you can follow the instructions to install miniconda (a light-weight version) for your appropriate system starting from here. You will want a python 3.X version, and more than likely a 64-bit version. And if you'd like to learn more about conda sometime, I have an introduction page here 🙂


The following line will create a gtotree conda environment and install GToTree:

conda create -y -n gtotree -c conda-forge -c bioconda -c defaults -c astrobiomike gtotree

DONE!

NOTE: Very kindly, @JoeVineis has noted that he had trouble installing GToTree with just the above command. But things did workout when breaking apart the environment build and the installation. So if you run into any snags with the one-line install above, maybe try this way :)

conda create -y -n gtotree python=3.7
conda activate gtotree
conda install -c conda-forge -c bioconda -c defaults -c astrobiomike gtotree

Now you should be able to enter and exit the environment with conda activate gtotree and conda deactivate gtotree. If you enter the environment and run the following:

gtt-hmms

It will print out where the GToTree default HMMs directory is located, and list the available HMMs there. And if you enter GToTree with no arguments, you should see the help menu. You can run a test that takes about 2 minutes like so:

gtt-test.sh

For which the end of the standard output should look like this:

#################################################################################
####                                 Done!!                                  ####
#################################################################################

  Overall, 12 genomes of the input 14 were retained (see notes below).

    Tree written to:
        GToTree_test/GToTree_test.tre

    Alignment written to:
        GToTree_test/Aligned_SCGs_mod_names.faa

    Main genomes summary table with comp./redund. estimates written to:
        GToTree_test/Genomes_summary_info.tsv

    Summary table with hits per target gene per genome written to:
        GToTree_test/SCG_hit_counts.tsv

    Files for additional PFam searches written to:
        GToTree_test/additional_pfam_search_results/

    Partitions (for downstream use with mixed-model treeing) written to:
        GToTree_test/run_files/Partitions.txt

 _______________________________________________________________________________

  Notes:

        1 accession(s) not successfully found at NCBI.
        1 genome(s) removed due to having too few hits to the targeted SCGs.

    Reported along with additional informative run files in:
        GToTree_test/run_files/

 _______________________________________________________________________________

    Log file written to:
        GToTree_test/gtotree-runlog.txt

    Programs used and their citations have been written to:
        GToTree_test/citations.txt


                                         Total process runtime: 0 hours and 1 minutes.

And if you took that output tree file "GToTree_test.tre" and threw it into a tree viewer, such as uploading it to the Interactive Tree of Life site, rooting it at the included archaeal sequence, and dragging and dropping in the "additional_pfam_search_results/PF05400.13-iToL.txt" file, it would look something like this (though with some different labels now, as I haven't updated this image):

Where the blue branches go to those genomes in which the FliT protein involved in flagellar biosynthesis was detected (searched for by it's PFam, PF05400.13, being specified in the "pfam_targets.txt" input file).

You can clean out the results from the test run by running:

gtt-clean-after-test.sh

GToTree comes packaged with full example datafiles and outputs as outline on the example-usage page in directories where conda installed it on your system. You can access these while in the "gtotree" conda environment under the $EXAMPLE_DATA_DIR variable. E.g. echo $EXAMPLE_DATA_DIR or cd $EXAMPLE_DATA_DIR.


Installation without conda

Again, the conda installation is highly recommended as it is more robust across different systems. But to try installing without conda, download and unpack/decompress GToTree wherever you'd like it to live on your system (be sure to change the versions below to the latest found here:

curl -L https://github.com/AstrobioMike/GToTree/archive/v1.5.22.tar.gz -o GToTree-v1.5.22.tar.gz
tar -xzvf GToTree-v1.5.22.tar.gz

Add the bin to your PATH

Now we need to add the "bin" directory to our PATH (see here if you are unfamiliar with what the PATH is and you'd like to know more). One way we can do this is change directories into the bin, and use pwd inside an echo command to put the full path into our PATH:

cd GToTree-1.5.22/bin # make sure you are in this bin directory
echo "export PATH=\"$(pwd):\$PATH\"" >> ~/.bash_profile

Add path to included HMM files

If you'd like to more easily be able to use the included single-copy gene HMM profiles, you can also add a variable to your bash profile so that you don't need to provide the full path to them whenever you use them. If you change directories into the "hmm_sets" directory, this can be done in a similar way as above:

cd ../hmm_sets/ # from where we were above
echo "export GToTree_HMM_dir=\"$(pwd)/\"" >> ~/.bash_profile

Last thing to do is source the ~/.bash_profile we just modified so those changes take effect in our current session:

source ~/.bash_profile

You can run gtt-hmms with no arguments to make sure the default HMM directory is set, and see what taxa the currently available HMM files can more specifically target.

And now if you type GToTree with no arguments, you should see the help menu (but note that you still need to take care of the dependencies presented below before you're ready to rock):

                                  GToTree v1.5.22
                         (github.com/AstrobioMike/GToTree)


 ----------------------------------  HELP INFO  ----------------------------------

  This program takes input genomes from various sources and ultimately produces
  a phylogenomic tree. You can find detailed usage information at:
                                  github.com/AstrobioMike/GToTree/wiki


 -------------------------------  REQUIRED INPUTS  -------------------------------

      1) Input genomes in one or any combination of the following formats:
        - [-a <file>] single-column file of NCBI assembly accessions
        - [-g <file>] single-column file with the paths to each GenBank file
        - [-f <file>] single-column file with the paths to each fasta file
        - [-A <file>] single-column file with the paths to each amino acid file,
                      each file should hold the coding sequences for just one genome

      2)  [-H <file>] location of the uncompressed HMM file being used, or just the
                      HMM name if you've set the environment variable 'GToTree_HMM_dir'
                      to the appropriate location or installed via conda (run 'gtt-hmms'
                      by itself to view the available gene-sets)


 -------------------------------  OPTIONAL INPUTS  -------------------------------


      Output directory specification:

        - [-o <str>] default: GToTree_output
                  Specify the desired output directory.


      User-specified modification of genome labels:

        - [-m <file>] specify desired genome labels
                  A two- or three-column tab-delimited file where column 1 holds either
                  the file name or NCBI accession of the genome to name (depending
                  on the input source), column 2 holds the desired new genome label,
                  and column 3 holds something to be appended to either initial or
                  modified labels (e.g. useful for "tagging" genomes in the tree based
                  on some characteristic). Columns 2 or 3 can be empty, and the file does
                  not need to include all input genomes.


      Options for adding taxonomy information:

        - [-t ] default: false
                  Provide this flag with no arguments if you'd like to add NCBI taxonomy
                  info to the sequence headers for any genomes with NCBI taxids. This will
                  will largely be effective for input genomes provided as NCBI accessions
                  (provided to the `-a` argument), but any input GenBank files will also
                  be searched for an NCBI taxid. See `-L` argument for specifying desired
                  ranks.

        - [-D ] default: false
                  Provide this flag with no arguments if you'd like to add taxonomy from the
                  Genome Taxonomy Database (GTDB; gtdb.ecogenomic.org). This will only be
                  effective for input genomes provided as NCBI accessions (provided to the
                  `-a` argument). This can be used in combination with the `-t` flag, in
                  which case any input accessions not represented in the GTDB will have NCBI
                  taxonomic infomation added (with '_NCBI' appended). See `-L` argument for
                  specifying desired ranks, and see helper script `gtt-get-accessions-from-GTDB`
                  for help getting input accessions based on GTDB taxonomy searches.

        - [-L <str>] default: Domain,Phylum,Class,Species,Strain
                  A comma-separated list of the taxonomic ranks you'd like added to
                  the labels if adding taxonomic information. E.g., all would be
                  "-L Domain,Phylum,Class,Order,Family,Genus,Species". Note that
                  strain-level information is available through NCBI, but not GTDB.


      Filtering settings:

        - [-c <float>] default: 0.2
                  A float between 0-1 specifying the range about the median of
                  sequences to be retained. For example, if the median length of a
                  set of sequences is 100 AAs, those seqs longer than 120 or shorter
                  than 80 will be filtered out before alignment of that gene set
                  with the default 0.2 setting.

        - [-G <float>] default: 0.5
                  A float between 0-1 specifying the minimum fraction of hits a
                  genome must have of the SCG-set. For example, if there are 100
                  target genes in the HMM profile, and Genome X only has hits to 49
                  of them, it will be removed from analysis with default value 0.5.

        - [-B ] default: false
                  Provide this flag with no arguments if you'd like to run GToTree
                  in "best-hit" mode. By default, if a SCG has more than one hit
                  in a given genome, GToTree won't include a sequence for that target
                  from that genome in the final alignment. With this flag provided,
                  GToTree will use the best hit. See here for more discussion:
                  github.com/AstrobioMike/GToTree/wiki/things-to-consider


      Additional PFam searching:

        - [-p <file>] single-column file of additional PFam targets to search for.
                  Table of hit counts, fasta of hit sequences, and files compatible
                  with the iToL web-based tree-viewer will be generated for each
                  target. See visualization of gene presence/absence example at
                  github.com/AstrobioMike/GToTree/wiki/example-usage for example.


      General run settings:

        - [-N ] default: false
                  No tree. Generate alignment only.

        - [-T <str>] default: FastTree
                  Which program to use for tree generation. Currently supported are
                  "FastTree" and "IQ-TREE". As of now, these run with default settings
                  only (and IQ-TREE includes "-mset WAG,LG". To run either with more
                  specific options (and there is a lot of room for variation here), you
                  can use the output alignment file from GToTree (and partitions file if
                  wanted for mixed-model specification) as input into a dedicated treeing
                  program.

        - [-n <int> ] default: 2
                  The number of cpus you'd like to use during the HMM search. (Given
                  these are individual small searches on single genomes, 2 is probably
                  always sufficient.)

        - [-j <int> ] default: 1
                  The number of jobs you'd like to run in parallel during steps
                  that are parallelizable. This includes things like downloading input
                  accession genomes and running parallel alignments, but not the tree step.
                  Note that I've occassionally noticed NCBI not being happy with over ~50
                  downloads being attempted concurrently. So if using a `-j` setting around
                  there or higher, and GToTree is saying a lot of input accessions were not
                  successfully downloaded, consider trying with fewer.
        - [-P ] default: false
                  Provide this flag with no arguments if your system can't use ftp,
                  and you'd like to try using http.

        - [-d ] default: false
                  Provide this flag with no arguments if you'd like to keep the
                  temporary directory. (Mostly useful for debugging.)


 --------------------------------  EXAMPLE USAGE  --------------------------------

	GToTree -a ncbi_accessions.txt -f fasta_files.txt -H Bacteria -D -j 4

Installing dependencies without conda

By far, the easiest way to get all the dependencies up and running is with conda as done above. But if you don't want to use conda, here are links to installing all the dependencies (be sure to install Easel along with HMMER3 as well if you are doing things the non-conda way).

Essential dependencies

If you use GToTree, please be sure to cite these folks – a citations.txt file including used programs is produced with each run to help 🙂

Optional dependencies depending on use

If you use GToTree in a manner that uses these tools, please cite these folks – a citations.txt file including used programs is produced with each run to help 🙂

  • Prodigal v2.6.3 - citation
    • if providing input genomes in fasta format, or GenBank format with no CDS annotations, or NCBI accessions to genomes with no gene calls
    • if providing input genomes as NCBI assembly accessions
  • TaxonKit v0.6.0 - citation
    • if adding NCBI taxonomy information to input genomes
  • Genome Taxonomy Database Release R05-RS95 - citation
    • if adding GTDB taxonomy information to input genomes
  • GNU Parallel v20161122 - citation info
    • if running things in parallel (specifically set with the -j argument)
  • IQ-TREE v1.6.9 - citation

NOTE: If doing a non-conda installation, you may need to also temporarily change your terminal's localization settings if you're not in the United States or Australia, as GToTree expect things to be encoded a certain way. If you run locale in the terminal, you will get a list of these. If any do not say "en_US.UTF-8", then you can run these two commands to temporarily change them (for the current terminal session): export LC_ALL="en_US.UTF-8" and export LANG="en_US.UTF-8". Now in this terminal window, GToTree will run appropriately. When you open a new terminal, your settings will be back to the way they were.


Clone this wiki locally