Skip to content

CLI argument to allow skipping genomes when lineage_tup is None? #25

@ccbaumler

Description

@ccbaumler

When the lineage within the taxonomy file is devoid of any information we get the output from Line 241.

I solved this problem by updating the taxdump names and nodes from NCBI, but should we allow skipping unclassified genomes or group them in some other way?

code:

== This is sourmash version 4.9.0. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

loading taxonomies from ['/group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20250806/lineages.protozoa.cs
v']
found 2424 identifiers in taxdb.
selecting sketches: k=21 scaled=1000 moltype=DNA
loading sketches from file /group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20250806/genbank-20250806-protozoa-k21.zip
cannot find ident GCA_051400955 in the provided taxonomy ifle.
The three closest matches to GCA_051400955 are:
* 'GCA_015146095.1'
* 'GCA_002140095.1'
* 'GCA_964014055.1'

No taxonomy information in the lineage file.

/group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20250806$ grep 051400955 lineages.protozoa.csv
GCA_051400955.1,3042617,,,,,,,,

/group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20250806$ grep 051624525 lineages.fungi.csv
GCA_051624525.1,3075319,,,,,,,,

/group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20250806$ grep 050924785 lineages.viral.csv
GCA_050924785.1,2851401,,,,,,,,,,,,,,,,

/group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20250806$ grep 050886295 lineages.archaea.csv
GCA_050886295.1,3025951,,,,,,,,

They all contain taxonomy information from NCBI...

https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_051400955.1/
https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_051624525.1/
https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_050924785.1/

Using version 0.3.1

$ sourmash info -v

== This is sourmash version 4.9.0. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

sourmash version 4.9.0
- loaded from path: /home/baumlerc/miniforge3/envs/pangenomes/lib/python3.13/site-packages/sourmash/cli

khmer version: None (internal Nodegraph)

screed version 1.1.3
- loaded from path: /home/baumlerc/miniforge3/envs/pangenomes/lib/python3.13/site-packages/screed

the following plugins are installed:

plugin type          from python module             v     entry point name
-------------------- ------------------------------ ----- --------------------
sourmash.cli_script  sourmash_plugin_pangenomics    0.3.1 classify_command
sourmash.cli_script  sourmash_plugin_pangenomics    0.3.1 createdb_command
sourmash.cli_script  sourmash_plugin_pangenomics    0.3.1 merge_command
sourmash.cli_script  sourmash_plugin_pangenomics    0.3.1 ranktable_command

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions