Skip to content

Clusterupdate : clustering of deleted sequences and conversion to tsv file #272

@ApollineBruley

Description

@ApollineBruley

Expected Behavior

I want to update my clusters after a database update (in which I add new sequences but also delete sequences compared to the old database).
The clusterupdate command works, but when I try to convert the cluster database to a tsv file, I have an error message related to the index (see below).

I tried the same thing on a new database where I just added sequences and it worked perfectly, so I assume the problem comes from the fact that I remove sequences from the old database?

Current Behavior

Error when trying to generate the tsv file.
In the cluster database obtained after clusterupdate ('CLU_updated') the removed sequences still appear, but they are absent of the updated sequence database ('DB_updated').

Steps to Reproduce (for bugs)

  1. Creation of old DB (oldDB.fa : 17 amino acid sequences)
    mmseqs createdb oldDB.fa DB_old

  2. Clustering of old DB
    mmseqs cluster DB_old CLU_old tmp

  3. Creation of new DB (newDB.fa : 13 sequences are identical with the old DB, 4 were removed, 4 were added)
    mmseqs createdb newDB.fa DB_new

  4. Cluster update
    mmseqs clusterupdate DB_old DB_new CLU_old DB_updated CLU_updated tmp
    No error there, but even though sequences of numeric identifiers 12 , 11 , 16 , 15 in the old db have been removed, they appear in the CLU_updated file. They do not appear in the DB_updated files.

  5. Conversion of cluster DB in tsv :
    mmseqs createtsv DB_updated DB_updated CLU_updated clusters.tsv
    => Error message, generation of empty files : clusters.tsv.1 ... clusters.tsv.7 and clusters.tsv.index.1 ... clusters.tsv.index.7

MMseqs Output (for bugs)

Program call:
createtsv DB_updated DB_updated CLU_updated clusters.tsv 

MMseqs Version:                  	2f66ae897fc813450fa5ef0c78123bd3c41c4717
first sequence as respresentative	false
Target column                    	1
Add Full Header                  	false
Database Output                  	false
Threads                          	8
Compressed                       	0
Verbosity                        	3

Query database: DB_updated
Touch data file DB_updated_h ... Done.
Result database: CLU_updated
Start writing to clusters.tsv
Invalid database read for database data file=DB_updated_h, database index=DB_updated_h.index
getData: local id (4294967295) >= db size (17)

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

  • Git commit used: 2f66ae8
  • Which MMseqs version was used: Compilation from source
  • Cmake versions used: cmake version 3.5.1
  • Operating system and version: Ubuntu 16.04 LTS

Thank you in advance for your help :)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions