Releases: soedinglab/MMseqs2
MMseqs2 Release 18-8cc5c
MMseqs2 Release 18 at a glance: new Forward–Backward aligner, re-enabled substitution matrix parameter estimation, faster ARM64 alignments, improved GPU support.
Breaking changes
- CMake ≥ 3.15 and <4 is required to compile MMseqs2 (da0b2c3).
gpuserverno longer accepts the--gpuparameter (3b5d13e).- databases generated by the
databasesmodule are now GPU compatible. This might slightly alter search/clustering results since the sequence order was changed.
New features and enhancements
- New Forward‑Backward (FWBW) aligner
fwbw(003fabc, fb687b7) by @Gyuuul2 @elpis51613 @lasseReifenrath - Custom substitution matrices are supported again through a new lambda calculator (5ebd6e9, efad625, b76ebc4) by @edawson.
- Proximity‑aware pairing:
pairalncan now match sequences that are physically close in accession space (--pairing-filter 1and--pairing-prox-diste019185, c9107ba, 835acb9, 1970db6, f6e9636, 60a894b). createdbanddatabasesaccept--gpuparameter to directly produce GPU databases (0578939, 90ee542). Sequence databases generated by thedatabasesmodule use the flag default and are GPU compatible.- Speedup aarch64 SIMD alignment with new/improved
simd_any,simd_eq_all,simd_hmax*instructions (103fe79). Thanks @nskyav - aarch64 GPU binaries
mmseqs-linux-gpu-arm64.tar.gzare now built with Clang 20 and wide Neon registers for additional speedup (62cf4d0, 9564601, 1668032). Thanks to @nskyav.mmseqs-linux-arm64.tar.gzis still the compiled with the older slower configuration createdmptaxonomyallows converting taxonomy databases back to .dmp files (bc0f9cb).taxonomyreportcan now emit one database per query with--report-mode 3(8284a8b, 033d5f5).- Reduce thread start overhead in
expandaln,pairaln,subtractdbsandunpackdb(18d8ddc).
Bug fixes
- Improve error handling for
createindexwith only ungapped prefilter (9668e96, 829003a). - Fix precomputed index being slightly too large (b98f207).
expandalnnow skips entries lacking alignments (9c13275) and finds correct representatives (#691, 8783404).clusterupdatepreserves members correctly (#961, e7f5852, 296d912).- MPI nucleotide clustering was crashing (defe1af)
- SAM start coordinate was wrong sometimes (c13eef0)
- Systems with old
mawkcould result in corrupted databases due to large‑int printing (eaecacf) - Wrong
createdbmode message (#955, 48143e7) - Fix
createdbcould crash if FD 0 was closed (99a025e). Thanks @jnooree
Developer notes
MMseqs2 Release 17-b804f
MMseqs2 Release 17 is mostly a bug fix release. Highlights include usability improvements in MMseqs2-GPU and fix for a common crash in the prefilter that was affecting many clustering runs.
New Features and Enhancements
result2profilecan print frequencies in TSV format (c2c3ad9)- add a new masking mode
--mask-n-repeat(c2c3ad9) - Improve GPU clients in server mode to wait for databasess to be loaded (e095774)
- GPU server now takes
CUDA_VISIBLE_DEVICESinto account. (b804fbe) - Reduced glibc requirements for precompiled MMseqs2-GPU binaries to 2.17 (i.e. CentOS 7; db8ad2d)
Bug Fixes
- Segmentation fault in
easy-clusterstarting in #916 (dc7f8ad) - GPU version generated corrupted sequence outputs #912 (e3b16fa)
- Sequences starting with
*could break Sequence mapping #927 (492297b) - Indexes without k-mer index are masked now (4766f92)
- Invalid taxids check in
majoritylcadoes not abort the whole process (8d17137) - Merged taxID larger than any taxid in
nodes.dmpcould corrupt memory #931 (fd37b37)
Developer Notes
- Export NATIVE_ARCH in cmake (17cd5c0)
MMseqs2 Release 16-747c6
MMseqs2 Release 16 introduces support for GPU-accelerated searches [1]. Additionally, we fixed numerous bugs and relicensed MMseqs2 under the MIT license.
[1] Kallenborn F, Chacon A, Hundt C, Sirelkhatim H, Didi K, Dallago C, Mirdita M, Schmidt B, Steinegger M: GPU-accelerated homology search with MMseqs2. bioRxiv (2024).
Breaking Changes
- Custom substitution matrices (--seed-sub-mat, --sub-mat) are not supported in this release. Only the built-in matrices will work. We will restore support in the next release. (93b2d94)
New Features and Enhancements
- Added GPU support to MMseqs2, allowing for faster computations of sensitive alignments on CUDA-compatible hardware on the Turing generation or newer (a66ad0c, 81171a5, 1806c0c)
- Added full-length six-frame translated search with
--translation-mode 1(#885) - Implement
qframeandtframeoutput fields inconvertalis(#615, #803, 417f22f) - Allows resuming of interrupted downloads in
databasesandcreatetaxdb(0b27c9d) - MMseqs2 taxonomy now always keeps at least the longest open reading frame within each input sequence after fragment elimination (#832, 5b4c816)
- Added option to not compress outputs in
tsv2exprofiledb(a146887) filterdbhas learned a new sort mode (--sort-entries 4 --weights file) to sort by priority (54f8983)- Updated tantan (3e53eee)
Bug Fixes
prefiltercould use excessive memory and crash for highly redundant databases (950342d)prefilterwas not properly evaluating the last potential hit, increases sensitivity of k-mer prefilter slightly (06f7429)result2msaworks correctly with clustered clustered databases (78ae2c5)- Fixed
pposoutput field calculation inconvertalis(fb38b7d, 816c5c9) - Fixed wrong coverage being passed to realignment (6267ffb)
- Fixed
--taxon-listbeing broken in multi-threadedprefilterandungappedprefilter(804bb2a) - Fixed segmentation fault in
prefilter(#872, a64d60a, ef2ebe9) - Fixed inconsistent ordering issue in
createclusearchdb(b59ad53) - Corrected backtrace in SAM output for nucleotide-protein alignments and show reverse complement sequence correctly (#845, 5f23f1f)
Developer Notes
MMseqs2 Release 15-6f452
MMseqs2 Release 15 brings efficient single query searches with low memory overhead through the new ungapped-prefiltering mode (--prefilter-mode 1). We also improved our greedy clustering algorithm and added a large swath of smaller fixes and features. Thanks to all contributors for their vital contributions and fixes.
Breaking
- Updated greedy cluster algorithm. The clustering picks better representatives to respect the sequence identity and coverage criteria. (2568829) Thanks @bbuchfink
New Features and Enhancements
- Implement additional
prefiltermodes (standard double k-mer prefilter, ungapped prefilter, exhaustive searching) (5e119e9) - Added
createclusearchdbandmkrepseqdbmodules to build cluster-search databases, this was implemented for Foldseek, cluster-search in MMseqs2 will be implemented at a later point (9ae4458, 80f8b0b, 542f362, ad6dfc6, 91f2a6a, 8310cd6, 0019026, 76b7df1) - Implement target-side similar k-mer search mode for sequence-sequence prefiltering (71dd32e)
- Rework
ungappedprefilterto improve performance and expose additional parameters such as taxon filtering and db-load-mode toungappedprefilter(8a89305, 800eb09, eb01b5b, 20d3afc) - Added
gappedprefiltermodule for Smith-Waterman prefiltering, similar toungappedprefilter(df77d9e) - Reworked
pairalnfor the ColabFold greedy taxonomy pairing mode (1514015) - Implemented experimental module for A3M filtering (167bbd1, 499bb73)
- Implemented weighted clustering (bd080e6, b36070a, fd1837b) Thanks @AnnSeidel
- Precomputed indices without k-mers can be created with
--index-subset(314c1f0, 8fe3bf9) - Add
result2neffmodule to extract Neff scores (4148e09) Thanks @neftlon - Add
pposformat-output toconvertalisfor count of positive substitution scores (5edc79b) Thanks @Dohyun-s - Speed-up FASTA parsing in
kseq.hwith memchr (98406dd) Thanks @valentynbez @kloetzl
Bugfixes
- Add min and max modes for
result2stats(19dce03, 61e7734) Thanks @ClovisG - Fixed a segmentation fault in ca3m with the same database (f5f780a) Thanks @ClovisG
- Fix crash when some input file sizes are an exact multiple of 4096 in
convertalisandgff2db(712f288) Thanks @RuoshiZhang - Fixed issues for GTDB r214 database creation (4b52296) Thanks @apcamargo
- Fix source number being limited to 16-bit (65k) (1d62fa0)
kseqnow correctly handles input sequences larger than 2^31 bytes (07ca4a7)- Fixed
unpackdbto work without a.lookupfile and added support for writing compressed files (92d8cc3, 570e3ed) createindex --check-compatiblecheck the k-mer threshold correctly now (bb0a1b3)- Fixed
prefilterexclusively long result lists reading to result truncation. This was primarily a Foldseek issue and shouldn't affect MMseqs2 (ed4c55f) - Corrected handling of multiline checks in
createdb(6b93884) - Fix crash by disabling wrapped scoring when the target sequence is shorter than the query (8459b6b) Thanks @AnnSeidel
- Fixed logic in reciprocal-best-hit by removing
resAB_sort(3bcbdba) Thanks @StephanieSKim - Corrected handling of differently ordered parts of sequence databases in
concatdbs(ea17d30) - Fix
--single-step-clusteringmisspelled in cluster warning (fa6c093) Thanks @valentynbez
Build and Compatibility Updates
- Addressed build and compatibility issues, including updates for newer compilers and architectures (e.g., Mac ARM64) (e26b9ad, 3e43617, b341b66, 932d32b) Thanks @A-N-Other
- Added Mac ARM64 support in GitHub actions and updated from Ubuntu 18.04 to a newer image (1fea43d, 05132de)
- Updated regression testing to fix errors in MPI test (2113766)
Developer
MMseqs2 Release 14-7e284
This is a major release containing features implemented for ColabFold, Foldseek, MMseqs2 profile-profile (not published yet, and still in preview) and many bugfixes. Thanks a lot to the contributors who submitted bug fixes.
If you are using the Docker Hub based MMseqs2 containers, please switch to the new Github Container Registry based ones. The Docker Hub containers will not be maintained in the future.
Breaking
- Profile databases created by previous MMseqs2 releases won't work anymore with this release. Please recreate them from previous search results or MSAs with
result2profileor `msa2profile. - Profile k-mer threshold parameter were fitted to new pseudo-counter parameter (
--pca,--pcb). Previous--k-scoreparameters will have differing sensitivity. However, most users will have set-sinstead, which was fitted to match as closely as possible.
Features
gff2dbnow should actually work correctly after refactoring (488df86, thanks @RuoshiZhang)result2msanow supports reading from precomputed index- Add
db2tar: Create a tar file from a database - Add parsable columnar tsv output to
databaseswith--tsv - Add taxonomic filtering during
prefilterwith--taxon-list - Add
--comp-bias-corr-scaleto adjust the weight of the compositional bias correction - Add
--mask-probparameter to adjust tantan's masking threshold - Add context specific pseudo-counts to
result2profile - Add iterative profile-profile search workflow (thanks @haydenji0731)
- Add support for profile-profile scoring in striped Smith-Waterman algorithm (thanks @haydenji0731)
- Add support for gap-open/gap-close costs to striped Smith-Waterman algorithm (thanks @hgsommer)
- Add environment variable
MMSEQS_IGNORE_INDEXto ignore an existing precomputed index createsubdband view can now return results from identifiers in.lookupwith--id-mode 1- Change
compressdbloop toomp staticto keep order - Improvements to nucleotide alignments and scoring (thanks @AnnSeidel)
Features built for ColabFold now available in MMseqs2
- Add
pairaln: taxonomic pairing on sequences for MSA building (9a0df0d, 5e245d1, 3f8695e, 3e92abf, edb8223, e19df7c) - Add A3M support to
result2msa(--msa-format-mode 5) - Add A3M support with alignment information to
result2msa(--msa-format-mode 6) result2profileallows--diff 0- Make taxonomy mapping mmap'able for (near) instant read-in
- Add workflow to create expandable profile (profile-profile) db from TSVs
tsv2exprofiledb - Enable
result2profile/filterresultto read new expand alignment index - Add support to filter MSAs in buckets
filterresult,result2profile - Add
--filter-min-enableto enable filtering only above a minimum threshold of hits (c6d8ae0) - Expand can filter in each target cluster before expanding (75af0c8, 85ce847)
Bugfixes
summarizeresultwas rejecting hits that match the coverage threshold exactly (#586, 67949d7)- Don’t use reserved filename characters in unpackdb (#467, c663497 thanks @cutecutecat)
- Fix typo (violoations -> violations) (#526, 74c3aa6, thanks @Benjamin-Lee)
- Fix potential endless loop in
rescorediagonal - Fix prefilter/alignment with 0-size query input #433
- Fix
unpackdbparameter parsing issue - Make sure
FILTER_RESULTvariable is always correctly set for exhaustive search (d4a3354) tar2dbbreaking with--tar-include/exclude(#561)- Wrong database name printed for variadic input when creating a tmp directory
extractorfssometimes loading invalid start/stop codons on non-avx2 platforms- Don't mask consensus sequences in profiles
result2msacorrectly prints X residues- Allocate
CSProfileonly if it's going to be used (d873697) - Taxonomy db paths are now correctly found if given a precomputed index (8ff26f2)
- Encode more strings internally as base64 if special characters are used (16b5774, d155586)
- Disable broken iterative profile searches in taxonomy (#432)
- Fixed a possible segmentation fault in
align(thanks @rchikhi)
MMseqs2 databases
- Added VOGDB
- Updated dbCAN2 to V9 and removed
.alnsuffix from profile names - Fix issues with ResFinder (#494, 56816b3), GTDB (#561, 678c82a), Kalamari (#531, ce7bf53), Uniref (#496, e85ceb9, thanks to @fanhuan)
Speedup
- Rework of
result2msato avoid allocating a lot of memory - Improvement of speed for ungapped alignment in
prefilter TaxonomyExpressionis faster with a single tax identifier (8ff7279)
MMseqs2 subprojects
- MMseqs2-based subprojects can use
databasestoo (5afd33c) - Add
appenddbtoindex: augment a precomputed index with other databases in sub-projects - Allow subprojects to build their own precomputed indices (a506d67)
- Add support for external k-mer thresholds for the prefilter (fea8d20)
- Subprojects can define their own DbType validators
Developers
- Added CirrusCI to test FreeBSD and old compilers (a2e2129, 904d0c6, a09a704, 4f1996a, 482dedc, 16830a5)
- MMseqs2 Docker containers are now published in the Github Container Registry (eb203d3, 5185d3c, ba4e11f)
- Our microtar fork can write tar files again (dcd180b)
- Add URIs as allowed parameter inputs (3b9cf88)
- Additional s390x fixes (linclust might work now)
- Add support for new MultiParameter type
- Bundled SIMDe was updated (thanks @mr-c)
MMseqs2 Release 13-45111
New Taxonomy Workflow (new feature and breaking change)
We introduce a new taxonomy workflow for assigning taxonomic labels to nucleotide sequences by searching against protein reference databases. For details see:
The nucleotide-to-protein taxonomic assignment is now much faster and is optimized towards annotation of contigs. If you use MMseqs2 taxonomy to assign taxonomic labels to short reads, consider using the --orf-filter 0 parameter to disable the new filter stage as it can reject too many short query sequences. MMseqs2 is still considerably faster with this parameter set.
As our nucleotide-to-nucleotide taxonomic assignment does not support the 2bLCA assignment mode for stable lowest-common-ancestor computation, we previously set MMseqs2 to perform LCA assignment by top-hit (--lca-mode 4) as default. Approximate (see manuscript) 2bLCA is now again the default and we automatically switch to top-hit if given nucleotide-to-nucleotide input.
Breaking changes
--slice-searchin now called--exhaustive-search- Unify
--compress--summarize--omit-consensus(inresult2msa) to--msa-format-mode
Features
- Add GTDB and CDD to databases downloader #410
- Add
nrtotaxmappingto create taxonomy mapping from NR - Add
unpackdbto split a database into separate files #406 - Add
majoritylcamodule for majority voting based taxonomy from alignment results - Add
cpdbandlndb - Taxonomy information is stored in binary format (a single
db_taxonomyfile, instead ofdb_{named,nodes,merged}.dmp,db_mapping) to speed up read-in. Old format is still supported. --exhaustive-searchis usable with ungapped alignments (--alignment-mode 4)- Allow sequence/result database input in
taxonomyreport#401/#408 msa2profile/resultcan skip the first sequence with--skip-querycreatetaxdbcan create a taxdb by mapping through.sourcein addition to.lookup(--tax-mapping-mode 1)splitsequencecan create a sequence database with original headersaligncan return short cluster format if only identifiers are required--alignment-output-modetar2dbcan be used multi-threaded if input allows (e.g..tarcontaining.gzfiles)- Encode species names in taxonomy blocklist to make sure we don't block random nodes in * e.g. GTDB)
- Split non-index parts over additional files in split index case to reduce peak memory use
proteinaln2nuclcan now compute scores and e-valuescreatedbcan create a sequence database from a database containing fasta files (e.g. created bytar2db)- Add
MMSEQS_FORCE_MERGEenvironment variable to force generating fully merged databases - Improved many descriptions, warnings and error messages
Bugs fixed
- Fix
filterresultoff by one issue removing wrong sequences - Fix
addtaxonomyalways crashing due to invalid check #355 - Reduce numbers of calls to
posix_memalignto fix lock contention on macOS extractorfsdoesn't flood warnings due to short sequences anymoreexpand2profile--pcais correctly set to0msa2profilealways copies.lookup/sourcefiles instead of symlinking- Clustering of clustering input would not work with set-cover or connected-component
- Short circuit
--cluster-reassignif nothing can be reassigned - Fix temporary files not getting removed in
linclust/clusterwith--remove-tmp--files - Fix
kmermatchersetting user k-mer pattern in auto k-mer selection and breaking - Krona
taxonomyreportwas not working if no sequence was unclassified - Make
Matcher::resultToBufferbuffer sizes consistent (could crash with very long backtraces, needs further refactoring) - Fix multiple locations where
Util::checkAllocationcould never be called as it would have crashed before - Whitespace containing parameters do not break workflows anymore (e.g. passing whitespaces to
--sub-mat) taxonomyreportandaddtaxonomyparameter were not adjustable ineasy-taxonomy- E-value parameters are now correctly parsed as doubles instead of floats #379
- Add symlinks to
splitdb#376 - Increase maximum number of open files in
DBReader - Include file size and modified date of inputs in
temporaryfile hash calculation #372 --cov-mode 5was not working #371- Database downloader deals correctly with redirects now
result2profilecould crash if target database contained much longer sequences than query database- Stop symlinking header database (and other ancillary files) in
filterresult
Developer
- Add vector of predefined substitution matrices to add additional matrices in subprojects
- Don't create false
_has_{builtin,attribute}macros (see simd-everywhere/simde#691 (comment)) - Add
USE_SYSTEM_ZSTDcmake flag to use system provided zstd #411 - Replace texlive with tectonic for faster/prettier userguide
- Add more instructions to
simd.h - Add initial fixes to get MMseqs2 working on s390x (work in progress)
- Prebuilt macOS binary is now a Universal Mac Binary supporting SSE, AVX and Apple Silicon NEON
- Build ARM64/PPC64LE binaries by cross-compiling
- Add missing licenses and READMEs for vendored libraries #403
- Update ALP to 1.98
- Update xxhash to v0.8.0
MMseqs2 Release 12-113e3
Breaking changes
- Remove
--add-internal-idparameter fromresult2msa filterdb --shuffleis now randomly instead of deterministically shuffled- Taxonomy expressions in filtertax(seq)db interpret
,as||now #320 convertalispidentoutput field now correctly reports percentage (0-100) sequence identity instead of fraction (0.00-1.00), usefidentto print the fraction instead
Features
- Support nucleotide clustering in
clusterandeasy-cluster - Support other architectures (SSE2/ARM64/POWER8/POWER9/etc) through SIMDe
- Linclust is much faster on systems with a lot of CPU cores
- Clustering update is faster, more stable and correctly deals with deleted sequences #272
- Add easy workflow for reciprocal best hit searches
easy-rbh - Add SILVA, Pfam-B, dbCAN2 to
databases databasesproduces taxonomy information for NR- Replace old greedy incremental clustering with new memory efficient version
- Add
result2dnamsamodule to create MSAs of nucleotide sequences - Continued progress on profile-profile searching (
result2pp,expandaln,expand2profile) , stay tuned! - Add multi-parameter to support to overwrite sequence type specific parameters: e.g.
--gap-open "nucl:5,aa:11" - Add ORF information as output options to
convertalis(qOrfStart/qOrfEnd, dbOrfStart, dbOrfEnd) - Speed up sorting using ips4o
- Speed up masking through new version of tantan
- Speed up multi-threaded writing of clustering results
- Speed up reading of database indices and merging target split databases
- Add memory tracking to account for index size when computing available memory (
--split-memory-limitshould be more reliable when searching/clustering billions of sequences). - Add
--search-type 4(translated/translated search) tocreateindex - Add
convertalis --format-mode 3HTML output based on MMseqs2 app (app.mmseqs.com) - Improve memory management in
result2msaandresult2profilemodules - Add
msa2resultmodule to create an alignment result db from MSAs - Add
filterresultto slim down result dbs with pairwise HHblits filtering #316 - Add
--kmers-per-sequence-scaletolinsearchto extract a k-mer fraction instead of a fixed count - Add a random integer to
--local-tmppath to avoid race conditions if multiple MMseqs2 happen on the same machine - Add
--max-seqstoungappedprefilter - Add
--tax-lineage-mode 2parameter to print numeric taxids
Bugs fixed
rbhworkflow was broken due to issues withfilterdb- Fix
-ain RBH search to show alignments - Fix PDB70 database creation in
databases - Fix aria2c download support
- Fix memory issues and MPI in kmermatcher
- Fix memory issues in
extractorfswhen using AVX2 - Fix
--cluster-reassignto respect--cov-mode - Set-cover supports up to 2^32 sequences (previously crashed with more than 2^31)
- Exit correctly if there is not have enough disk space instead of crashing in the next module
- Fix
prefilterorder instability when searching very redundant databases - Correctly parse keys from data files in
filterdb --filter-file, this was causing instability inlinsearch - Allow overwriting string parameters with empty strings
- Fix ASAN issue in
extractorfwhen using AVX2 - Microtar would try to seek backwards constantly resulting in horrible gzip read performance
- Avoid lookup writing to corrupt memory if an accession is too long
- Fix various inconsistencies and usability issues in
alignall:--alignment-modeinconsistent withalignmodule--add-backtracedid not do anything
- Fix restart of clusterings using reassignment
cluster --cluster-reassign - Fix createdb did not correctly read gz/bzip files with
--createdb-mode 1#323
MMseqs2 Release 11-e1a1c
At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster. The new databases module helps to download and setup database. We now have a chat support at chat.mmseqs.com.
Known Issues
rbhcrashes due to invalid sorting mode (#290)- Homebrew's macOS version does not use multiple cores (#289)
prefilterresults can be unstable between different runs for extremely redundant databases (#277)linclust/clustercan crash for very small input sets (#274)
Breaking Changes
kmermatcher--skip-n-repeat-kmerparameter was replaced with--ignore-multi-kmer
Does not discard whole sequences anymore if a k-mer occured to often, instead it skips the specific k-mers.
Either mode is only used in Plass and not in Linclust--lca-ranksfrom(easy-)taxonomyandlcahas to be delimited with semicolons (;) instead of colons (:)--dont-shuffleflag was renamed to--shuffle true/false
Features
- new
databasesworkflow to list and download common databases.
Supported databases:
Name Type Taxonomy Url
- UniRef100 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef90 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef50 Aminoacid yes https://www.uniprot.org/help/uniref
- UniProtKB Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot Aminoacid yes https://uniprot.org
- NR Aminoacid - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT Nucleotide - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- PDB Aminoacid - https://www.rcsb.org
- PDB70 Profile - https://github.com/soedinglab/hh-suite
- Pfam-A.full Profile - https://pfam.xfam.org
- Pfam-A.seed Profile - https://pfam.xfam.org
- eggNOG Profile - http://eggnog5.embl.de
- Resfinder Nucleotide - https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari Nucleotide yes https://github.com/lskatz/Kalamari
(easy-)search --slice-searchis now usable. Slice search finds all hits that fulfill the alignment criteria while using only as much disk space as defined by--disk-space-limitcreatedband the variouseasy-workflows learned to read query input fromSTDINtaxonomyreportlearned to display the summarized taxonomy result with Krona- new
filtertaxseqdbmodule for filtering sequence DBs with taxonomy information according to provided taxa --taxon-listparameter understands expressions. E.g. get all bacterial and human sequences--taxon-list "2||9606"easy-searchandconvertaliscan now output taxonomic information using--format-output
taxid Taxonomic identifier
taxname Taxon Name
taxlineage Taxonomic lineage
- speed up in
(easy-)cluster/linclustby improving k-mer extraction - MMseqs2 consistently creates .source and .lookup files to match from which input file a sequence came from
E.g.:mmseqs createdb input1.fa input2.fa seqDBeach sequence in seqDB can tell if it came frominput1.faorinput2.fa createdblearned to index an existing (single-line-seq per entry) FASTA file without copying the FASTA content to a new databasealignandrescorediagonallearned to align circular sequencesalignexposes the z-drop parameter of its Banded Nucleotide alignment algorithmreverseseqlearned to reverse profilesfilterdbcan filter rows with value within given percentage of first row- new
aggragatetaxmodule to assign a taxonomic label to contigs according to the fragments matched on the contig - Adjusting
--max-seq-lenis not required anymore, MMseqs2 automatically increases the length now. - MMseqs2 on Cygwin/Windows uses
nedmallocas its memory allocator now and does not massively slow down due to lock contention - new
tar2dbmodule to efficiently transform content oftararchives to MMseqs2 databases
Bug fixes
createindexwould create corrupted indices for profile target databasesrbhworkflow would create its result DB at an unexpected (wrong) location(easy)-taxonomy --lca-mode 3(Approx. LCA) was aligning invalid sequences in the second iteration and producing bad resultslca(and(easy)-taxonomy) add empty columns for unclassifed sequences to be valid TSVskmermatcheruses xxhash for hashing now (faster)kmermatcheravoid crash machine has not enough memory to process data at once (affects linclust/cluster)kmermatchercorrectly deals with sequences longer than MAX_SHRT nowkmermatcherfixed various edge cases (e.g. alignment of 1-char sequences)kmermatcherhash-shift would be ignoredoffsetalignmentcould produce wrong results in the minus-strandclustnow correctly and consistently handles alignment DB inputclusthashbetter deals with nucleotide input now and several multi-threaded inefficiencies were resolved(easy-)cluster--single-step-clusteringcould cluster unrelated sequences due to hash collisionsprefilter --diag-score 0respects--min-ungapped-scorecreateseqfiledbcould print empty sequence linestaxonomyreportcould crash if no sequence was unclassifiedresult2flatcould crash with long sequence inputresult2msa, result2profile, msa2profilebackport filtering fix from HHblitsaligncould produce bad alignments if all sequence lenghts in query DB where a lot shorter than in target DBsplitsequencefix issues with splitsequence if combined with compressedresult2profilefix Filter2 bug of HH-suite in MMseqs2applywould crash due to reading wrong entry lengthsfilterdb --filter-expressionwas not thread safe and could corrupt resultsfilterdb--extract-linesand--trim-to-one-columnare compatible with each other
Developers
- Internal representation of sequences changed from 4-byte per character to 1-byte per character
- Compilation under AppleClang + libomp works now (see
util/build_osx.sh) - Tools inheriting from MMseqs2 can now add their own citations
- MMseqs2 on macOS compiles with the macOS 10.9 SDK (removed
symlinkatcall; relevant for bioconda)
MMseqs2 Release 10-6d92c
At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster.
Known Issues
- High sensitivity searches (higher than -s 6) with precomputed indices should fail. Pass
--db-load-mode 3as a workaround to the MMseqs2 call.
Breaking Changes
- Default taxonomy mode is assigning the same taxonomic label as the top hit. The previous "approximate 2bLCA" mode can be used with
--lca-mode 3or the non-approximated 2bLCA with--lca-mode 2 - MMseqs2 will refuse to compile on compilers without OpenMP support (Use
-DREQUIRE_OPENMP=0to force a single-threaded no OpenMP build) - The confusingly named (and probably non-functional)
--global-alignmentparameter is gone - File names of the latest precompiled binaries changed. All archives contain a copy of the user guide and the MMseqs2 binary in the same subfolder (see further down for binaries of release 10-6d92c):
| SIMD | Linux | macOS | Windows |
|---|---|---|---|
| SSE4.1 | mmseqs-linux-sse41.tar.gz | mmseqs-osx-sse41.tar.gz | mmseqs-win64.zip |
| AVX2 | mmseqs-linux-avx2.tar.gz | mmseqs-osx-avx2.tar.gz | - |
Known Issues
- MMseqs2 on Windows seems to not scale well on multiple threads
- MMseqs2 on Windows can crash when built with AVX2 support (mostly on VMs)
Features
createindexcan precompute split indices to improve runtime when searching against a database that is larger than the system memory. Precomputed databases also require less overhead RAM, since only the required parts are loadedeasy-search,easy-taxonomy,easy-linclustandeasy-clusterworkflows can take any number of query FASTA or FASTQ files- MMseqs2 validates database types. It will exit with an error message on wrong input, where it would previously crash
kmermatcherreports the diagonal with the most k-mer matcheskmermatcherscales the number of k-mers with sequence length (--kmer-per-seq-scale)rescorediagonalgot two new rescore modes, one for global alignment scoring and one for scoring a quasi global alignment fullfilling a local window criterion- Peak memory usage for reading in very large databases is greatly reduced. 128GB nodes should comfortably be able to deal with up to the maximum of 4.2 billion entries
- Parameters taking byte values support syntax with a SI suffix (e.g.,
--split-memory-limit 64G) - Nucleotide substitution matrices should be user definable
- Taxonomy report is compatible with Pavian. Thanks to Florian Breitwieser!
clusterworkflow learned a reassignment mode--cluster-reassign. This mode corrects errors that occured because of cascaded clusteringextractorfscan directly translate a nucleotide ORF to an amino acid sequenceresult2statscan write TSV filescreatesubdbsupports softlinks instead of always hard copying the whole file to disk- reduced harddisk space usage for all cascaded clusterings
easy-taxonomyreports the top hit alignment as a separate output file with the suffixtophit_alncreateindexchecks if an index needs to be recomputed were improved
Bug fixes
- MMseqs2 did not compile on FreeBSD. Please let us know about free continuous integration options to make sure it will keep working in the future
proteinaln2nuclcould return wrong coordinatesapplywould deadlock when running with multiple threads- MPI searches are way more reliable, there were various issues around merging the separate results. MPI logic of split and merge is also integrated into the regression tests suite
prefiltersplits nucleotide searches if not enough memory is availablekmermatchercould corrupt memoryrescorediagonalcould produce wrong sequence identities when aligning mixed-case sequences- macOS builds were not actually static (still dynamically link libsystem however)
lcamodule could corrupt memory and crashcreatedbdoes not crash on systems with only 4GB of RAM anymore- AVX2 and SSE4.1 builds could produce slightly different results
summarizeresultsdoes not crash on empty alignments results anymore- fix wrong tophit_report in
easy-taxonomy - Precompiled Windows builds were broken
- Precomputed indices of databases with very short sequences could truncate alignments if the query sequences were longer
Developers
-
Tools using MMseqs2 as a framework do not need to export MMseqs2 modules again anymore
-
MMseqs2 uses Azure Pipelines for all platforms to run our regression tests suite and provide precompiled binaries
-
MMseqs2 runs under ASan without any issues. We fixed various small memory leaks
-
The regression suite is directly linked through a submodule
It can be used by running:
git submodule update --init ./util/regression/run_regression.sh $PATH_TO_MMSEQS/mmseqs $TMP_DIR
MMseqs2 Release 9-d36de
At a glance: Improved taxonomy, add colors to user output, improve computation progress bar, small speed ups and many bug fixes
Features
- Add support for Kraken style taxonomy reports. Thanks to Florian Breitwieser
- New easy-taxonomy workflow
- New progress bar to reduce output
- Colored errors and warnings
Bugs
- Fix alignment problem in SSW library mengyao/Complete-Striped-Smith-Waterman-Library#61
- Fix iterative profile search
- Fix protein nucleotide index issues
- Fix cluster update workflow
- Fix critical multi threading bug in taxonomy workflow