Data Resources from the MMseqs2 Family

Our data resources in one place.

6M Marine Eukaryotic Proteins and 89M Profiles

Tara oceans metagenomics datasets assembled with MegaHit. Proteins predicted with MetaEuk.

Levy Karin et al, Microbiome 8, 48, 2020

Metaclust - Clustering of 2B Metagenomic Predicted Proteins

Predicted proteins from >1800 metagenomes and >400 metatranscriptomes with Prodigal. Clustered to 50% and 95% sequence identity with Linclust.

Steinegger, Söding, Nat Commun 9, 2542, 2018

Uniclust - High Quality Clustering of the Uniprot

Clustered to 30%, 50% and 90% sequence identity. Clustered with MMseqs2 and Linclust.

Mirdita et al, Nucleic Acids Res 45, D170–6, 2017

Soil reference catalogue and Marine eukaryotic reference catalogue

SRC: 2B proteins from soil metagenomics. MERC: 300M proteins from marine metatranscriptomics. Assembled with Plass.

Steinegger et al, Nat Methods 16, 603–6, 2019

Big Fantastic Database - Clustering of SRC and MERC

Over 300M HMMs containing over 2.6B proteins for HHblits. Clustered with MMseqs2 and Linclust.

Steinegger et al, Nat Methods 16, 603–6, 2019

ColabFold databases

Environmental protein databases built for ColabFold - an accessible AlphaFold2 implementation.

Mirdita, Schütze, Moriwaki, Lim, Ovchinnikov, Steinegger, Nature Methods, 2022

Foldcomp databases

Compressed protein structure databases for Foldcomp: a library and format for compressing and indexing large protein structure sets.

Kim, Mirdita, Steinegger, biorxiv, 2022