Overview#
Here we test the scaling of the MMseqs version 15 on
Rome (Dual EPYC 7552 with 48 cores each and 512GB DDR4 RAM combined) and
Turin (Single EPYC 9655 with 96 cores and 768GB DDR5 RAM) CPUs
It should be noted that GPU support in version 17 of MMseqs is very slow on our test samples and should not be used on Maestro.
Methodology#
We use the metagenome SRR006547(40k samples) available at NIH to establish the baseline and SRR006547 (327k samples) to check the scaling
Note that the execution times and the memory requirements depend on the genome size.
First, we copy over the fasta and create indices for our metagenome and the Uniprot, against which we will search.
cmd line (text)
mkdir -p TURIN
cd TURIN
cp /pasteur/helix/projects/hpc/BENCHMARKS/SRR006547.fasta .
module load MMseqs2/15-6f452
mmseqs createdb /local/databases/rel/uniprot/current/fasta/3.6/uniprot.fa target --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
mmseqs createdb SRR006547.fasta source --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
THREADS=XXXX; /usr/bin/time mmseqs search --threads ${THREADS?} source target result tmpdirectory
Results#
| threads | execution time Turin | execution time Rome | Used RAM Turin | Used RAM Rome |
|---|---|---|---|---|
| 96 | 6m30 | 24m | 250g | 170g |
| 48 | 9m40 | 25m | 324g | 178g |
| 32 | 13m30 | 29m | 318g | 237g |
| 24 | 19m | 38m | 319g | 232g |
| 24x2 | 22m | 42m | 316gx2 | 216x2 |
| 24x4 | 24m | 62m | 100gx4 | 100gx4 |
| 16 | 27m | 60m | 320g | 242g |
cold start (cache invalidation withdropcache 3) can be noticeable for high core count, 12min instead of 6. For lower core counts, like 24cores the effect of cold start is negligible.
There is no difference in execution times between using appa and helix filers for IO.
Reducing available memory (--split-memory-limit XXXG) about twice compared to unrestricted has no performance implications and allows running 4 copies on 24cores each in about the same wall time.
Reducing it about 3 times has small impact of order of 20% or less.
For the SRR006546 genome the corresponding times on 24 cores are 78m for Turin and 155 for Rome, with the same memory requirement of 100G.
Conclusions#
We recommend using 24 cores and 100g per mmseqs run. Here is the full command line to start:
Code Block (bash)
mmseqs search --threads 24 --split-memory-limit 110G source target result tmpdirectory
Do not forget to clear tmpdirectory once you are done with the sample, and to clean up target* once you are finished with all samples.