Overview#

Here we test the scaling of the MMseqs version 15 on

Rome (Dual EPYC 7552 with 48 cores each and 512GB DDR4 RAM combined) and

Turin  (Single EPYC 9655 with 96 cores and 768GB DDR5 RAM) CPUs

It should be noted that GPU support in version 17 of MMseqs is very slow on our test samples and should not be used on Maestro.

Methodology#

We use the metagenome SRR006547(40k samples) available at NIH to establish the baseline and SRR006547 (327k samples) to check the scaling

Note that the execution times and the memory requirements depend on the genome size.

First, we copy over the fasta and create indices for our metagenome and the Uniprot, against which we will search.

cmd line (text)

mkdir -p TURIN
cd TURIN
cp /pasteur/helix/projects/hpc/BENCHMARKS/SRR006547.fasta .
module load MMseqs2/15-6f452

mmseqs createdb /local/databases/rel/uniprot/current/fasta/3.6/uniprot.fa target --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 
mmseqs createdb SRR006547.fasta  source --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 
THREADS=XXXX; /usr/bin/time mmseqs search --threads ${THREADS?} source target result tmpdirectory

Results#

threads execution time Turin execution time Rome Used RAM Turin Used RAM Rome
96 6m30 24m 250g 170g
48 9m40 25m 324g 178g
32 13m30 29m 318g 237g
24 19m 38m 319g 232g
24x2 22m 42m 316gx2 216x2
24x4 24m 62m 100gx4 100gx4
16 27m 60m 320g 242g

cold start (cache invalidation withdropcache 3) can be noticeable for high core count, 12min instead of 6. For lower core counts, like 24cores the effect of cold start is negligible.

There is no difference in execution times between using appa and helix filers for IO.

Reducing available memory (--split-memory-limit XXXG) about twice compared to unrestricted has no performance implications and allows running 4 copies on 24cores each in about the same wall time.

Reducing it about 3 times has small impact of order of 20% or less.

For the SRR006546 genome the corresponding times on 24 cores are 78m for Turin and 155 for Rome, with the same memory requirement of 100G.

Conclusions#

We recommend using 24 cores and 100g per mmseqs run. Here is the full command line to start:

Code Block (bash)

mmseqs search --threads 24 --split-memory-limit 110G source target result tmpdirectory

Do not forget to clear tmpdirectory once you are done with the sample, and to clean up target* once you are finished with all samples.