Overview#

Here we test the scaling of the MMseqs version 15 on

Rome (Dual EPYC 7552 with 48 cores each and 512GB DDR4 RAM combined) and

Turin (Single EPYC 9655 with 96 cores and 768GB DDR5 RAM) CPUs

It should be noted that GPU support in version 17 of MMseqs is very slow on our test samples and should not be used on Maestro.

Methodology#

We use the metagenome SRR006547(40k samples) available at NIH to establish the baseline and SRR006547 (327k samples) to check the scaling

Note that the execution times and the memory requirements depend on the genome size.

First, we copy over the fasta and create indices for our metagenome and the Uniprot, against which we will search.

cmd line (text)

mkdir -p TURIN
cd TURIN
cp /pasteur/helix/projects/hpc/BENCHMARKS/SRR006547.fasta .
module load MMseqs2/15-6f452

mmseqs createdb /local/databases/rel/uniprot/current/fasta/3.6/uniprot.fa target --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 
mmseqs createdb SRR006547.fasta  source --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 
THREADS=XXXX; /usr/bin/time mmseqs search --threads ${THREADS?} source target result tmpdirectory

Results#

threads	execution time Turin	execution time Rome	Used RAM Turin	Used RAM Rome
96	6m30	24m	250g	170g
48	9m40	25m	324g	178g
32	13m30	29m	318g	237g
24	19m	38m	319g	232g
24x2	22m	42m	316gx2	216x2
24x4	24m	62m	100gx4	100gx4
16	27m	60m	320g	242g

cold start (cache invalidation withdropcache 3) can be noticeable for high core count, 12min instead of 6. For lower core counts, like 24cores the effect of cold start is negligible.

There is no difference in execution times between using appa and helix filers for IO.

Reducing available memory (--split-memory-limit XXXG) about twice compared to unrestricted has no performance implications and allows running 4 copies on 24cores each in about the same wall time.

Reducing it about 3 times has small impact of order of 20% or less.

For the SRR006546 genome the corresponding times on 24 cores are 78m for Turin and 155 for Rome, with the same memory requirement of 100G.

Conclusions#

We recommend using 24 cores and 100g per mmseqs run. Here is the full command line to start:

Code Block (bash)

mmseqs search --threads 24 --split-memory-limit 110G source target result tmpdirectory

Do not forget to clear tmpdirectory once you are done with the sample, and to clean up target* once you are finished with all samples.