Linear Algebra Library Benchmarks on Maestro

Intro#

I tried to make an idea of the performance that could be expected from several linear algebra libraries on the maestro cluster.

These benchmark are mono process. They do common operations : access a column or a line of a matrix, transpose a matrix, multiply 2 matrix, invert a matrix.

Source code for the benches can be found here: https://gitlab.pasteur.fr/vlegrand/linear_algebra_lib_comparison

I tested 4 libraries listed below. Each time, only 1 bench and nothing else is running on the compute node.

Compiler used is always the same (gcc 10 as available on maestro) withflag -O3.

Anyone interested in the subject is most welcome to make suggestions regarding new operations or new libraries to test.

0circle

1- armadillo 10.8.2#

Code Block (bash)

module load gcc/10.1.0
module load CBLAS/3.8.0
module load atlas/3.10.2
module load armadillo/10.8.2  
g++ -O3  yourmain.cpp -DARMA_DONT_USE_WRAPPER   -larmadillo -lcblas -latlas

2- Eigen 3.4#

Code Block (bash)

module load gcc/10.1.0
module load eigen/3.4.0
g++ -O3 yourmain.cpp

3- blas/lapack#

I chose to use the versions available in gensoft.

Code Block (bash)

module load gcc/10.1.0

module load BLAS/3.8.0

module load lapack/3.10.0

4- Blaze:#

I copied the sources of blaze into the directory ontaining my code.

It is necessary to link with blas and lapack to use blaze. I link with the blas and lapack versions that are available on the cluster.

5- Results#

test1 : Accessing the columns of a matrix (float implementation)

library→ matrix size	Armadillo (maestro-2011)	Eigen (maestro-2010)	Blas + Lapack (maestro-2008)	Blaze (maestro-2007)
500000*50	0s	0	0	0s
1000000*100	0s	0s	0	0s

Note: For blas and lapack, matrix are plain C arrays. I did an implementation using a 2D array and another one that uses a 1D array.

There was no difference in access time.

test1 : Accessing the columns of a matrix (double implementation)

librairy→ matrix size	Armadillo (maestro-2011)	Eigen (maestro-2004)	Blas + Lapack (maestro-2008)	Blaze (maestro-2007)
500000*50	0s	0s	0s	0s
1000000*100	0s	0s	0s	0s

test2a : inverting a matrix (float implementation)

librairy→ matrix size	Armadillo (maestro-2011)	Eigen (maestro-2010)	Blas + Lapack	Blaze
50*50	0s	0s	0	0s
500*500	0	0	0	0s
1000*1000	0	0	0	0
2500*2500	11s	2s	11s	3s
5000*5000	87s	18s	87s	27s
10000*10000	691s	140s	694s	200s
25000*25000	10654s	2171s	10734	3354s
50000*50000	>18h	4h48min	segfault	exception std::bad_alloc

test2a : inverting a matrix (double implementation)

library→ matrix size	Armadillo (maestro-2011)	Eigen (maestro-2004)	Blas + Lapack	Blaze
50*50	0s	0s	0s	0s
500*500	0s	0s	0s	0s
1000*1000	1s	0s	1s	0s
2500*2500	14s	5s	14s	6s
5000*5000	117s	36s	110s	50s
10000*10000	922s	288s	931	398s
25000*25000	14252s	4522s	14888s	5933s
50000*50000	>21h	36606s	NA	exception std::bad_alloc

test2b : multiply matrix (same number of columns and lines) (float implementation)

library→ matrix size	Armadillo (maestro-2011)	Eigen (maestro-2010)	Blas + Lapack (maestro-2008)	Blaze
50*50	0s	0s	0	0s
500*500	0s	0	0	0s
1000*1000	1s	0	1	0
2500*2500	10s	2s	10s	1s
5000*5000	87s	13s	77s	11s
10000*10000	704s	104s	703s	96s
25000*25000	10836	1606s	11808	1512s
50000*50000	NA	inconsistency in results	segfault	exception

test2b : multiply matrix (same number of columns and lines) (double implementation)

librairy→ matrix size	Armadillo (maestro-2011)	Eigen (maestro-2004)	Blas + Lapack (maestro-2008)	Blaze
50*50	0s	0s	0s	0s
500*500	0s	0s	0s	0
1000*1000	1s	0s	1s	0s
2500*2500	15s	3s	15s	3s
5000*5000	118s	27s	122	23s
10000*10000	950s	213s	959	190s
25000*25000	14737	3335s	14315s	2985s
50000*50000	NA	NA	segfault	exception

test3 : Transpose a matrix (float implementation)

librairie→ Taille matrice	Armadillo (maestro-2011)	Eigen (maestro-2010)	Blas + Lapack	Blaze (maestro-2004)
500000*50	1s ( ** )	0s	2s ( * )	0s
1000000*100	7s (**)	0	15s ( * )	0s (1s pour la version transpose+mul)

( * ) For lapack, the transpose operation is followed by a multiplicaion (cblas_sgemm).

( ** ) For armadillo , I did a transpose followed by a multiplication to be able to compare with blas.

TODO: do teh same with blaze and Eigen

test3: transpose a matrix (double implementation)

librairie→ Taille matrice	Armadillo (maestro-2011)	Eigen (maestro-2010)	Blas + Lapack	Blaze (maestro-2004)
500000*50	1s (**)	0s+0s=0s (3)	1s	0s
1000000*100	7s (**)	1s+2s=3s (3)	9s	0s (2s if transpose+mul)

( * ) For lapack, the transpose operation is followed by a multiplicaion (cblas_sgemm).

( ** ) For armadillo , I did a transpose followed by a multiplication to be able to compare with blas.

(3) For Eigen, I have 2 timers: 1 for transpose end another one for multiplacation. Sum of the 2 is to be compared with the execution time of the cblas_sgemm operation.

6- Interesting articles on the subject:#

https://cs.stanford.edu/people/shadjis/blas.html