Skip to content

Linear Algebra Library Benchmarks on Maestro

Intro#

I tried to make an idea of the performance that could be expected from several linear algebra libraries on the maestro cluster.

These benchmark are mono process. They do common operations : access a column or a line of a matrix, transpose a matrix, multiply 2 matrix, invert a matrix.

Source code for the benches can be found here: https://gitlab.pasteur.fr/vlegrand/linear_algebra_lib_comparison

I tested 4 libraries listed below. Each time, only 1 bench and nothing else is running on the compute node.

Compiler used is always the same (gcc 10 as available on maestro) withflag -O3.

Anyone interested in the subject is most welcome to make suggestions regarding new operations or new libraries to test.

0circle

1- armadillo 10.8.2#

Code Block (bash)

module load gcc/10.1.0
module load CBLAS/3.8.0
module load atlas/3.10.2
module load armadillo/10.8.2  
g++ -O3  yourmain.cpp -DARMA_DONT_USE_WRAPPER   -larmadillo -lcblas -latlas

2- Eigen 3.4#

Code Block (bash)

module load gcc/10.1.0
module load eigen/3.4.0
g++ -O3 yourmain.cpp

3- blas/lapack#

I chose to use the versions available in gensoft.

Code Block (bash)

module load gcc/10.1.0

module load BLAS/3.8.0

module load lapack/3.10.0

4- Blaze:#

I copied the sources of blaze into the directory ontaining my code.

It is necessary to link with blas and lapack to use blaze. I link with the blas and lapack versions that are available on the cluster.

5- Results#

  • test1 : Accessing the columns of a matrix (float implementation)
library→ matrix size Armadillo (maestro-2011) Eigen (maestro-2010) Blas + Lapack (maestro-2008) Blaze (maestro-2007)
500000*50 0s 0 0 0s
1000000*100 0s 0s 0 0s

Note: For blas and lapack, matrix are plain C arrays. I did an implementation using a 2D array and another one that uses a 1D array.

There was no difference in access time.

  • test1 :  Accessing the columns of a matrix (double implementation)
librairy→ matrix size Armadillo (maestro-2011) Eigen (maestro-2004) Blas + Lapack (maestro-2008) Blaze (maestro-2007)
500000*50 0s 0s 0s 0s
1000000*100 0s 0s 0s 0s
  • test2a : inverting a matrix (float implementation)
librairy→ matrix size Armadillo (maestro-2011) Eigen (maestro-2010) Blas + Lapack Blaze
50*50 0s 0s 0 0s
500*500 0 0 0 0s
1000*1000 0 0 0 0
2500*2500 11s 2s 11s 3s
5000*5000 87s 18s 87s 27s
10000*10000 691s 140s 694s 200s
25000*25000 10654s 2171s 10734 3354s
50000*50000 >18h 4h48min segfault exception std::bad_alloc
  • test2a : inverting a matrix (double implementation)
library→ matrix size Armadillo (maestro-2011) Eigen (maestro-2004) Blas + Lapack Blaze
50*50 0s 0s 0s 0s
500*500 0s 0s 0s 0s
1000*1000 1s 0s 1s 0s
2500*2500 14s 5s 14s 6s
5000*5000 117s 36s 110s 50s
10000*10000 922s 288s 931 398s
25000*25000 14252s 4522s 14888s 5933s
50000*50000 >21h 36606s NA exception std::bad_alloc
  • test2b : multiply matrix (same number of columns and lines) (float implementation)
library→ matrix size Armadillo (maestro-2011) Eigen (maestro-2010) Blas + Lapack (maestro-2008) Blaze
50*50 0s 0s 0 0s
500*500 0s 0 0 0s
1000*1000 1s 0 1 0
2500*2500 10s 2s 10s 1s
5000*5000 87s 13s 77s 11s
10000*10000 704s 104s 703s 96s
25000*25000 10836 1606s 11808 1512s
50000*50000 NA inconsistency in results segfault exception
  • test2b : multiply matrix (same number of columns and lines)  (double implementation)
librairy→ matrix size Armadillo (maestro-2011) Eigen (maestro-2004) Blas + Lapack (maestro-2008) Blaze
50*50 0s 0s 0s 0s
500*500 0s 0s 0s 0
1000*1000 1s 0s 1s 0s
2500*2500 15s 3s 15s 3s
5000*5000 118s 27s 122 23s
10000*10000 950s 213s 959 190s
25000*25000 14737 3335s 14315s 2985s
50000*50000 NA NA segfault exception
  • test3 : Transpose a matrix (float implementation)
librairie→ Taille matrice Armadillo (maestro-2011) Eigen (maestro-2010) Blas + Lapack Blaze (maestro-2004)
500000*50 1s ( ** ) 0s 2s ( * ) 0s
1000000*100 7s (**) 0 15s ( * ) 0s (1s pour la version transpose+mul)

( * )  For lapack, the transpose operation is followed by a multiplicaion (cblas_sgemm).

( ** ) For armadillo , I did a transpose followed by a multiplication to be able to compare with blas.

TODO: do teh same with blaze and Eigen

  • test3: transpose a matrix (double implementation)
librairie→ Taille matrice Armadillo (maestro-2011) Eigen (maestro-2010) Blas + Lapack Blaze (maestro-2004)
500000*50 1s (**) 0s+0s=0s (3) 1s 0s
1000000*100 7s (**) 1s+2s=3s (3) 9s 0s (2s if transpose+mul)

( * )  For lapack, the transpose operation is followed by a multiplicaion (cblas_sgemm).

( ** ) For armadillo , I did a transpose followed by a multiplication to be able to compare with blas.

(3) For Eigen, I have 2 timers: 1 for transpose end another one for multiplacation. Sum of the 2 is to be compared with the execution time of the cblas_sgemm operation.

6- Interesting articles on the subject:#

https://cs.stanford.edu/people/shadjis/blas.html