Linear Algebra Library Benchmarks on Maestro
Intro#
I tried to make an idea of the performance that could be expected from several linear algebra libraries on the maestro cluster.
These benchmark are mono process. They do common operations : access a column or a line of a matrix, transpose a matrix, multiply 2 matrix, invert a matrix.
Source code for the benches can be found here: https://gitlab.pasteur.fr/vlegrand/linear_algebra_lib_comparison
I tested 4 libraries listed below. Each time, only 1 bench and nothing else is running on the compute node.
Compiler used is always the same (gcc 10 as available on maestro) withflag -O3.
Anyone interested in the subject is most welcome to make suggestions regarding new operations or new libraries to test.
0circle
1- armadillo 10.8.2#
Code Block (bash)
module load gcc/10.1.0
module load CBLAS/3.8.0
module load atlas/3.10.2
module load armadillo/10.8.2
g++ -O3 yourmain.cpp -DARMA_DONT_USE_WRAPPER -larmadillo -lcblas -latlas
2- Eigen 3.4#
Code Block (bash)
module load gcc/10.1.0
module load eigen/3.4.0
g++ -O3 yourmain.cpp
3- blas/lapack#
I chose to use the versions available in gensoft.
Code Block (bash)
module load gcc/10.1.0
module load BLAS/3.8.0
module load lapack/3.10.0
4- Blaze:#
I copied the sources of blaze into the directory ontaining my code.
It is necessary to link with blas and lapack to use blaze. I link with the blas and lapack versions that are available on the cluster.
5- Results#
- test1 : Accessing the columns of a matrix (float implementation)
| library→ matrix size | Armadillo (maestro-2011) | Eigen (maestro-2010) | Blas + Lapack (maestro-2008) | Blaze (maestro-2007) |
|---|---|---|---|---|
| 500000*50 | 0s | 0 | 0 | 0s |
| 1000000*100 | 0s | 0s | 0 | 0s |
Note: For blas and lapack, matrix are plain C arrays. I did an implementation using a 2D array and another one that uses a 1D array.
There was no difference in access time.
- test1 : Accessing the columns of a matrix (double implementation)
| librairy→ matrix size | Armadillo (maestro-2011) | Eigen (maestro-2004) | Blas + Lapack (maestro-2008) | Blaze (maestro-2007) |
|---|---|---|---|---|
| 500000*50 | 0s | 0s | 0s | 0s |
| 1000000*100 | 0s | 0s | 0s | 0s |
- test2a : inverting a matrix (float implementation)
| librairy→ matrix size | Armadillo (maestro-2011) | Eigen (maestro-2010) | Blas + Lapack | Blaze |
|---|---|---|---|---|
| 50*50 | 0s | 0s | 0 | 0s |
| 500*500 | 0 | 0 | 0 | 0s |
| 1000*1000 | 0 | 0 | 0 | 0 |
| 2500*2500 | 11s | 2s | 11s | 3s |
| 5000*5000 | 87s | 18s | 87s | 27s |
| 10000*10000 | 691s | 140s | 694s | 200s |
| 25000*25000 | 10654s | 2171s | 10734 | 3354s |
| 50000*50000 | >18h | 4h48min | segfault | exception std::bad_alloc |
- test2a : inverting a matrix (double implementation)
| library→ matrix size | Armadillo (maestro-2011) | Eigen (maestro-2004) | Blas + Lapack | Blaze |
|---|---|---|---|---|
| 50*50 | 0s | 0s | 0s | 0s |
| 500*500 | 0s | 0s | 0s | 0s |
| 1000*1000 | 1s | 0s | 1s | 0s |
| 2500*2500 | 14s | 5s | 14s | 6s |
| 5000*5000 | 117s | 36s | 110s | 50s |
| 10000*10000 | 922s | 288s | 931 | 398s |
| 25000*25000 | 14252s | 4522s | 14888s | 5933s |
| 50000*50000 | >21h | 36606s | NA | exception std::bad_alloc |
- test2b : multiply matrix (same number of columns and lines) (float implementation)
| library→ matrix size | Armadillo (maestro-2011) | Eigen (maestro-2010) | Blas + Lapack (maestro-2008) | Blaze |
|---|---|---|---|---|
| 50*50 | 0s | 0s | 0 | 0s |
| 500*500 | 0s | 0 | 0 | 0s |
| 1000*1000 | 1s | 0 | 1 | 0 |
| 2500*2500 | 10s | 2s | 10s | 1s |
| 5000*5000 | 87s | 13s | 77s | 11s |
| 10000*10000 | 704s | 104s | 703s | 96s |
| 25000*25000 | 10836 | 1606s | 11808 | 1512s |
| 50000*50000 | NA | inconsistency in results | segfault | exception |
- test2b : multiply matrix (same number of columns and lines) (double implementation)
| librairy→ matrix size | Armadillo (maestro-2011) | Eigen (maestro-2004) | Blas + Lapack (maestro-2008) | Blaze |
|---|---|---|---|---|
| 50*50 | 0s | 0s | 0s | 0s |
| 500*500 | 0s | 0s | 0s | 0 |
| 1000*1000 | 1s | 0s | 1s | 0s |
| 2500*2500 | 15s | 3s | 15s | 3s |
| 5000*5000 | 118s | 27s | 122 | 23s |
| 10000*10000 | 950s | 213s | 959 | 190s |
| 25000*25000 | 14737 | 3335s | 14315s | 2985s |
| 50000*50000 | NA | NA | segfault | exception |
- test3 : Transpose a matrix (float implementation)
| librairie→ Taille matrice | Armadillo (maestro-2011) | Eigen (maestro-2010) | Blas + Lapack | Blaze (maestro-2004) |
|---|---|---|---|---|
| 500000*50 | 1s ( ** ) | 0s | 2s ( * ) | 0s |
| 1000000*100 | 7s (**) | 0 | 15s ( * ) | 0s (1s pour la version transpose+mul) |
( * ) For lapack, the transpose operation is followed by a multiplicaion (cblas_sgemm).
( ** ) For armadillo , I did a transpose followed by a multiplication to be able to compare with blas.
TODO: do teh same with blaze and Eigen
- test3: transpose a matrix (double implementation)
| librairie→ Taille matrice | Armadillo (maestro-2011) | Eigen (maestro-2010) | Blas + Lapack | Blaze (maestro-2004) |
|---|---|---|---|---|
| 500000*50 | 1s (**) | 0s+0s=0s (3) | 1s | 0s |
| 1000000*100 | 7s (**) | 1s+2s=3s (3) | 9s | 0s (2s if transpose+mul) |
( * ) For lapack, the transpose operation is followed by a multiplicaion (cblas_sgemm).
( ** ) For armadillo , I did a transpose followed by a multiplication to be able to compare with blas.
(3) For Eigen, I have 2 timers: 1 for transpose end another one for multiplacation. Sum of the 2 is to be compared with the execution time of the cblas_sgemm operation.