Overview

Available GPU nodes#

To obtain information about the current GPU nodes in the cluster, use sinfo command like this:

GPU description (bash)

yourlogin@maestro-submit ~ $  sinfo -e -o "%D %c %m %G" -p gpu  
NODES CPUS MEMORY GRES
1 96 980000 disk:890000M,gmem:no_consume:80G,gpu:A100:4,spread:9600
3 96 480000 disk:890000M,gmem:no_consume:40G,gpu:A100:4,spread:9600
7 96 980000 disk:890000M,gmem:no_consume:48G,gpu:A40:8,spread:9600
1 96 980000 disk:890000M,gmem:no_consume:48G,gpu:l40s:8,spread:9600
yourlogin@maestro-submit ~ $ sinfo -e -o "%D %c %m %G" -p dedicatedgpu
NODES CPUS MEMORY GRES
1 96 480000 disk:890000M,gmem:no_consume:40G,gpu:A100:4,spread:9600
4 24 100000 disk:890000M,gmem:no_consume:40G,gpu:A100:2,spread:2400
2 24 100000 disk:890000M,gmem:no_consume:48G,gpu:A40:2,spread:2400

The above output means that the cluster contained (at the moment the command was run) the following GPU nodes:

3 A100 nodes with 4 GPUs, 40 GB of RAM per GPU, 96 cores, 500 GB of RAM
1 A100 node with 4 GPUs, 80 GB of RAM per GPU, 96 cores, 1TB of RAM
7 A40 nodes with 8 GPUs, 48 GB of RAM per GPU, 96 cores, 1 TB of RAM
1 l40s node with 8 GPUs, 48 GB of RAM per GPU, 96 cores, 1 TB of RAM

in the gpu partition meant for production run.

and the dedicatedgpu partition reserved to opportunistic jobs that can be killed and rescheduled, containing

1 A100 node with 4 GPUs, 40 GB of RAM per GPU, 96 cores, 500 GB of RAM,
4 A100 nodes with 2 GPUs, 40 GB of RAM per GPU, 24 cores, 100 GB of RAM,
2 A40 nodes with 2 GPUs, 48 GB of RAM per GPU, 24 cores, 100 GB of RAM.

Submission on GPUs#

Available QoS#

gpu#

To submit on common GPU nodes, you must use one of the allowed QoS. The gpu QoS is limited to 3 days as you can see with sacctmgr. If you don't expect your jobs to run more than a few hours, use normal(24 hours) as usual.fast (2 hours) and ultrafast (5 min) QoS are also available to test your code.

sacctmgr to obtain information on QoS (bash)

$ sacctmgr show qos where name=gpu format=name,priority,maxwall
      Name   Priority     MaxWall 
---------- ---------- ----------- 
       gpu      10000  3-00:00:00 
$ sacctmgr show qos where name=fast format=name,priority,maxwall
      Name   Priority     MaxWall 
---------- ---------- ----------- 
      fast       1000    02:00:00 
$ sacctmgr show qos where name=ultrafast format=name,priority,maxwall
      Name   Priority     MaxWall 
---------- ---------- ----------- 
 ultrafast       5000    00:05:00

dedicatedgpu#

The dedicatedgpu partition contains GPU nodes belonging to research units. You can then only submit short jobs (using fast or ultrafast QoS) since jobs can be killed (and automatically requeued) if the GPU nodes owner needs the resources.

General Resources: cards and their memory#

GPU are a type of General RESources. As a consequence, you must also use the --gres option to indicate the number of GPUs you want to use and the type of GPUs if it matters. Example:

Example of command using the gpu gres (bash)

$ sbatch  -p gpu           -q gpu        --gres=gpu:A100    /path/to/your/batch_script
$ sbatch  -p dedicatedgpu  -q fast       --gres=gpu:4       /path/to/your/batch_script
$ srun    -p gpu           -q ultrafast  --gres=gpu:A40:2   your_command

For example, you can check a GPU parameters this way:

Retrieve GPU node usage (bash)

yourlogin@maestro ~ $ srun  -p gpu  --qos=ultrafast  --gres=gpu:A100:1   nvidia-smi   
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:81:00.0 Off |                    0 |
| N/A   22C    P0    49W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Please note that a GPU node has:

RAM for the cores that you can allocate with --mem or --mem-per-cpu (the MEMORY column of sinfo output)
a dedicated amount of RAM for each GPU card: the card memory our GPU memory. All GPU cards belonging to the same node are of the same type (A40, A100...) and have the same amount of dedicated card memory each (32 GB, 40 GB or 80 GB as shown in the GRES column of sinfo output above).

The card memory is also a General RESource (called gmem) that you can add to the --gres option to specify the minimum of dedicated memory the GPU cards must have. Example

Example of command using the gpu & gmem gres (bash)

yourlogin@maestro ~ $ srun -p gpu -q ultrafast --gres=gpu:1,gmem:50G   nvidia-smi

Mon Apr 11 20:20:02 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   25C    P0    67W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Here, we haven't explicitly asked for an A100 but it's the only one which has at least 50 GB of GPU memory per card (81920MiB i.e. 80 GB exactly).

If you ask for less, depending on card availability, Slurm will allocate the nodes with the closest memory amount to avoid wasting memory. In the following example, the allocation is done on a A40 card which has 44 GB of RAM which is enough:

Example of command using the gpu & gmem gres (bash)

yourlogin@maestro ~ $ srun -p gpu -q ultrafast --gres=gpu:1,gmem:42G   nvidia-smi

Mon Apr 11 20:28:13 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:61:00.0 Off |                    0 |
|  0%   26C    P8    28W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Of course, if you ask for a combination of type and memory amount that doesn't exist, a error message is returned

Example of command using the gpu & gmem gres (bash)

yourlogin@maestro ~ $ srun -p gpu -q ultrafast --gres=gpu:A40:1,gmem:50G   nvidia-smi
srun: error: Unable to allocate resources: Requested node configuration is not available

GPU numbering inside and outside the allocation#

If you need the IDs of the GPU cards your are going to run on, please not that the numbering is not the same inside and outside the allocation. To simplify the example, let's use salloc instead of srun or sbatch and allocate one card on a GPU node.

Code Block (bash)

$ salloc -p gpu -q gpu --gres=gpu:1
salloc: Granted job allocation 10426439
salloc: Waiting for resource configuration
salloc: Nodes maestro-3002 are ready for job
yourlogin@maestro-3002 ~ $

If you check the number of the GPU card using scontrol , that is to say from outside the allocation (whatever the place you launch the scontrol command) since you use the job ID, you will retrieve the ID of the GPU card (here 2) assigned to the job on the node:

Code Block (bash)

yourlogin@maestro-3002 ~ $ scontrol show jobid=10426439 -dd | grep " GRES="
     Nodes=maestro-3002 CPU_IDs=4 Mem=4096 GRES=gpu:A100:1(IDX:2)

while if you check the number of the GPU card inside the allocation, for example by running nvidia-smi or by looking at the value of the $CUDA_VISIBLE_DEVICES environment variable inside the salloc , then you will retrieve the number assigned to the card inside the allocation

Code Block (bash)

yourlogin@maestro-3000 ~ $ echo $CUDA_VISIBLE_DEVICES 
0

The numbering inside the allocation always starts at 0. So if you ask for 2 GPU cards with --gres=gpu:2, then in the allocation 2 GPU cards numbered 0 and 1 will be visible:

Code Block (bash)

yourlogin@maestro-3000 ~ $ echo $CUDA_VISIBLE_DEVICES 
0,1

Since that numbering is perfectly predictable, you can potentially use it in the configuration of a software program requiring GPU card IDs.

Interactive sessions#

If you need an interactive session to make tests, use salloc with the same options as above:

Code Block (bash)

yourlogin@maestro ~ $ salloc -p gpu  -q gpu --gres=gpu:A100:1

You then obtain a shell on a GPU node but that shell runs on what NVIDIA calls the "host", that is to say the CPU part of the node. To access the GPUs of your container from that shell, you must use srun as you do in a sbatch script:

interactive session on gpu node (bash)

yourlogin@maestro $ salloc -p gpu -q gpu --gres=gpu:A100
salloc: Pending job allocation 46539046
salloc: job 46539046 queued and waiting for resources
salloc: job 46539046 has been allocated resources
salloc: Granted job allocation 46539046
salloc: Waiting for resource configuration
salloc: Nodes maestro-3004 are ready for job
yourlogin@maestro-3004 ~ $ srun <your GPU program with its options and arguments>

As usual, all option values of srun are inherited from salloc unless you overwrite them explicitly. So, in the above example, the command after srun is sent to the allocated GPUs.