Monitor Multiple Jobs on the Same GPU Node

GPU jobs#

If you have several jobs on the same node, to monitor each of them separately for example with nvidia-smi , you must do it from the inside of the allocation.

Code Block (text)

JOBID     NAME                PARTITION QOS       ST CPUS  NODES NODELIST                      
15676949  J3                  gpu       gpu       R  1     1     maestro-3015                  
15676914  J4                  gpu       gpu       R  2     1     maestro-3015

For example, the jobs above have 1 GPU card (number 5) and 2 GPU cards (number 2 and 3) respectively on maestro-3015:

Code Block (text)

$ scontrol show job -dd 15676949 | grep "GRES=gpu"
JOB_GRES=gpu:A40:1
     Nodes=maestro-3015 CPU_IDs=16 Mem=4096 GRES=gpu:A40:1(IDX:5)

and

Code Block (text)

$ scontrol show job -dd 15676914 | grep "GRES=gpu"
   JOB_GRES=gpu:A40:2
     Nodes=maestro-3015 CPU_IDs=15,52 Mem=8192 GRES=gpu:A40:2(IDX:2-3)

To see how the programs run by the jobs use the GPU card(s), you must run the nvidia-smi command inside the job allocation. For that, use the --jobid option like this

Code Block (text)

$ srun --overlap --jobid 15676949 sh -c 'hostname && nvidia-smi'                                                              
maestro-3015
Fri Jul 19 15:18:13 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     On  | 00000000:C1:00.0 Off |                    0 |
|  0%   67C    P0             293W / 300W |  40928MiB / 46068MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    870677      C   ...<path/to/the/executable>               40920MiB |
+---------------------------------------------------------------------------------------+

Instead of creating its own allocation, here the srun command is directly executed as a step inside the allocation made for job 15676949, just as if it was in the sbatch script of that job.

Note the --overlap option. It allows a step (that is to say a srun command) to share all resources (CPUs, memory and GRes such as GPUs) with any other step. A step using --overlap overlaps any other step, even those that haven't specified --overlap so you don't need to put --overlap in your sbatch script.

If you do that for the second job 15676914, both allocated card are visible:

Code Block (text)

$ srun --overlap --jobid 15676914 sh -c 'hostname && nvidia-smi'                                                              
maestro-3015
Fri Jul 19 15:17:49 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     On  | 00000000:01:00.0 Off |                    0 |
|  0%   73C    P0             293W / 300W |  40928MiB / 46068MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     On  | 00000000:24:00.0 Off |                    0 |
|  0%   67C    P0             297W / 300W |  40928MiB / 46068MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    870415      C   ...<path/to/the/executable>               40920MiB |
|    1   N/A  N/A    870428      C   ...<path/to/the/executable>               40920MiB |
+---------------------------------------------------------------------------------------+

As usual, you can notice that the numbering of the GPU cards is not the same inside the allocation (where it always starts at 0) and outside the allocation where the numbering corresponds to the one given to the cards of the whole node (2 and 3 for job 15676914 above) as described here.