Monitor Multiple Jobs on the Same GPU Node
GPU jobs#
If you have several jobs on the same node, to monitor each of them separately for example with nvidia-smi , you must do it from the inside of the allocation.
Code Block (text)
JOBID NAME PARTITION QOS ST CPUS NODES NODELIST
15676949 J3 gpu gpu R 1 1 maestro-3015
15676914 J4 gpu gpu R 2 1 maestro-3015
For example, the jobs above have 1 GPU card (number 5) and 2 GPU cards (number 2 and 3) respectively on maestro-3015:
Code Block (text)
$ scontrol show job -dd 15676949 | grep "GRES=gpu"
JOB_GRES=gpu:A40:1
Nodes=maestro-3015 CPU_IDs=16 Mem=4096 GRES=gpu:A40:1(IDX:5)
and
Code Block (text)
$ scontrol show job -dd 15676914 | grep "GRES=gpu"
JOB_GRES=gpu:A40:2
Nodes=maestro-3015 CPU_IDs=15,52 Mem=8192 GRES=gpu:A40:2(IDX:2-3)
To see how the programs run by the jobs use the GPU card(s), you must run the nvidia-smi command inside the job allocation. For that, use the --jobid option like this
Code Block (text)
$ srun --overlap --jobid 15676949 sh -c 'hostname && nvidia-smi'
maestro-3015
Fri Jul 19 15:18:13 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:C1:00.0 Off | 0 |
| 0% 67C P0 293W / 300W | 40928MiB / 46068MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 870677 C ...<path/to/the/executable> 40920MiB |
+---------------------------------------------------------------------------------------+
Instead of creating its own allocation, here the srun command is directly executed as a step inside the allocation made for job 15676949, just as if it was in the sbatch script of that job.
Note the --overlap option. It allows a step (that is to say a srun command) to share all resources (CPUs, memory and GRes such as GPUs) with any other step. A step using --overlap overlaps any other step, even those that haven't specified --overlap so you don't need to put --overlap in your sbatch script.
If you do that for the second job 15676914, both allocated card are visible:
Code Block (text)
$ srun --overlap --jobid 15676914 sh -c 'hostname && nvidia-smi'
maestro-3015
Fri Jul 19 15:17:49 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:01:00.0 Off | 0 |
| 0% 73C P0 293W / 300W | 40928MiB / 46068MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:24:00.0 Off | 0 |
| 0% 67C P0 297W / 300W | 40928MiB / 46068MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 870415 C ...<path/to/the/executable> 40920MiB |
| 1 N/A N/A 870428 C ...<path/to/the/executable> 40920MiB |
+---------------------------------------------------------------------------------------+
As usual, you can notice that the numbering of the GPU cards is not the same inside the allocation (where it always starts at 0) and outside the allocation where the numbering corresponds to the one given to the cards of the whole node (2 and 3 for job 15676914 above) as described here.