Overview
How to keep an eye on my job when it is running#
For that, you can use salloc. To be able to fully monitor your program, restrict yourself to 1 node explicitly if you use more than 1 CPU. Once you have obtained an interactive shell on a node, launch tmux session and open as many panes as you need:
- some to launch programs (that will be SLURM steps)
- some to monitor your programs with
top, check the size of your temporary files withwatch ls -lh /local/scratch, ...
How to test and learn my jobs' resource usage#
Example with memory and core usage#
If your job seems to be slower than expected or you need to adjust the amount of memory of your job, you can use the command seff. It will provide an easy to read output to get the CPU and Memory efficiency.
Example:
Code Block (text)
login@maestro-submit ~ $ seff 10175173
Job ID: 10175173
Cluster: maestro
User/Group: biomaj-prod/banques
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 01:56:13
CPU Efficiency: 97.18% of 01:59:35 core-walltime
Job Wall-clock time: 01:59:35
Memory Utilized: 22.50 GB
Memory Efficiency: 93.75% of 24.00 GB
For more information, please consult the seff page.
Example with memory and core usage for multiple jobs#
If your jobs seem to be slower than expected or you need to adjust the amount of memory of your jobs, you can use the reportseff command. It will provide you with an easy to read output to get the CPU and Memory efficiency.
Example:
Code Block (text)
login@maestro-submit ~ $ reportseff 10175173 10175174
JobID State Elapsed TimeEff CPUEff MemEff
10175173 COMPLETED 01:59:35 8.3% 97.2% 93.8%
10175174 COMPLETED 00:06:47 0.5% 82.3% 11.0%
TimeEff → Efficiency of time usage, calculated with the timelimit. For example, the default timelimit to submit on the partition common is 24H.
CPUEff → Efficiency of cores usage, calculated with the following formula :total of CPU time / number of cores / duration of the job * 100
MemEff → Efficiency of memory usage, calculated with the following formula : max memory usage / amount of memory requested * 100
If you use reportseff on a job array, you can see the efficiency of each array task
Code Block (text)
login@maestro-submit ~ $ reportseff 30379298
JobID State Elapsed TimeEff CPUEff MemEff
30379298_0 COMPLETED 00:00:21 0.3% 90.5% 1.1%
30379298_1 COMPLETED 00:00:22 0.3% 86.4% 1.1%
30379298_2 COMPLETED 00:00:23 0.3% 87.0% 1.1%
30379298_3 COMPLETED 00:00:22 0.3% 90.9% 1.1%
30379298_4 COMPLETED 00:00:24 0.3% 87.5% 1.2%
30379298_5 COMPLETED 00:00:22 0.3% 90.9% 1.1%
[...]
30379298_2123 RUNNING 00:00:06 0.1% --- ---
30379298_2124 RUNNING 00:00:06 0.1% --- ---
30379298_2125 RUNNING 00:00:06 0.1% --- ---
30379298_[2126-4278%20] PENDING --- --- --- ---
For more information , please consult the reportseff page.
Best practice of CPU and memory usage#
After checking what is the efficiency of a job, it's necessary to understand how to interpret the result.
CPU#
The following graph shows up how to interpret the core efficiency of a job :
[[DRAW.IO DIAGRAM PLACEHOLDER]]
To sum up:
→ If a job with only one allocated core has less than 75% of efficiency, you should check what is the bottleneck of the job, but you don't need to allocate more cores to the job.
→ If a job with only one allocated core has more than 75% of efficiency, the job seems to be well allocated. You should check if the program launched is well set-up, by checking if there is not too much threads/processes per core set.
→ If a job with more than one allocated core has less than 75% of efficiency, you should check if the program is well set-up, by checking if there is enough threads/processes set or by decreasing the number of cores for the job.
→ If a job with more than one allocated core has more than 75% of efficiency, the job seems to be well allocated. You should check it the program launched is well set-up, by checking if there are not too many threads/processes per core.
Memory#
The following graph shows up how to interpret the memory efficiency of a job :
[[DRAW.IO DIAGRAM PLACEHOLDER]]
To sum up:
→ If a job has less than 50 % of memory efficiency, you should decrease the memory allocated to it.
→ If a job has more than 50 % of memory efficiency and less than 80% of memory efficiency, the jobs seems to be well allocated.
→ If a job has more than 80 % of memory efficiency, you should increase the memory allocated to it in order to avoid "Out Of Memory" issue.
Example with GPU usage#
Code Block (text)
login@maestro-submit ~ $ reportseff 13979617 -g
JobID State Elapsed TimeEff CPUEff MemEff GPUEff GPUMem
13979617 OUT_OF_MEMORY 00:01:06 0.0% 46.4% 64.8% 58.5% 80.2%
maestro-3003 46.4% 64.8% 93.3% 88.9%
1 94% 88.9%
2 95% 88.9%
3 91% 88.9%
maestro-3013 46.4% 64.8% 23.6% 71.4%
0 25% 88.9%
1 32% 88.9%
2 31% 88.9%
3 30% 88.9%
7 0% 1.3%
GPUEff → Efficiency of graphic card usage
GPUMem → Efficiency of graphic card memory usage
If the waiting time for a GPU job is too long, it could be useful to check if your pipeline could be handled by a GPU with less dedicated GPU memory. By doing so, the scheduler will be more likely to find a node with the required ressources so the waiting time of your job should decrease.
For more information, please consult the dedicated reportseff page.
Example with I/O#
With sacct, you have access to the average and maximum reading and writing speed of your job:
Code Block (bash)
login@maestro-submit ~ $ sacct -j 350241 --format=jobid,jobname,user,state,exitcode,maxdiskwrite,avediskwrite
JobID JobName User State ExitCode MaxDiskWrite AveDiskWrite
-------- ---------- --------- ---------- -------- ------------ --------------
350241 myjob login TIMEOUT 1:0 291.89M 291.89M
allowing you to determine if you should investigate the code to improve it.
Example with nodes#
If in your srun/sbatch command, you have been flexible on the number of nodes to execute your tasks using -N <minnode>-<maxnode>, you have access to the number of allocated nodes and the list of them:
Code Block (bash)
login@maestro-submit ~ $ sacct -j 13292320 --format=jobid,jobname,user,qos,state,ncpus,nnodes,nodelist%25
JobID JobName User QOS State ExitCode NCPUS NNodes NodeList
-------- ---------- ------- ------- ---------- -------- ------ ------- -----------------------
13292320 myjob login fast COMPLETED 0:0 8 4 maestro-[1012-1013,1015,1018]