scontrol to Update, Hold, or Release Pending Jobs
Checking the characteristics of a job#
Too see all the details of a running or pending job, you can use scontrol show command:
Code Block (bash)
login@maestro-submit ~ $ scontrol show job <job id>
The output looks like
Code Block (bash)
login@maestro-submit ~ $ scontrol show job 16876320
JobId=16876320 JobName=J401
UserId=<login>(<userID>) GroupId=<GroupName>(<groupID>) MCS_label=N/A
Priority=5458 Nice=0 Account=<account name> QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2017-08-09T20:16:32 EligibleTime=2017-08-09T20:16:32
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=common AllocNode:Sid=maestro-submit0:28063
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=5000,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=5000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/pasteur/appa/homes/login/script.sh
WorkDir=/pasteur/appa/homes/login
StdErr=/pasteur/appa/homes/login/J401.err
StdIn=/dev/null
StdOut=/pasteur/appa/homes/login/J401.out
Power=
Update jobs#
Update a job#
If you made a mistake when you submitted a job and that that job is still pending, you can update it to correct you error with the scontrol update command. Indeed,
Code Block (bash)
login@maestro-submit ~ $ scontrol update job <jobid>
updates information of pending job(s) to change how slurm will schedule it/them. You can update:
- the qos,
- the partition,
- the gres,
- the licences,
- the timelimit,
- priority using the field
nice.
For example, if you submitted your job in the default partition common in the default QoS normal (24 hours) whereas you know that you job won't take more than 30 minutes, then you can update it and change:
- the partition to allow the job to run either in the
commonor in thededicatedpartition, - the QoS to run the job in the higher priority QoS
fast, - and even add
timelimitso that the scheduler will try to launch the job in compatible windows time when cores are available.
the command then looks like
Code Block (bash)
login@maestro-submit ~ $ scontrol update jobid=<jobid> partition=common,dedicated qos=fast timelimit=00:30:00
Unfortunately, jobs' characteristics such as memory, cpus per taks or task number can't be updated.
If the job is a job array, then all pending jobs are modified but the running ones remain untouched.
Change the priority order of your jobs#
You can't increase the priority of a job of yours but you can lower the priority of one of your job using nice
Code Block (bash)
$ scontrol update jobid=<jobid> nice
The scheduling priority of the job is then decreased by 100 (the default). But you can order your jobs exactly the way you want by specifying a value
Code Block (bash)
$ scontrol update jobid=<jobid> nice=<positive integer value>
Let's say that you have 2 jobs with exactly the same priority:
Code Block (bash)
login@maestro-submit ~ $ squeue -u <your login> -O jobid,name,partition,qos,state,reason,prioritylong,nice
JOBID NAME PARTITION QOS STATE REASON PRIORITY NICE
57191049 first common normal PENDING Resources 5511 0
57191050 second common normal PENDING Priority 5511 0
You can put the job called second (57191050) as the first to start by lowering the priority of the job called first. (57191049)
Code Block (bash)
login@maestro-submit ~ $ scontrol update jobid=57191049 nice
login@maestro-submit ~ $ squeue -u <login> -O jobid,name,partition,qos,state,reason,prioritylong,nice
JOBID NAME PARTITION QOS STATE REASON PRIORITY NICE
57191050 second common normal PENDING Resources 5511 0
57191049 first common normal PENDING Priority 5412 100
Now the job called second has a higher priority than the job called first. As a consequence, the PENDING REASON of the second job is Resources, meaning that it will start as soon as the resources are available, while the first job has a lower priority and so has Priority as PENDING REASON.
But you can change that starting order by lowering the priority of the second job (57191050)
Code Block (bash)
login@maestro-submit ~ $ scontrol update jobid=57191050 nice=200
login@maestro-submit ~ $ squeue -u < your login> -O jobid,name,partition,qos,state,reason,prioritylong,nice
JOBID NAME PARTITION QOS STATE REASON PRIORITY NICE
57191049 first common normal PENDING Resources 5412 100
57191050 second common normal PENDING Priority 5312 200
Note that, if you make a mistake, you can always correct it afterwards
Code Block (bash)
login@maestro-submit ~ $ scontrol update jobid=57191050 nice=555
login@maestro-submit ~ $ squeue -u <login> -O jobid,name,partition,qos,state,reason,prioritylong,nice
JOBID NAME PARTITION QOS STATE REASON PRIORITY NICE
57191049 first common normal PENDING Resources 5413 100
57191050 second common normal PENDING None 4958 555
If you have several jobs with the same name (given with -J/--job-name ), you can change the priority of all of them at once by replacing jobid= by jobname= in the scontrol command. Example
Code Block (bash)
login@maestro-submit ~ $ squeue -u <your login> -O jobid,name,partition,qos,state,reason,prioritylong,nice --name=third
JOBID NAME PARTITION QOS STATE REASON PRIORITY NICE
57191391 third common normal PENDING Resources 5512 0
57191392 third common normal PENDING Priority 5512 0
Code Block (bash)
login@maestro-submit ~ $ scontrol update jobname=third nice=333
login@maestro-submit ~ $ squeue -u <your login> -O jobid,name,partition,qos,state,reason,prioritylong,nice --name=third
JOBID NAME PARTITION QOS STATE REASON PRIORITY NICE
57191391 third common normal PENDING None 5179 333
57191392 third common normal PENDING None 5179 333
Update a list of jobs#
Let's imagine that you have submitted a list of jobs in the common partition with the fast QoS. The common partition is very busy and your jobs have a low priority compared to the jobs of other users given your resource consumption over the last 7 days. You wish you had submitted them in the dedicated partition as well to maximize their chance of starting even if they could be killed and resubmitted automatically.
You can use the squeue command to retrieve the job ids of your (-u <your login>) pending jobs (-t PD) submitted in the common partition (-p common) with fast QoS (-q fast) this way:
Code Block (bash)
login@maestro-submit ~ $ squeue -u <your login> -t PD -p common -q fast --Format=jobid --noheader
Note the use of:
--Format=jobidto only output 1 column containing the job ids,--noheaderoption to suppress the header of the column which is only useful to humans.
That command returns 1 job id per line:
Code Block (bash)
1108497
1108499
1108493
1108495
1108496
For each of them, you want to perform the following scontrol update command:
Code Block (bash)
login@maestro-submit ~ $ scontrol update job <jobid> partition=common,dedicated
to tell Slurm that, from now on, he can launch the job with the provided job id on any node of one of these partitions (as long as the required resources specified in the original submission
command are available of course).
That command must be applied to any single job id returned by the previous squeue command. For that, use the "for" loop statement (for/do/done) to do so. One by one, each job id from the list returned by squeue will be assigned to a variable (that we choose to name jobid) and then use the content of this variable (accessible using the ${}) to build the scontrol update command. To make that command executed, put it in a "do ... done" block. In a script, you would write it this way:
Code Block (bash)
for jobid in $(squeue -u <your login> -t PD -p common -q fast --Format=jobid --noheader)
do
scontrol update job ${jobid} partition=common,dedicated
done
But if you want to write it on a single line to do an easy copy/paste in a terminal, you would rather write it this way:
Code Block (bash)
login@maestro-submit ~ $ for jobid in $(squeue -u <yourlogin> -t PD -p common -q fast --Format=jobid --noheader); do scontrol update job $jobid partition=common,dedicated; done
It's exactly the same but you use ";" instead of newline to separate the instructions. If you wanted to display the result of the scontrol update command, you could add another instruction such as:
Code Block (bash)
login@maestro-submit ~ $ squeue --Format="jobid,name:20,username,partition,qos,statecompact,reason,starttime" -j ${jobid} --noheader
The output looks like:
Code Block (bash)
<jobid> <job name> <your login> dedicated fast R None 2020-04-07T10:30:40
if the job can start immediately or
Code Block (bash)
<jobid> <job name> <your login> dedicated fast PD Priority 2020-04-08T00:37:00
if the job must wait (pending state PD) because of its low priority (Priority) compared to other jobs. In that case, the time displayed in the last column is the job start time in the worst case.
Inserted in the former do/done block, it looks like:
Code Block (bash)
for jobid in $(squeue -u <your login> -t PD -p common -q fast --Format=jobid --noheader)
do
scontrol update job ${jobid} partition=common,dedicated
squeue --Format="jobid,name:20,username,partition,qos,statecompact,reason,starttime" -j ${jobid} --noheader
done
The one-liner version is then:
Code Block (bash)
login@maestro-submit ~ $ for jobid in $(squeue -u <your login> --Format=jobid -t PD --noheader -p common -q fast); do scontrol update job ${jobid} partition=common,dedicated; squeue --Format="jobid,name:20,username,partition,qos,statecompact,reason,starttime" -j ${jobid} --noheader; done
Update one or more job(s) using a jobname instead of job id(s)#
You can replace the job id by the job name. With name, all jobs with the same name are processed. If used with jobname, indicate your login as well to avoid trying to update jobs from other people with the same job name (typically wrap). Example:
Code Block (bash)
login@maestro-submit ~ $ scontrol update jobname=<job name> userid=<yourlogin> qos=fast
Hold and release jobs#
hold: put a lock on some specific pending jobs to prevent it from starting and let pass other jobs first. Can be used with job name or job id. With name, all jobs with the same name are prevented from starting
Code Block (bash)
login@maestro-submit ~ $ scontrol hold name=<name of your job>
or
Code Block (bash)
login@maestro-submit ~ $ scontrol hold <jobid>
If the job is a job array, then the hold only applies on pending job array tasks. The running ones are neither suspended nor killed or requeued. This is what means No error in the output of the command for the corresponding tasks
Code Block (bash)
login@maestro-submit ~ $ scontrol hold <job array jobid>
scontrol hold <job array jobid>
<job array jobid>_2520,2536-2997,2999-4045: No error
<job array jobid>_2526,2528-2535: Job has already finished
even if the REASON "JobHeldUser" appears in squeue output for these running tasks. Of course completed tasks are not affected by the lock.
But note that if a job array task is requeued by Slurm (due to a node failure or because the task was running on the dedicated partition), then the job array task will remain pending with REASON "JobHeldUser" in squeue output like the job array itself
Code Block (bash)
login@maestro-submit ~ $ squeue -j <job array jobid> -t pd
JOBID PARTITION NAME USER. ST TIME NODES NODELIST(REASON)
<job array jobid>_[3342-4045] common <job name> login PD 0:00 1 (JobHeldUser)
release: to release held jobs. Can be used with job name or job id. With name, all jobs with the same name are released
Code Block (bash)
login@maestro-submit ~ $ scontrol release name=<name of your job>
or
Code Block (bash)
login@maestro-submit ~ $ scontrol release <jobid>
Related articles#
Related articles appear here based on the labels you select. Click to edit the macro and add or change labels.
false5FAQAfalsemodifiedtruepagelabel in ("pending","update","scontrol") and type = "page" and space = "FAQA"update pending scontrol
true
| Related issues |