Which Partition/QoS for Which Type of Computation
Partitions and QoS#
Partitions are set of nodes. To submit a job, you necessarily tell SLURM which set of nodes your job must run on. If you don't specify any by setting -p or --partition option, the jobs are sent to the default one.
QoS stands for Quality of Service. It aims at restricting what jobs are allowed to do on the chosen partition (maximum running time, maximum number of allocated cores per job or user...). The same QoS can apply on different partitions.
Available partitions#
common: the default partition and gives access the shared nodes offered by the DSIlong: a small partition allowing jobs unlimited in time (thanks to thelongQoS)gpu: partition containing the shared GPU nodes offered by the DSIclcbio: nodes having licenses for CLC Assembly Cell tools from Qiagenclcgwb: nodes having licenses for CLC Genomics WorkBench from Qiagendedicated: the partition containing all the special nodes, that is to say:- nodes belonging to research units,
- nodes with licenses attached to it (nodes belonging to
clcbioandclcgwbpartitions).
Other existing partitions are partitions belonging to research units or projects. Their names are the same as the SLURM account or unix group of the research unit/project. Only people belonging to the account can access the corresponding partition.
Available QoS#
ultrafast:limited to 5 minutes and allowed oncommon,dedicatedand gpu partitionsfast: limited to 2 hours, allowed oncommon,dedicatedand gpu partitionsnormal: default QoS, limited to 24 hours (1 day inMaxWallcolumn), allowed oncommonand gpu partitionslong: unlimited in time, limited to 5 cores per user, allowed on thelongpartition onlygpu: limited to 3 days, allowed on thegpupartition only
QoS that apply on research unit's partitions are not limited in time or number of cores. Owners can use all their resources as they want.
You can obtain these information though sacctmgr as in
information on QoS (bash)
login@maestro-submit ~ $ sacctmgr show qos where name=ultrafast,fast,normal,long,gpu,clcbio,clcgwb format=name,priority,maxwall,maxcpusperuser
Name Priority MaxWall MaxCPUsPU
---------- ---------- ----------- ---------
clcbio 10000 1-00:00:00
clcgwb 10000 1-00:00:00
fast 1000 02:00:00
gpu 10000 3-00:00:00
long 0 365-00:00:00 5
normal 100 1-00:00:00
ultrafast 5000 00:05:00
Submitting a jobs specifying QoS and partition#
Reminder:
- if don't specify the partition, the job will run in on nodes from
common - if you don't specify the partition, the job will run with QoS
normal(limited to 24 hours)
To specify partition and QoS:
- explicitly indicate the partition in the command line using option -p or –-partition=
- explicitly put a QoS allowed on this partition with option
-qor --qos=
If you want to maximize the chance of starting quickly, you can specify more than one partition as long as the QoS is allowed on all of them. For example, you can run jobs with the fast QoS either on common or on the dedicated partition. SLURM will then look for nodes matching the requested resources on both sets of nodes and will launch the job as soon as it has found them. To run such a job, the command line looks like
Code Block (bash)
login@maestro-submit ~ $ srun -p common,dedicated --qos=fast <your command with its options and arguments>
Note that job arrays can run on more than one partition because each array task is considered as a job. As a consequence, each array task can run either on common or on the dedicated partition depending on resources availability.
Priorities#
Short jobs that will release resources quickly for other people are encouraged. As a consequence, as you can see above, QoS priorities have been set this way: fast > normal.
Be opportunistic: run fast jobs on idle nodes from the dedicated partition#
To avoid wasting computing power, when the special nodes are not used by jobs they were designed for, they can run short regular jobs submitted:
- in the
dedicatedpartition - with
fast QoS
This is possible because the dedicated partition overlaps with all project/unit's partitions and special nodes partitions (clcgwb, clcbio).
Jobs sent on the dedicated partition with fast QoS are said "opportunistic". They can be killed and requeued by SLURM at any moment if a more legitimate job (requiring a license or belonging to a member of the owner unit) is launched and requires the requested resources to be freed by SLURM to be able to start. This mechanism is called preemption. Job running in the unit's/project's partition can preempt nodes belonging to that partition.
The requeued jobs will start again on the same partition (decicated) with the same QoS (fast) as soon as resources are available again. Thus, if nodes belonging to another unit/project are still available and match the requested resources, the requeued jobs can start again right away.