Which Partition/QoS for Which Type of Computation

Partitions and QoS#

Partitions are set of nodes. To submit a job, you necessarily tell SLURM which set of nodes your job must run on. If you don't specify any by setting -p or --partition option, the jobs are sent to the default one.

QoS stands for Quality of Service. It aims at restricting what jobs are allowed to do on the chosen partition (maximum running time, maximum number of allocated cores per job or user...). The same QoS can apply on different partitions.

Available partitions#

common: the default partition and gives access the shared nodes offered by the DSI
long: a small partition allowing jobs unlimited in time (thanks to the long QoS)
gpu: partition containing the shared GPU nodes offered by the DSI
clcbio: nodes having licenses for CLC Assembly Cell tools from Qiagen
clcgwb: nodes having licenses for CLC Genomics WorkBench from Qiagen
dedicated: the partition containing all the special nodes, that is to say:
nodes belonging to research units,
nodes with licenses attached to it (nodes belonging to clcbio and clcgwb partitions).

Other existing partitions are partitions belonging to research units or projects. Their names are the same as the SLURM account or unix group of the research unit/project. Only people belonging to the account can access the corresponding partition.

Available QoS#

ultrafast:limited to 5 minutes and allowed on common, dedicated and gpu partitions
fast: limited to 2 hours, allowed on common, dedicated and gpu partitions
normal: default QoS, limited to 24 hours (1 day in MaxWall column), allowed on common and gpu partitions
long: unlimited in time, limited to 5 cores per user, allowed on the long partition only
gpu: limited to 3 days, allowed on the gpu partition only

QoS that apply on research unit's partitions are not limited in time or number of cores. Owners can use all their resources as they want.

You can obtain these information though sacctmgr as in

information on QoS (bash)

login@maestro-submit ~ $ sacctmgr show qos where name=ultrafast,fast,normal,long,gpu,clcbio,clcgwb format=name,priority,maxwall,maxcpusperuser
      Name   Priority     MaxWall MaxCPUsPU 
---------- ---------- ----------- --------- 
    clcbio      10000  1-00:00:00           
    clcgwb      10000  1-00:00:00           
      fast       1000    02:00:00           
       gpu      10000  3-00:00:00           
      long          0 365-00:00:00         5 
    normal        100  1-00:00:00           
 ultrafast       5000    00:05:00

Submitting a jobs specifying QoS and partition#

Reminder:

if don't specify the partition, the job will run in on nodes from common
if you don't specify the partition, the job will run with QoS normal (limited to 24 hours)

To specify partition and QoS:

explicitly indicate the partition in the command line using option -p or –-partition=
explicitly put a QoS allowed on this partition with option -q or --qos=

If you want to maximize the chance of starting quickly, you can specify more than one partition as long as the QoS is allowed on all of them. For example, you can run jobs with the fast QoS either on common or on the dedicated partition. SLURM will then look for nodes matching the requested resources on both sets of nodes and will launch the job as soon as it has found them. To run such a job, the command line looks like

Code Block (bash)

login@maestro-submit ~ $ srun -p common,dedicated --qos=fast  <your command with its options and arguments>

Note that job arrays can run on more than one partition because each array task is considered as a job. As a consequence, each array task can run either on common or on the dedicated partition depending on resources availability.

Priorities#

Short jobs that will release resources quickly for other people are encouraged. As a consequence, as you can see above, QoS priorities have been set this way: fast > normal.

Be opportunistic: run fast jobs on idle nodes from the dedicated partition#

To avoid wasting computing power, when the special nodes are not used by jobs they were designed for, they can run short regular jobs submitted:

in the dedicated partition
with fast QoS

This is possible because the dedicated partition overlaps with all project/unit's partitions and special nodes partitions (clcgwb, clcbio).

Jobs sent on the dedicated partition with fast QoS are said "opportunistic". They can be killed and requeued by SLURM at any moment if a more legitimate job (requiring a license or belonging to a member of the owner unit) is launched and requires the requested resources to be freed by SLURM to be able to start. This mechanism is called preemption. Job running in the unit's/project's partition can preempt nodes belonging to that partition.

The requeued jobs will start again on the same partition (decicated) with the same QoS (fast) as soon as resources are available again. Thus, if nodes belonging to another unit/project are still available and match the requested resources, the requeued jobs can start again right away.