Skip to content

How to Create a Disconnection-Proof Interactive Session

Contrary to sbatch, you can have an interactive shell with salloc. The problem is that in case of disconnection, due to a network glitch for example, a usual salloc releases the allocation and the processes running in the allocation are killed. Here is a work-around.

Prerequisite#

This work-around implies connecting to a node through ssh. To be able to do that, you need a pair of ssh keys. If you don't have such a pair already, follow the procedure described on this page.

Creation and use of the allocation#

  1. create an allocation with salloc with the required resources but add the option --no-shell option. As usual, the QoS will determine the duration of the allocation.
  2. once your job is submitted and running, choose a node belonging to your allocation. If your allocation is across several nodes, you can choose the first one (the batchHost from which the srun are generally launched)

Code Block (text)

login@maestro-submit ~ $ salloc -n 32 -J interactivealloc  --no-shell
salloc: Pending job allocation 42995555
salloc: job 42995555 queued and waiting for resources
salloc: job 42995555 has been allocated resources
salloc: Granted job allocation 42995555
salloc: Waiting for resource configuration
salloc: Nodes maestro-[1013-1017] are ready for job
login@maestro-submit ~ $ squeue -j 42995555
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          42995555    common interact    login  R       0:19      5 maestro-[1013-1017] 
login@maestro-submit ~ $ scontrol show job 42995555 | grep -i batchhost
   BatchHost=maestro-1013
  1. Connect to the chosen node through ssh. Since you have a job on it, you will be allowed to log on that node

Code Block (text)

login@maestro-submit ~ $ ssh maestro-1013
login@maestro-1013's password: 
                           _             
 _ __ ___   __ _  ___  ___| |_ _ __ ___  
| '_ ` _ \ / _` |/ _ \/ __| __| '__/ _ \ 
| | | | | | (_| |  __/\__ \ |_| | | (_) |
|_| |_| |_|\__,_|\___||___/\__|_|  \___/ 

login@maestro-1013 ~ $
  1. Beware, you are on the same node as your allocation but you are not inside the allocation. So to be able to use the allocated resources, you have to tell srun that you want to use the allocation of the job running on that node. To do that, just add option --jobid=<job id> to your srun commands. Look at the different behaviors with and without srun --jobid=<job id> in this example:

Code Block (bash)

login@maestro-1013 ~ $ hostname
maestro-1013
login@maestro-1013 ~ $ srun --jobid=42995555 hostname
maestro-1013
maestro-1016
maestro-1014
maestro-1015
maestro-1017

In the first case, the hostname command runs only on the host you are logged on. In the second case, the command is executed on every node of the allocation, in this case 5 as shown in the output of squeue above. 5. Now, do as you would do on maestro.pasteur.fr: create a tmux or screen session so that if you experience a network glitch and are disconnected from the node, you will be able to reattach the tmux session once logged again on the node,

Code Block (bash)

login@maestro-1013 ~ $ tmux new -s my_tmux_session
login@maestro-1013 ~ $ tmux ls
my_tmux_session: 1 windows (created Mon Sep 18 17:09:31 2023) [193x28] (attached)
login@maestro-1013 ~ $ srun --jobid=42995555 /path/to/my/program arg1 arg2


[my_tmux_session 0:bash*
  1. You can of course detach from your tmux or screen session and disconnect from the node explicitly and come back later as you would do on maestro.pasteur.fr.

Code Block (bash)

login@maestro-1013 ~ $ exit 

[my_tmux_session 0:bash*                              
[exited] 
login@maestro-1013 ~ $ tmux ls 
no server running on /tmp/tmux-XXXXX/default
login@maestro-1013 ~ $ exit
logout
login@maestro-submit ~ $
  1. If you forgot your job and tmux or screen session, those will be wiped out when the allocation will reach its time limit and you won't be able to log to any of the nodes belonging to the deleted allocation anymore. If the nodes belong to your unit partition, do not forget to close your tmux session and terminate your job explicitly with scancel <job id>

Code Block (bash)

login@maestro-submit ~ $ scancel 42995555