How to Create a Disconnection-Proof Interactive Session
Contrary to sbatch, you can have an interactive shell with salloc. The problem is that in case of disconnection, due to a network glitch for example, a usual salloc releases the allocation and the processes running in the allocation are killed. Here is a work-around.
Prerequisite#
This work-around implies connecting to a node through ssh. To be able to do that, you need a pair of ssh keys. If you don't have such a pair already, follow the procedure described on this page.
Creation and use of the allocation#
- create an allocation with
sallocwith the required resources but add the option--no-shelloption. As usual, the QoS will determine the duration of the allocation. - once your job is submitted and running, choose a node belonging to your allocation. If your allocation is across several nodes, you can choose the first one (the
batchHostfrom which thesrunare generally launched)
Code Block (text)
login@maestro-submit ~ $ salloc -n 32 -J interactivealloc --no-shell
salloc: Pending job allocation 42995555
salloc: job 42995555 queued and waiting for resources
salloc: job 42995555 has been allocated resources
salloc: Granted job allocation 42995555
salloc: Waiting for resource configuration
salloc: Nodes maestro-[1013-1017] are ready for job
login@maestro-submit ~ $ squeue -j 42995555
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
42995555 common interact login R 0:19 5 maestro-[1013-1017]
login@maestro-submit ~ $ scontrol show job 42995555 | grep -i batchhost
BatchHost=maestro-1013
- Connect to the chosen node through
ssh. Since you have a job on it, you will be allowed to log on that node
Code Block (text)
login@maestro-submit ~ $ ssh maestro-1013
login@maestro-1013's password:
_
_ __ ___ __ _ ___ ___| |_ _ __ ___
| '_ ` _ \ / _` |/ _ \/ __| __| '__/ _ \
| | | | | | (_| | __/\__ \ |_| | | (_) |
|_| |_| |_|\__,_|\___||___/\__|_| \___/
login@maestro-1013 ~ $
- Beware, you are on the same node as your allocation but you are not inside the allocation. So to be able to use the allocated resources, you have to tell
srunthat you want to use the allocation of the job running on that node. To do that, just add option--jobid=<job id>to yoursruncommands. Look at the different behaviors with and withoutsrun --jobid=<job id>in this example:
Code Block (bash)
login@maestro-1013 ~ $ hostname
maestro-1013
login@maestro-1013 ~ $ srun --jobid=42995555 hostname
maestro-1013
maestro-1016
maestro-1014
maestro-1015
maestro-1017
In the first case, the hostname command runs only on the host you are logged on. In the second case, the command is executed on every node of the allocation, in this case 5 as shown in the output of squeue above.
5. Now, do as you would do on maestro.pasteur.fr: create a tmux or screen session so that if you experience a network glitch and are disconnected from the node, you will be able to reattach the tmux session once logged again on the node,
Code Block (bash)
login@maestro-1013 ~ $ tmux new -s my_tmux_session
login@maestro-1013 ~ $ tmux ls
my_tmux_session: 1 windows (created Mon Sep 18 17:09:31 2023) [193x28] (attached)
login@maestro-1013 ~ $ srun --jobid=42995555 /path/to/my/program arg1 arg2
[my_tmux_session 0:bash*
- You can of course detach from your
tmuxorscreensession and disconnect from the node explicitly and come back later as you would do onmaestro.pasteur.fr.
Code Block (bash)
login@maestro-1013 ~ $ exit
[my_tmux_session 0:bash*
[exited]
login@maestro-1013 ~ $ tmux ls
no server running on /tmp/tmux-XXXXX/default
login@maestro-1013 ~ $ exit
logout
login@maestro-submit ~ $
- If you forgot your job and
tmuxorscreensession, those will be wiped out when the allocation will reach its time limit and you won't be able to log to any of the nodes belonging to the deleted allocation anymore. If the nodes belong to your unit partition, do not forget to close yourtmuxsession and terminate your job explicitly withscancel <job id>
Code Block (bash)
login@maestro-submit ~ $ scancel 42995555