Cluster Daily Use Overview#

Welcome to the Daily Use Guide for the cluster. This page provides a concise overview of essential workflows and best practices for efficient and effective cluster usage. Whether you're submitting jobs, managing resources, or troubleshooting issues, this guide covers the key topics you need every day.

Core Job Management Commands#

Master the main commands to submit, monitor, update, and cancel jobs:

sbatch: Submit a job script.
squeue: View the status of all jobs.
sinfo: Check node and partition availability.
scancel: Cancel a running or pending job.
scontrol update: Modify job parameters (e.g., time, memory) before start.
sacct: Retrieve job accounting and performance data.

For detailed instructions, see: Main Commands: Submit, Monitor, Update, and Cancel Jobs

Requesting and Adjusting Resources#

Efficiently request and modify compute resources (CPU, memory, time, etc.) based on your job needs:

Use #SBATCH directives in your job script to specify:
--time: Wall clock time limit.
--cpus-per-task, --mem: CPU and memory requirements.
--partition: Target compute partition.
Update running jobs with scontrol update (if allowed).
Avoid over-requesting; use sacct to analyze past usage.

For detailed instructions, see: How to Request and Adjust Resources

Using Graphical Tools#

Access cluster tools via graphical interfaces (e.g., web portals, GUI clients):

Prerequisites:
Enable SSH X11 forwarding (ssh -X user@cluster).
Install required GUI applications (e.g., xterm, gvim, jupyter).
Use srun --x11 to launch GUI apps on compute nodes.

For detailed instructions, see: Prerequisites for Using Tools with a Graphical Interface

Running MPI Programs#

Submit and run parallel programs using Message Passing Interface (MPI):

Use mpirun or srun with --mpi=pmi2 or --mpi=pmix.
Example: srun --ntasks=4 --cpus-per-task=2 mpirun ./my_mpi_program
Ensure your job script requests sufficient cores and nodes.

For detailed instructions, see: How to Run MPI Programs

Submitting Jobs on GPU Nodes#

Run GPU-accelerated workloads using dedicated GPU partitions:

Request GPU resources with:
--gres=gpu:1 (1 GPU)
--partition=gpu or --partition=gpu-a100
Use nvidia-smi to check GPU availability.
Ensure your application is compiled with CUDA support.

For detailed instructions, see: How to Submit on GPU Nodes

Troubleshooting and Monitoring#

Common issues and solutions:

Job fails to start? Check sacct -j <jobid> for error details.
Job stuck in pending? Use squeue -u $USER and verify resource availability.
Out of memory? Reduce memory request or optimize code.
GPU not detected? Confirm correct partition and GPU module loaded.

For detailed guidance, see: Troubleshooting and Monitoring

Temporary File Locations#

Avoid writing temporary files to /tmp or home directories. Instead:

Use the local scratch space on compute nodes:
/scratch/$USER (preferred)
$TMPDIR (automatically set by Slurm)
Files in /scratch are ephemeral and deleted after job completion.

For detailed guidance, see: Where to Write Temporary Files

Choosing the Right Partition and QoS#

Match your workload to the appropriate compute resource:

Use Case	Recommended Partition / QoS
Short, interactive jobs	`interactive`, `normal`
Long-running simulations	`long`, `batch`
High-memory jobs	`highmem`
GPU-intensive tasks	`gpu`, `gpu-a100`
High-priority urgent jobs	`priority`, `urgent`

For detailed guidance, see: Which Partition or QoS for Which Type of Computation

Quick Tips#

Always test small jobs before scaling up.
Use sbatch --test-only to validate job scripts.
Keep job scripts clean and well-commented.
Monitor resource usage to avoid overloading the system.

Need help? Contact your system administrator or check the cluster’s official documentation.

Happy computing!