Cluster Daily Use Overview#
Welcome to the Daily Use Guide for the cluster. This page provides a concise overview of essential workflows and best practices for efficient and effective cluster usage. Whether you're submitting jobs, managing resources, or troubleshooting issues, this guide covers the key topics you need every day.
Core Job Management Commands#
Master the main commands to submit, monitor, update, and cancel jobs:
sbatch: Submit a job script.squeue: View the status of all jobs.sinfo: Check node and partition availability.scancel: Cancel a running or pending job.scontrol update: Modify job parameters (e.g., time, memory) before start.sacct: Retrieve job accounting and performance data.
For detailed instructions, see: Main Commands: Submit, Monitor, Update, and Cancel Jobs
Requesting and Adjusting Resources#
Efficiently request and modify compute resources (CPU, memory, time, etc.) based on your job needs:
- Use
#SBATCHdirectives in your job script to specify: --time: Wall clock time limit.--cpus-per-task,--mem: CPU and memory requirements.--partition: Target compute partition.- Update running jobs with
scontrol update(if allowed). - Avoid over-requesting; use
sacctto analyze past usage.
For detailed instructions, see: How to Request and Adjust Resources
Using Graphical Tools#
Access cluster tools via graphical interfaces (e.g., web portals, GUI clients):
- Prerequisites:
- Enable SSH X11 forwarding (
ssh -X user@cluster). - Install required GUI applications (e.g.,
xterm,gvim,jupyter). - Use
srun --x11to launch GUI apps on compute nodes.
For detailed instructions, see: Prerequisites for Using Tools with a Graphical Interface
Running MPI Programs#
Submit and run parallel programs using Message Passing Interface (MPI):
- Use
mpirunorsrunwith--mpi=pmi2or--mpi=pmix. - Example:
srun --ntasks=4 --cpus-per-task=2 mpirun ./my_mpi_program - Ensure your job script requests sufficient cores and nodes.
For detailed instructions, see: How to Run MPI Programs
Submitting Jobs on GPU Nodes#
Run GPU-accelerated workloads using dedicated GPU partitions:
- Request GPU resources with:
--gres=gpu:1(1 GPU)--partition=gpuor--partition=gpu-a100- Use
nvidia-smito check GPU availability. - Ensure your application is compiled with CUDA support.
For detailed instructions, see: How to Submit on GPU Nodes
Troubleshooting and Monitoring#
Common issues and solutions:
- Job fails to start? Check
sacct -j <jobid>for error details. - Job stuck in pending? Use
squeue -u $USERand verify resource availability. - Out of memory? Reduce memory request or optimize code.
- GPU not detected? Confirm correct partition and GPU module loaded.
For detailed guidance, see: Troubleshooting and Monitoring
Temporary File Locations#
Avoid writing temporary files to /tmp or home directories. Instead:
- Use the local scratch space on compute nodes:
/scratch/$USER(preferred)$TMPDIR(automatically set by Slurm)- Files in
/scratchare ephemeral and deleted after job completion.
For detailed guidance, see: Where to Write Temporary Files
Choosing the Right Partition and QoS#
Match your workload to the appropriate compute resource:
| Use Case | Recommended Partition / QoS |
|---|---|
| Short, interactive jobs | interactive, normal |
| Long-running simulations | long, batch |
| High-memory jobs | highmem |
| GPU-intensive tasks | gpu, gpu-a100 |
| High-priority urgent jobs | priority, urgent |
For detailed guidance, see: Which Partition or QoS for Which Type of Computation
Quick Tips#
- Always test small jobs before scaling up.
- Use
sbatch --test-onlyto validate job scripts. - Keep job scripts clean and well-commented.
- Monitor resource usage to avoid overloading the system.
Need help? Contact your system administrator or check the cluster’s official documentation.
Happy computing!