Reference
Quick reference guides, cheat sheets, and troubleshooting for TAU HPC
Common Slurm Commands
Quick reference for the most common Slurm commands on the TAU HPC cluster.
Submitting Jobs
| Command | Description |
|---|---|
sbatch job.sh |
Submit a batch job script |
srun --pty bash |
Start an interactive session |
sbatch --depend=afterok:JOBID job.sh |
Submit job after another completes |
Monitoring Jobs
| Command | Description |
|---|---|
squeue -u username |
Your running and pending jobs |
squeue |
All jobs on the cluster |
scontrol show job JOBID |
Full details of a job |
sacct -j JOBID --format=JobID,JobName,State,MaxRSS,Elapsed |
Job accounting and memory usage |
sattach JOBID |
Attach to a running job's output |
Managing Jobs
| Command | Description |
|---|---|
scancel JOBID |
Cancel a specific job |
scancel -u username |
Cancel all your jobs |
Cluster Information
| Command | Description |
|---|---|
sinfo |
Partition and node status |
scontrol show partition PARTITION |
Partition details and limits |
check_my_partitions |
Your available partitions and accounts |
features |
Available node constraints/features |
Environment Modules
| Command | Description |
|---|---|
module avail |
List all available modules |
module avail NAME |
Search for a specific module |
module spider NAME |
Detailed module info including dependencies |
module load NAME |
Load a module |
module list |
List loaded modules |
module unload NAME |
Unload a module |
module purge |
Unload all modules |
Common SBATCH Directives
| Directive | Description |
|---|---|
#SBATCH --job-name=NAME |
Job name |
#SBATCH --account=ACCOUNT |
Account name |
#SBATCH --partition=PARTITION |
Partition/queue |
#SBATCH --qos=QOS |
Quality of Service |
#SBATCH --time=HH:MM:SS |
Max run time |
#SBATCH --ntasks=N |
Number of tasks |
#SBATCH --nodes=N |
Number of nodes |
#SBATCH --cpus-per-task=N |
CPU cores per task |
#SBATCH --mem-per-cpu=NG |
Memory per CPU |
#SBATCH --mem=NG |
Total memory |
#SBATCH --gres=gpu:N |
Number of GPUs |
#SBATCH --constraint=FEATURE |
Node constraint/feature |
#SBATCH --array=1-N |
Job array |
#SBATCH --output=FILE_%j.out |
Output file (%j = job ID) |
#SBATCH --error=FILE_%j.err |
Error file |
#SBATCH --mail-user=EMAIL |
Notification email |
#SBATCH --mail-type=END,FAIL |
When to notify |
PBS to Slurm Migration
Quick reference for users migrating from the old PBS/Torque system to Slurm.
Command Equivalents
| PBS/Torque | Slurm | Description |
|---|---|---|
qsub job.sh |
sbatch job.sh |
Submit a job |
qsub -I |
srun --pty bash |
Interactive session |
qstat |
squeue |
View jobs |
qstat -u username |
squeue -u username |
Your jobs |
qdel JOBID |
scancel JOBID |
Cancel a job |
pbsnodes |
sinfo |
Node status |
qstat -f JOBID |
scontrol show job JOBID |
Job details |
Directive Equivalents
| PBS/Torque | Slurm | Description |
|---|---|---|
#PBS -N name |
#SBATCH --job-name=name |
Job name |
#PBS -q queue |
#SBATCH --partition=partition |
Queue/partition |
#PBS -l nodes=1:ppn=8 |
#SBATCH --nodes=1 --cpus-per-task=8 |
Nodes and cores |
#PBS -l mem=4gb |
#SBATCH --mem=4G |
Memory |
#PBS -l walltime=02:00:00 |
#SBATCH --time=02:00:00 |
Wall time |
#PBS -o output.log |
#SBATCH --output=output.log |
Output file |
#PBS -e error.log |
#SBATCH --error=error.log |
Error file |
#PBS -M email |
#SBATCH --mail-user=email |
|
#PBS -m abe |
#SBATCH --mail-type=ALL |
Mail events |
#PBS -V |
#SBATCH --export=ALL |
Export environment |
#PBS -t 1-10 |
#SBATCH --array=1-10 |
Job array |
Environment Variables
| PBS/Torque | Slurm | Description |
|---|---|---|
$PBS_JOBID |
$SLURM_JOB_ID |
Job ID |
$PBS_JOBNAME |
$SLURM_JOB_NAME |
Job name |
$PBS_NODEFILE |
$SLURM_JOB_NODELIST |
Node list |
$PBS_ARRAYID |
$SLURM_ARRAY_TASK_ID |
Array task ID |
$PBS_NP |
$SLURM_NTASKS |
Number of tasks |
$PBS_O_WORKDIR |
$SLURM_SUBMIT_DIR |
Submission directory |
Key Differences
- Slurm requires
--accountand--qos— runcheck_my_partitionsto find yours - Slurm uses
--cpus-per-taskinstead ofppn - Memory in Slurm is per-CPU (
--mem-per-cpu) or total (--mem) - Slurm does not automatically change to the submission directory — add
cd $SLURM_SUBMIT_DIRto your script if needed
Troubleshooting
Common errors and solutions for job submission and cluster usage.
Job Submission Errors
No partition specified
srun: error: Unable to allocate resources: No partition specified or system default partition
Always specify a partition. Run check_my_partitions to find yours.
Invalid account or partition
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
Your account and partition combination is incorrect. Run check_my_partitions and make sure both match.
QOS not permitted
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy
The QOS you specified doesn't match your account/partition. Run check_my_partitions to see valid combinations.
Job Failures
Out of Memory (OOM)
sacct -j JOBID -o JobID,JobName,State%20
JobID JobName State
-------- -------------------- --------------------
71 my_job OUT_OF_MEMORY
Your job used more RAM than allocated. Resubmit with a higher --mem or --mem-per-cpu. To estimate needed memory:
sacct -j JOBID --format=JobID,JobName,MaxRSS,Elapsed
Timeout
Job state shows TIMEOUT — your job exceeded the time limit. Resubmit with a longer --time value.
Job stuck in Pending (PD)
Check the reason:
squeue -u username -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
Common reasons:
- Resources — cluster is busy, wait for nodes to free up
- QOSMaxCpuPerUserLimit — you've hit your CPU quota, wait for running jobs to finish
- InvalidQOS — wrong QOS, check
check_my_partitions - ReqNodeNotAvail — requested node is down, remove
--nodelistconstraint
NFS / Storage Issues
Job hangs or freezes on file operations
May indicate an NFS mount issue. Check if your home directory is accessible:
ls ~
If it hangs, contact HPC support — do not kill the job manually as it may cause further issues.
Disk quota exceeded
bash: cannot create temp file: Disk quota exceeded
Your home directory is full. Move large files to scratch space or contact HPC support for a quota increase.
Module Issues
Module not found
module avail MODULE_NAME
Check the exact module name. Use module spider MODULE_NAME for a broader search including partial matches.
Getting Help
If you can't resolve an issue, contact HPC support at hpc@tauex.tau.ac.il. Include:
- Your username
- Job ID
- The command you ran
- The full error message
Security Installations
Required security software for TAU workstations and servers.
NAC — Forescout
Network Access Control client required for connecting to the TAU network.
ForeScoutSecureConnector_64_visible_daemon.tar.gz
Installation
tar -zxvf ForeScoutSecureConnector_64_visible_daemon.tar.gz
cd secure_connector
./install.sh
systemctl start SecureConnector.service
EDR — CrowdStrike Falcon
Endpoint Detection and Response client.
- falcon-sensor_7.18.0-17106_amd64.deb— Ubuntu 22.04 / 24.04
- falcon-sensor-7.17.0-17005.el9.x86_64.rpm— Rocky / RHEL
Installation — Ubuntu
dpkg -i falcon-sensor_7.18.0-17106_amd64.deb
systemctl restart falcon-sensor.service
Installation — Rocky / RHEL
rpm -ivh falcon-sensor_7.17.0-17005.el9.x86_64.rpm
systemctl restart falcon-sensor.service
Registration (new installations only)
/opt/CrowdStrike/falconctl -s --cid=<cid-code>
To obtain the CID code, contact infosec@tauex.tau.ac.il.