Reference

Quick reference guides, cheat sheets, and troubleshooting for TAU HPC

Common Slurm Commands

Quick reference for the most common Slurm commands on the TAU HPC cluster.

Submitting Jobs

Command Description
sbatch job.sh Submit a batch job script
srun --pty bash Start an interactive session
sbatch --depend=afterok:JOBID job.sh Submit job after another completes

Monitoring Jobs

Command Description
squeue -u username Your running and pending jobs
squeue All jobs on the cluster
scontrol show job JOBID Full details of a job
sacct -j JOBID --format=JobID,JobName,State,MaxRSS,Elapsed Job accounting and memory usage
sattach JOBID Attach to a running job's output

Managing Jobs

Command Description
scancel JOBID Cancel a specific job
scancel -u username Cancel all your jobs

Cluster Information

Command Description
sinfo Partition and node status
scontrol show partition PARTITION Partition details and limits
check_my_partitions Your available partitions and accounts
features Available node constraints/features

Environment Modules

Command Description
module avail List all available modules
module avail NAME Search for a specific module
module spider NAME Detailed module info including dependencies
module load NAME Load a module
module list List loaded modules
module unload NAME Unload a module
module purge Unload all modules

Common SBATCH Directives

Directive Description
#SBATCH --job-name=NAME Job name
#SBATCH --account=ACCOUNT Account name
#SBATCH --partition=PARTITION Partition/queue
#SBATCH --qos=QOS Quality of Service
#SBATCH --time=HH:MM:SS Max run time
#SBATCH --ntasks=N Number of tasks
#SBATCH --nodes=N Number of nodes
#SBATCH --cpus-per-task=N CPU cores per task
#SBATCH --mem-per-cpu=NG Memory per CPU
#SBATCH --mem=NG Total memory
#SBATCH --gres=gpu:N Number of GPUs
#SBATCH --constraint=FEATURE Node constraint/feature
#SBATCH --array=1-N Job array
#SBATCH --output=FILE_%j.out Output file (%j = job ID)
#SBATCH --error=FILE_%j.err Error file
#SBATCH --mail-user=EMAIL Notification email
#SBATCH --mail-type=END,FAIL When to notify

PBS to Slurm Migration

Quick reference for users migrating from the old PBS/Torque system to Slurm.

Command Equivalents

PBS/Torque Slurm Description
qsub job.sh sbatch job.sh Submit a job
qsub -I srun --pty bash Interactive session
qstat squeue View jobs
qstat -u username squeue -u username Your jobs
qdel JOBID scancel JOBID Cancel a job
pbsnodes sinfo Node status
qstat -f JOBID scontrol show job JOBID Job details

Directive Equivalents

PBS/Torque Slurm Description
#PBS -N name #SBATCH --job-name=name Job name
#PBS -q queue #SBATCH --partition=partition Queue/partition
#PBS -l nodes=1:ppn=8 #SBATCH --nodes=1 --cpus-per-task=8 Nodes and cores
#PBS -l mem=4gb #SBATCH --mem=4G Memory
#PBS -l walltime=02:00:00 #SBATCH --time=02:00:00 Wall time
#PBS -o output.log #SBATCH --output=output.log Output file
#PBS -e error.log #SBATCH --error=error.log Error file
#PBS -M email #SBATCH --mail-user=email Email
#PBS -m abe #SBATCH --mail-type=ALL Mail events
#PBS -V #SBATCH --export=ALL Export environment
#PBS -t 1-10 #SBATCH --array=1-10 Job array

Environment Variables

PBS/Torque Slurm Description
$PBS_JOBID $SLURM_JOB_ID Job ID
$PBS_JOBNAME $SLURM_JOB_NAME Job name
$PBS_NODEFILE $SLURM_JOB_NODELIST Node list
$PBS_ARRAYID $SLURM_ARRAY_TASK_ID Array task ID
$PBS_NP $SLURM_NTASKS Number of tasks
$PBS_O_WORKDIR $SLURM_SUBMIT_DIR Submission directory

Key Differences

Troubleshooting

Common errors and solutions for job submission and cluster usage.

Job Submission Errors

No partition specified

srun: error: Unable to allocate resources: No partition specified or system default partition

Always specify a partition. Run check_my_partitions to find yours.

Invalid account or partition

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Your account and partition combination is incorrect. Run check_my_partitions and make sure both match.

QOS not permitted

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy

The QOS you specified doesn't match your account/partition. Run check_my_partitions to see valid combinations.

Job Failures

Out of Memory (OOM)

sacct -j JOBID -o JobID,JobName,State%20

JobID    JobName               State
-------- -------------------- --------------------
71       my_job        OUT_OF_MEMORY

Your job used more RAM than allocated. Resubmit with a higher --mem or --mem-per-cpu. To estimate needed memory:

sacct -j JOBID --format=JobID,JobName,MaxRSS,Elapsed

Timeout

Job state shows TIMEOUT — your job exceeded the time limit. Resubmit with a longer --time value.

Job stuck in Pending (PD)

Check the reason:

squeue -u username -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

Common reasons:

NFS / Storage Issues

Job hangs or freezes on file operations

May indicate an NFS mount issue. Check if your home directory is accessible:

ls ~

If it hangs, contact HPC support — do not kill the job manually as it may cause further issues.

Disk quota exceeded

bash: cannot create temp file: Disk quota exceeded

Your home directory is full. Move large files to scratch space or contact HPC support for a quota increase.

Module Issues

Module not found

module avail MODULE_NAME

Check the exact module name. Use module spider MODULE_NAME for a broader search including partial matches.

Getting Help

If you can't resolve an issue, contact HPC support at hpc@tauex.tau.ac.il. Include:

Security Installations

Required security software for TAU workstations and servers.

NAC — Forescout

Network Access Control client required for connecting to the TAU network.

ForeScoutSecureConnector_64_visible_daemon.tar.gz

Installation

tar -zxvf ForeScoutSecureConnector_64_visible_daemon.tar.gz
cd secure_connector
./install.sh
systemctl start SecureConnector.service

EDR — CrowdStrike Falcon

Endpoint Detection and Response client.

Installation — Ubuntu

dpkg -i falcon-sensor_7.18.0-17106_amd64.deb
systemctl restart falcon-sensor.service

Installation — Rocky / RHEL

rpm -ivh falcon-sensor_7.17.0-17005.el9.x86_64.rpm
systemctl restart falcon-sensor.service

Registration (new installations only)

/opt/CrowdStrike/falconctl -s --cid=<cid-code>

To obtain the CID code, contact infosec@tauex.tau.ac.il.