Reference

Quick reference guides, cheat sheets, and troubleshooting for TAU HPC

Common Slurm Commands
PBS to Slurm Migration
Troubleshooting
Security Installations

Common Slurm Commands

Quick reference for the most common Slurm commands on the TAU HPC cluster.

Submitting Jobs

Command	Description
`sbatch job.sh`	Submit a batch job script
`srun --pty bash`	Start an interactive session
`sbatch --depend=afterok:JOBID job.sh`	Submit job after another completes

Monitoring Jobs

Command	Description
`squeue -u username`	Your running and pending jobs
`squeue`	All jobs on the cluster
`scontrol show job JOBID`	Full details of a job
`sacct -j JOBID --format=JobID,JobName,State,MaxRSS,Elapsed`	Job accounting and memory usage
`sattach JOBID`	Attach to a running job's output

Managing Jobs

Command	Description
`scancel JOBID`	Cancel a specific job
`scancel -u username`	Cancel all your jobs

Cluster Information

Command	Description
`sinfo`	Partition and node status
`scontrol show partition PARTITION`	Partition details and limits
`check_my_partitions`	Your available partitions and accounts
`features`	Available node constraints/features

Environment Modules

Command	Description
`module avail`	List all available modules
`module avail NAME`	Search for a specific module
`module spider NAME`	Detailed module info including dependencies
`module load NAME`	Load a module
`module list`	List loaded modules
`module unload NAME`	Unload a module
`module purge`	Unload all modules

Common SBATCH Directives

Directive	Description
`#SBATCH --job-name=NAME`	Job name
`#SBATCH --account=ACCOUNT`	Account name
`#SBATCH --partition=PARTITION`	Partition/queue
`#SBATCH --qos=QOS`	Quality of Service
`#SBATCH --time=HH:MM:SS`	Max run time
`#SBATCH --ntasks=N`	Number of tasks
`#SBATCH --nodes=N`	Number of nodes
`#SBATCH --cpus-per-task=N`	CPU cores per task
`#SBATCH --mem-per-cpu=NG`	Memory per CPU
`#SBATCH --mem=NG`	Total memory
`#SBATCH --gres=gpu:N`	Number of GPUs
`#SBATCH --constraint=FEATURE`	Node constraint/feature
`#SBATCH --array=1-N`	Job array
`#SBATCH --output=FILE_%j.out`	Output file (%j = job ID)
`#SBATCH --error=FILE_%j.err`	Error file
`#SBATCH --mail-user=EMAIL`	Notification email
`#SBATCH --mail-type=END,FAIL`	When to notify

PBS to Slurm Migration

Quick reference for users migrating from the old PBS/Torque system to Slurm.

Command Equivalents

PBS/Torque	Slurm	Description
`qsub job.sh`	`sbatch job.sh`	Submit a job
`qsub -I`	`srun --pty bash`	Interactive session
`qstat`	`squeue`	View jobs
`qstat -u username`	`squeue -u username`	Your jobs
`qdel JOBID`	`scancel JOBID`	Cancel a job
`pbsnodes`	`sinfo`	Node status
`qstat -f JOBID`	`scontrol show job JOBID`	Job details

Directive Equivalents

PBS/Torque	Slurm	Description
`#PBS -N name`	`#SBATCH --job-name=name`	Job name
`#PBS -q queue`	`#SBATCH --partition=partition`	Queue/partition
`#PBS -l nodes=1:ppn=8`	`#SBATCH --nodes=1 --cpus-per-task=8`	Nodes and cores
`#PBS -l mem=4gb`	`#SBATCH --mem=4G`	Memory
`#PBS -l walltime=02:00:00`	`#SBATCH --time=02:00:00`	Wall time
`#PBS -o output.log`	`#SBATCH --output=output.log`	Output file
`#PBS -e error.log`	`#SBATCH --error=error.log`	Error file
`#PBS -M email`	`#SBATCH --mail-user=email`	Email
`#PBS -m abe`	`#SBATCH --mail-type=ALL`	Mail events
`#PBS -V`	`#SBATCH --export=ALL`	Export environment
`#PBS -t 1-10`	`#SBATCH --array=1-10`	Job array

Environment Variables

PBS/Torque	Slurm	Description
`$PBS_JOBID`	`$SLURM_JOB_ID`	Job ID
`$PBS_JOBNAME`	`$SLURM_JOB_NAME`	Job name
`$PBS_NODEFILE`	`$SLURM_JOB_NODELIST`	Node list
`$PBS_ARRAYID`	`$SLURM_ARRAY_TASK_ID`	Array task ID
`$PBS_NP`	`$SLURM_NTASKS`	Number of tasks
`$PBS_O_WORKDIR`	`$SLURM_SUBMIT_DIR`	Submission directory

Key Differences

Slurm requires --account and --qos — run check_my_partitions to find yours
Slurm uses --cpus-per-task instead of ppn
Memory in Slurm is per-CPU (--mem-per-cpu) or total (--mem)
Slurm does not automatically change to the submission directory — add cd $SLURM_SUBMIT_DIR to your script if needed

Troubleshooting

Common errors and solutions for job submission and cluster usage.

Job Submission Errors

No partition specified

srun: error: Unable to allocate resources: No partition specified or system default partition

Always specify a partition. Run check_my_partitions to find yours.

Invalid account or partition

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Your account and partition combination is incorrect. Run check_my_partitions and make sure both match.

QOS not permitted

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy

The QOS you specified doesn't match your account/partition. Run check_my_partitions to see valid combinations.

Job Failures

Out of Memory (OOM)

sacct -j JOBID -o JobID,JobName,State%20

JobID    JobName               State
-------- -------------------- --------------------
71       my_job        OUT_OF_MEMORY

Your job used more RAM than allocated. Resubmit with a higher --mem or --mem-per-cpu. To estimate needed memory:

sacct -j JOBID --format=JobID,JobName,MaxRSS,Elapsed

Timeout

Job state shows TIMEOUT — your job exceeded the time limit. Resubmit with a longer --time value.

Job stuck in Pending (PD)

Check the reason:

squeue -u username -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

Common reasons:

Resources — cluster is busy, wait for nodes to free up
QOSMaxCpuPerUserLimit — you've hit your CPU quota, wait for running jobs to finish
InvalidQOS — wrong QOS, check check_my_partitions
ReqNodeNotAvail — requested node is down, remove --nodelist constraint

NFS / Storage Issues

Job hangs or freezes on file operations

May indicate an NFS mount issue. Check if your home directory is accessible:

ls ~

If it hangs, contact HPC support — do not kill the job manually as it may cause further issues.

Disk quota exceeded

bash: cannot create temp file: Disk quota exceeded

Your home directory is full. Move large files to scratch space or contact HPC support for a quota increase.

Module Issues

Module not found

module avail MODULE_NAME

Check the exact module name. Use module spider MODULE_NAME for a broader search including partial matches.

Getting Help

If you can't resolve an issue, contact HPC support at hpc@tauex.tau.ac.il. Include:

Your username
Job ID
The command you ran
The full error message

Security Installations

Required security software for TAU workstations and servers.

NAC — Forescout

Network Access Control client required for connecting to the TAU network.

ForeScoutSecureConnector_64_visible_daemon.tar.gz

Installation

tar -zxvf ForeScoutSecureConnector_64_visible_daemon.tar.gz
cd secure_connector
./install.sh
systemctl start SecureConnector.service

EDR — CrowdStrike Falcon

Endpoint Detection and Response client.

falcon-sensor_7.18.0-17106_amd64.deb — Ubuntu 22.04 / 24.04
falcon-sensor_7.39.0-19203_amd64.deb — Ubuntu 24.04
falcon-sensor-7.17.0-17005.el9.x86_64.rpm — Rocky / RHEL

Installation — Ubuntu

dpkg -i falcon-sensor_7.18.0-17106_amd64.deb
systemctl restart falcon-sensor.service

Installation — Rocky / RHEL

rpm -ivh falcon-sensor_7.17.0-17005.el9.x86_64.rpm
systemctl restart falcon-sensor.service

Registration (new installations only)

/opt/CrowdStrike/falconctl -s --cid=<cid-code>

To obtain the CID code, contact infosec@tauex.tau.ac.il.