# Reference Quick reference guides, cheat sheets, and troubleshooting for TAU HPC # Common Slurm Commands Quick reference for the most common Slurm commands on the TAU HPC cluster. ## Submitting Jobs

Command	Description
`sbatch job.sh`	Submit a batch job script
`srun --pty bash`	Start an interactive session
`sbatch --depend=afterok:JOBID job.sh`	Submit job after another completes

## Monitoring Jobs

Command	Description
`squeue -u username`	Your running and pending jobs
`squeue`	All jobs on the cluster
`scontrol show job JOBID`	Full details of a job
`sacct -j JOBID --format=JobID,JobName,State,MaxRSS,Elapsed`	Job accounting and memory usage
`sattach JOBID`	Attach to a running job's output

## Managing Jobs

Command	Description
`scancel JOBID`	Cancel a specific job
`scancel -u username`	Cancel all your jobs

## Cluster Information

Command	Description
`sinfo`	Partition and node status
`scontrol show partition PARTITION`	Partition details and limits
`check_my_partitions`	Your available partitions and accounts
`features`	Available node constraints/features

## Environment Modules

Command	Description
`module avail`	List all available modules
`module avail NAME`	Search for a specific module
`module spider NAME`	Detailed module info including dependencies
`module load NAME`	Load a module
`module list`	List loaded modules
`module unload NAME`	Unload a module
`module purge`	Unload all modules

## Common SBATCH Directives

Directive	Description
`#SBATCH --job-name=NAME`	Job name
`#SBATCH --account=ACCOUNT`	Account name
`#SBATCH --partition=PARTITION`	Partition/queue
`#SBATCH --qos=QOS`	Quality of Service
`#SBATCH --time=HH:MM:SS`	Max run time
`#SBATCH --ntasks=N`	Number of tasks
`#SBATCH --nodes=N`	Number of nodes
`#SBATCH --cpus-per-task=N`	CPU cores per task
`#SBATCH --mem-per-cpu=NG`	Memory per CPU
`#SBATCH --mem=NG`	Total memory
`#SBATCH --gres=gpu:N`	Number of GPUs
`#SBATCH --constraint=FEATURE`	Node constraint/feature
`#SBATCH --array=1-N`	Job array
`#SBATCH --output=FILE_%j.out`	Output file (%j = job ID)
`#SBATCH --error=FILE_%j.err`	Error file
`#SBATCH --mail-user=EMAIL`	Notification email
`#SBATCH --mail-type=END,FAIL`	When to notify

# PBS to Slurm Migration Quick reference for users migrating from the old PBS/Torque system to Slurm. ## Command Equivalents

PBS/Torque	Slurm	Description
`qsub job.sh`	`sbatch job.sh`	Submit a job
`qsub -I`	`srun --pty bash`	Interactive session
`qstat`	`squeue`	View jobs
`qstat -u username`	`squeue -u username`	Your jobs
`qdel JOBID`	`scancel JOBID`	Cancel a job
`pbsnodes`	`sinfo`	Node status
`qstat -f JOBID`	`scontrol show job JOBID`	Job details

## Directive Equivalents

PBS/Torque	Slurm	Description
`#PBS -N name`	`#SBATCH --job-name=name`	Job name
`#PBS -q queue`	`#SBATCH --partition=partition`	Queue/partition
`#PBS -l nodes=1:ppn=8`	`#SBATCH --nodes=1 --cpus-per-task=8`	Nodes and cores
`#PBS -l mem=4gb`	`#SBATCH --mem=4G`	Memory
`#PBS -l walltime=02:00:00`	`#SBATCH --time=02:00:00`	Wall time
`#PBS -o output.log`	`#SBATCH --output=output.log`	Output file
`#PBS -e error.log`	`#SBATCH --error=error.log`	Error file
`#PBS -M email`	`#SBATCH --mail-user=email`	Email
`#PBS -m abe`	`#SBATCH --mail-type=ALL`	Mail events
`#PBS -V`	`#SBATCH --export=ALL`	Export environment
`#PBS -t 1-10`	`#SBATCH --array=1-10`	Job array

## Environment Variables

PBS/Torque	Slurm	Description
`$PBS_JOBID`	`$SLURM_JOB_ID`	Job ID
`$PBS_JOBNAME`	`$SLURM_JOB_NAME`	Job name
`$PBS_NODEFILE`	`$SLURM_JOB_NODELIST`	Node list
`$PBS_ARRAYID`	`$SLURM_ARRAY_TASK_ID`	Array task ID
`$PBS_NP`	`$SLURM_NTASKS`	Number of tasks
`$PBS_O_WORKDIR`	`$SLURM_SUBMIT_DIR`	Submission directory

## Key Differences - Slurm requires `--account` and `--qos` — run `check_my_partitions` to find yours - Slurm uses `--cpus-per-task` instead of `ppn` - Memory in Slurm is per-CPU (`--mem-per-cpu`) or total (`--mem`) - Slurm does not automatically change to the submission directory — add `cd $SLURM_SUBMIT_DIR` to your script if needed # Troubleshooting Common errors and solutions for job submission and cluster usage. ## Job Submission Errors ### No partition specified ```bash srun: error: Unable to allocate resources: No partition specified or system default partition ``` Always specify a partition. Run `check_my_partitions` to find yours. ### Invalid account or partition ```bash sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified ``` Your account and partition combination is incorrect. Run `check_my_partitions` and make sure both match. ### QOS not permitted ```bash sbatch: error: Batch job submission failed: Job violates accounting/QOS policy ``` The QOS you specified doesn't match your account/partition. Run `check_my_partitions` to see valid combinations. ## Job Failures ### Out of Memory (OOM) ```bash sacct -j JOBID -o JobID,JobName,State%20 JobID JobName State -------- -------------------- -------------------- 71 my_job OUT_OF_MEMORY ``` Your job used more RAM than allocated. Resubmit with a higher `--mem` or `--mem-per-cpu`. To estimate needed memory: ```bash sacct -j JOBID --format=JobID,JobName,MaxRSS,Elapsed ``` ### Timeout Job state shows `TIMEOUT` — your job exceeded the time limit. Resubmit with a longer `--time` value. ### Job stuck in Pending (PD) Check the reason: ```bash squeue -u username -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R" ``` Common reasons: - **Resources** — cluster is busy, wait for nodes to free up - **QOSMaxCpuPerUserLimit** — you've hit your CPU quota, wait for running jobs to finish - **InvalidQOS** — wrong QOS, check `check_my_partitions` - **ReqNodeNotAvail** — requested node is down, remove `--nodelist` constraint ## NFS / Storage Issues ### Job hangs or freezes on file operations May indicate an NFS mount issue. Check if your home directory is accessible: ```bash ls ~ ``` If it hangs, contact HPC support — do not kill the job manually as it may cause further issues. ### Disk quota exceeded ```bash bash: cannot create temp file: Disk quota exceeded ``` Your home directory is full. Move large files to scratch space or contact HPC support for a quota increase. ## Module Issues ### Module not found ```bash module avail MODULE_NAME ``` Check the exact module name. Use `module spider MODULE_NAME` for a broader search including partial matches. ## Getting Help If you can't resolve an issue, contact HPC support at . Include: - Your username - Job ID - The command you ran - The full error message # Security Installations Required security software for TAU workstations and servers. ## NAC — Forescout Network Access Control client required for connecting to the TAU network. [ForeScoutSecureConnector\_64\_visible\_daemon.tar.gz](https://hpcguide.tau.ac.il/attachments/4) ### Installation ```bash tar -zxvf ForeScoutSecureConnector_64_visible_daemon.tar.gz cd secure_connector ./install.sh systemctl start SecureConnector.service ``` ## EDR — CrowdStrike Falcon Endpoint Detection and Response client. - [falcon-sensor\_7.18.0-17106\_amd64.deb](https://hpcguide.tau.ac.il/attachments/3) — Ubuntu 22.04 / 24.04 - [falcon-sensor\_7.39.0-19203\_amd64.deb](https://hpcguide.tau.ac.il/attachments/6) — Ubuntu 24.04 - [falcon-sensor-7.17.0-17005.el9.x86\_64.rpm](https://hpcguide.tau.ac.il/attachments/5) — Rocky / RHEL ### Installation — Ubuntu ```bash dpkg -i falcon-sensor_7.18.0-17106_amd64.deb systemctl restart falcon-sensor.service ``` ### Installation — Rocky / RHEL ```bash rpm -ivh falcon-sensor_7.17.0-17005.el9.x86_64.rpm systemctl restart falcon-sensor.service ``` ### Registration (new installations only) ```bash /opt/CrowdStrike/falconctl -s --cid= ``` To obtain the CID code, contact .