# Reference

Quick reference guides, cheat sheets, and troubleshooting for TAU HPC

# Common Slurm Commands

Quick reference for the most common Slurm commands on the TAU HPC cluster.

## Submitting Jobs

<table id="bkmrk-command-description-"><thead><tr><th>Command</th><th>Description</th></tr></thead><tbody><tr><td>`sbatch job.sh`</td><td>Submit a batch job script</td></tr><tr><td>`srun --pty bash`</td><td>Start an interactive session</td></tr><tr><td>`sbatch --depend=afterok:JOBID job.sh`</td><td>Submit job after another completes</td></tr></tbody></table>

## Monitoring Jobs

<table id="bkmrk-command-description--1"><thead><tr><th>Command</th><th>Description</th></tr></thead><tbody><tr><td>`squeue -u username`</td><td>Your running and pending jobs</td></tr><tr><td>`squeue`</td><td>All jobs on the cluster</td></tr><tr><td>`scontrol show job JOBID`</td><td>Full details of a job</td></tr><tr><td>`sacct -j JOBID --format=JobID,JobName,State,MaxRSS,Elapsed`</td><td>Job accounting and memory usage</td></tr><tr><td>`sattach JOBID`</td><td>Attach to a running job's output</td></tr></tbody></table>

## Managing Jobs

<table id="bkmrk-command-description--2"><thead><tr><th>Command</th><th>Description</th></tr></thead><tbody><tr><td>`scancel JOBID`</td><td>Cancel a specific job</td></tr><tr><td>`scancel -u username`</td><td>Cancel all your jobs</td></tr></tbody></table>

## Cluster Information

<table id="bkmrk-command-description--3"><thead><tr><th>Command</th><th>Description</th></tr></thead><tbody><tr><td>`sinfo`</td><td>Partition and node status</td></tr><tr><td>`scontrol show partition PARTITION`</td><td>Partition details and limits</td></tr><tr><td>`check_my_partitions`</td><td>Your available partitions and accounts</td></tr><tr><td>`features`</td><td>Available node constraints/features</td></tr></tbody></table>

## Environment Modules

<table id="bkmrk-command-description--4"><thead><tr><th>Command</th><th>Description</th></tr></thead><tbody><tr><td>`module avail`</td><td>List all available modules</td></tr><tr><td>`module avail NAME`</td><td>Search for a specific module</td></tr><tr><td>`module spider NAME`</td><td>Detailed module info including dependencies</td></tr><tr><td>`module load NAME`</td><td>Load a module</td></tr><tr><td>`module list`</td><td>List loaded modules</td></tr><tr><td>`module unload NAME`</td><td>Unload a module</td></tr><tr><td>`module purge`</td><td>Unload all modules</td></tr></tbody></table>

## Common SBATCH Directives

<table id="bkmrk-directive-descriptio"><thead><tr><th>Directive</th><th>Description</th></tr></thead><tbody><tr><td>`#SBATCH --job-name=NAME`</td><td>Job name</td></tr><tr><td>`#SBATCH --account=ACCOUNT`</td><td>Account name</td></tr><tr><td>`#SBATCH --partition=PARTITION`</td><td>Partition/queue</td></tr><tr><td>`#SBATCH --qos=QOS`</td><td>Quality of Service</td></tr><tr><td>`#SBATCH --time=HH:MM:SS`</td><td>Max run time</td></tr><tr><td>`#SBATCH --ntasks=N`</td><td>Number of tasks</td></tr><tr><td>`#SBATCH --nodes=N`</td><td>Number of nodes</td></tr><tr><td>`#SBATCH --cpus-per-task=N`</td><td>CPU cores per task</td></tr><tr><td>`#SBATCH --mem-per-cpu=NG`</td><td>Memory per CPU</td></tr><tr><td>`#SBATCH --mem=NG`</td><td>Total memory</td></tr><tr><td>`#SBATCH --gres=gpu:N`</td><td>Number of GPUs</td></tr><tr><td>`#SBATCH --constraint=FEATURE`</td><td>Node constraint/feature</td></tr><tr><td>`#SBATCH --array=1-N`</td><td>Job array</td></tr><tr><td>`#SBATCH --output=FILE_%j.out`</td><td>Output file (%j = job ID)</td></tr><tr><td>`#SBATCH --error=FILE_%j.err`</td><td>Error file</td></tr><tr><td>`#SBATCH --mail-user=EMAIL`</td><td>Notification email</td></tr><tr><td>`#SBATCH --mail-type=END,FAIL`</td><td>When to notify</td></tr></tbody></table>

# PBS to Slurm Migration

Quick reference for users migrating from the old PBS/Torque system to Slurm.

## Command Equivalents

<table id="bkmrk-pbs%2Ftorque-slurm-des"><thead><tr><th>PBS/Torque</th><th>Slurm</th><th>Description</th></tr></thead><tbody><tr><td>`qsub job.sh`</td><td>`sbatch job.sh`</td><td>Submit a job</td></tr><tr><td>`qsub -I`</td><td>`srun --pty bash`</td><td>Interactive session</td></tr><tr><td>`qstat`</td><td>`squeue`</td><td>View jobs</td></tr><tr><td>`qstat -u username`</td><td>`squeue -u username`</td><td>Your jobs</td></tr><tr><td>`qdel JOBID`</td><td>`scancel JOBID`</td><td>Cancel a job</td></tr><tr><td>`pbsnodes`</td><td>`sinfo`</td><td>Node status</td></tr><tr><td>`qstat -f JOBID`</td><td>`scontrol show job JOBID`</td><td>Job details</td></tr></tbody></table>

## Directive Equivalents

<table id="bkmrk-pbs%2Ftorque-slurm-des-1"><thead><tr><th>PBS/Torque</th><th>Slurm</th><th>Description</th></tr></thead><tbody><tr><td>`#PBS -N name`</td><td>`#SBATCH --job-name=name`</td><td>Job name</td></tr><tr><td>`#PBS -q queue`</td><td>`#SBATCH --partition=partition`</td><td>Queue/partition</td></tr><tr><td>`#PBS -l nodes=1:ppn=8`</td><td>`#SBATCH --nodes=1 --cpus-per-task=8`</td><td>Nodes and cores</td></tr><tr><td>`#PBS -l mem=4gb`</td><td>`#SBATCH --mem=4G`</td><td>Memory</td></tr><tr><td>`#PBS -l walltime=02:00:00`</td><td>`#SBATCH --time=02:00:00`</td><td>Wall time</td></tr><tr><td>`#PBS -o output.log`</td><td>`#SBATCH --output=output.log`</td><td>Output file</td></tr><tr><td>`#PBS -e error.log`</td><td>`#SBATCH --error=error.log`</td><td>Error file</td></tr><tr><td>`#PBS -M email`</td><td>`#SBATCH --mail-user=email`</td><td>Email</td></tr><tr><td>`#PBS -m abe`</td><td>`#SBATCH --mail-type=ALL`</td><td>Mail events</td></tr><tr><td>`#PBS -V`</td><td>`#SBATCH --export=ALL`</td><td>Export environment</td></tr><tr><td>`#PBS -t 1-10`</td><td>`#SBATCH --array=1-10`</td><td>Job array</td></tr></tbody></table>

## Environment Variables

<table id="bkmrk-pbs%2Ftorque-slurm-des-2"><thead><tr><th>PBS/Torque</th><th>Slurm</th><th>Description</th></tr></thead><tbody><tr><td>`$PBS_JOBID`</td><td>`$SLURM_JOB_ID`</td><td>Job ID</td></tr><tr><td>`$PBS_JOBNAME`</td><td>`$SLURM_JOB_NAME`</td><td>Job name</td></tr><tr><td>`$PBS_NODEFILE`</td><td>`$SLURM_JOB_NODELIST`</td><td>Node list</td></tr><tr><td>`$PBS_ARRAYID`</td><td>`$SLURM_ARRAY_TASK_ID`</td><td>Array task ID</td></tr><tr><td>`$PBS_NP`</td><td>`$SLURM_NTASKS`</td><td>Number of tasks</td></tr><tr><td>`$PBS_O_WORKDIR`</td><td>`$SLURM_SUBMIT_DIR`</td><td>Submission directory</td></tr></tbody></table>

## Key Differences

- Slurm requires `--account` and `--qos` — run `check_my_partitions` to find yours
- Slurm uses `--cpus-per-task` instead of `ppn`
- Memory in Slurm is per-CPU (`--mem-per-cpu`) or total (`--mem`)
- Slurm does not automatically change to the submission directory — add `cd $SLURM_SUBMIT_DIR` to your script if needed

# Troubleshooting

Common errors and solutions for job submission and cluster usage.

## Job Submission Errors

### No partition specified

```bash
srun: error: Unable to allocate resources: No partition specified or system default partition
```

Always specify a partition. Run `check_my_partitions` to find yours.

### Invalid account or partition

```bash
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
```

Your account and partition combination is incorrect. Run `check_my_partitions` and make sure both match.

### QOS not permitted

```bash
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy
```

The QOS you specified doesn't match your account/partition. Run `check_my_partitions` to see valid combinations.

## Job Failures

### Out of Memory (OOM)

```bash
sacct -j JOBID -o JobID,JobName,State%20

JobID    JobName               State
-------- -------------------- --------------------
71       my_job        OUT_OF_MEMORY
```

Your job used more RAM than allocated. Resubmit with a higher `--mem` or `--mem-per-cpu`. To estimate needed memory:

```bash
sacct -j JOBID --format=JobID,JobName,MaxRSS,Elapsed
```

### Timeout

Job state shows `TIMEOUT` — your job exceeded the time limit. Resubmit with a longer `--time` value.

### Job stuck in Pending (PD)

Check the reason:

```bash
squeue -u username -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
```

Common reasons:

- **Resources** — cluster is busy, wait for nodes to free up
- **QOSMaxCpuPerUserLimit** — you've hit your CPU quota, wait for running jobs to finish
- **InvalidQOS** — wrong QOS, check `check_my_partitions`
- **ReqNodeNotAvail** — requested node is down, remove `--nodelist` constraint

## NFS / Storage Issues

### Job hangs or freezes on file operations

May indicate an NFS mount issue. Check if your home directory is accessible:

```bash
ls ~
```

If it hangs, contact HPC support — do not kill the job manually as it may cause further issues.

### Disk quota exceeded

```bash
bash: cannot create temp file: Disk quota exceeded
```

Your home directory is full. Move large files to scratch space or contact HPC support for a quota increase.

## Module Issues

### Module not found

```bash
module avail MODULE_NAME
```

Check the exact module name. Use `module spider MODULE_NAME` for a broader search including partial matches.

## Getting Help

If you can't resolve an issue, contact HPC support at <hpc@tauex.tau.ac.il>. Include:

- Your username
- Job ID
- The command you ran
- The full error message

# Security Installations

Required security software for TAU workstations and servers.

## NAC — Forescout

Network Access Control client required for connecting to the TAU network.

[ForeScoutSecureConnector\_64\_visible\_daemon.tar.gz](https://hpcguide.tau.ac.il/attachments/4)

### Installation

```bash
tar -zxvf ForeScoutSecureConnector_64_visible_daemon.tar.gz
cd secure_connector
./install.sh
systemctl start SecureConnector.service
```

## EDR — CrowdStrike Falcon

Endpoint Detection and Response client.

- [falcon-sensor\_7.18.0-17106\_amd64.deb](https://hpcguide.tau.ac.il/attachments/3)— Ubuntu 22.04 / 24.04
- [falcon-sensor-7.17.0-17005.el9.x86\_64.rpm](https://hpcguide.tau.ac.il/attachments/5)— Rocky / RHEL

### Installation — Ubuntu

```bash
dpkg -i falcon-sensor_7.18.0-17106_amd64.deb
systemctl restart falcon-sensor.service
```

### Installation — Rocky / RHEL

```bash
rpm -ivh falcon-sensor_7.17.0-17005.el9.x86_64.rpm
systemctl restart falcon-sensor.service
```

### Registration (new installations only)

```bash
/opt/CrowdStrike/falconctl -s --cid=<cid-code>
```

To obtain the CID code, contact <infosec@tauex.tau.ac.il>.