# Troubleshooting

Common errors and solutions for job submission and cluster usage.

## Job Submission Errors

### No partition specified

```bash
srun: error: Unable to allocate resources: No partition specified or system default partition
```

Always specify a partition. Run `check_my_partitions` to find yours.

### Invalid account or partition

```bash
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
```

Your account and partition combination is incorrect. Run `check_my_partitions` and make sure both match.

### QOS not permitted

```bash
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy
```

The QOS you specified doesn't match your account/partition. Run `check_my_partitions` to see valid combinations.

## Job Failures

### Out of Memory (OOM)

```bash
sacct -j JOBID -o JobID,JobName,State%20

JobID    JobName               State
-------- -------------------- --------------------
71       my_job        OUT_OF_MEMORY
```

Your job used more RAM than allocated. Resubmit with a higher `--mem` or `--mem-per-cpu`. To estimate needed memory:

```bash
sacct -j JOBID --format=JobID,JobName,MaxRSS,Elapsed
```

### Timeout

Job state shows `TIMEOUT` — your job exceeded the time limit. Resubmit with a longer `--time` value.

### Job stuck in Pending (PD)

Check the reason:

```bash
squeue -u username -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
```

Common reasons:

- **Resources** — cluster is busy, wait for nodes to free up
- **QOSMaxCpuPerUserLimit** — you've hit your CPU quota, wait for running jobs to finish
- **InvalidQOS** — wrong QOS, check `check_my_partitions`
- **ReqNodeNotAvail** — requested node is down, remove `--nodelist` constraint

## NFS / Storage Issues

### Job hangs or freezes on file operations

May indicate an NFS mount issue. Check if your home directory is accessible:

```bash
ls ~
```

If it hangs, contact HPC support — do not kill the job manually as it may cause further issues.

### Disk quota exceeded

```bash
bash: cannot create temp file: Disk quota exceeded
```

Your home directory is full. Move large files to scratch space or contact HPC support for a quota increase.

## Module Issues

### Module not found

```bash
module avail MODULE_NAME
```

Check the exact module name. Use `module spider MODULE_NAME` for a broader search including partial matches.

## Getting Help

If you can't resolve an issue, contact HPC support at <hpc@tauex.tau.ac.il>. Include:

- Your username
- Job ID
- The command you ran
- The full error message