Troubleshooting

Common errors and solutions for job submission and cluster usage.

Job Submission Errors

No partition specified

srun: error: Unable to allocate resources: No partition specified or system default partition

Always specify a partition. Run check_my_partitions to find yours.

Invalid account or partition

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Your account and partition combination is incorrect. Run check_my_partitions and make sure both match.

QOS not permitted

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy

The QOS you specified doesn't match your account/partition. Run check_my_partitions to see valid combinations.

Job Failures

Out of Memory (OOM)

sacct -j JOBID -o JobID,JobName,State%20

JobID    JobName               State
-------- -------------------- --------------------
71       my_job        OUT_OF_MEMORY

Your job used more RAM than allocated. Resubmit with a higher --mem or --mem-per-cpu. To estimate needed memory:

sacct -j JOBID --format=JobID,JobName,MaxRSS,Elapsed

Timeout

Job state shows TIMEOUT — your job exceeded the time limit. Resubmit with a longer --time value.

Job stuck in Pending (PD)

Check the reason:

squeue -u username -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

Common reasons:

NFS / Storage Issues

Job hangs or freezes on file operations

May indicate an NFS mount issue. Check if your home directory is accessible:

ls ~

If it hangs, contact HPC support — do not kill the job manually as it may cause further issues.

Disk quota exceeded

bash: cannot create temp file: Disk quota exceeded

Your home directory is full. Move large files to scratch space or contact HPC support for a quota increase.

Module Issues

Module not found

module avail MODULE_NAME

Check the exact module name. Use module spider MODULE_NAME for a broader search including partial matches.

Getting Help

If you can't resolve an issue, contact HPC support at hpc@tauex.tau.ac.il. Include:


Created 2026-06-14 08:41:36 UTC by levk
Updated 2026-06-14 08:42:04 UTC by levk