New slurm qos usage

From HPC Guide
Revision as of 06:45, 23 October 2025 by Dvory (talk | contribs)
Jump to navigation Jump to search

We have chatgpt page, which explains it all in HPC-helper-toolkit

Each partition (or “pool”) now has several QoS tiers that determine job priority and preemption behavior.

QOS types for each pool
QOS Purpose Preempts Can be preempted by
Share-type QoS (e.g. 0.125_48c_8g, 0.75_48c_8g) For multi-owner pools; defines each owner’s guaranteed slice (CPU/GPU portion). owner,public --
owner Used on your lab’s pool to run above your guaranteed slice (higher than public). public share-type QoS
public (partition: power-general-shared-pool) Used on cluster-wide shared pools for friendly or opportunistic runs -- owner, share-type QoS
public (partition: power-general-public-pool) Used on cluster-wide shared, little group of nodes, not preemptable -- --

Preemption rule summary: share-type QoS > owner > public

This means:

• A share-type QoS job can preempt owner or public jobs on the same pool.

• An owner job can preempt public jobs.

• Public jobs cannot preempt any other jobs.


How to Submit Jobs with the Correct QoS

Below are examples of how to use the new QoS tiers with your account: Owner QoS (on your lab’s pool)

sbatch -A UIDHERE-users_v2 -p UIDHERE-pool --qos=owner --time=02:00:00 run.sh

Share-type QoS (on a multi-owner pool, for your guaranteed slice)

sbatch -A UIDHERE-users_v2 -p gpu-dudu-tzach-yoav-pool --qos=0.125_48c_8g --gres=gpu:A100:1 run.sh

Public QoS (friendly, cluster-wide)

sbatch -A UIDHERE-users_v2 -p power-general-shared-pool --qos=public --time=01:00:00 run.sh

For the small, protected CPU pool

sbatch -A UIDHERE-users_v2 -p power-general-public-pool --qos=public --time=01:00:00 run.sh

Handy Checks During Usage

You can monitor your jobs and see their QoS and reasons:

squeue --me -O "JOBID,ACCOUNT,PARTITION,QOS,STATE,REASON"
sprio -w

If your job was preempted, check:

sacct -j <jobid> --format=JobID,State,Reason