Difference between revisions of "Submitting a job to a slurm queue"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
− | + | SLURM (Simple Linux Utility for Resource Management) is a job scheduler used on many high-performance computing systems. It manages and allocates resources such as compute nodes and controls job execution. | |
− | |||
− | + | === Accessing the System === | |
+ | To submit jobs to the SLURM scheduler at Tel Aviv University, you must access the system through one of the designated login nodes. These nodes act as the gateway for submitting and managing your SLURM jobs. The available login nodes are: | ||
− | + | * <code>powerslurm-login.tau.ac.il</code> | |
+ | * <code>powerslurm-login2.tau.ac.il</code> | ||
− | + | ==== Login Requirements: ==== | |
− | + | # Membership in the "power" group: Ensure you are a part of the "power" group which grants the necessary permissions for accessing the HPC resources. | |
− | + | # University Credentials: Log in using your Tel Aviv University credentials. This ensures secure access and that your job submissions are appropriately accounted for under your user profile. | |
− | |||
− | |||
− | + | Remember, these login nodes are the initial point of contact for all your job management tasks, including job submission, monitoring, and other SLURM-related operations. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | === Basic Job Submission Commands === | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | # sbatch: Submit a batch job script. | |
− | < | + | #* Example: <code>sbatch --ntasks=1 --time=10 pre_process.bash</code> |
− | + | #* This submits <code>pre_process.bash</code> with 1 task for 10 minutes. | |
− | + | #* Example of chaining jobs: <code>sbatch --ntasks=128 --time=60 --depend=45001 do_work.bash</code> | |
− | + | # salloc: Allocate resources for an interactive job but doesn't start it immediately. | |
− | </ | + | #* Example: <code>salloc --ntasks=8 --time=10 bash</code> |
+ | # srun: Submit an interactive job with MPI (Message Passing Interface), often called a "job step." | ||
+ | #* Example: <code>srun --ntasks=2 --label hostname</code> | ||
+ | #* With MPI: <code>srun -intasks=2 --label hostname</code> | ||
+ | # sattach: Attach stdin/out/err to an existing job or job step. | ||
− | + | === Interactive Job Examples === | |
− | |||
− | |||
− | |||
− | |||
− | + | * Opening a bash shell: <code>srun --ntasks=56 --pty bash</code> | |
− | < | + | * Specifying compute nodes: <code>srun --ntasks=56 -p gcohen_2018 --nodelist="compute-0-12" --pty bash</code> |
− | |||
− | |||
− | |||
− | </ | ||
− | < | ||
− | |||
− | |||
− | </ | ||
− | + | === Script Example: === | |
− | < | + | <syntaxhighlight lang="bash"> |
− | + | #!/bin/bash | |
− | |||
− | + | #SBATCH --job-name=my_job # Job name | |
− | + | #SBATCH --account=my_account # Account name for billing | |
− | + | #SBATCH --partition=long # Partition name | |
− | + | #SBATCH --time=02:00:00 # Time allotted for the job (hh:mm:ss) | |
− | + | #SBATCH --ntasks=4 # Number of tasks (processes) | |
− | ( | + | #SBATCH --cpus-per-task=1 # Number of CPU cores per task |
− | + | #SBATCH --mem-per-cpu=4G # Memory per CPU core | |
− | + | #SBATCH --output=my_job_%j.out # Standard output and error log (%j expands to jobId) | |
− | + | #SBATCH --error=my_job_%j.err # Separate file for standard error | |
− | |||
− | |||
− | |||
− | |||
− | + | # Load modules or software if required | |
− | + | # module load python/3.8 | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | # Print some information about the job | |
− | + | echo "Starting my SLURM job" | |
+ | echo "Job ID: $SLURM_JOB_ID" | ||
+ | echo "Running on nodes: $SLURM_JOB_NODELIST" | ||
+ | echo "Allocated CPUs: $SLURM_JOB_CPUS_PER_NODE" | ||
− | + | # Run your application, this could be anything from a custom script to standard applications | |
+ | # ./my_program | ||
+ | # python my_script.py | ||
− | + | # End of script | |
+ | echo "Job completed" | ||
− | + | </syntaxhighlight> | |
+ | |||
+ | === Error Handling === | ||
+ | |||
+ | * On some clusters, specifying resources is necessary. Without them, the job may fail. | ||
+ | ** Example error: <code>srun: error: Unable to allocate resources: No partition specified or system default partition</code> | ||
+ | ** Correct usage: <code>srun --pty -c 1 --mem=2G -p power-yoren /bin/bash</code> | ||
+ | |||
+ | === SLURM Information Commands === | ||
+ | |||
+ | * sinfo: View all queues (partitions). | ||
+ | * squeue: View all jobs. | ||
+ | * scontrol show partition: View all partitions. | ||
+ | * scontrol show job <job_number>: View a job's attributes. | ||
+ | |||
+ | === Tips for Managing SLURM Jobs === | ||
+ | |||
+ | * Chain jobs by using the <code>--depend</code> flag in <code>sbatch</code>. | ||
+ | * Use <code>salloc</code> for interactive jobs that require specific resources for a limited time. | ||
+ | * <code>srun</code> is versatile for both interactive and batch jobs, especially with MPI. | ||
+ | * Always specify necessary resources in clusters where defaults are not set. |
Revision as of 14:04, 17 January 2024
SLURM (Simple Linux Utility for Resource Management) is a job scheduler used on many high-performance computing systems. It manages and allocates resources such as compute nodes and controls job execution.
Accessing the System
To submit jobs to the SLURM scheduler at Tel Aviv University, you must access the system through one of the designated login nodes. These nodes act as the gateway for submitting and managing your SLURM jobs. The available login nodes are:
powerslurm-login.tau.ac.il
powerslurm-login2.tau.ac.il
Login Requirements:
- Membership in the "power" group: Ensure you are a part of the "power" group which grants the necessary permissions for accessing the HPC resources.
- University Credentials: Log in using your Tel Aviv University credentials. This ensures secure access and that your job submissions are appropriately accounted for under your user profile.
Remember, these login nodes are the initial point of contact for all your job management tasks, including job submission, monitoring, and other SLURM-related operations.
Basic Job Submission Commands
- sbatch: Submit a batch job script.
- Example:
sbatch --ntasks=1 --time=10 pre_process.bash
- This submits
pre_process.bash
with 1 task for 10 minutes. - Example of chaining jobs:
sbatch --ntasks=128 --time=60 --depend=45001 do_work.bash
- Example:
- salloc: Allocate resources for an interactive job but doesn't start it immediately.
- Example:
salloc --ntasks=8 --time=10 bash
- Example:
- srun: Submit an interactive job with MPI (Message Passing Interface), often called a "job step."
- Example:
srun --ntasks=2 --label hostname
- With MPI:
srun -intasks=2 --label hostname
- Example:
- sattach: Attach stdin/out/err to an existing job or job step.
Interactive Job Examples
- Opening a bash shell:
srun --ntasks=56 --pty bash
- Specifying compute nodes:
srun --ntasks=56 -p gcohen_2018 --nodelist="compute-0-12" --pty bash
Script Example:
#!/bin/bash
#SBATCH --job-name=my_job # Job name
#SBATCH --account=my_account # Account name for billing
#SBATCH --partition=long # Partition name
#SBATCH --time=02:00:00 # Time allotted for the job (hh:mm:ss)
#SBATCH --ntasks=4 # Number of tasks (processes)
#SBATCH --cpus-per-task=1 # Number of CPU cores per task
#SBATCH --mem-per-cpu=4G # Memory per CPU core
#SBATCH --output=my_job_%j.out # Standard output and error log (%j expands to jobId)
#SBATCH --error=my_job_%j.err # Separate file for standard error
# Load modules or software if required
# module load python/3.8
# Print some information about the job
echo "Starting my SLURM job"
echo "Job ID: $SLURM_JOB_ID"
echo "Running on nodes: $SLURM_JOB_NODELIST"
echo "Allocated CPUs: $SLURM_JOB_CPUS_PER_NODE"
# Run your application, this could be anything from a custom script to standard applications
# ./my_program
# python my_script.py
# End of script
echo "Job completed"
Error Handling
- On some clusters, specifying resources is necessary. Without them, the job may fail.
- Example error:
srun: error: Unable to allocate resources: No partition specified or system default partition
- Correct usage:
srun --pty -c 1 --mem=2G -p power-yoren /bin/bash
- Example error:
SLURM Information Commands
- sinfo: View all queues (partitions).
- squeue: View all jobs.
- scontrol show partition: View all partitions.
- scontrol show job <job_number>: View a job's attributes.
Tips for Managing SLURM Jobs
- Chain jobs by using the
--depend
flag insbatch
. - Use
salloc
for interactive jobs that require specific resources for a limited time. srun
is versatile for both interactive and batch jobs, especially with MPI.- Always specify necessary resources in clusters where defaults are not set.