Difference between revisions of "Submitting a job to a slurm queue"
Line 211: | Line 211: | ||
== Troubleshooting & Tips == | == Troubleshooting & Tips == | ||
− | === | + | === Common Errors === |
# <code>srun: error: Unable to allocate resources: No partition specified or system default partition</code> <br />'''Solution:''' Always specify a partition. Example: <code>srun --pty -c 1 --mem=2G -p power-general /bin/bash</code> | # <code>srun: error: Unable to allocate resources: No partition specified or system default partition</code> <br />'''Solution:''' Always specify a partition. Example: <code>srun --pty -c 1 --mem=2G -p power-general /bin/bash</code> | ||
Line 223: | Line 223: | ||
</syntaxhighlight>it means that the ram requested for the job was not enough, please resubmit the job again with more ram. see [https://wikihpc.tau.ac.il/index.php?title=Slurm_user_guide#Estimating_RAM_Usage below] for help with understanding how much ram your job may need. | </syntaxhighlight>it means that the ram requested for the job was not enough, please resubmit the job again with more ram. see [https://wikihpc.tau.ac.il/index.php?title=Slurm_user_guide#Estimating_RAM_Usage below] for help with understanding how much ram your job may need. | ||
− | === | + | === Chain Jobs === |
Use the <code>--depend</code> flag to set job dependencies. | Use the <code>--depend</code> flag to set job dependencies. | ||
Line 231: | Line 231: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | === | + | === Always Specify Resources === |
When submitting jobs, ensure you include all required resources like partition, memory, and CPUs to avoid job failures. | When submitting jobs, ensure you include all required resources like partition, memory, and CPUs to avoid job failures. | ||
− | === | + | === Attaching to Running Jobs === |
If you need to monitor or interact with a running job, use <code>sattach</code>. This command allows you to attach to a job's input, output, and error streams in real-time. | If you need to monitor or interact with a running job, use <code>sattach</code>. This command allows you to attach to a job's input, output, and error streams in real-time. | ||
Latest revision as of 15:03, 29 September 2024
Accessing the System
To submit jobs to SLURM at Tel Aviv University, you need to access the system through one of the following login nodes:
- powerslurm-login.tau.ac.il
- powerslurm-login2.tau.ac.il
Requirements for Access
- Group Membership: You must be part of the "power" group to access the resources.
- University Credentials: Use your Tel Aviv University username and password to log in.
These login nodes are your starting point for submitting jobs, checking job status, and managing your SLURM tasks.
SSH Example
To access the system using SSH, use the following example:
# Replace 'your_username' with your actual Tel Aviv University username
ssh your_username@powerslurm-login.tau.ac.il
If you want to connect to the second login node, use:
# Replace 'your_username' with your actual Tel Aviv University username
ssh your_username@powerslurm-login2.tau.ac.il
If you have an SSH key set up for password-less login, you can specify it like this:
# Replace 'your_username' and '/path/to/your/private_key' accordingly
ssh -i /path/to/your/private_key your_username@powerslurm-login.tau.ac.il
Environment Modules
Environment Modules in SLURM allow users to dynamically modify their shell environment, providing an easy way to load and unload different software applications, libraries, and their dependencies. This system helps avoid conflicts between software versions and ensures the correct environment for running specific applications.
Here are some common commands to work with environment modules:
#List Available Modules: To see all the modules available on the system, use:
module avail
#To search for a specific module by name (e.g., `gcc`), use:
module avail gcc/gcc-12.1.0
#Get Detailed Information About a Module: The `module spider` command provides detailed information about a module, including versions, dependencies, and descriptions:
module spider gcc/gcc-12.1.0
#View Module Settings: To see what environment variables and settings will be modified by a module, use:
module show gcc/gcc-12.1.0
#Load a Module: To set up the environment for a specific software, use the `module load` command. For example, to load GCC version 12.1.0:
module load gcc/gcc-12.1.0
#List Loaded Modules: To view all currently loaded modules in your session, use:
module list
#Unload a Module: To unload a specific module from your environment, use:
module unload gcc/gcc-12.1.0
#Unload All Modules:** If you need to clear your environment of all loaded modules, use:
module purge
By using these commands, you can easily manage the software environments needed for different tasks, ensuring compatibility and reducing potential conflicts between software versions.
Basic Job Submission Commands
Finding Your Account and Partition
Before submitting a job, you need to know which partitions you have permission to use.
Run the command `check_my_partitions
` to view a list of all the partitions you have permission to send jobs to.
Submitting Jobs
sbatch: Submits a job script for batch processing.
Example:
sbatch --ntasks=1 --time=10 -p power-general -A power-general-users pre_process.bash
# This command submits pre_process.bash to the power-general partition for 10 minutes.
# With 1 GPU:
sbatch --gres=gpu:1 -p gpu-general -A gpu-general-users gpu_job.sh
Writing SLURM Job Scripts
Here is a simple job script example:
Basic Script
#!/bin/bash
#SBATCH --job-name=my_job # Job name
#SBATCH --account=power-general-users # Account name
#SBATCH --partition=power-general # Partition name
#SBATCH --time=02:00:00 # Max run time (hh:mm:ss)
#SBATCH --ntasks=1 # Number of tasks
#SBATCH --cpus-per-task=1 # CPUs per task
#SBATCH --mem-per-cpu=4G # Memory per CPU
#SBATCH --output=my_job_%j.out # Output file
#SBATCH --error=my_job_%j.err # Error file
echo "Starting my SLURM job"
echo "Job ID: $SLURM_JOB_ID"
echo "Running on nodes: $SLURM_JOB_NODELIST"
echo "Allocated CPUs: $SLURM_JOB_CPUS_PER_NODE"
# Your application commands go here
# ./my_program
echo "Job completed"
Script for 1 GPU
#!/bin/bash
#SBATCH --job-name=gpu_job # Job name
#SBATCH --account=my_account # Account name
#SBATCH --partition=gpu-general # Partition name
#SBATCH --time=02:00:00 # Max run time
#SBATCH --ntasks=1 # Number of tasks
#SBATCH --cpus-per-task=1 # CPUs per task
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --mem-per-cpu=4G # Memory per CPU
#SBATCH --output=my_job_%j.out # Output file
#SBATCH --error=my_job_%j.err # Error file
module load python/python-3.8
echo "Starting GPU job"
echo "Job ID: $SLURM_JOB_ID"
echo "Running on nodes: $SLURM_JOB_NODELIST"
echo "Allocated CPUs: $SLURM_JOB_CPUS_PER_NODE"
# Your GPU commands go here
echo "Job completed"
Importance of Correct RAM Usage in Jobs
When writing SLURM job scripts, it's crucial to understand and correctly specify the memory requirements for your job.
Proper memory allocation ensures efficient resource usage and prevents job failures due to out-of-memory (OOM) errors.
Why Correct RAM Usage Matters
- Resource Efficiency: Allocating the right amount of memory helps in optimal resource utilization, allowing more jobs to run simultaneously on the cluster.
- Job Stability: Underestimating memory requirements can lead to OOM errors, causing your job to fail and waste computational resources.
- Performance: Overestimating memory needs can lead to underutilization of resources, potentially delaying other jobs in the queue.
How to Specify Memory in SLURM
- --mem: Specifies the total memory required for the job.
- --mem-per-cpu: Specifies the memory required per CPU.
Example:
#SBATCH --mem=4G # Total memory for the job
#SBATCH --mem-per-cpu=2G # Memory per CPU
Interactive Jobs
#Start an interactive session:
srun --ntasks=1 -p power-general -A power-general-users --pty bash
#Specify a compute node:
srun --ntasks=1 -p power-general -A power-general-users --nodelist="compute-0-12" --pty bash
#Using GUI:
srun --ntasks=1 -p power-general -A power-general-users --x11 /bin/bash
Submitting RELION Jobs
To submit a RELION job interactively on the gpu-relion
queue with X11 forwarding, use the following steps:
#Start an interactive session with X11:
srun --ntasks=1 -p gpu-relion -A your_account --x11 --pty bash
#Load the RELION module:
module load relion/relion-4.0.1
#Launch RELION:
relion
AlphaFold
AlphaFold is a deep learning tool designed for predicting protein structures.
Guide: AlphaFold Guide
Common SLURM Commands
#View all queues (partitions):
sinfo
#View all jobs:
squeue
#View details of a specific job:
scontrol show job <job_number>
#Get information about partitions:
scontrol show partition
Troubleshooting & Tips
Common Errors
srun: error: Unable to allocate resources: No partition specified or system default partition
Solution: Always specify a partition. Example:srun --pty -c 1 --mem=2G -p power-general /bin/bash
- Job failed, and upon doing scontrol show job job_id or when running sacct -j job_id -o JobID,JobName,State%20
you see:JobState=OUT_OF_MEMORY Reason=OutOfMemory
or :it means that the ram requested for the job was not enough, please resubmit the job again with more ram. see below for help with understanding how much ram your job may need.JobID JobName State ------------ ---------- -------------------- 71 oom_test OUT_OF_MEMORY 71.batch batch OUT_OF_MEMORY 71.extern extern COMPLETED
Chain Jobs
Use the --depend
flag to set job dependencies.
Example:
sbatch --ntasks=1 --time=60 -p power-general -A power-general-users --depend=45001 do_work.bash
Always Specify Resources
When submitting jobs, ensure you include all required resources like partition, memory, and CPUs to avoid job failures.
Attaching to Running Jobs
If you need to monitor or interact with a running job, use sattach
. This command allows you to attach to a job's input, output, and error streams in real-time.
Example:
sattach <job_id>
To view job steps of a specific job, use the following command:
scontrol show job <job_id>
Look for sections labeled "StepId" within the output.
For specific job steps, use:
sattach <job_id.step_id>
Note: sattach
is particularly useful for interactive jobs, where you can provide input directly. For non-interactive jobs, it acts like tail -f
, allowing you to monitor the output stream.
Estimating RAM Usage
When writing SLURM job scripts, it's crucial to understand and correctly specify the memory requirements for your job. Proper memory allocation ensures efficient resource usage and prevents job failures due to out-of-memory (OOM) errors.
Tips for Estimating RAM Usage
- Check Application Documentation: Refer to the official documentation or user guides for memory-related information.
- Run a Small Test Job: Submit a smaller version of your job and monitor its memory usage using commands like `free -m`, `top`, or `htop`.
- Use Profiling Tools: Tools like `valgrind`, `gprof`, or built-in profilers can help you understand memory usage.
- Analyze Previous Jobs: Review SLURM logs and job statistics for insights into memory consumption of past jobs.
- Consult with Peers or Experts: Ask colleagues or experts who have experience with similar workloads.
Example: Monitoring Memory Usage
#!/bin/bash
#SBATCH --job-name=memory_test
#SBATCH --account=your_account
#SBATCH --partition=your_partition
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --output=memory_test.out
#SBATCH --error=memory_test.err
# Monitor memory usage
echo "Memory usage before running the job:"
free -m
# Your application commands go here
# ./your_application
# Monitor memory usage after running the job
echo "Memory usage after running the job:"
free -m
General Tips
- Start Small: Begin with a conservative memory request and increase it based on observed usage.
- Consider Peak Usage: Plan for peak memory usage to avoid OOM errors.
- Use SLURM's Memory Reporting: Use `sacct` to view memory usage statistics.
Example:
sacct -j <job_id> --format=JobID,JobName,MaxRSS,Elapsed