Difference between revisions of "Submitting a job to a queue"
orig>Wiki admin |
|||
(14 intermediate revisions by 4 users not shown) | |||
Line 3: | Line 3: | ||
The Faculty Computer Coordinators can change their netgroup from general to power. | The Faculty Computer Coordinators can change their netgroup from general to power. | ||
− | Users’ jobs are executed on the compute nodes (compute-0-0 – compute-0- | + | Users’ jobs are executed on the compute nodes (compute-0-0 – compute-0-500) under control of a queuing system (PBSPRO). Users are able to logon to the head node, power, via ssh (where their home directory is mounted from the CC filer, the same as on the other CC servers) and submit their jobs to the batch system. |
Power cluster and pbspro queueing system | Power cluster and pbspro queueing system | ||
− | PBSPRO main commands | + | ===PBSPRO main commands=== |
− | |||
− | |||
A good reference can be found in link http://www.pbsworks.com/documentation/support/PBSProUserGuide10.4.pdf | A good reference can be found in link http://www.pbsworks.com/documentation/support/PBSProUserGuide10.4.pdf | ||
Start with one of the below commands: | Start with one of the below commands: | ||
− | + | <pre> | |
− | ssh <username>@ | + | ssh <username>@power9login.tau.ac.il |
− | + | ssh powerlogin9.tau.ac.il -l <username> | |
− | ssh | + | </pre> |
− | |||
Create a batch job script, for example, file named script that contains the following lines: | Create a batch job script, for example, file named script that contains the following lines: | ||
<pre> | <pre> | ||
Line 26: | Line 23: | ||
</pre> | </pre> | ||
− | Send the script to be executed in one of the existing queues, for example, to queue | + | Send the script to be executed in one of the existing queues, for example, to queue ‘public’: |
<pre> | <pre> | ||
− | qsub -q | + | qsub -q public script |
</pre> | </pre> | ||
The number which is returned from this command is the job id that was assigned to the new job: | The number which is returned from this command is the job id that was assigned to the new job: | ||
Line 49: | Line 46: | ||
To see the current available queues and their cputime and memory limits, execute: | To see the current available queues and their cputime and memory limits, execute: | ||
− | + | <pre> | |
qstat –q | qstat –q | ||
+ | </pre> | ||
To see the status of a specific job, you may run: | To see the status of a specific job, you may run: | ||
− | + | <pre> | |
qstat -f <job number> | qstat -f <job number> | ||
− | + | </pre> | |
Some of the queues are private, accessible to a predefined group of users, other are public, open to all the users of power. | Some of the queues are private, accessible to a predefined group of users, other are public, open to all the users of power. | ||
More detailed information on any queue limits may be viewed by: | More detailed information on any queue limits may be viewed by: | ||
+ | <pre> | ||
+ | qmgr -c "list queue <queuename>" | ||
+ | </pre> | ||
− | |||
For example: | For example: | ||
− | + | <pre> | |
− | qmgr -c "list queue | + | qmgr -c "list queue power-general" |
− | + | </pre> | |
Default queue limits are enforced unless specified otherwise (up to max values) on 'qsub' command, using flag ‘-l’ (small ‘L’), according the following format: | Default queue limits are enforced unless specified otherwise (up to max values) on 'qsub' command, using flag ‘-l’ (small ‘L’), according the following format: | ||
− | + | <pre> | |
qsub -q <queue> -l<attribute=limit,attribute=limit,.. <script> | qsub -q <queue> -l<attribute=limit,attribute=limit,.. <script> | ||
+ | </pre> | ||
+ | For example: | ||
+ | <pre> | ||
+ | qsub -q power-dvory -lpmem=2000mb,pvmem=3000mb <script> | ||
+ | qsub -q power-dvory -lmem=14gb,pmem=5gb,vmem=20gb,pvmem=20gb <script> | ||
+ | qsub -q gpu -lngpus=1 <script> | ||
+ | qsub -q public -lselect=1:ncpus=4 <script> | ||
+ | </pre> | ||
+ | While: | ||
− | + | '''mem''' - refers to maximum amount of memory to be allocated | |
− | + | '''pmem''' - refers to maximum amount of memory to be allocated per process | |
− | + | '''vmem''' - refers to maximum amount of virtual memory to be allocated | |
− | + | '''pvmem''' - refers to maximum amount of virtual memory to be allocated per process | |
− | + | '''nodes''' - number of required nodes (servers) | |
− | + | '''ppn''' - number of required cores (within a node) | |
− | + | '''ngpus''' - number of required gpus (exists only for queue gpu) | |
+ | The more updated syntax includes the word 'select' | ||
− | |||
− | |||
− | |||
− | + | '''Quesues list''' | |
+ | '''gpu''' – this queue’s purpose it to enable running jobs, which require gpu processing | ||
− | + | '''public''' - queue that is used for public users, who do not pay. They have the lower priority. | |
− | + | '''power-PI-username''' - this queue is used by PI and her/his group. Jobs are directed to one global queue, named power-general | |
The standard output and standard error files will be written by default at the end of the execution to files in your home directory: script.o#n and script.e#n (where #n is the job number given to your job by the batch queueing system). | The standard output and standard error files will be written by default at the end of the execution to files in your home directory: script.o#n and script.e#n (where #n is the job number given to your job by the batch queueing system). | ||
To delete a job, use the qdel command: | To delete a job, use the qdel command: | ||
− | + | <pre> | |
qdel <job number> | qdel <job number> | ||
− | + | </pre> | |
Line 107: | Line 115: | ||
Explanations regarding PBS script directives can be found at: https://www.osc.edu/supercomputing/batch-processing-at-osc/pbs-directives-summary | Explanations regarding PBS script directives can be found at: https://www.osc.edu/supercomputing/batch-processing-at-osc/pbs-directives-summary | ||
− | For example, instead of specifying ‘qsub –q | + | For example, instead of specifying ‘qsub –q public …’, one may add ‘#PBS –q public’ to the script to be executed. Like in the below script, named ‘script.sh’, which can be run using the command: ‘qsub script.sh’ |
− | + | <pre> | |
#!/bin/bash | #!/bin/bash | ||
− | |||
#PBS -l walltime=1:00:00 | #PBS -l walltime=1:00:00 | ||
− | + | #PBS -l select=1:ncpus=4,mem=400mb | |
− | #PBS -l | ||
− | |||
./my application | ./my application | ||
− | + | </pre> | |
− | |||
Running matlab example | Running matlab example | ||
Line 123: | Line 127: | ||
myTable.m ⇒ This matlab file calculates something | myTable.m ⇒ This matlab file calculates something | ||
− | + | <pre> | |
− | |||
− | |||
fprintf('=======================================\n'); | fprintf('=======================================\n'); | ||
− | |||
fprintf(' a b c d \n'); | fprintf(' a b c d \n'); | ||
− | |||
fprintf('=======================================\n'); | fprintf('=======================================\n'); | ||
− | |||
while 1 | while 1 | ||
− | |||
for j = 1:10 | for j = 1:10 | ||
− | |||
a = sin(10*j); | a = sin(10*j); | ||
− | |||
b = a*cos(10*j); | b = a*cos(10*j); | ||
− | |||
c = a + b; | c = a + b; | ||
− | |||
d = a - b; | d = a - b; | ||
− | |||
fprintf('%+6.5f %+6.5f %+6.5f %+6.5f \n',a,b,c,d); | fprintf('%+6.5f %+6.5f %+6.5f %+6.5f \n',a,b,c,d); | ||
− | |||
end | end | ||
− | |||
end | end | ||
− | |||
fprintf('=======================================\n'); | fprintf('=======================================\n'); | ||
− | + | </pre> | |
− | |||
my_table_script.sh ⇒ This script executes the matlab program. Need just to run qsub with this script | my_table_script.sh ⇒ This script executes the matlab program. Need just to run qsub with this script | ||
− | + | <pre> | |
#!/bin/bash | #!/bin/bash | ||
Line 164: | Line 153: | ||
#PBS -l mem=5000mb | #PBS -l mem=5000mb | ||
− | #PBS -q | + | #PBS -q power-dvory |
hostname | hostname | ||
Line 170: | Line 159: | ||
cd /a/home/cc/tree/taucc/staff/dvory/matlab | cd /a/home/cc/tree/taucc/staff/dvory/matlab | ||
− | matlab -nodisplay -r "myTable()" | + | matlab -nodisplay -nosplash -nodesktop -r "run(myTable());exit;" |
− | + | </pre> | |
− | |||
run_in_loop.sh ⇒ However, one may also generate many jobs with this file | run_in_loop.sh ⇒ However, one may also generate many jobs with this file | ||
− | + | <pre> | |
− | + | #!/bin/bash | |
for i in {1..100} | for i in {1..100} | ||
Line 185: | Line 173: | ||
done | done | ||
+ | </pre> | ||
Running my job is with the command: | Running my job is with the command: | ||
− | + | <pre> | |
./run_in_loop.sh | ./run_in_loop.sh | ||
+ | </pre> | ||
− | + | ===Interactive session=== | |
− | |||
− | Interactive session | ||
Interactive sessions (line mode) are enabled, using flag ‘-I’ (a big ‘i'): | Interactive sessions (line mode) are enabled, using flag ‘-I’ (a big ‘i'): | ||
− | + | <pre> | |
− | qsub -I < | + | qsub -q <queue name> -I |
− | + | </pre> | |
(without adding a script name) | (without adding a script name) | ||
− | |||
Interactive sessions with X window | Interactive sessions with X window | ||
To enable opening an x window (such as matlab window, or math window) | To enable opening an x window (such as matlab window, or math window) | ||
Line 207: | Line 194: | ||
Login to power.tau.ac.il with ‘X’: | Login to power.tau.ac.il with ‘X’: | ||
− | ssh -X -l <username> | + | <pre> |
− | + | ssh -X -l <username> powerlogin9.tau.ac.il | |
+ | </pre> | ||
Then use the qsub command with ‘-X’: | Then use the qsub command with ‘-X’: | ||
+ | <pre> | ||
qsub -I -X -q <queue> | qsub -I -X -q <queue> | ||
− | + | </pre> | |
(without adding a script name) | (without adding a script name) | ||
Keep in mind that - running matlab via an X window slows the matlab execution. | Keep in mind that - running matlab via an X window slows the matlab execution. | ||
For the benefit of matlab, need to allocate more memory than is defined in the default public queues, at least the following memory needs to requested: | For the benefit of matlab, need to allocate more memory than is defined in the default public queues, at least the following memory needs to requested: | ||
+ | <pre> | ||
+ | qsub -q power-dvory –lmem=60gb,pmem=60gb,vmem=60gb,pvmem=60gb -I -X | ||
+ | </pre> | ||
− | + | ===Parallelism=== | |
− | + | Parallel jobs can be executed in the cluster - using up to 96 cores (=ppn) for a job. For example, jobs compiled with mpich can be submitted with the following command: | |
− | + | <pre> | |
− | + | qsub -l select=1:ncpus=8 -q paublic <script-filename> | |
− | + | </pre> | |
− | Parallel jobs can be executed in the cluster - using up to | ||
− | |||
− | qsub -l | ||
− | |||
Multithreaded matlab jobs can be submitted with the following command: | Multithreaded matlab jobs can be submitted with the following command: | ||
− | + | <pre> | |
− | qsub -l | + | qsub -l select=1:ncpus=8 -q parallel <matlab-script> |
− | + | </pre> | |
‘-l’ refers to a small ‘L’ | ‘-l’ refers to a small ‘L’ | ||
− | Environment modules | + | ===Environment modules=== |
The Environment Modules package provides for the dynamic modification of a user’s environment via modulefiles. | The Environment Modules package provides for the dynamic modification of a user’s environment via modulefiles. | ||
Typically modulefiles instruct the module command to alter or set shell environment variables such as PATH, MANPATH, etc. Modules are useful in managing different versions of applications. | Typically modulefiles instruct the module command to alter or set shell environment variables such as PATH, MANPATH, etc. Modules are useful in managing different versions of applications. | ||
Useful commands: | Useful commands: | ||
− | + | <pre> | |
− | module avail | + | module avail <required module> |
− | + | </pre> | |
+ | e.g. | ||
+ | <pre> | ||
+ | module avail python | ||
+ | </pre> | ||
+ | ⇒ lists the available modules on the system | ||
+ | <pre> | ||
module load <module> | module load <module> | ||
− | + | </pre> | |
e.g.: | e.g.: | ||
+ | <pre> | ||
+ | module load intel/ifort10 | ||
+ | </pre> | ||
+ | ⇒ loads the appropriate module and enables to use ifort version 10 without specifying the path to its binaries and libraries | ||
+ | <pre> | ||
+ | module list | ||
+ | </pre> | ||
+ | ⇒ lists the loaded modules | ||
+ | <pre> | ||
+ | module unload intel/ifort10 | ||
+ | </pre> | ||
+ | ⇒ unloads the loaded module | ||
− | + | ===GPU usage=== | |
− | + | When not asking for a specific gpu, the code actually asks by default the first one, which may be busy, or using all its memory. | |
− | module | + | In order to specify a gpu to be used, need to define it (using the environment variable: CUDA_VISIBLE_DEVICES) |
− | + | e.g. | |
− | + | <pre> | |
+ | export CUDA_VISIBLE_DEVICES=2 | ||
+ | </pre> | ||
+ | If one would like to use more than 1 gpu, she may type them separated with comma, e.g. | ||
+ | <pre> | ||
+ | export CUDA_VISIBLE_DEVICES="2,4" | ||
+ | </pre> | ||
+ | In order to know which gpus are free, one may copy python script to local disk and activate it: | ||
+ | <pre> | ||
+ | module load python/python-anaconda_3.7 | ||
+ | python /powerapps/scripts/check_avail_gpu.py | ||
+ | </pre> | ||
+ | It will output the first free gpu (in its last line) | ||
+ | If need more than one gpu, need to overwrite parameter num_gpus in line 96 | ||
+ | num_gpus = 1 | ||
+ | - it needs to be changed it to the number of needed gpus |
Latest revision as of 06:54, 2 April 2024
Power is a Linux cluster system running CentOS (version 7.3-8). The cluster consists of a single head node (power9), and more than 400 compute nodes (some with 16GB, others with 36GB or even 600GB memory and even more) 16 to 96 cores each. Users belonging to netgroup 'power' can login and run their batch jobs on it.
The Faculty Computer Coordinators can change their netgroup from general to power.
Users’ jobs are executed on the compute nodes (compute-0-0 – compute-0-500) under control of a queuing system (PBSPRO). Users are able to logon to the head node, power, via ssh (where their home directory is mounted from the CC filer, the same as on the other CC servers) and submit their jobs to the batch system.
Power cluster and pbspro queueing system
PBSPRO main commands
A good reference can be found in link http://www.pbsworks.com/documentation/support/PBSProUserGuide10.4.pdf
Start with one of the below commands:
ssh <username>@power9login.tau.ac.il ssh powerlogin9.tau.ac.il -l <username>
Create a batch job script, for example, file named script that contains the following lines:
#!/bin/bash cd executables ./a.out
Send the script to be executed in one of the existing queues, for example, to queue ‘public’:
qsub -q public script
The number which is returned from this command is the job id that was assigned to the new job:
6770818.power.tau.ac.il
You can see the status of your executing jobs by executing:
qstat -u <username>
Which lists all the jobs running or being queued for the specified user.
Job status may be mainly one of the following:
Q – queued (waiting for its run) R - running You can see the status of all the executing jobs by executing:
qstat
To see the current available queues and their cputime and memory limits, execute:
qstat –q
To see the status of a specific job, you may run:
qstat -f <job number>
Some of the queues are private, accessible to a predefined group of users, other are public, open to all the users of power. More detailed information on any queue limits may be viewed by:
qmgr -c "list queue <queuename>"
For example:
qmgr -c "list queue power-general"
Default queue limits are enforced unless specified otherwise (up to max values) on 'qsub' command, using flag ‘-l’ (small ‘L’), according the following format:
qsub -q <queue> -l<attribute=limit,attribute=limit,.. <script>
For example:
qsub -q power-dvory -lpmem=2000mb,pvmem=3000mb <script> qsub -q power-dvory -lmem=14gb,pmem=5gb,vmem=20gb,pvmem=20gb <script> qsub -q gpu -lngpus=1 <script> qsub -q public -lselect=1:ncpus=4 <script>
While:
mem - refers to maximum amount of memory to be allocated
pmem - refers to maximum amount of memory to be allocated per process
vmem - refers to maximum amount of virtual memory to be allocated
pvmem - refers to maximum amount of virtual memory to be allocated per process
nodes - number of required nodes (servers)
ppn - number of required cores (within a node)
ngpus - number of required gpus (exists only for queue gpu) The more updated syntax includes the word 'select'
Quesues list
gpu – this queue’s purpose it to enable running jobs, which require gpu processing
public - queue that is used for public users, who do not pay. They have the lower priority.
power-PI-username - this queue is used by PI and her/his group. Jobs are directed to one global queue, named power-general
The standard output and standard error files will be written by default at the end of the execution to files in your home directory: script.o#n and script.e#n (where #n is the job number given to your job by the batch queueing system).
To delete a job, use the qdel command:
qdel <job number>
PBSPRO file parameters
The script to be run may have additional commands which are directions to the scheduler, instead of adding parameters to the qsub command line.
Explanations regarding PBS script directives can be found at: https://www.osc.edu/supercomputing/batch-processing-at-osc/pbs-directives-summary
For example, instead of specifying ‘qsub –q public …’, one may add ‘#PBS –q public’ to the script to be executed. Like in the below script, named ‘script.sh’, which can be run using the command: ‘qsub script.sh’
#!/bin/bash #PBS -l walltime=1:00:00 #PBS -l select=1:ncpus=4,mem=400mb ./my application
Running matlab example In this example there are 3 files:
myTable.m ⇒ This matlab file calculates something
fprintf('=======================================\n'); fprintf(' a b c d \n'); fprintf('=======================================\n'); while 1 for j = 1:10 a = sin(10*j); b = a*cos(10*j); c = a + b; d = a - b; fprintf('%+6.5f %+6.5f %+6.5f %+6.5f \n',a,b,c,d); end end fprintf('=======================================\n');
my_table_script.sh ⇒ This script executes the matlab program. Need just to run qsub with this script
#!/bin/bash #PBS -e /tmp/dvory/matlab/output #PBS -o /tmp/dvory/matlab/output #PBS -l mem=5000mb #PBS -q power-dvory hostname cd /a/home/cc/tree/taucc/staff/dvory/matlab matlab -nodisplay -nosplash -nodesktop -r "run(myTable());exit;"
run_in_loop.sh ⇒ However, one may also generate many jobs with this file
#!/bin/bash for i in {1..100} do qsub my_table_script.sh done
Running my job is with the command:
./run_in_loop.sh
Interactive session
Interactive sessions (line mode) are enabled, using flag ‘-I’ (a big ‘i'):
qsub -q <queue name> -I
(without adding a script name)
Interactive sessions with X window
To enable opening an x window (such as matlab window, or math window)
This may be enabled using the commands below:
Login to power.tau.ac.il with ‘X’:
ssh -X -l <username> powerlogin9.tau.ac.il
Then use the qsub command with ‘-X’:
qsub -I -X -q <queue>
(without adding a script name) Keep in mind that - running matlab via an X window slows the matlab execution.
For the benefit of matlab, need to allocate more memory than is defined in the default public queues, at least the following memory needs to requested:
qsub -q power-dvory –lmem=60gb,pmem=60gb,vmem=60gb,pvmem=60gb -I -X
Parallelism
Parallel jobs can be executed in the cluster - using up to 96 cores (=ppn) for a job. For example, jobs compiled with mpich can be submitted with the following command:
qsub -l select=1:ncpus=8 -q paublic <script-filename>
Multithreaded matlab jobs can be submitted with the following command:
qsub -l select=1:ncpus=8 -q parallel <matlab-script>
‘-l’ refers to a small ‘L’
Environment modules
The Environment Modules package provides for the dynamic modification of a user’s environment via modulefiles.
Typically modulefiles instruct the module command to alter or set shell environment variables such as PATH, MANPATH, etc. Modules are useful in managing different versions of applications. Useful commands:
module avail <required module>
e.g.
module avail python
⇒ lists the available modules on the system
module load <module>
e.g.:
module load intel/ifort10
⇒ loads the appropriate module and enables to use ifort version 10 without specifying the path to its binaries and libraries
module list
⇒ lists the loaded modules
module unload intel/ifort10
⇒ unloads the loaded module
GPU usage
When not asking for a specific gpu, the code actually asks by default the first one, which may be busy, or using all its memory. In order to specify a gpu to be used, need to define it (using the environment variable: CUDA_VISIBLE_DEVICES) e.g.
export CUDA_VISIBLE_DEVICES=2
If one would like to use more than 1 gpu, she may type them separated with comma, e.g.
export CUDA_VISIBLE_DEVICES="2,4"
In order to know which gpus are free, one may copy python script to local disk and activate it:
module load python/python-anaconda_3.7 python /powerapps/scripts/check_avail_gpu.py
It will output the first free gpu (in its last line) If need more than one gpu, need to overwrite parameter num_gpus in line 96 num_gpus = 1 - it needs to be changed it to the number of needed gpus