These are run by writing a script and submitting it to the queue with the sbatch command like this:
Scripts for batch jobs must start with the interpreter to be used to excute them (different from PBS/Torque). You can give arguments to sbatch as comments in the script. Example:
#!/bin/bash # Name of the job #SBATCH -J testjob # Partition to use - this example is commented out ##SBATCH -p NONIB # Time limit. Often not needed as there are defaults, # but you might have to specify it to get the maximum allowed. # time: 10 hours ##SBATCH --time=10:0:0 # Pick nodes with feature 'foo'. Different clusters have # different features available # but most of the time you don't need this ##SBATCH -C foo # Restrict the job to run on the node(s) named ##SBATCH -w compute-0-13 # Number of processes #SBATCH -n1 # Start the program srun myprogram
A more complicated example which uses three tasks:
#!/bin/bash #SBATCH -n3 # 3 tasks echo Starting job $SLURM_JOB_ID echo SLURM assigned me these nodes srun -l hostname srun -n2 program1 & # start 2 copies of program 1 srun -n1 program2 & # start 1 copy of program 2 wait # wait for all to finish
These can be run in two ways, via salloc and srun. If you just want a single interactive session on a compute node then using srun to allocate resources for a single task and launch a shell as that one task is probably the way to go. But if you want to run things in parallel or more than one task at once in your interactive job, use salloc to allocate resources and then srun or mpirun to start the tasks, as starting multiple copies of an interactive shell at once probably isn't what you want.
# One interactive task. Quit the shell to finish srun --pty -u bash -i # One task with one GPU assigned (GPU clusters only, obviously) srun --pty --gres=gpu:1 -u bash -i # One task with one maxwell GPU srun --pty --gres=gpu:1 -C maxwell -u bash -i # one task with one Titan Black GPU srun --pty --gres=gpu:1 -C titanblack -u bash -i # One task with one GPU on the node called 'happy' srun --pty --gres=gpu:1 -w happy -u bash -i # Allocate three tasks, followed by running three instances of 'myprog' within the allocation. # Then start one copy of longprog and two copies of myprog2, then release the allocation salloc -n3 srun myprog srun -n1 longprog & srun -n2 myprog2 exit
squeue # view the queue scancel <jobid> # cancel a job sinfo # See the state of the system sacct -l -j <jobid> # List accounting info about a job
Asking for resources
salloc/srun/sbatch support a huge array of options which let you ask for nodes, cpus, tasks, sockets, threads, memory etc. If you combine them SLURM will try to work out a sensible allocation, so for example if you ask for 13 tasks and 5 nodes SLURM will cope. Here are the ones that are most likely to be useful:
|-n||Number of tasks (roughly, processes)|
|-N||Number of nodes to assign. If you're using this, you might also be interested in --tasks-per-node|
|--tasks-per-node||Maximum tasks to assign per node if using -N|
|--cpus-per-task||Assign tasks containing more than one CPU. Useful for jobs with shared memory parallelization|
|-C||Features the nodes assigned must have|
|-w||Names of nodes that must be included - for selecting a particular node or nodes|
Use this to make SLURM assign you more memory than the default amount available per CPU. The units are MB. Works by automatically assigning sufficient extra CPUs to the job to ensure it gets access to enough memory.
|--gres=gpu:X||Ask for X GPUs. NB if you combine this with a -N option you will get X GPUs per node you asked for with -N, not X GPUs total. SLURM does not support having varying numbers of GPUs per node in a job yet.|
SLURM can power off idle compute nodes and boot them up when a compute job comes along to use them. Because of this, compute jobs may take a couple of minutes to start when there are no powered on nodes available. To see if the nodes are power saving check the output of sinfo:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST DEBUG up infinite 0 n/a CLUSTER up 28-00:00:0 33 idle~ comp-0-[0-11,13-18,20-28,30-35] CLUSTER up 28-00:00:0 1 idle comp-0-29 IB up 28-00:00:0 21 idle~ comp-0-[1,3,5,8-9,11,13,16-17,20-24,26-28,31-33,35] IB up 28-00:00:0 1 idle comp-0-29 NONIB* up 28-00:00:0 12 idle~ comp-0-[0,2,4,6-7,10,14-15,18,25,30,34]
In this case all of the nodes on this cluster except comp-0-29 are shut down to save power. The tilde (~) symbol next to the names shows this. comp-0-29 is powered up but idle.
Inside a batch script you should just be able to call mpirun, which will communicate with SLURM and launch the job over the appropriate set of nodes for you:
#!/bin/bash # 13 tasks over 5 nodes #SBATCH -n13 -N5 echo Hosts are srun -l hostname mpirun /home/cen1001/src/mpi_hello
To run MPI jobs interactively you can assign some nodes using salloc, and then call mpirun from inside that allocation. Unlike PBS/Torque, the shell you launch with salloc runs on the same machine you ran salloc on, not on the first node of the allocation. But mpirun will do the right thing.
salloc -n12 bash mpirun /home/cen1001/src/mpi_hello
You can even use srun to launch MPI jobs interactively without mpirun's intervention on some of the clusters. The --mpi option here is to tell srun which method the MPI library uses for launching tasks. This is the correct one for use with our OpenMPI installations.
srun --mpi=pmi2 -n13 ./mpi_hello
Non-MPI Parallel jobs
In a parallel job which doesn't use MPI you can find out which hosts you have and how many by running "srun -l hostname" inside the job script. The -l option will print the slurm task number next to the assigned hostname for the task, skip it if you want just the list of hostnames.
You can then use srun inside the job to start individual tasks.
Jobs with multiple GPUs on GPU clusters
If you use -N (number of nodes) with --gres=gpu:X you will get X GPUs on each node you ask for. To assign these to different tasks use srun within the job:
#!/bin/bash #SBATCH -J testmultigpu #SBATCH --gres=gpu:2 #SBATCH -N2 ##SBATCH -n12 # show us what resources we have: these run over everything srun -l hostname srun -l echo $CUDA_VISIBLE_DEVICES # assign one instance of show.sh to each GPU over all four GPUs. # Have to set -N here or it will default to -N2 which makes no sense: srun -l --gres=gpu:1 -n1 -N1 /home/cen1001/show.sh & srun -l --gres=gpu:1 -n1 -N1 /home/cen1001/show.sh & srun -l --gres=gpu:1 -n1 -N1 /home/cen1001/show.sh & srun -l --gres=gpu:1 -n1 -N1 /home/cen1001/show.sh & wait # assign one instance of show.sh to both GPUs in a node over both nodes: srun -l --gres=gpu:2 -N1 /home/cen1001/show.sh & srun -l --gres=gpu:2 -N1 /home/cen1001/show.sh & wait