Getting information
squeue # view the queue sprio # see what priority each queued job has sshare -a # see the fairshare numbers sinfo # see the system state
Running a job
SLURM is slightly different to Torque in that it has both jobs and job steps. A 'job' is an allocation of resources to an account for a time. A 'job step' is a task that runs inside the allocation. You can launch multiple job steps within a job if you want, although probably most of the time you'll just want one.
As the pat cluster is a GPU cluster it makes sense to schedule by GPUs rather than CPUs so the examples concentrate on GPUs.
Interactive jobs
All of these will run for the maximum allowed time, three days. Within the job the variable CUDA_VISIBLE_DEVICES will be set to the appropriate value for the assigned GPU.
srun --pty --gres=gpu:1 -u bash -i # One single GPU srun --pty --gres=gpu:1 -C maxwell -u bash -i # One Maxwell GPU srun --pty --gres=gpu:1 -C titanblack -u bash -i # One Titan Black GPU srun --pty --gres=gpu:1 -w compute-titanblack-0-6 -u bash -i # One GPU on the node called compute-titanblack-0-6 srun --pty --gres=gpu:1 -C happy -u bash -i # Any one GPU on the node that used to be called 'happy' - old node names are now set as 'features'
To change the GPU mode issue the appropriate nvidia-smi command within the job:
sudo nvidia-smi -c 3 -i $CUDA_VISIBLE_DEVICES # set to exclusive
sudo nvidia-smi -c 0 -i $CUDA_VISIBLE_DEVICES # set to default
Cancelling
scancel <jobid>
Batch jobs
Run a batch job with sbatch <scriptname>.
Example batch script
#!/bin/bash #SBATCH --job-name=cudamemtest #SBATCH --gres=gpu:1 #SBATCH --constraint=maxwell #SBATCH --mail-type=ALL hostname source /etc/profile.d/modules.sh module add cuda/6.5 sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3 /home/cen1001/cuda_memtest-1.2.3/cuda_memtest --stress --num_passes 1 --num_iterations 100
Output will appear in slum-<jobid>.out as each job step finishes. You can change that with sbatch options.
Because a batch job can launch multiple job steps, each taking a part of the job's allocation, you can use the srun command within the batch job to tell SLURM to allocate particular resources to each job step, which in the case of GPUs means setting CUDA_VISIBLE_DEVICES. Here is an example which assigns two GPUs and runs a different job on each.
#!/bin/bash #SBATCH --job-name=cudamemtest #SBATCH --gres=gpu:2 #SBATCH --constraint=maxwell source /etc/profile.d/modules.sh module add cuda/6.5 srun --gres=gpu:1 /home/cen1001/cuda_memtest-1.2.3/cuda_memtest --disable_all --enable_test 1 --num_passes 100 --num_iterations 100 & srun --gres=gpu:1 /home/cen1001/cuda_memtest-1.2.3/cuda_memtest --disable_all --enable_test 2 --num_passes 100 --num_iterations 100 & wait
Attaching to a batch job while it runs
sattach <job>.<step> will let you peek at the output from a running job step.
Queues and constraints
Unlike the local Torque systems, there are not lots of different queues providing shorthand for requesting various combinations of CPUs and job time. SLURM's 'partitions', which are the equivalent to Torque's 'queues' do not support setting the default task geometry in the same way so there's no point having lots of partitions. On the upside, the task geometry options available for jobs are far more flexible and powerful than Torque's - look at the srun and sbatch manpages if you want to know more. If you don't care, just ask for GPUs as in the examples above and let the CPU allocation take care of itself.
You can set the time for a job with the -t/--time flag to srun/sbatch (minutes, the maximum is three days and if you don't give a time that's what you'll get).
Having said that, the cluster does have two partitions: 'MAIN' and 'DEBUG'. 'DEBUG' is a special-purpose partition only available to people nominated by the Wales group computer reps. It allows unlimited time and jobs in this partition may pre-empt jobs in the 'MAIN' partition. It is there purely for debugging. By default all jobs use MAIN. There are also sometimes other partitions present for special purposes.
Types of compute node
The cluster's nodes are not identical, unlike many local cluster systems. There are a range of GPUs available, and also sometimes more than one OS version. Different OS versions have different software available as not all compilers/CUDA versions are supported on every OS. You select the features you want with SLURM constraints.
Currently available features:
Name | Description |
teslak20 | Nvidia Tesla K20m GPUs |
titanblack | Nvidia GeForce 700 Titan Black GPUs |
3gpu | Node has 3 gpus |
4gpu | Node has 4 gpus |
happy,hitaki,joe,ronn, mongrol,morrigun,blackblood, rojaws,hammerstein,quartz |
The names the nodes had before they were properly clustered |
To see what combination of features each node has run 'scontrol show nodes' on pat.
To select particular features use the '-C' or '--constraint' option to srun or sbatch. You can combine multiple features with & for a boolean AND, or | for a boolean OR.
Memory limits
SLURM on pat is configured to set a default memory limit of 3.5GB per core, but jobs can set that higher by using the --mem-per-cpu setting. However no matter how high we set the limits in SLURM, Linux itself still imposes some virtual memory limits in extreme conditions despite our having turned off all the tunable ones we could find.
SLURM Power saving
The nodes in the ABC cluster sometimes have power saving enabled. When power saving is on SLURM will shut them down after ten minutes of inactivity and then boot them up automatically when they are assigned to a compute job. Booting a GPU node from cold takes about two and a half minutes, so there will be a wait when starting a job if the cluster has been idle for a while.
If a node is power saving then 'sinfo nodes' will show its state as idle~ , and 'scontrol show nodes' will show it as IDLE+POWER .
Using the DEBUG partition
The intention of this partition is to allow people to allocate a GPU for debugging for an indefinite time. It's not for running production work. Only people nominated by group computer reps can have access to this partition. To allocate a GPU do something like this
salloc -n1 --gres=gpu:1 -p DEBUG --no-shell
using whatever parameters you need to get the GPU you want. salloc understands all the same ones as sbatch and srun . SLURM will bump running jobs off the GPUs if it needs to in order to satisfy the allocation request. The salloc command will return a job id. You'll be able to see this job in the queue, running with unlimited walltime.
Then to access the allocated GPU do something like
srun --jobid=id mycommand
where 'id' is the job id that the salloc command gave you. To get rid of the allocation and allow others to use the GPU, cancel it with
scancel id