This covers things specific to the pat cluster. For general SLURM use see SLURM usage.
Partitions
pat has the following partitions:
Name | Nodes | Time limit | Notes |
GPU | All the nodes with GPUs | 30 days | Default partition |
CPU | All the nodes with only CPUs | 30 days | |
DEBUG | All the nodes | None | Pre-emptor, restricted access |
Types of compute node
The cluster's nodes are not identical, unlike many local cluster systems. There are a range of GPUs available, and also sometimes more than one OS version. Different OS versions have different software available as not all compilers/CUDA versions are supported on every OS. You select the features you want with SLURM constraints. There are also cpu-only nodes.
Currently available features:
Name | Description |
teslak20 | Nvidia Tesla K20m GPUs |
titanblack | Nvidia GeForce 700 Titan Black GPUs |
3gpu | Node has 3 gpus |
4gpu | Node has 4 gpus |
cpu | Node has dual, 16-core CPUs |
To see what combination of features each node has run 'scontrol show nodes' on pat.
To select particular features use the '-C' or '--constraint' option to srun or sbatch. You can combine multiple features with & for a boolean AND, or | for a boolean OR.
Using the DEBUG partition
The intention of this partition is to allow people to allocate a GPU for debugging for an indefinite time. It's not for running production work. Only people nominated by group computer reps can have access to this partition. To allocate a GPU do something like this
salloc -n1 --gres=gpu:1 -p DEBUG --no-shell
using whatever parameters you need to get the GPU you want. salloc understands all the same ones as sbatch and srun . SLURM will bump running jobs off the GPUs if it needs to in order to satisfy the allocation request. The salloc command will return a job id. You'll be able to see this job in the queue, running with unlimited walltime.
Then to access the allocated GPU do something like
srun --jobid=id mycommand
where 'id' is the job id that the salloc command gave you. To get rid of the allocation and allow others to use the GPU, cancel it with
scancel id