See the generic SLURM documentation for how to use SLURM. This page describes the setup on the Nest cluster.
Nest has sixteen nodes which were funded by the Theory RIG and four nodes which were funded by the Wales group. Hence there is a restricted-access partition which contains the Wales-funded nodes.
Nest has five 'partitions':
Partition Name | Who can use it | Nodes | Default time limit | Maximum time limit | Other settings | Priority for preemption |
TEST | Anyone | node-0-0 | 30 minutes | 30 minutes | None | 100 |
MAIN | Anyone | node-0-1 to node-0-15 | 48 hours | 48 hours | This is the default partition | 50 |
WALES | Members of nest-wales-users group | node-0-16 to node-0-19 | 7 days | 28 days | None | 50 |
LONG | Anyone | node-0-1 to node-0-15 | 7 days | 28 days |
Limited to max 200 cores at a time over all jobs in this partition |
50 |
CLUSTER | Anyone | All the nodes | 7 days | 28 days | None | 0 |
When submitting a compute job you can give a list of possible partitions with the -p flag. SLURM will try to place the job in a suitable partition where it will start as soon as possible. If you don't choose a partition at all, SLURM will use MAIN. This is safe from pre-emption but does not have access to every node so may take longer to start. If you want it to start as soon as possible and don't mind risking it being pre-empted, use MAIN,CLUSTER. That way the job will run in MAIN if there are enough free nodes there, but if there are not enough nodes in MAIN but there are enough over the whole machine, it will run in CLUSTER. However if someone else then submits a job to MAIN, TEST, LONG, or WALES that can be run by cancelling the CLUSTER job, the CLUSTER job gets cancelled. If you want a cancelled job to be put back on the queue to be restarted later, use the --requeue flag.