See the general rocks notes on the Computer Officers' wiki at https://wikis.ch.cam.ac.uk/cosdocco/wiki/index.php/Setting_up_clusters#t... which covers most things.
Power management
Use IPMI; not all of the nodes are on managed PDUs.
IPMI
IPMIview20 is installed and has been the most successful way to get remote consoles. The IPMI controllers are flaky and a power cycle is often helpful.
GPU problems
Sometimes GPUs go on a go-slow. ssh -Y compute-whatever and run /usr/local/bin/cuda-z to see what speed each GPU is at. Sometimes using nvidia-smi -r to reset the GPU helps.
Virtual memory
CUDA uses insane amounts of address space to communicate with its GPUs. Any CUDA program on a 4 GPU machine will allocate 4xtotal physical memory even if it only uses one GPU. So we turn off some of the vm system sanity checking in sysctl:
vm.overcommit_memory = 1
and tell SLURM not to limit virtual memory
VSizeFactor=0