See the general xCAT notes on the Computer Officers' wiki at https://wikis.ch.cam.ac.uk/cosdocco/wiki/index.php/Setting_up_clusters and https://wikis.ch.cam.ac.uk/cosdocco/wiki/index.php/Cluster_quick_reference (both restricted access)
GPU problems
Sometimes GPUs go on a go-slow. ssh -Y compute-whatever and run /usr/local/bin/cuda-z to see what speed each GPU is at. Sometimes using nvidia-smi -r to reset the GPU helps.
Virtual memory
CUDA uses insane amounts of address space to communicate with its GPUs. Any CUDA program on a 4 GPU machine will allocate 4xtotal physical memory even if it only uses one GPU. So we turn off some of the vm system sanity checking in sysctl:
vm.overcommit_memory = 1
and tell SLURM not to limit virtual memory
VSizeFactor=0