See the general rocks notes on the Computer Officers' wiki at which covers most things.

Power management

Use IPMI; not all of the nodes are on managed PDUs.


IPMIview20 is installed and has been the most successful way to get remote consoles. The IPMI controllers are flaky and a power cycle is often helpful.

GPU problems

Sometimes GPUs go on a go-slow. ssh -Y compute-whatever and run /usr/local/bin/cuda-z to see what speed each GPU is at. Sometimes using nvidia-smi -r to reset the GPU helps.

Virtual memory

CUDA uses insane amounts of address space to communicate with its GPUs. Any CUDA program on a 4 GPU machine will allocate 4xtotal physical memory even if it only uses one GPU. So we turn off some of the vm system sanity checking in sysctl:

 vm.overcommit_memory = 1

and tell SLURM not to limit virtual memory



System status 

