skip to content
 

See the general rocks notes on the Computer Officers' wiki at https://wikis.ch.cam.ac.uk/cosdocco/wiki/index.php/Setting_up_clusters#t... which covers most things.

Power management

Use IPMI; not all of the nodes are on managed PDUs.

IPMI

IPMIview20 is installed and has been the most successful way to get remote consoles. The IPMI controllers are flaky and a power cycle is often helpful.

GPU problems

Sometimes GPUs go on a go-slow. ssh -Y compute-whatever and run /usr/local/bin/cuda-z to see what speed each GPU is at. Sometimes using nvidia-smi -r to reset the GPU helps.

Virtual memory

CUDA uses insane amounts of address space to communicate with its GPUs. Any CUDA program on a 4 GPU machine will allocate 4xtotal physical memory even if it only uses one GPU. So we turn off some of the vm system sanity checking in sysctl:

 vm.overcommit_memory = 1

and tell SLURM not to limit virtual memory

 VSizeFactor=0

 

System status 

System monitoring page

Can't find what you're looking for?

Then you might find our A-Z site index useful. Or, you can search the site using the box at the top of the page, or by clicking here.