skip to content
 

See the general xCAT notes on the Computer Officers' wiki at https://wikis.ch.cam.ac.uk/cosdocco/wiki/index.php/Setting_up_clusters and https://wikis.ch.cam.ac.uk/cosdocco/wiki/index.php/Cluster_quick_reference (both restricted access)

GPU problems

Sometimes GPUs go on a go-slow. ssh -Y compute-whatever and run /usr/local/bin/cuda-z to see what speed each GPU is at. Sometimes using nvidia-smi -r to reset the GPU helps.

Virtual memory

CUDA uses insane amounts of address space to communicate with its GPUs. Any CUDA program on a 4 GPU machine will allocate 4xtotal physical memory even if it only uses one GPU. So we turn off some of the vm system sanity checking in sysctl:

 vm.overcommit_memory = 1

and tell SLURM not to limit virtual memory

 VSizeFactor=0

 

System status 

System monitoring page

Can't find what you're looking for?

Then you might find our A-Z site index useful. Or, you can search the site using the box at the top of the page, or by clicking here.