Pat cluster admin notes

See the general xCAT notes on the Computer Officers' wiki at https://wikis.ch.cam.ac.uk/cosdocco/wiki/index.php/Setting_up_clusters and https://wikis.ch.cam.ac.uk/cosdocco/wiki/index.php/Cluster_quick_reference (both restricted access)

GPU problems

Sometimes GPUs go on a go-slow. ssh -Y compute-whatever and run /usr/local/bin/cuda-z to see what speed each GPU is at. Sometimes using nvidia-smi -r to reset the GPU helps.

Virtual memory

CUDA uses insane amounts of address space to communicate with its GPUs. Any CUDA program on a 4 GPU machine will allocate 4xtotal physical memory even if it only uses one GPU. So we turn off some of the vm system sanity checking in sysctl:

 vm.overcommit_memory = 1

and tell SLURM not to limit virtual memory

 VSizeFactor=0

Pat cluster admin notes

GPU problems

Virtual memory

System status

Can't find what you're looking for?

Quick Links

About the Department

Departmental Services

Contact IT Support at the Department of Chemistry, University of Cambridge

Study at Cambridge

About the University

Research at Cambridge