Rogue GPU server user notes

Rogue is a very small cluster consisting of two GPU servers and a filestore/head node. They all run Linux. The GPU servers each have eight Nvidia V100 GPUs. They also have two 16-core Intel Skylake CPUs and 192 GB of RAM.

Rogue can only be accessed by sshing into the head node, whose external name is rogue.ch.private.cam.ac.uk. All work is done from there; there is no need to log directly into compute nodes. Rogue uses the local Admitto service, so you log in with the same password as on the workstations.

Homespace is on a disk array attached to the head node. The /home filesystem is currently 5TB in size but we have room to grow this if necessary. It has user quotas which are currently set to 20GB soft limit with a hard limit of 25GB. It is backed up regularly. The most recent backups are available under /rsnapshots on the head node. Older copies can be accessed by the IT staff.

/home is shared to all nodes on the cluster's internal network, so your job sees the same home directory wherever it is on the machine. It's important to remember that from a compute job's point of view accessing this directory is extremely slow, especially if all the nodes are trying at once. Compute jobs should always write data to a local disk if possible, and copy it back to /home at the end.

There is also a shared scratch filesystem /sharedscratch in which you will have a directory. These are not backed up. They have a quota restriction of 250GB soft limit and 300GB hard limit, but it is expected that most people will stay well within that amount. They have the same speed issue as /home.

Each node also has a local /scratch filesystem on which the queueing system will create you a directory when you use the node. These filesystems are about 4TB in size with no quota restriction and are the most appropriate place for your jobs to write temporary files during a run. They are local to each node and so considerably faster than the NFS-mounted /home and /sharedscratch. Please clean up files on /scratch when you are done with them; see the queueing documentation for how to find out which node's /scratch to look at. All of the node /scratch directories are accessible under /nodescratch on the head node. The system uses an automounter so the directories only appear when you try to access them. For example to see the /scratch from node node-0-0 you need to type something like 'ls /nodescratch/node-0-0' .

A variety of compilers and libraries are installed. Like most local Linux machines rogue has the modules environment to allow you to switch between different compilers and libraries. The default environment is set up with all the available 64-bit compilers and the latest Nvidia CUDA. If you need to change this then use the module avail command to see what the other options are, and edit the version you want into your ~/.bashrc file.

All compute jobs should be run through the queueing system. The queueing system will run each job on a set of free compute nodes, copying the output back to a user-specified file at the end of the job. The queueing system is SLURM; this will be familiar to users of some of the other theory sector clusters, but please note that the available queues are not the same on every machine.

Read the generic instructions for instructions on how to use SLURM. Rogue's queueing setup is currently very simple. There is only one queue (partition, in SLURM usage) and jobs are started in priority order, provided the right number of CPUs and GPUs are available. Priority is calculated based on recent usage, so if you have used a lot of time lately your waiting jobs' priority goes down. We may change the setup as we get more experience with the system.

Rogue GPU server user notes

System status

Can't find what you're looking for?

Quick Links

About the Department

Departmental Services

Contact IT Support at the Department of Chemistry, University of Cambridge

Study at Cambridge

About the University

Research at Cambridge