skip to content
 

This document is for anyone who has to manage cerebro. There's not much in here for end users.

Day to day stuff

Adding users

Cerebro picks up its user accounts from the Active Directory. Make the user an AD account if they don't have one already and then add them to the 'cerebro-users' group which can be found in the Alavi container. A cron job runs once per hour and inserts copies of any new user accounts in 'cerebro-users' into cerebro's local Unix account system. Passwords are checked directly with the AD and not stored locally. The cerebro-users group includes the alavi group so all members of that group also get accounts.

Alavi group computer reps can edit the group membership using the Delegated Management System.

Give the user a copy of or point them at cerebro user notes. If they haven't used any of the local clusters before, also give them the Theory sector Maui/Torque introduction to get them going.

SLURM

See the local SLURM notes.

Node access control for jobs is contolled by SLURM. People can only ssh to a node they have a live job on. The SLURM prologue script makes the local /scratch directories on nodes as needed.

Parallel tools

Cerebro runs Rocks, so to run a command on all nodes use 'rocks run' eg

rocks run "ls -l /home/cen1001"

Startup and shutdown

The nodes are on IPMI and can be powered up and down from the head node.

# apc on compute-0-0
# apc off compute-0-0

Documentation

There are user docs on the web at cerebro's pages, some stuff in the filesystem at /cm/shared/docs/cm.

Updating software

# yum update 

# yum upgrade

Don't forget to reboot if you need to activate new kernels.

Adding software

Use yum to see if you can get the package as part of the OS.

Generally best to put 3rd party applications under /usr/local where they are NFS-shared to the nodes, and any modules to /usr/local/modulefiles.

Dealing with problems

Reinstalling the nodes

The nodes sync with the node image on boot so just reboot any misbehaving ones if you suspect software problems.

Hardware problems

If you need to remove a node tell SLURM first:

scontrol update node=compute-0-0 state=drain reason="Hardware fault" # Do the other quads in the chassis too.
sinfo # wait til there are no reservations on all four

Once a node is back

scontrol update node=compute-0-0 state=resume

Backups

On pip.

Other useful information

Hardware

The disks on the head node are arranged as a RAID1 and a RAID6. Hobbit keeps an eye on it.

Software

Chunks of /usr/local/shared are synced from the network every day to pick up new versions of compilers.

You can use the web interface to the IPMI cards on the nodes by starting firefox on the head node and pointing it at node0XX.ipmi.cluster . And logging in with the right password, of course.

Can't find what you're looking for?

Then you might find our A-Z site index useful. Or, you can search the site using the box at the top of the page, or by clicking here.