This document is for anyone who has to manage sinister. There's not much in here for end users.

Day to day stuff

Sinister picks up its user accounts from the Active Directory. Make the user an AD account if they don't have one already and then add them to the 'sinister-users' group which can be found in the Surface Science container. A cron job runs once per hour and inserts copies of any new user accounts in 'sinister-users' into sinister's local OpenLDAP system. It doesn't do Unix groups other than the personal group yet. Passwords are checked directly with the AD and not stored locally.

Give the user a copy of or point them at sinister user notes. If they haven't used any of the local clusters before, also give them the Theory sector SLURM introduction to get them going.

SLURM

The SLURM prologue script makes the local /scratch directories on nodes as needed.

The SLURM config does not automatically put nodes back online if they recover after a crash; need to do

scontrol update node=<nodename> state=resume

Parallel tools

rocks run host compute "command arg1 arg2"

Documentation

It runs Rocks so online.

Startup and shutdown

Use the 'apc' script to power nodes on and off. Most also have IPMI but only try it if 'apc' does not work. For example, compute-0-13's IPMI controller is at ipmi-0-13.ipmi. If cannot be brought back must be set to offline to the CLUSTER. For example:

scontrol update node=compute-0-44 state=drain reason="Fails to boot RT192807"

Other useful information

Hardware

The disks on the head node are arranged as a RAID1 and a RAID6. Hobbit keeps an eye on it.

Software

Chunks of /usr/local/shared are synced from the network every day to pick up new versions of compilers.

You can use the web interface to the IPMI cards on the nodes by starting firefox on the head node and pointing it at ipmi-0-X.ipmi . And logging in with the right password, of course.

System status

System monitoring page

