skip to content
 

This document is for anyone who has to manage elvis. There's not much in here for end users.

Day to day stuff

Adding users

Elvis picks up its user accounts from the Active Directory. Make the user an AD account if they don't have one already and then add them to the 'elvis-users' group which can be found in the Surface Science container. A cron job runs once per hour and inserts copies of any new user accounts in 'elvis-users' into elvis's local OpenLDAP system. It doesn't do Unix groups other than the personal group yet. Passwords are checked directly with the AD and not stored locally.

Give the user a copy of or point them at elvis user notes. If they haven't used any of the local clusters before, also give them the Theory sector Maui/Torque introduction to get them going.

IPMI

Elvis only has IPMI on its internal network. Luckily elvis-filestore has an interface onto the Chemistry network, sso you can get at the IPMI by logging onto elvis-filestore and using ipmitool there. Hostname is elvis-ipmi.

Torque/Maui

See the local Maui admin guide. The queue setup is very basic as elvis has few users.

Node access control for jobs is completely open. Any user can log into any node. The Torque prologue script makes the local /scratch directories on nodes as needed.

The epilogue script tries to clean ipcs where it is safe to do so.

Parallel tools

pexec still exists. You can do a great deal from the cmgui cluster management GUI, or the command line equivalent cmsh.

Startup and shutdown

The nodes are on IPMI and can be powered up and down from the head node.

# cmsh -c "device; power on -c slave"
# cmsh -c "device; power off -n node002"

Documentation

There is /info/new, user docs on the web at elvis's pages, some stuff in the filesystem at /cm/shared/docs/cm.

The /etc/motd file should be kept up to date based on the /info/new file with the make-motd script (on the PATH for root).

When you edit the 'new' file, run make-motd to generate /etc/motd from it. Running the script without options just puts the new motd to stdout, running it make-motd -r (r for really) copies the new one into place and tries to update it on the web server too.

Updating software

# yum update # even kernel updates are safe now
# yum --installroot /cm/images/default-image update

Don't forget to reboot if you need to activate new kernels.

Adding software

Use yum to see if you can get the package as part of the OS.

Generally best to put 3rd party applications under /usr/local where they are NFS-shared to the nodes, and any modules to /usr/local/Modules.

If you need to add something to every compute node that can't go in /usr/local you need to add it using the parallel tools and then edit the node install image so that reinstalled nodes still get it. This lives under /cm/images/default-image/.

yum --installroot /cm/images/default-image install foobar

After doing this reboot a node and check it all still works. Nodes resync with the image on every boot. Or you can get fancy with the Cluster Manager GUI (cmgui) by making test images and putting one node on the test image.

Infiniband

The head node has an IB card and runs the IB subnet manager daemon smd. node001 also runs a copy of the IB subnet manager. You don't need more.

Troubleshooting tips from Pieter:

ibclearcounters
ibclearerrors
ibchecknet
ibcheckerrors
ibdiagnet # will produce one harmless warning about subnets if all's well
ibstat 
ibstatus
pexec ibstatus | grep state
pexec ibstatus | grep rate
perfquery | grep SymbolErrors # should be zero
pexec perfquery | grep SymbolErrors

Dealing with problems

Reinstalling the nodes

The nodes sync with the node image on boot so just reboot any misbehaving ones if you suspect software problems.

Hardware problems

If you need to remove a node tell PBS first:

pbsnodes -o nodeXXX.cm.cluster # yes, you need the full name. Do the twin too.
checknode nodeXXX # wait til there are no reservations on both twins

Once a node is back

pbsnodes -c nodeXXX.cm.cluster

You can get a nifty IPMI console on a node by running Firefox on the head node and pointing it at nodeXXX.ipmi.cluster and watch it boot. You can run memtest from there.

Backups

Officially none. The Computer Officers back the OS up. But not the user data.

Tech support

support@clustervision.com, quote id 90441

Other useful information

Hardware

The disks on the head node are arranged as a RAID1. The controller is set up to email on problems.

There are three switches: the IB switch ibswitch01 (not managed, no IP), switch01 (the main ethernet, managed) and switch02 (IPMI net, not managed, no IP).

Software

Chunks of /usr/local/shared are synced from the network every day to pick up new versions of compilers.

Can't find what you're looking for?

Then you might find our A-Z site index useful. Or, you can search the site using the box at the top of the page, or by clicking here.