skip to content
 

This document is for anyone who has to manage hathor. There's not much in here for end users.

Day to day stuff

Adding users

Hathor picks up its user accounts from the Active Directory. Make the user an AD account if they don't have one already and then add them to the 'hathor-users' group. A cron job runs every ten minutes and inserts copies of any new user accounts in 'hathor-users' into hathor's local accounts database. It doesn't do Unix groups other than the personal group yet. Passwords are checked directly with the AD and not stored locally.

Torque/Maui

See the local Maui admin guide. The queue setup is very basic.

Node access control for jobs is completely open. Any user can log into any node.

Parallel tools

Hathor has the xCAT cluster management stack installed. The parallel shell is called xdsh. There is also one called psh. And there is pscp (parallel scp) and xdcp.

Startup and shutdown

The nodes are on IPMI and can be powered up and down from the head node.

# rpower compute state # show power status of all compute nodes
# rpower node001 off # power off node001
# rpower compute on # power on all compute nodes

Documentation

http://xcat.sourceforge.net/

Customizing

Don't edit things under /opt/xcat/share/xcat, use /install/custom/install/centos

synclists for files under /install . extra rpms, createrepo

Updating software

# yum update # I guess. But kernel updates may not be safe.

 

Adding software

Use yum to see if you can get the package as part of the OS.

Generally best to put 3rd party applications under /usr/local/apps where they are NFS-shared to the nodes.

Dealing with problems

Reinstalling the nodes

# nodeset nodeXXX install # set to install on boot
# nodestat
# rinstall nodeXXX # both sets to boot and actually powers it off and on

These days the autoinstaller seems a bit broken and I have to install OFED after the install by doing

# updatenode node0XX -P ofed

Or (2016) this works (nb node will take a very long time to boot until you do this, as it will time out on a lot of stuff):

# ssh node00X
# mount 192.168.122.254:/install /mnt
# /mnt/MLNX_OFED_LINUX-2.0-3.0.0-rhel6.4-x86_64/mlnxofedinstall

Hardware problems

If you need to remove a node tell PBS first:

pbsnodes -o nodeXXX  

Once a node is back

pbsnodes -c nodeXXX

Backups

See http://hobbit.ch.cam.ac.uk/xymon/2/2Elliott/2ElliottLinux/

Tech support

support@ocf.co.uk

Other useful information

Hardware

The disks on the head node are arranged as a RAID1.

There are some nifty commands for managing nodes.

rbeacon nodeXXX # flash the lights in nodeXXX
wcons nodeXXX # KVM over IP console, needs X
rcons nodeXXX # same but without X. Quit from it with control-e c . like on a Sun SSP.

Software

xCAT's config database is in SQLite on head node. tabdump is a useful query command.

updatenode is the command for updating nodes. Be careful with arguments. By default it runs all the config scripts for the node which includes reinstalling Infiniband.

Can't find what you're looking for?

Then you might find our A-Z site index useful. Or, you can search the site using the box at the top of the page, or by clicking here.