This document is for anyone who has to manage hathor. There's not much in here for end users.
Day to day stuff
Hathor picks up its user accounts from the Active Directory. Make the user an AD account if they don't have one already and then add them to the 'hathor-users' group. A cron job runs every ten minutes and inserts copies of any new user accounts in 'hathor-users' into hathor's local accounts database. It doesn't do Unix groups other than the personal group yet. Passwords are checked directly with the AD and not stored locally.
See the local Maui admin guide. The queue setup is very basic.
Node access control for jobs is completely open. Any user can log into any node.
Hathor has the xCAT cluster management stack installed. The parallel shell is called xdsh. There is also one called psh. And there is pscp (parallel scp) and xdcp.
Startup and shutdown
The nodes are on IPMI and can be powered up and down from the head node.
# rpower compute state # show power status of all compute nodes # rpower node001 off # power off node001 # rpower compute on # power on all compute nodes
Don't edit things under /opt/xcat/share/xcat, use /install/custom/install/centos
synclists for files under /install . extra rpms, createrepo
# yum update # I guess. But kernel updates may not be safe.
Use yum to see if you can get the package as part of the OS.
Generally best to put 3rd party applications under /usr/local/apps where they are NFS-shared to the nodes.
Dealing with problems
Reinstalling the nodes
# nodeset nodeXXX install # set to install on boot # nodestat # rinstall nodeXXX # both sets to boot and actually powers it off and on
These days the autoinstaller seems a bit broken and I have to install OFED after the install by doing
# updatenode node0XX -P ofed
Or (2016) this works (nb node will take a very long time to boot until you do this, as it will time out on a lot of stuff):
# ssh node00X # mount 192.168.122.254:/install /mnt # /mnt/MLNX_OFED_LINUX-2.0-3.0.0-rhel6.4-x86_64/mlnxofedinstall
If you need to remove a node tell PBS first:
pbsnodes -o nodeXXX
Once a node is back
pbsnodes -c nodeXXX
Other useful information
The disks on the head node are arranged as a RAID1.
There are some nifty commands for managing nodes.
rbeacon nodeXXX # flash the lights in nodeXXX wcons nodeXXX # KVM over IP console, needs X rcons nodeXXX # same but without X. Quit from it with control-e c . like on a Sun SSP.
xCAT's config database is in SQLite on head node. tabdump is a useful query command.
updatenode is the command for updating nodes. Be careful with arguments. By default it runs all the config scripts for the node which includes reinstalling Infiniband.