This document is for anyone who has to manage venus or swan. There's not much in here for end users.
Make the user an Active Directory account if they don't have one already. Then add them to the venus-users or swan-users group in AD. A cron job on the cluster takes care of the rest, so their account will be ready 15 minutes after you add them to the group. They log in with their Admitto password. Venus and swan access AD via the LDAP interface so look at the LDAP config files, not the Samba config files.
Point the user at venus user notes and venus queue and parallel setup. Or if it's swan they are using, Swan's documentation. If they haven't used any of the local clusters before, also point them at the the Maui/Torque introduction.
To quickly increase someone's quota on a particular filesystem by 25% do
# /usr/local/sbin/bumpquota username filesystem
# /usr/local/sbin/bumpquota cen1001 /sharedscratch
The filesystems with quotas are usually /home and /sharedscratch.
The /home and /sharedscratch filesystems have quotas. The default quota is set on a new account by the /usr/local/sbin/ad-sync.py script which creates the user accounts.
See the local Maui admin guide. The copies on Venus and Swan are part of Rocks and, unlike practically every other installation locally, reside under /opt- binaries, spools, logs, and all.
Rocks has two: rocks run host and tentakel. rocks run host is like dsh, in that it operates on one node at a time, and tentakel is genuinely parallel.
Startup and shutdown
The compute nodes have IPMI but it doesn't work properly so we are using the APC PDU to switch them on and off.
apc status compute-0-0 # are you on? apc on compute-0-0 # switch it on apc off compute-0-0 # switch it off apc on ALL # switch all on
Only root can run this.
To shut down the whole cluster:
# shut down compute nodes tentakel /sbin/shutdown -h now # wait a bit, then shut down head node shutdown -h now
After a power cut the nodes will need to be powered on by using the 'apc' command.
Generally best to put applications under /usr/local where they are NFS-shared to the nodes. Anything under /opt is likely part of Rocks, and we don't mix our stuff with theirs.
If you need to add something to every compute node that can't go in /usr/local you need to add it using the parallel tools and then edit the node install config files so that reinstalled nodes still get it. These live under /home/install/site-profiles/5.4/nodes . They are XML. The extend-compute.xml file applies to all compute nodes. The files are pretty self-explanatory. Once you've edited them run them through xmllint -noout to check the syntax. Messing these up will cause problems. Then
cd /export/rocks/install rocks create distro
to rebuild the installer. Newly installed nodes should then pick up the changes.
Dealing with problems
See local backup documentation for venus. Swan is backed up on uccbackup.
Venus and Swan are built out of pieces. The compute nodes are Avantek and the head node is WoC. We have three years of warranty so to March 2014.
The Rocks philosophy is to reinstall nodes if they do anything odd. You may want to reinstall a node that's misbehaving. To do this
rocks list host boot rocks set host boot compute-X-Y action=install ssh compute-X-Y reboot
To watch the node install log on to the head node with X forwarding and do
You may have to wait a bit for the node to be in a state where the console will connect.
Other useful information
Venus/swan runs Rocks, which uses a NIS-alike system called 411 to sync files between the head nodes and compute nodes. You can see the list to be synced in /var/411/Files.mk, and sync it with make -C /var/411 .