skip to content
 

This document is for anyone who has to manage venus or swan. There's not much in here for end users.


Adding users


Make the user an Active Directory account if they don't have one already. Then add them to the venus-users or swan-users group in AD. A cron job on the cluster takes care of the rest, so their account will be ready 15 minutes after you add them to the group. They log in with their Admitto password. Venus and swan access AD via the LDAP interface so look at the LDAP config files, not the Samba config files.


Point the user at venus user notes and venus queue and parallel setup. Or if it's swan they are using, Swan's documentation. If they haven't used any of the local clusters before, also point them at the the Maui/Torque introduction.


Quotas


To quickly increase someone's quota on a particular filesystem by 25% do



# /usr/local/sbin/bumpquota username filesystem


for example



# /usr/local/sbin/bumpquota cen1001 /sharedscratch


The filesystems with quotas are usually /home and /sharedscratch.


The /home and /sharedscratch filesystems have quotas. The default quota is set on a new account by the /usr/local/sbin/ad-sync.py script which creates the user accounts.


Torque/Maui


See the local Maui admin guide. The copies on Venus and Swan are part of Rocks and, unlike practically every other installation locally, reside under /opt- binaries, spools, logs, and all.



Parallel tools


Rocks has two: rocks run host and tentakel. rocks run host is like dsh, in that it operates on one node at a time, and tentakel is genuinely parallel.


Startup and shutdown


The compute nodes have IPMI but it doesn't work properly so we are using the APC PDU to switch them on and off.



apc status compute-0-0 # are you on?
apc on compute-0-0 # switch it on
apc off compute-0-0 # switch it off
apc on ALL # switch all on


Only root can run this.


To shut down the whole cluster:



# shut down compute nodes
tentakel /sbin/shutdown -h now
# wait a bit, then shut down head node
shutdown -h now


After a power cut the nodes will need to be powered on by using the 'apc' command.


Adding software


Generally best to put applications under /usr/local where they are NFS-shared to the nodes. Anything under /opt is likely part of Rocks, and we don't mix our stuff with theirs.


If you need to add something to every compute node that can't go in /usr/local you need to add it using the parallel tools and then edit the node install config files so that reinstalled nodes still get it. These live under /home/install/site-profiles/5.4/nodes . They are XML. The extend-compute.xml file applies to all compute nodes. The files are pretty self-explanatory. Once you've edited them run them through xmllint -noout to check the syntax. Messing these up will cause problems. Then



cd /export/rocks/install
rocks create distro


to rebuild the installer. Newly installed nodes should then pick up the changes.


Dealing with problems


Backups


See local backup documentation for venus. Swan is backed up on uccbackup.


Tech support


Venus and Swan are built out of pieces. The compute nodes are Avantek and the head node is WoC. We have three years of warranty so to March 2014.


Reinstalling nodes


The Rocks philosophy is to reinstall nodes if they do anything odd. You may want to reinstall a node that's misbehaving. To do this



rocks list host boot
rocks set host boot compute-X-Y action=install
ssh compute-X-Y
reboot


To watch the node install log on to the head node with X forwarding and do



rocks-console compute-X-Y


You may have to wait a bit for the node to be in a state where the console will connect.


Other useful information


Rocks


Venus/swan runs Rocks, which uses a NIS-alike system called 411 to sync files between the head nodes and compute nodes. You can see the list to be synced in /var/411/Files.mk, and sync it with make -C /var/411 .


Can't find what you're looking for?

Then you might find our A-Z site index useful. Or, you can search the site using the box at the top of the page, or by clicking here.