This document is for anyone who has to manage zero. There's not much in here for end users.
Day to day
Make the user an Active Directory account if they don't already have one, and then add them to the 'zero-users' group. A cron job on zero runs every ten minutes and will notice the new user and add them to the local system files. Use the Admitto password to log in.
Give the user a copy of zero user notes, zero queue setup, and zero parallel environments, or point them at these links. If they haven't used any of the local clusters before, also give them/point to the Maui/Torque introduction.
Zero's name service is flat files, controlled by the a-bit-NIS-like '411' service, see below.
The /export (aka /home) and /sharedscratch filesystems live on zero-filestore (filestore-1-0 from inside the cluster) and have quotas. Quotas can be viewed with:
quota -u cen1001
Quotas have six numbers: size used, soft size limit, hard size limit, files used, soft file limit, hard file limit. We don't set file limits so the file limit numbers should both be zero. The size limits are given in blocks of 1024 bytes. The hard limit is usually set a bit higher than the soft limit. Quota increases should be given out with care- anything bigger than about a 25% increase needs a very good reason. It's also a good idea to check the free space in the fileystsem with 'df' before giving an increase. To set quotas for a user on a particular filesystem (usually /home) give the four limit numbers in order and then the filesystem name like this:
setquota -u cen1001 1024000 1500000 0 0 /home
Node access control for jobs is handled by a PAM module. It talks to SLURM and only allows access to users with a job running on the node. Root can always log in.
The SLURM prologue script makes the local /scratch directories.
The SLURM config is updated nightly by a script that checks the departmental database and puts people into the 'external' QOS group when they have left. This is a group that can only have up to 80% of the machine at once. If you need to take someone out of that group (only if Daan or Mark give permissions) put them into the 'zero-unrestricted-users' group in AD and the fact that they have left will be ignored.
Rocks has two: "rocks run host" and tentakel. WARNING: tentakel operates on all the nodes, including the filestore node, by default. "rocks run host" only works on compute nodes by default.
Startup and shutdown
Although the system has APC PDUs the nodes are mainly twins, with one power supply per pair of nodes. Remote power cycling therefore isn't a good idea and is not configured. To shut down the whole cluster:
rocks run host /sbin/shutdown -h now # wait a bit shutdown -h now
Generally best to put applications under /usr/local where they are NFS-shared to the nodes. Anything under /opt is likely part of Rocks, and we don't mix our stuff with theirs. See also [[Setting up clusters]] for how to build your own RPMs for all the local clusters.
If you need to add something to every compute node that can't go in /usr/local you need to add it using the parallel tools and then edit the node install config files so that reinstalled nodes still get it. These live under /export/rocks/install/site-profiles/6.1.1/nodes . They are XML. The extend-compute.xml file applies to all compute nodes. Once you've edited it run it through xmllint to check the syntax. Then
cd /export/rocks/install rocks create distro
to rebuild the installer. Newly installed nodes should then pick up the changes.
Dealing with problems
The Rocks philosophy is to reinstall nodes if they do anything odd. To do this
rocks list host boot rocks set host boot compute-X-Y action=install # to set the lot, rocks set host boot compute action=install ssh compute-X-Y reboot
To watch the node install log on to the head node with X forwarding and do
Or use the shoot-node command which does it all in one go.
You may have to wait a bit for the node to be in a state where the console will connect.
Other useful information
Zero runs Rocks, which uses a NIS-alike system called 411 to sync files between the head nodes and compute nodes. You can see the list to be synced in /var/411/Files.mk, and sync it with make -C /var/411 .