skip to content
 

This document is for anyone who has to manage zero. There's not much in here for end users.

Day to day

Adding users

Make the user an Active Directory account if they don't already have one, and then add them to the 'zero-users' group. A cron job on zero runs every ten minutes and will notice the new user and add them to the local system files. Use the Admitto password to log in.

Give the user a copy of zero user notes, zero queue setup, and zero parallel environments, or point them at these links. If they haven't used any of the local clusters before, also give them/point to the Maui/Torque introduction.

Zero's name service is flat files, controlled by the a-bit-NIS-like '411' service, see below.

Quotas

The /export (aka /home) and /sharedscratch filesystems live on zero-filestore (filestore-1-0 from inside the cluster) and have quotas. Quotas can be viewed with:


quota -u cen1001

Quotas have six numbers: size used, soft size limit, hard size limit, files used, soft file limit, hard file limit. We don't set file limits so the file limit numbers should both be zero. The size limits are given in blocks of 1024 bytes. The hard limit is usually set a bit higher than the soft limit. Quota increases should be given out with care- anything bigger than about a 25% increase needs a very good reason. It's also a good idea to check the free space in the fileystsem with 'df' before giving an increase. To set quotas for a user on a particular filesystem (usually /home) give the four limit numbers in order and then the filesystem name like this:


setquota -u cen1001 1024000 1500000 0 0 /home

SLURM

Node access control for jobs is handled by a PAM module. It talks to SLURM and only allows access to users with a job running on the node. Root can always log in.

The SLURM prologue script makes the local /scratch directories.

The SLURM config is updated nightly by a script that checks the departmental database and puts people into the 'external' QOS group when they have left. This is a group that can only have up to 80% of the machine at once. If you need to take someone out of that group (only if Daan or Mark give permissions) put them into the 'zero-unrestricted-users' group in AD and the fact that they have left will be ignored.

Parallel tools

Rocks has two: "rocks run host" and tentakel. WARNING: tentakel operates on all the nodes, including the filestore node, by default. "rocks run host" only works on compute nodes by default.

Startup and shutdown

Although the system has APC PDUs the nodes are mainly twins, with one power supply per pair of nodes. Remote power cycling therefore isn't a good idea and is not configured. To shut down the whole cluster:

rocks run host /sbin/shutdown -h now # wait a bit shutdown -h now

Adding software

Generally best to put applications under /usr/local where they are NFS-shared to the nodes. Anything under /opt is likely part of Rocks, and we don't mix our stuff with theirs. See also [[Setting up clusters]] for how to build your own RPMs for all the local clusters. 

If you need to add something to every compute node that can't go in /usr/local you need to add it using the parallel tools and then edit the node install config files so that reinstalled nodes still get it. These live under /export/rocks/install/site-profiles/6.1.1/nodes . They are XML. The extend-compute.xml file applies to all compute nodes. Once you've edited it run it through xmllint to check the syntax. Then


cd /export/rocks/install
rocks create distro

to rebuild the installer. Newly installed nodes should then pick up the changes.

Dealing with problems

Reinstalling nodes

The Rocks philosophy is to reinstall nodes if they do anything odd. To do this


rocks list host boot
rocks set host boot compute-X-Y action=install # to set the lot, rocks set host boot compute action=install
ssh compute-X-Y
reboot

To watch the node install log on to the head node with X forwarding and do


rocks-console compute-X-Y

Or use the shoot-node command which does it all in one go.

You may have to wait a bit for the node to be in a state where the console will connect.

Other useful information

Rocks

Zero runs Rocks, which uses a NIS-alike system called 411 to sync files between the head nodes and compute nodes. You can see the list to be synced in /var/411/Files.mk, and sync it with make -C /var/411 .

Can't find what you're looking for?

Then you might find our A-Z site index useful. Or, you can search the site using the box at the top of the page, or by clicking here.