- System Architecture
- Starting and Stopping
- Changing configuration
- When all the commands hang
- Blocked jobs
- Jumping the queue
- Jobs that won't start
- Jobs that won't die
This is a quick reference to managing Maui/PBS/Torque on the local clusters. It's no substitute for the manual. Local systems use either OpenPBS or Torque as the workload management system and Maui as the scheduler. Torque is a reimplementation of OpenPBS, with fewer bugs.
There are three daemons: maui, pbs_server, and pbs_mom. The latter runs only on compute nodes, the former two only on head nodes. pbs_server manages the whole system- it handles the job submissions, starts and stops jobs, and monitors jobs and nodes. pbs_server and pbs_mom run as root. The latter has to; the former could probably be made not to but it would be difficult.
maui runs as user maui and makes the scheduling decisions when pbs_server asks it to. It does not collect monitoring information or control jobs itself- it simply gets data from pbs_server and then tells it which job to start next and where to start it.
pbs_mom runs on the nodes and carries out the actual starting, stopping and job monitoring functions on behalf of pbs_server. It runs the jobs as the user who who submitted them, which is why it has to run as root itself.
pbs_server on the head node can be shut down quickly with qterm -t quick. This does not kill running jobs. qterm with no arguments will kill all jobs. The daemon is not started automatically on most of the servers, because after a crash this can cause more problems. To start up the server safely after a qterm -t quick, do pbs_server -t hot to tell it not to restart the entire job pool.
Maui relies on pbs_server so it is often not started by default. I will admit this isn't totally consistent on different clusters. The danger with starting it by default is that if there's been a power cut and the system is a mess it can just cause all the queued jobs to fail and be removed. However if you don't start it by default then the cluster can't start up without help. Swings and roundabouts.
You need to be user maui, not superuser, to do anything with the maui daemon. The command to close it down cleanly is schedctl -k. To start it /usr/local/sbin/maui. On most of the clusters there is an init script that can be run as root which handles both cases. The script is /etc/init.d/maui where it exists.
pbs_mom is usually configured to start at boot on the compute nodes. This is harmless as mom will not do anything until pbs_server is running. Starting mom when she's already running does not cause any problem, so if in doubt just dsh -a /usr/local/sbin/pbs_mom, or the parallel shell of your choice.
If you want to drain a single node, tell pbs_server that it is offline. Maui will then not schedule new jobs onto the node, but any existing jobs will finish normally. The comand is pbsnodes -o nodename.
To drain the whole system you could do several things. If you want to have a scheduled downtime then the best way is to use Maui's setres command to set up a reserved time in the system. Maui will then plan things so that no job starts which would not have finished before the reserved time. An example would be setres -s 07:00:00_11/30/05 -d 12:00 ALL, which would drain the system so that it was completely idle by 7am on November 30th (beware American date format).
An emergency drain can be done by stopping Maui from scheduling with schedctl -s, but this can be confusing as the users can't tell anything's changed; they just see their jobs not starting. Telling pbs_server not to request service from Maui doesn't always seem to work (qmgr -c 'set server scheduling=false', should you care). I find that the best way is often to disable all the queues in PBS with qdisable and qstop. The former stops the users from submitting new jobs, the latter stops PBS from running the already queued jobs.
You can change certain items of Maui configuration on the fly, but not everything. It is normally best to edit the config file ~maui/maui.cfg and restart Maui instead (on tardis where it was installed by the vendor it's ~maui/spool/maui.cfg for some reason) The schedctl -r command mentioned in the manual does not work! A restart of Maui won't affect running jobs, although it can take a newly started Maui a few minutes to get all the information it needs from PBS and produce sensible output. Be patient if showq seems to be wrong at first.
Maui is extremely tolerant of mistakes in config files, and will ignore any line that it doesn't understand without warning you. This would not matter much except that there are incorrect versions of some options given in the manual. If in doubt you can always check against the source code, which is usually in Maui's home directory. On some machines it's a very slightly modified version so always look at the source on the machine you're interested in.
PBS config has to be changed on the fly using qmgr; most of the config files are not human-readable. Maui does not always pick up all PBS config changes very well- in particular it seems to have a problem with adding queues on the fly. A restart of Maui cures this.
This usually means that one or more nodes have failed and PBS is still trying to contact those nodes for an update. This rarely happens with Torque, but frequently with PBS. The newest versions of Torque support the useful but dangerous qdel -p command which wipes the job from the system even if its MOM has vanished. It is normally better to fix the node than resort to this! Find the offending node(s) first. dsh -a uptime is pretty useful for this. Use pbsnodes -o nodename to mark them as offline. Then find what jobs were running on those nodes with pbsnodes -a or checknode. Note the job ids. Shut down the server with qterm -t quick and then change directory to /var/spool/PBS/server_priv/jobs. On systems with older copies of PBS (just rama I think) it may be in /usr not /var. Remove all files relating to the job ids you noticed, and start PBS with pbs_server -t hot. The PBS and Maui commands should all now respond normally. Then you can deal with the crashed nodes. Once they are up again, bring them online in PBS with pbsnodes -c nodename.
Use diagnose -q to check how the job became blocked. Many blocked states are normal and can be ignored.
If the job has been deferred then this is usually because it failed after starting. This can be through no fault of the job's own- the node it ran on may have had a problem, or another user's job may have interfered with it. To release a deferred job for another try, do runjob -c jobid to clear the assigned host list, and then releasehold jobid to put it back on the queue. Check the dodgy node with checknode -v.
The setspri command will allow you to bring any job to the top of the queue.
If you think that an idle job should have started then you can check why Maui hasn't started it with the checkjob command. Often it will be being blocked because Maui is trying to get enough nodes free to let a higher priority parallel job run. If your job is not at the top of the queue and won't start even though there seem to be free nodes, this is almost always the reason. Always look at the top job.
The reason for this behaviour is reservation and backfill, which is a system that allows mixed serial and parallel jobs to run while keeping utilization reasonably high. Maui takes the queued job with the highest priority and works out when the earliest it could run is, assuming all running jobs use their full walltime allocation. It then takes the calculated start time for the queued job and the set of nodes it will run on and reserves those nodes. Nothing will be allowed to start that could impinge on the start time of the top job. This means that if a big parallel job is waiting and the system has some free nodes, but not enough to let it run, then no small jobs will be allowed to start on the free nodes unless their walltime limit is small enough that they will be finished before the top job's calculated start time. Otherwise the top job could get delayed by a stream of smaller jobs.
If you want to run a small, short job in this situation, use the showbf command to see what the biggest walltime limit you can use without delaying the waiting large job is. If you set a limit just under that time then your job will get backfilled onto the free nodes.
Maui recalculates the top job and the reservations regularly, so it's possible for things to change if jobs finish earlier than expected or as fairshare changes the priority of the queued jobs.
Reservation and backfill does not maximise utilization- to do that you need backfill alone. The problem with not having reservation is that you will almost never be able to get a parallel job to start on a busy machine as the chances of a large number of nodes all being free at once are small.
You can see the reservations with showres -v. You can see which nodes are doing what with diagnose -n and checknode is also helpful.
Often a user will complain that they can't qdel a particular job. There are two common causes - the job has exited and PBS hasn't noticed, or the node that the job is running on has crashed.
First of all try qsig -s0 jobid . If the job had already exited, this will cause PBS to notice this and in a minute or two it will vanish from the queue.
If that doesn't work then run pbsnodes -l and look for nodes that are marked as down but NOT offline - offline nodes have already been dealt with. Also run checkjob on the job id. If the job is on a node that is down but not offline then offline the node with pbsnodes -onodename . Then you can try rebooting it. On some of the clusters there are remote power management commands so you won't have to walk to the machine - check the docs for the particular cluster.
If that fails then check logs to see if there's something else up.
As a last desperate resort you can run qdel -p but it's not recommended - there is no guarantee it won't muck up the whole queueing system.