Job Monitoring

There are a number of commands that can be use to monitor Jobs queues and machines their jobs run on.

The qstat command is used to show a summery of jobs on a cluster.

The checkjob comand shows detailed information for a single job.

The mdiag -n command shows a summary of the condition of machines in a cluster.

qstat

The qstat command shows a limited amount of information on a number of jobs in the scheduling system.

The "qstat -a" command shows information on all queued jobs.
The "qstat -r" command shows information on all running jobs.
The "qstat -i" command shows information on all non-running jobs.


Job ID refers to the job identifier assigned by PBS.
Username refers to the job owner.
Queue refers to the queue in which the job currently resides.
Jobname refers to the job name given by the owner.
SessID refers to the session id (if the job is running).
NDS refers to the number of nodes requested by the job.
TSK refers to the number of cpus or tasks requested by the job.
Req’d Memory is the amount of memory requested by the job.
Req’d Time is the wall time requested by the job (hh:mm).
S refers to the jobs current state:
E – Job is exiting
H – Job is held.
Q – Job is queued.
R – Job is running.
Elap Time is the Elapsed time since the job has started (hh:mm).

For more information on the qstat command please refer to Cluster Resources documentation located here.

checkjob

The checkjob command shows a large amount of detailed information for a single job.
The checkjob command is invoked "checkjob <jobid>", the -v -v flags can be added for more detail.

Class refers to the queue that the job is currently running in.
State refers to the state of the job: Idle, Starting, Running
Time Queued Total refers to the amount of time job spent in the queue.
Time Queued Eligible refers to the amount of queue time that is eligible for consideration when job priority based on queue time is calculated.
If a user submits a lot of jobs it is possible that only first few jobs may be gaining eligible queue time.
Required Hostlist is a list of hosts on one or more of which the job must run.
Reserved Nodes is the list of hosts that the job is currently reserved to run on.
StartPriority is the current priority of the job, the higher the better.
The end of the “checkjob -v –v” command lists reasons that the job is not being started on each node.
There may and usually is more than on reason the job is not being started on a node, only one reason is shown here.

For more information on checkjob command please refer to Cluster Resources documentation located here.

mdiag -n

The "mdiag -n" command shows a summary of the actual and current condition of machines in cluster.

This command is useful on large SMP machines or small clusters were
one can get an, overview of what is happening on the cluster in a single page.

The State field refers to state each machine is in: Idle, Busy or Down.
The Name field refers to the machine (cluster node) name.
The first number in the Procs field is number of currently empty processors and the second number
after the colon is the total processors available on the machine in question.

For more information on mdiag -n command please refer to Cluster Resources documentation located here.


Updated 2011-11-09.