System Notices

Resolved: Jasper and Hungabee /home filesystem issue

 

11 December 2012:  The Lustre filesystem is now responding at full speed again.

11 December 2012:  The Lustre filesystem has now improved, though issues remain.  We are continuing to work on the problem.

10 December 2012:  The shared Lustre filesystem on Jasper and Hungabee is having a problem.  Access to /home on both machines is very slow.  We are looking into the problem and will fix it as soon as possible.  Apologies for any inconvenience.

Unscheduled jasper login node reboot - 06 Dec 12 at 11:00AM MST

There was an unscheduled reboot of the Jasper login node when it became unresponsive.  PBS jobs are unaffected.

Hungabee complex scheduled outage complete

 

1630h MST:  The scheduled outage has now ended.

 

Hungabee will have a scheduled outage Tuesday Dec 04 2012

from 9:00 AM to 6 PM MST. Jobs that cannot finish before this

will be scheduled to run after this outage.

Resolved: Hermes/Nestor - scheduler inactive

Update 20121114 2255 Pacific

We have re-enabled the queues and job scheduling has resumed.  Some jobs died due to a node that went offline, but other jobs should have been unaffected.

Original message:

Our resource manager/scheduler is currently inactive.  We are investigating.

In the meantime jobs are not running and I have halted job submission.  Sorry for the inconvenience.

Nov. 10, 2012, 1730 MST Power bump at U of C affected jobs on Breezy, Lattice and Parallel.

A power bump at the University of Calgary on Saturday afternoon about 1730 MST caused some computing nodes on Breezy, Lattice and Parallel to reboot or fail system checks.  Please check your output carefully as some running jobs were lost. Service was largely restored by about 2215 MST Saturday night, although some nodes may be out of service until after the current long weekend.  Sorry for the inconvenience.

UBC Glacier - Back Online

All file systems on Glacier have been restored. Unfortunately, we were not able to save any of the existing files stored on /global/scratch. We are very sorry for this. However, if you lost data from this file system, please contact us and we'll try and increase your job's priority. Also, please check /global/scratch on Orcinus and see if we may have previously copied your data there.

Please note that all data stored on /global/home was unaffected. In addition, we've added an additional 4TB to this file system.

Once again, we are extremely sorry for this inconvenience and delay.

Resolved, was: Unscheduled outage Checkers: 08 Nov 2012

 

Update 09 Nov 2012

At about 4:50PM MST, Checkers is was put back in production.  Unfortunately, some running and queued jobs were affected by the disk array outage.

 

Original Message

Starting about 4:00PM MST, there has been an unscheduled outage of Checkers.  The disk array serving the /home file system is having hardware errors.  We are working with the vendor to resolve this issue.  We apologize for all inconveniences.

UPDATE - Hermes/Nestor not properly scheduling "procs" jobs

Update 20121108 1428 PST

We have implemented a submit filter to temporarily address this issue.  All jobs submitted with "procs=X" in the job script, for X>1, will have "-q nestor" added to the script.  Submitting jobs with "-l procs=X" on the command line will at this time still route incorrectly.  When invoking qsub in this way please also specify "-q nestor".  Sorry for the inconvenience.

Original message

We have discovered an issue with scheduling since yesterday's upgrade of
Torque. The queue resource configuration is not currently honouring the
maximum procs count for Hermes jobs, and so it is possible to submit jobs
to Hermes while specifying more than one processor. These jobs will then
not run.

This also means that jobs that specify "procs=X" are not getting routed
correctly. For the time being, if you are specifying multiple processors
using the procs syntax, please specify the Nestor queue using "-q nestor".

The "nodes=X:ppn=Y" is still interpreted correctly and these jobs are routed
to the correct destination.

UBC Glacier - File System Update

Regret to inform that we were unable to save the /global/scratch file system. As a consequence, all date stored here was lost. (Please note that /global/home is unaffected.) We are currently working to create a new /global/scratch file system. More information will follow, as soon as this file system is online.

Resolved, was: Unscheduled outage Jasper head node: 02 Nov 2012

We had an unscheduled outage of Jasper head node, all user jobs were not affected.

Syndicate content