System Notices

Fixed: Login node unscheduled outage on Jasper, Checkers, Hungabee - 10:45 - 11:00 AM MDT, 5 October 2012

The login was unavailabe for Checkers, Jasper and Hungabee from 10:45 - 11:00 AM MDT, 5 October 2012 

due to a ldap server problem.

 

Users were unable to log into these services during this time.

We apologize for any inconvenience.

Fixed: Hungabee head node unscheduled outage - 10:00 - 10:15 AM MDT, 5 October 2012

Hungabee head node unscheduled outage - 10:00 - 10:15 AM MDT, 5 October 2012

 

The hungabee headnode UV100 has crashed and was rebooted due to a software bug.

Users logged in or transfering data to or from the hungabee cluster were affected.

Jobs running on UV1000 2048 core compute node should not be unaffected.

We apologize for any inconvenience.

Update back in production: Jasper, Hungabee, Checkers: Planned Outage - Wednesday, October 10

The UofA Site including Checkers and Jasper clusters as well as the Hungabee SMP system will be unavailable for hardware upgrade and reconfiguration from Wednesday 10 October 9:00 AM MDT to Thursday 11 October 9:00 MDT.

 

System reservations have been put into place, preventing jobs from being run during the outage.

 

Update: Back in production Wednesday 10 October 5:19 PM MDT

Sep. 28, 2012, 21:15: Bugaboo Available Again

The Bugaboo system is available again after the power outage.

Silo planned outage for maintenance: 0900-1300 CST Wed October 17 2012

Silo and hopper will be down for scheduled maintenance on Wed, Oct 17 2012 from 0900-1300 CST, and will be unavailable for transfers at that time.  Please schedule your usage accordingly.

Please contact support@westgrid.ca with any questions or concerns.

Sep. 28, 2012: Bugaboo Shutdown

Because of a power outage in the datacentre the Bugaboo system needs to be powered off on Friday, Sep. 28, 16:00 (Pacific).

All jobs that are still running at the time of the shutdown will get terminated. The system will not start jobs with a walltime that would extend into the shutdown period.

We expect that the system will become available again late Friday evening.

Fixed: Hungabee Unscheduled Outage - 21 Sept 2012

25 Sept - 1600h:  Hungabee is now back in production.

25 Sept - 1000h:  During final testing, the UV100 login node was restarted.

24 Sept - 1800h:  Further analysis showed that the problem was software-related. A workaround has been arranged and final testing is scheduled to be completed overnight. If this testing is successful, the UV1000 will be put back in production tomorrow morning.

24 Sept - 1400h:  New information has become available indicating the problem may be hardware-related.  It is now unlikely the UV1000 will be back in production today.

24 Sept - 1300h:  Testing continues.  Unless new information indicates otherwise, the UV1000 will likely put back into production later today.

21 Sept - 1630h:  The UV1000 was once again restarted but, due to the ongoing instability, it will be left out of production over the weekend for testing purposes.

21 Sept - 1400h:  The UV1000  (2048 core compute node) crashed again. 

21 Sept - 1300h:  Hungabee has been returned to full production.

21 Sept - 1230h:  The UV1000 was successfully restarted.  The hardware vendor was engaged to help search for the cause of the failure, but none was found.  Some diagnostic mechanisms failed to work as expected during the failure, and we are working with the vendor to determine the cause of this, so that future failures can be properly diagnosed.

21 Sept - 1030h:  Access to the login node of Hungabee has been restored.  The UV1000 compute node remains down.  We are continuing to investigate.

21 Sept - 1000h:  Hungabee is not currently available.  The cause of this outage is not yet known.  We are investigating and will report more detail as it becomes available.

Sep. 17, 2012 Orcinus: Limited computing capacity (updated Sep. 20, 2012, 3:30 PM (PT))

 UBC Plant Operations need to perform urgent maintenance of the cooling
 infrastructure in the building and as a result Orcinus' computing capacity
 has to be scaled down to ~50%. The maintenance will be completed

                Thursday, September 20, 2012 in the afternoon

 At that time we will resume the full scheduling. We apologise for any
 delays.

 

 

 Thursday, September 20, 2012

 

 The cooling infrastructure maintenance has been completed and the full

 Orcinus' operation restored. Thanks for your patience.

Restored - Jasper cluster unscheduled outage due to filesystem problem - Sept 15

 

Sept 15 00:30 MDT:  Lustre is once again running on Jasper and ability to log in has been restored.  We have updated software on the lustre systems and are hopeful the filesystem will be stable now.

Sept 14 18:36 MDT:  Lustre caused job failures on all Jasper nodes, DDN has been contacted, we've uploaded logs.

                             DDN analysts are looking at them.

 

Sept 14 16:10 MDT: Jasper cluster unscheduled  outage due to filesystem problem

                             We apologize for any inconvenience.

Calgary WestGrid network connection upgrade - Wednesday, September 19 @ 9AM

We will be moving the Calgary connection to Cybera to a new device. We expect a brief interuption in connectivity while the network cables are moved. Breezy, Lattice and Parallel connectivity will be affected.

Syndicate content