System Notices

Power bump at the UofC

November 1: Due to a power bump at the UofC, lattice, parallel and breezy had nodes that rebooted. Some jobs were lost. Please check your jobs and restart as necessary.

UBC Glacier - File System Crash

Due to a hardware issue, the /global/scratch file system on Glacier crashed. As a result, any compute jobs using /global/scratch will fail. In preparation for fixing the issue, there is currently no user access to any Glacier head nodes. We will endeavor to fix the issue as quickly as possible.

Fixed: UofA Site Power Outage Oct. 28 (unscheduled)

 

5:30 pm The Checkers Cluster is back in production.  Some running jobs were affected.  Please check your jobs and restart as necessary.

4:00 pm The Jasper cluster and Hungabee, the large shared memory machine, are back in production. Some running and queued jobs were affected. Some jobs were successfully requeued, but please check your jobs and restart as necessary.

 

1:00 pm Power and cooling are back in the data centre. Machines have been booted but there are issues that need to be addressed because of the power outage. Machines will be made available as they are confirmed to function properly. This outage covered the entire UofA North campus and some surrounding neighbourhoods.

 

8:00 am MST Oct. 28, 2012 The University of Alberta suffered a power outage. All Compute Canada machines at the University of Alberta are affected and personnel are onsite investigating the cause.

UPDATED - Hermes/Nestor outage, 7 November 2012

Update 20121107 2030 PDT.

The cluster is back up and reservations have been released.

Apart from the main purpose of this outage, which was to upgrade the InfiniBand firmware and software levels, we took the opportunity to upgrade systems and Torque as well, so some of the client tools may have a slightly different look.

Please let us know if you notice any issues.  Thanks for your patience.

 

Update 20121107 1700 PDT.

The cluster is back up and we are finishing off testing.  The head nodes are available but we have not yet declared the outage over until we have finished testing.  Thank you for your patience.

 

Original message:

A one-day maintenance outage is planned for 7 November 2012 for the Hermes/Nestor clusters.  This outage is to perform necessary upgrades to our IB infrastructure and will require a full site outage.  Work will commence at 0800 Pacific time and while it is not expected to take the full day we reserve this time for contingencies.

Jobs will not run during this time and a system reservation has been put in place across the cluster such that jobs will not start prior to the outage if their projected runtime would overlap the outage window.  Such jobs will remain queued.

Updates will be posted here.

Hungabee, Jasper, Checkers network outage on Oct. 24

25 Oct:  The outage is now complete.  All U of A WestGrid machines now should be accessible.

23 Oct:  We have been informed by the WestGrid network provider - CANARIE - that they will be doing network maintenance on Wednesday, October 24th between 2000h and 2100h MDT.  During this time, connectivity to UAlberta WestGrid systems, including Hungabee, Jasper and Checkers, may be disrupted.  If you cannot connect to one of these systems during the maintenance, please try again later.

Glacier - storage maintenance. Nov 6, 2012

Dear Glacier Users


We need to perform some maintenance on Glacier's storage configuration. As a result Glacier will not be available for either computation or file transfers starting:

           Tuesday November 6, 2012 starting 10:00 AM PST

We hope to return Glacier to full operation within a day or two. A system wide reservation has been set. Please adjust your walltime requests accordingly. Thank you for understanding and cooperation.

University of Calgary access to WestGrid interruption

November 1st, 7AM-8AM: The UofC's network connection to W.G. will move from an old device to the new campus router. All UofC access to W.G. will be affected during this duration. Any existing connections to W.G. sites during this interval will also be terminated. UPDATE: This change is being moved to November 1st due to technical issues.

Fixed: Hungabee Unscheduled Outage - 18 October

19 October - 1300h:  Replacement parts have been installed in the UV1000, and the machine has now been put back into production.

18 October - 1530h:  The vendor has confirmed a hardware failure.  Until replacement parts can be installed, the UV1000 will not be put back into production.  The estimated time for returning to production is later tomorrow, 19 October.

18 October - 0630h:  The UV1000 compute node of Hungabee has crashed.  All running jobs have been lost.  We are investigating the cause of the failure and will return it to production as soon as possible.

Update: Nestor/Hermes GPFS outage (12 October 1445 PDT)

Update 2, 12 October 1445

We have had limited success working with the vendors to determine the cause of this problem and have decided to bring the cluster back online.  We brought 75% of the newer Hermes nodes online yesterday for testing and so far have experienced no problems, so we are going production with these nodes.  We will be monitoring the cluster for developments.  Thank you once again for your patience.

Update 1, 10 October 1545

After disconnecting the newer Hermes nodes and their IB switches from the fabric and rebooting the core IB switch, the original Hermes nodes and Nestor have both recovered.  We have re-enabled the queues and are monitoring as we work with the vendors to determine the cause of the problem and a plan to reintroduce the newer Hermes nodes and IB switches.  In the meantime, resources for serial jobs are very limited.

I apologise for this outage and appreciate your patience as we investigate.  We will post updates as they become available.

Original Message

Nestor/Hermes are experiencing an unscheduled GPFS outage.  We are investigating and will post updates as they become available.

Fixed: October 9-10: Grex Unscheduled Outage

October 9, 5:14pm (Central time): The storage system on Grex went offline. We are working to bring Grex back into production ASAP.

October 9, 8:30pm (Central time): The Lustre file-system (/global/scratch) is unstable and we have deferred this to our storage vendor.

October 10, 9:00pm (Central time): Issues are resolved and Grex is back in full production. Many running jobs seemed to have survived the outage.

Syndicate content