System Notices

Fixed - Unscheduled outage Hungabee and Jasper cluster, filesystem failure - Sept 11

Sept 11 3:00 PM MDT:  the Lustre filesystem of Hungabee and Jasper has now returned to production.  We are continuing to work with the vendor to prevent future problems, and will implement any changes required as soon as possible.

Sept 11 8:45 AM MDT: Unscheduled outage Hungabee and Jasper cluster, due to filesystem failure. 

   The Lustre filesystem serving both Hungabee and Jasper clusters experienced a failure.

   Both clusters are currently down. The vendor has been contacted and is examining the problem.

   We apologize for any inconvenience.

Sep. 8, 2012: Bugaboo Power Outage Cancelled

The planned power outage for Sep. 8, 2012 for the datacentre that houses the Bugaboo system has been cancelled and all system should continue to operate as usual.

Sept 4 3:45 PM MDT Update: Hungabee back in production with replacement hardware.

Sept 4 3:45 PM MDT

   Update: The 2048 core Hungabee compute node back in production with replacment haredware.

 

Sept 1 1:30 PM MDT

    The 2048 core Hungabee compute node unscheduled outage due to hardware error.

    We appologize for any inconvinece.

Sep. 8, 2012: Bugaboo Unavailable

Due to a power outage that affects the whole data centre Bugaboo must be powered off on Saturday, Sep. 8 around 7:00 AM. We expect that the system will become available late in the evening on that day.

The scheduler will not start jobs with a walltime that extends into the power outage. Jobs that are still running in the morning of Sep. 8 will get terminated.

September 4, 2012: Reminder of extended Lattice, Parallel and Breezy maintenance outage

This is just a reminder (of the announcement originally made in mid July) that Parallel, Lattice and Breezy will be removed from service on September 4 for approximately 10 days for the addition of new hardware.  Firmware upgrades will also be installed.  We apologize for any inconvenience this might cause.


Updated 2012-08-31.

Noticeable network latencies from/to nestor/hermes - Updated

August 30, 2012 at 11:00PM:

The network latencies seem to be fixed. We are keeping an eye on the network performance. Please let us know if you see any slowness from/to hermes/nestor.

 

August 30, 2012 at 12:00PM:

We are experiencing noticeable network latencies when working on Hermes/Nestor at least from within UVic or from other WestGrid machines. Our system and network admins are fully working on the issue.

Sorry for any inconvienence.

Resolved: Some Checkers compute nodes down - Aug 30

Aug 30 1600h MDT - A switch issue was confirmed.  Two network cables were replaced.  Most compute nodes have now returned to production.

Aug 30 1130h MDT - due to a suspected switch problem, many Checkers compute nodes have gone down.  Jobs on these nodes have likely been lost.  The problem is being investigated.  Apologies for any inconvenience this issue may be causing.

Jasper/Hungabee Shared Filesystem Issue Resolved - August 28

August 28 - 2330h MDT:  After consultation with the vendor, the problem was determined to be high load levels on the OSS component of Lustre.  The system recovered when cluster activity was reduced.  The job queue has now been restarted.  To prevent future issues, the vendor will apply performance tuning changes to the system.

August 28 - 1500h MDT:  The Lustre distributed filesystem that provides /home and /global/scratch for Jasper and Hungabee is having issues.  In at least many cases, the filesystem is non-functional and will likely prevent any use of Jasper and Hungabee.  The problem is being investigated.

RESOLVED: Silo/hopper: /home2, /data2, /data6 available again

1630 CST Update:

Replacement parts were installed in the UPS and the UPS was returned to service.  The back end storage controller has been turned back on, and all filesystems are again mounted on silo/hopper.

Please contact support@westgrid.ca with any questions or concerns.

 

 

 

1600 CST:

One of the Uninterruptible Power Supplies powering the back-end data storage system on silo/hopper has failed over to battery.  As a precautionary measure, the following filesystems:

 

/home2

/data2

/data6

 

have been unmounted so that the storage controllers can be powered down gracefully until such time as the UPS is returned to service.  These filesystems will be unavailable for use until their controllers are powered back on.  

 

Please contact support@westgrid.ca with any questions or concerns.

Rolling reboots Silo: Normal service resumed.

Edit 1545CST: 

 

Maintenance is complete on silos.  Normal service has resumed.

Please contact support@westgrid.ca with any questions or concerns.

 

----

As part of regular maintenance and response to power outage last week, we will be doing a rolling reboot of all the silo/hopper servers on Monday Aug 27 1430-1630 CST.  Each server should be down for no longer than 15 minutes, and thus the silo service should be available for quick transfers during this period.  Extended transfers to/from silo should be avoided during this maintenance window to minimize the chance that the server downtime will occur during a transfer.  10 minutes of notice will be given before a given silo/hopper server reboots.

Please contact support@westgrid.ca with any questions or concerns.

Syndicate content