System Notices

Silo/Hopper rolling downtime Monday Aug 13: maintenance complete

Update 1054CST: All maintenance is complete.  Please feel free to resume normal usage of Silo.

Please contact support@westgrid.ca with any questions or concerns.

 

----------------

As part of regular maintenance and response to a minor hardware issue earlier in the month, we will be doing a rolling reboot of all the silo/hopper servers on Monday Aug 13 0900-1100 CST.  Each server should be down for no longer than 15 minutes, and thus the silo service should be available for quick transfers during this period.  Extended transfers to/from silo should be avoided during this maintenance window to minimize the chance that the server downtime will occur during a transfer.  15 minutes of notice will be given before a given silo/hopper server reboots.

Please contact support@westgrid.ca with any questions or concerns.

09 August 2012 - Jasper available to all WG users

Jasper is now available to all Westgrid users.

09/08/12 noon MDT: Jasper Queueing Turned On

 

We now have PBS running jobs on Jasper.

Aug. 9, 2012 2:45 pm Update: Jasper, Checkers back in production. Was: Power fluctuation at UofA - cluster nodes down

Update Aug 9 14:45:

Checkers cluster put back into production.

 

We apologize for any inconvenience. 

 

Update Aug 8 19:10:

Jasper is being put back online.

Checkers cluster networking is broken, Vendor is working on solution.

 

Aug. 8, 2013 11:00:

A power fluctuation at the UofA has caused some systems not on UPS to shutdown.

Jasper nodes are currently all down. Half of the checkers cluster is now visible on the network.

Admins are restarting nodes and investigating possible physical problems.

The cause of the fluctuation is being investigated.

Aug 8 2:05 PM MDT Lustre and Hungabee back in production Was:Hungabee and Jasper unscheduled outage, parallel filesystem

Aug 8 8:50 AM MDT Hungabee and Jasper unscheduled outage, parallel filesystem

Luster high performance filesystem serving Hungabee and Jasper is down. We are currently investigating the cause.

 

 

Aug 8 2:05 PM MDT hungabee and lustre back in production.

Jasper still down due to seperate power outage.

Aug. 9, 2012 10:00 to 13:00 Jasper/Checkers scheduled outage - switch firmware maintenance

 

Aug. 9, 2012 10:00 to 13:00

Users will be unable to submit or run jobs during this time. Jobs with walltime fitting into the window until the maintenance period will run, other submitted jobs will start running when the machines are released.

Internal switches in both the jasper and checkers clusters will have firmware versions downgraded to stabilize communication within and between the clusters.

Checkers Maintenance Update - August 1-9, 2012

August 2

In order to allow all running jobs to finish, the switch restart is now scheduled for Thursday, August 9.  No new jobs will be run until after the restart.

 

August 1

The management network switches on Checkers will be restarted sometime during the day on August 2.  Job scheduling has been paused until the end of August 2 in order to reduce the chance of jobs being lost during the outage.

Hungabee (UV1000) Outage Ended - August 2

August 2

Hungabee is now up again.  Hardware failure of a blade was confirmed.  The failed components were replaced and the UV1000 was brought back up again.  During the outage, changes were made to the NFS-mounted directories on the UV100 (Hungabee's login node) which should result in greatly reduced NFS-related system stalls and hangs.

 

July 28 0835h MDT

Hungabee compute node (UV1000) failed to restart and remains down.  Possible hardware failure of a compute blade.  The investigation continues.

 

Jul 28 0125h MDT

Hungabee compute node (UV1000) became unresponsive and was consequently rebooted, the cause is being investigated.

COMPLETE - Nestor/Hermes Site Outage: 31 July - 1 August

Update 20120801 0010 PDT - outage complete

The work scheduled for this outage has now been completed and systems are now back in production.  Nestor jobs remain queued at the moment due to test jobs we would like to complete as they will provide useful performance metrics as well as further assurance that our InfiniBand issues have been resolved.  Hermes, Atlas and Xen queues are all operational.  Thank you for your patience.

Original Message

With issues in our InfiniBand and management Ethernet switches apparently resolved, we are ready to continue with integration of the new hardware for the Hermes expansion.  This requires a site outage and we will be performing some necessary maintenance at this time as well.  The outage is scheduled for 8 a.m. 31 July to 5 p.m. 1 August, all times Pacific.

Syndicate content