System Notices

Hermes/Nestor Login Nodes Temporarily Unavailable

Due to user overconsumption of memory on both of our interactive login nodes Hermes and Nestor were unavailable for interactive sessions from around 2100 to 2350 PST this evening.  Memory and other user limits have been implemented to help manage this problem.  Running jobs were not affected.

Feb. 18, 2013 Back in production: Jasper login node unavailable 19:45 - 20:15 MST

7:45 pm Network connections to jasper are failing. A sys admin is looking into the problem.

 

8:25 pm Jasper back in production.

  • No running or queued jobs were affected.
  • Users were unable to log into the jasper cluster and submit or view the results of jobs  between 19:45 and 20:15 MST.

Feb 14-15, 2013: Brief network interruption on Grex.

UPDATE: The faulty network switch has been replaced and Grex is back at full capacity. Unfortunately, in order to install the new switch we had to reset the whole network. This reset caused a timeout within Torque and all running jobs across multiple nodes were terminated. Single node jobs (e.g. Gaussian jobs) were unaffected. Please contact support@westgrid.ca if you have any concerns or if you require elevated priority to rerun any jobs that were terminated. We apologize for the inconvenience.

Feb 15: We were unable to complete the replacement yesterday and we will continue today. The queues will be halted from 8:30 until 13:00 (central time).

Feb 14: One of our network switches died on Feb 12 and we are replacing this switch today (Feb 14). About 150 compute nodes will be disconnected from the ethernet network while we are replacing the switch. Most running jobs should be unaffected by this outage, but the queuing system will be stopped.

Jasper / Hungabee / Checkers Site Network Upgrade - Feb 19

There will be a network upgrade - from 2200h to 0000h Mountain time on February 19 - affecting all Westgrid facilities at the University of Alberta - Jasper, Hungabee and Checkers.  Access to these machines from outside the U of A may be interupted during the upgrade.  No scheduled or running jobs will be affected unless the jobs require network access to resources outside the U of A.

Complete: Hermes/Nestor Outage scheduled for 7 February 2013

Update 7 February 2013 22:45 PST.

This outage is now complete.  Thank you for your patience and please let us know if you see any issues.

Original Message

An all-day outage is scheduled for the Hermes and Nestor clusters on Thursday 7 February 2013.  The start time is 4:30 a.m. PST. and is scheduled to end the following morning at 5:00 a.m. PST.

This outage is necessary to implement necessary upgrades to network equipment servicing the data centre, to make the final migration of user data to the new storage, and other maintenance tasks.

A system reservation has been placed on the system such that jobs will be blocked from running during the outage.

WestGrid SFU Network Outage: 9:00AM PST, Wed January 30, 2013

We will move WestGrid SFU connection from current Dell switch to new Force 10 switch Wed, Jan 30 at 9AM PST. We do not expect the outage last more than 5 minutes. If you have any question, pleace e-mail to support@westgrid.ca.

Hungabee (UV1000) Unscheduled Outage - 20 January

22 January, 1630h:  The planned Hungabee outage, starting 900h on 22 Jan., allowed further work to be done on the UV1000.  It is now scheduled to go back into producction at the end of the planned outage.

21 January, 830h:  In the last several hours, the behaviour of the UV1000 has degraded.  There is the possibility that jobs may now be lost.  Our investigation continues.

20 January, 2320h:  Hungabee's UV1000 compute node has experienced an event causing it briefly to lose contact with our monitoring tools.  Running jobs appear to be unaffected but, as a precaution, all queues have been paused to prevent new jobs from starting.  We are investigating the cause and will restart queus as soon as possible.

Brief GPFS interruption affected running jobs (20130117 17:48 PST)

Yesterday evening at 17:48 Hermes and Nestor experienced a three-second interruption in our GPFS cluster.  Some jobs have been affected.  We are investigating the cause and apologise if your work was affected.

Hungabee/Jasper Back in production: Was: scheduled outage Tuesday 22 January 2013

22 Jan. - 2015h:  Hungabee is now back in production.

22 Jan. - 1630h:  Jasper is now back in production.

 

Hungabee and Jasper will have a scheduled outage starting on Tuesday, January 22 at 9AM.  The outage may last up to 24 hours. During this outage, quotas will be enabled on the Lustre filesystem, as well as other maintenance work.  Jobs unable to complete before the outage starts will be run after the outage.

Unexpected network outage at UVic affecting Hermes/Nestor

During planned electrical maintenance at VICTX this morning an unexpected problem arose which caused all of the CANARIE equipment to power down at VICTX. This power down was unexpected as this equipment has dual power supplies.

All network traffic leaving Victoria through BCNET and CANARIE, apart from commodity internet, would have been affected, including CANARIE and research traffic.

Outages were at the times detailed below (PST):

Start: January 17, 2013 - 06:57
End: January 17, 2013 - 07:30

We apologize for this unplanned outage.

Syndicate content