System Notices

Resolved - Unexpected GPFS interruption on Nestor/Hermes

This has been resolved.  Thanks for your patience.

 

During integration of new GPFS nodes into our cluster some nodes were expelled from GPFS, including our login nodes.  Logins may be unsuccessful or take a very long time and existing sessions may not be able to access home, scratch and software directories.  We are working to correct the issue.

 

Sorry for the inconvenience.

Silo maintenance completed: 1300 CST Wed Feb 6

Silo and hopper are up and in full production after scheduled maintenance.

Please contact support@westgrid.ca with any questions or concerns.

Jan 2, 2013, 11:00: Bugaboo Filesystem Problems (again)

The filesystem problems have reappeared. We are now running a check on all components (not just the ones that show errors). This will take a few hours and logins are disabled during this time.

Update: Jan 2, 2013, 15:00:

The file system checks have been completed. Logins have been reenabled.

Jan 1, 2013: Orcinus Logins Restored

Logins have been restored to Orcinus. However, normal scheduling operations will not be resumed until we received confirmation that the issue with the chiller has been resolved. Hopefully this will occur early tomorrow, as soon as UBC Plant Operations staff return from the holiday break. Until then, please check the status of your compute jobs and resubmit any that were lost. We are sorry for this inconvenience.

Jan 1, 2013: Bugaboo File System Problems

There are again filesystem problems on the Bugaboo system. Logins are currently disabled.

We are currently investigating the problem and will post updates here.

Update, Jan 1, 23:00 (PST):

Logins are enabled for now. One of the file servers has hardware problems and remains

powered off. Hence failover is currently not available and the system will go down again

in case new problems develop.

January 1, 2013 Orcinus cluster is down

January 1, 2013, 6:15 AM

The Chemistry building experienced the chiller failure. As a result, Orcinus had to be powered down and is off-line. We will try to restore the normal operation as soon as possible. Sorry for any inconveniences.


Dec. 19-23, 2012 - Bugaboo file system problems

 

Dec. 19:

There is a general file system problem on Bugaboo this afternoon (Wednesday, Dec. 19).   This notice will be updated when the problem has been resolved or new information becomes available.  Sorry for the inconvenience.

Update Dec. 20, 15:46 Pacific:

The cause of yesterday's problem was a power failure.  Bugaboo is now back in service. There were some residual problems this afternoon (Thursday, Dec. 20) with the /home file system not being mounted on some of the compute nodes, causing some jobs to fail. The file systems have now been checked and appear to be mounted on all the compute nodes. Please check output carefully and resubmit any jobs as necessary.

Update Dec 20, 17:05 Pacific:

There are still problems with the /usr/local file system on bugaboo. Please refrain from logging in and submitting jobs. Check this page for updates on bugaboo's status.

Update Dec 23, 01:50 Pacific:

Bugaboo was back online for some hours on Saturday but further file system difficulties were discovered early Sunday morning.  Logins have once again been disabled while the system administrators diagnose the problem.

 

Update Dec 23,  8:00 PM (PST):

Bugaboo is back on-line. The normal cluster operations have been restored and access to the system enabled. 

 


 

Updated: 2012-12-23 0150 PST.

Dec 18 - Campus-Wide Power Outage at UBC

In the early morning, the UBC campus lost power and, as a result, both Orcinus and Glacier went offline. We are working to restore full operation. Sorry for this inconvenience.

 

Update: Dec. 18, 2012  7:30PM (PST)

Orcinus system is back on-line.  We still cannot restore full electrical power to one of Glacier's UPS. Glacier will be offline till tomorrow.

Update: Dec 19, 2012 3:00PM (PST)

Glacier is on-line and available for job processing.

Dec. 26th. 2012, Calgary - Local Network Outage Scheduled

 

 

Between December 26th. 7:00 a.m. and  December 27th. 11.00 p.m. (2012) a local network outage

will occur on the Calgary campus.  The outage is required to perform a major upgrade to the core

network infrastructure.  WestGrid machines should continue to run jobs, however local connections

to Breezy,  Lattice and Parallel may not be possible during this time.

 

Brief network disruption scheduled for 8 p.m. PST tonight 17 December

Update: This upgrade was performed without incident.  Thanks for your patience.

Original mesage:

In order to address network performance issues on some of our 10G connections we plan to implement a firmware update on our Force 10 network switch.  Due to time constraints, the severity of the problem, and the relatively low expected impact on the majority of our users we will be implementing this change without the usual notice.

The upgrade will require a switch reboot, which may take up to 30 minutes.  During this time access to the interactive nodes will be impossible and jobs' access to external resources will also be affected.

We will post updates in this space.

Syndicate content