System Notices

Wednesday, April 10, 2012 - U of Calgary air conditioner maintenance complete

UPDATE: The AC is back online and jobs scheduling has been resumed.

One of the AC units in the research data center is undergoing maintenance. We have stopped scheduling jobs to Lattice and Parallel so as to not increase the heat load. We expect the AC to be operational by 2PM MST. Scheduling will resume after the AC is back in operation.

March 26, 2013, 12:45 PM (PDT) UBC Chiller failure

 

We have suffered the chiller failure and as a result orcinus is down.

All running jobs were lost. We will try to restore the operations as soon as possible.

 

 

 Update:

 March 26, 2013, 19:30 PM (PDT)

 

 Plant Operations were able to restore partial cooling. Since the cause of the failure is unknown and the chiller still has some 

  vital components off-line, the scheduling has been suspended.

 

We hope that the normal operations could resume tomorrow, 

 

                Wednesday, March 27

 

 Please, check your data and resubmit the lost computations.

 We are sorry for the interruption and all inconveniences.

Back in production was: Hungabee uv1000 Scheduled outage April 2 9:00 AM MST - 5:00 PM MST

Update 5:00 PM MST April 2 2013:  

Hungabee uv1000 is back in production with its full complement of 2048 cores and 16 TB of RAM.  

Hungabee uv1000 Scheduled outage April 2 9:00 AM MST - 5:00 PM MST: 

This outage is to replace failed components and bring
the uv1000 into production with its full complement
of 2048 cores and 16 TB of RAM.
Jobs that will not finish before the outage will not be started before the outage 
but will wait till after the outage if over.
We apologize for the inconvenience.

Hungabee uv1000 back in production in a reduced state. Was:Hungabee uv1000 scheduled outage March 18 2013

Hungabee uv1000 is undergoing scheduled maintenance on March 18 2013 ending in 5:00 PM

The uv1000 will be unavailable for user jobs during this time.

 

Update March 18 21:00 MST:

 

    We run into some hardware failures.

    Replacement parts are being shipped. We are working on a workaround. 

 

 

Update March 19 12:00 Noon MST: 

    The uv1000 is back in production in a degraded state,

    32 cores and 256GB RAM have been temporarily removed.

    Replacement parts are being shipped.         

UBC Glacier - Scheduled Maintenance Completed

 

GPFS maintenance has been completed and full operations are restored. Thanks for your patience and sorry for all inconveniences.

Mar 12, 11:00 (CDT): Minor interruption to NFS on Grex

The NFS server on Grex was temporarily unavailable. A majority of all jobs were unaffected, but some jobs may have timed out and failed due to the interruption. Please contact support@westgrid.ca if you need assistance or if you have any concerns. We apologize for the inconvenience.

Hungabee uv1000 will be participating in a special project and may be unavailable 1:00 am on Mar. 15 through the weekend for most users.

 

 

Hungabee special project:
There is a reservation in place beginning at 1:00 am on Mar. 15
that will extend through the weekend. Currently the reservation
is for the entire machine, this may change as requirements 
become better defined.
You can continue to submit jobs and they will run as long as they 
don't conflict with the reservation and resources are available. 

 

Mar. 6, 12:12 (PST): Bugaboo Cooling System Failure

Due to a malfunction of the chiller system about 200 of the Bugaboo nodes crashed. The affected nodes are b1 - b60, b62 - b121, b124 - b160, b393, s1 - s32. All jobs that were running on these nodes were terminated. We are sorry for the problems these node crashes have caused! Pleas resubmit the jobs.

UBC Glacier: Scheduled Maintenance - Tuesday, March 12, 2013

 

PLEASE NOTE: Glacier will be going down for file system maintenace starting:

 

   Tuesday, March 12, 2013

 

 The system will be offline until later in the evening. Please adjust your 

 job submission walltimes accordingly.

Hermes/Nestor: decommissioning old GPFS filesystem Tuesday 19 February

At 10 a.m. PST on 19 February 2013 we will be decommissioning the old GPFS filesystem to release storage for other uses.  We migrated off of this filesystem to new storage during the outage of 7 February 2013.

We do not expect this work to affect the new filesystem, but we will be removing storage and components from an active GPFS cluster and so there is some potential risk to the home and scratch filesystem.  Please let us know if you experience or observe any unexpected or adverse behaviour.

Syndicate content