System Notices

August 29 12:30 MDT Update Hungabee back in production Was: Hungabee SMP 2048 core compute node scheduled outage to replace hardware.

August 29, 2012 9:00 AM - 5:00PM MDT: Hungabee SMP 2048 core compute node scheduled outage to replace hardware. 

     This scheduled outage is to replace internal communication (numalink) hardware.

     Sorry for the inconvenience.

Update August 29, 2012 12:30 AM 

     Hungabee back in production after hardware was replaced. 

Silo/Hopper back UP in production

UPDATE 9:15am CST:

 

All normal service has been restored (silo+hopper were up and functional as of 9:30pm Wednesday).

Please contact support@westgrid.ca with any questions or concerns.

 

------------

8:00pm CST:

A major power failure affecting much or all of the University of Saskatchewan has occurred.  Silo and Hopper are down.   Updates will be provided as they are available.

 

Please contact support@westgrid.ca with any questions or concerns.

Brief network disruption between world and Nestor/Hermes, 29 August 2012

There will be a 5 to 10-minute interruption in WestGrid connectivity at UVic between 0700 and 0800 on Wednesday 29 August.  We will be switching
from the active path to a backup as the active path between our WestGrid equipment and the Victoria Exchange will be affected by road work scheduled for September.  The internal networks should not be affected but any jobs, processes or sessions requiring access outside of UVic will experience disruption during this time.

Update August 22, 2012 2:26 AM MDT: Hungabee SMP 2048 core compute node back in production. Was: Hungabee compute node scheduled outage to reboot

August 21, 2012 10:30 AM MDT: Hungabee SMP node scheduler problem.

    The scheduler running on the 2028 core node is locked, user jobs are still running.

    We need to reboot the 2048 core node to restart the schedualing system.

    We will wait for all user jobs to complete 7:30 PM MDT and then restart the Hungabee SMP node.

 

August 21, 2012 7:30 - 11 PM MDT: Hungabee compute node scheduled outage to reboot.

    There will be no jobs affected by this reboot.

 

Update August 21, 2012 9:30 PM MDT: Hungabee SMP 2048 core compute node will not boot due to hardware problem,
    We have contacted the vendor who is working on a fix.
    Sorry for the inconvenience.

 

Update  August 22, 2012 2:26 AM MDT:  Hungabee SMP 2048 core compute node back in production. 

Hungabee Not Accepting Jobs - Aug. 21

The main Hungabee node - the UV1000 - has an issue with its batch scheduling software which is preventing any new jobs from being run.  As well, it may cause the loss of currently running jobs.  The UV100 node (for small jobs) is not affected.

The problem is being investigated.  If a quicker solution cannot be found, the UV1000 will be restarted after all currently running jobs are off the machine.  If the machine is restarted, it should become available again late tonight or early tomorrow.

August 21, 2012: Bugaboo unresponsive - Resolved.

12:52 PM:

Bugaboo is back and avaiable for users.

 

9:00 AM:

We are having an issue with bugaboo. We will let you know once the issue is
fixed.

Sorry for inconvenience,

Checkers login node accessible again - August 18

Aug 18 - the login node was restarted, which restored service.  The cause of this incident is being investigated; any changes required to avoid similar incidents in the future will be implemented.

 

Aug 18 - The login node of the Checkers cluster - checkers.westgrid.ca - has stopped responding to ssh attempts.  Current jobs on the cluster appear to be continuing normally.  The situation is being investigated and will be fixed as soon as possible.

October 16: Grex Power Outage

October 16, 7:01am (central time): A brief power fluctuation at the University of Manitoba caused all compute nodes on Grex to reboot and all jobs were lost.

October 16, 7:33am (central time): Most compute nodes are online and we will enable job submission in the next hour or so.

October 16, 8:23am (central time): Grex is back in production.

We apologize for this inconvenience. Please contact support@westgrid.ca if you have any questions or concerns.

August 13, 2012: Updating Torque System on Grex.

we are updating the Torque system on Grex. The queues will be temporarily disabled and you may face issues with the job submissions.

Sorry for the inconvenience.

Aug. 10, 2012 Checkers jobs lost

Torque on checkers oversubscribed most nodes when the queues were restarted yesterday - this required torque to be restarted at about 4:00 am.

All queued and running jobs were lost. Unmonitored jobs are being removed from nodes and the batch system will be monitored as jobs enter the queue.

We apologize for this inconvenience.

Syndicate content