Grex QuickStart Guide

About this QuickStart Guide

This QuickStart guide gives a brief overview of the WestGrid Grex facility, highlighting some of the features that distinguish it from other WestGrid resources. It is intended to be read by new WestGrid account holders and by current users considering whether to move to the Grex system. For more detailed information about the Grex hardware and performance characteristics, available software, usage policies and how to log in and run jobs, follow the links given below.

Introduction

Grex is a SGI Altix XE 1300 cluster with 3792 cores. It is intended for parallel applications that can take advantage of either the non-blocking low latency Infiniband network or the large memory per compute node.

Grex is latin for herd and serves as an alias for two login nodes

  • bison.westgrid.ca
  • tatanka.westgrid.ca

The login nodes are named from the Bovine family of animals.

Click for image of Grex

Hardware

Processors

The Grex cluster consists of 316 compute nodes, each with two 6-core Intel Xeon X5650 2.66GHz processors (Intel Westmere architecture). 24 compute nodes have 96GB of memory and the remaining 292 nodes have 48GB of memory.

Interconnect

All compute nodes are connected by a non-blocking Infiniband 4X QDR network.

Storage

The storage system is deployed through a DDN S2A9900 system. The file systems are listed in the following table:

File system Size Quota
(per user)
Purpose Backup policy
/home 7TB 10GB Home
directories
Weekly backup
/global/scratch 95TB 1000GB Global scratch No backup
$TMPDIR

Between 150GB-2TB

  per compute node

Request space
when submitting

job (-l file=size)

Node-local
scratch
Deleted when the
job is completed

 

Software

On Grex, a variety of user software (Gaussian, GAMESS-US, ORCA, SIESTA, VMD, VASP etc.), compilers (Intel Fortran/C/C++ and GNU GCC compiler suites) and performance libraries (Intel MKL, GotoBLAS) is preinstalled for general use. Right now, two distributions of message passing libraries are supported: Intel MPI 4 and OpenMPI 1.6.1. Generally, the software installed on Grex is available via "environment modules"; to review and access it, use module avail command. For more complete list of application software, compilers,  mathematical and graphical libraries see the main WestGrid software page tables.

Please write to WestGrid support if there is additional software that you would like installed on Grex.

Using Grex

Logging In

To log in to Grex, connect to grex.westgrid.ca  using an ssh (secure shell) client. For more information about connecting and setting up your environment, see the QuickStart Guide for New Users.

Compiling and Running Jobs

The login nodes can be used to compile code and to run short interactive and/or test runs. All other jobs must be submitted to the batch system. Normally, you would need to chose a compiler suite (Intel or GCC) and, in case of parallel applications, a MPI library (OpenMPI or Intel MPI). By default, Intel 12.0.3 compilers with OpenMPI libraries are loaded. If you want to change that, issue module purge command to unload them, and then use module load to select the combination of the compilers and libraries you need.

As on other WestGrid systems, batch jobs are handled by a combination of TORQUE and Moab software. For more information about submitting batch jobs, see Running Jobs. Grex now enforces memory limits for the TORQUE batch jobs. Jobs that do not specify memory (mem or pmem parameters in qsub) explicitly, will be assigned the default pmem=256mb value. Note that unlike some of the other Westgrid systems, vmem and pvmem resource requests on Grex should not be used.

The /home file sytem on Grex is not designed for heavy I/O operations. Therefore, it is important that jobs that require significant amount of I/O are not run from /home/<username>.  Instead, /global/scratch/<username> should be used all such jobs.

Batch Job Policies

The following polices are implemented on Grex

  • The default walltime is 3 hours
  • The default amount of memory per processor (pmem parameter) is 256MB. Memory limits are now enforced, so an accurate estimate of memory resource should be provided.
  • The maximum walltime is 21 days for Gaussian jobs and 7 days for all other jobs. 
  • The maximum number of processor-days for all currently running jobs for a single user is 2,800 processor-days.  (Attempting to exceed that limit will lead to an error message about MAXPS - maximum processor-seconds.)
  • The maximum number of jobs that a user may have queued to run is 4000. The maximum size of an array job is 2000.
  • As of this writing (2013-01-14), groups of users without a Resource Allocation Committee award are allowed to simultaneously use only up to 300 CPU cores per accounting group.

Updated 2013-01-14.