Running Jobs on the WestGrid Glacier Cluster

Introduction

Please see the Running Jobs page for a general introduction to the batch queuing and scheduling system used at WestGrid sites. Additional site-specific information for the Glacier system is given below.

File Systems and Storage

Storage space is provided through IBM's General Parallel File System (GPFS) - a high-performance shared-disk file system that can provide fast data access from all nodes in a cluster. A Storage Area Network (SAN) comprised of almost 14 TB of disk space is connected directly to 4 storage nodes (moraine1,...,moraine4) that fulfill I/O requests from all nodes in the cluster.

There are two general access filesystems with different characteristics available on glacier.westgrid.ca:

  • /global/home

    /global/home/username is your $HOME directory.
    The disk space is limited, please use this FS to store your essential data (source code, processed and "small size file" results etc.)
    If your code creates a large data sets - Please do not use this FS as a starting directory of your jobs. Use /global/scratch instead.
    We backup the /global/home filesystem with 14 days expiration policy.
    Size:5.4TB.
  • /global/scratch

    FS designed for fast changing "large data sets", and work area
    Please create directory of your choice (cd /global/scratch ; mkdir username ) and use it as a starting directory of your jobs.
    We do not backup this FS.
    Size:8.6T.

In addition each compute node has ~ 18 GB /scratch partition.

For long term storage of your files,  please move them out of /global/home or /global/scratch to the WestGrid storage facility or your own system. The gcp (/usr/apps/bin/gcp) command can be used to efficiently transfer files to the storage facility. See the Gridstore QuickStart Guide for more information.

Batch Jobs and Usage Guidelines

As described on the main Running Jobs page, batch jobs should be submitted with qsub and deleted with qdel.  However, due to the large number of jobs that are active or queued on the Glacier cluster at any one time, you may prefer to monitor jobs with the site-specific qsort command, instead of qstat or showq (which are typically used on other WestGrid systems).  Unlike qstat and showq, which require a -u <username> option to restrict the output to show just information about your jobs, qsort defaults to that.  Type qsort -h for information about qsort options.

Sample jobs scripts are available on the Glacier Programming page.

Some points to keep in mind when submitting jobs on Glacier:

  • As noted above, jobs that produce significant output should be started from your subdirectory under /global_scratch and not from your home directory.
  • The maximum walltime limit is 240 hours.
  • There is no specific limit to the maximum number of processors that can be used for a given job, but, waiting times will increase as the number of requested processors is increased.
  • There are two processors on each node so that maximum processors per node (ppn) is 2. See the main Running Jobs page for more information about how to specify ppn.
  • The majority of the Glacier compute nodes have 2 GB of physical RAM, while about 10% have 4 GB.

Interactive Sessions

Limited interactive work can be done on the Glacier login node. For more extended interactive sessions for debugging or visualizaiton, submit an "interactive batch job", as described on the main Running Jobs page, with a command like:

qsub -I -l walltime=00:30:00

There are two compute nodes specifically reserved for short debugging runs, with a maximum walltime limit of 10 minutes. These are usually available without a long wait.  To request one of these nodes, add -W x="QOS:debug" to the qsub command line:

qsub -I -l walltime=00:10:00 -W x="QOS:debug"

 


Updated 2009-02-18.