QuickStart Guide to Bugaboo

About this QuickStart Guide

This QuickStart guide gives a brief overview of the WestGrid Bugaboo facility, highlighting some of the features that distinguish it from other WestGrid resources. It is intended to be read by new WestGrid account holders and by current users considering whether to move to the Bugaboo system. For more detailed information about the Bugaboo hardware and performance characteristics, available software, usage policies and how to log in and run jobs, follow the links given below.

Introduction

Bugaboo is a Dell blade cluster with 4584 cores connected with Infiniband, running the Scientific Linux operating system. It is intended for jobs that require access to large storage systems (terabytes of data, for example).

Hardware

Processors

The Bugaboo cluster is comprised of 28 chassis. Ten of these contain 16 8-core blades each, for a total of 1280 cores. Each of these blades (compute node) contains two sockets, with each socket containing an Intel Xeon E5430 quad-core processor, running at 2.66 GHz.  Each blade has 16 GB of memory that can be shared among the 8 cores on that node.

Another 16 chassis contain a total of 254 12-core blades, for a total of 3048 cores. Each of these blades contains two sockets, with each socket containing an Intel Xeon X5650 6-core processor, running at 2.66 GHz.  Each blade has 24 GB of memory that can be shared among the 12 cores on that node.

The remaining two chassis contain 16 8-core blades each, for a total of 256 cores. Each of these compute nodes contains two sockets, with each socket containing an Intel Xeon X5355 quad-core processor, running at 2.66 GHz.  Each blade has 16 GB of memory that can be shared among the 8 cores on that node. These two chassis only have Gigabit ethernet interconnect and are used for serial jobs only.

In summary, there are a total of 4584 cores in the cluster with 2GB of memory per core.

The scheduler is configured such that it will never assign nodes with different processors to a single job. Thus, from a job perspective the cluster will always appear homogeneous, i.e., all processes run on the same type of processor.

Interconnect

The compute nodes are connected with Infiniband using a 288-port QLogic switch.  The connection between nodes within a chassis is non-blocking, but, there is "2 to 1" blocking for connections that span chassis.  That means that the maximum bandwidth for communications between nodes in different chassis is only half that for nodes within the same chassis. The exception are the blades with X5355 processors (see above) that are connected via Gigabit ethernet only and are used for serial programs only.

Storage

Bugaboo uses Lustre, a high performance cluster file system to provide storage for /home (containing users' home directories) and /global/scratch (space for temporary storage, typically associated with running jobs).  Size, quotas and backup policy for these file systems are as shown in this table:

 

File system
 Size

Quota
(per user)

 Backup policy
 /home  115 TB
 300 GB
Daily backup
 /global/scratch  455 TB
1 TB
No backup

 

To check your usage and how it relates to your quota for /home or /global/scratch, you can use commands of the form:

lfs quota -u your_username /home

lfs quota -u your_username /global/scratch

There is additional currently unallocated space available (to a toal of 2.4 PB) that can be assigned to either /home or /global/scratch, depending on future usage requirements.

On the 8-core nodes (E5430 processors) there are also two 146GB SATA 10000 RPM drives per node for local storage.  These are in a RAID 0 (striped) configuration for extra performance. About 248 GB is available to users as /scratch on each compute node. The 12-core nodes (X5650 processors) have two 300GB drives instead with about 450GB of disk space available to users in /scratch.

Software

A list of the installed software on Bugaboo is available on the WestGrid software web page. The usual numerical libraries (e.g., BLAS, LAPACK, SCALAPACK, GSL, FFTW) are available along with several simulation packages like the GROMACS, NAMD, LAMMPS molecular dynamics packages and SIESTA, ABINIT, DIRAC for electronic structure calculations.  GCC, Intel and Open64 compilers are available.

There are other software packages installed on Bugaboo that are not listed on the WestGrid software web page, e.g., PETSC, SLATEC, ARPACK, PARPACK, etc. Please write to support@westgrid.ca if there is software that you would like to have installed.

Using Bugaboo

Logging in and using the login server

To login to Bugaboo connect to the head node (login server) bugaboo.westgrid.ca, using an ssh (secure shell) client. For more information about connecting and setting up your environment, see the QuickStart Guide for New Users.  The login server is used for such tasks as editing and compiling source code, performing short and small test runs for debugging purposes and submitting batch jobs.

Compiling and running programs

After the expansion in the summer of 2011 the Bugaboo cluster has nodes with slightly different processors. For that reason it is important when using the Intel compiler for compiling programs NOT to use the -xHost and NOT to use the -fast (which implies -xHost) compilation options. Instead it is recommended to use -O3 -xSSSE3 -axSSE4.2,SSE4.1 options when the highest level of optimization is desired. So compiled programs will run on all Bugaboo nodes.

Bugaboo uses the so-called "unified environment", i.e., you can use the generic compilers cc, c++, f77, f90, mpicc, mpicxx, mpif77, mpif90 to compile programs. By default these generic compilers will use the Intel compilers. However, this default can be changed by loading so-called "modules" - ask support@westgrid.ca for details.

All installed software libraries can be linked at compile time by adding -lname_of_library to the compiler options, e.g., -lblas for linking with the BLAS library and -lfftw3 for linking with the FFTW library, version 3. Some libraries depend on other libraries, e.g., to link with the LAPACK library you  also need the BLAS library and to link with the SCALAPACK library -lscalapack -lblacs -llapack -lblas is required. These dependencies are listed on the WestGrid software web page. Under no circumstances is it necessary to specify a path to library directories (-L/directory options) or is it necessary to set the LD_LIBRARY_PATH environment variable. In fact, this is strongly discouraged as this may actually prevent the generation of a working executable.

All programs can be run just by typing the name of the program, i.e., without specifying the full path (directory) of the program - all programs are in your PATH. If you find a program that you cannot run this way, please report this as a bug to support@westgrid.ca. Specifying the full path of a program is strongly discouraged and we do not guarantee that this will continue to work (e.g., the path may change when a new version is installed).

Running parallel programs

MPI programs are run using the mpiexec command, which in general has the form

mpiexec -n # cmd cmdargs

where # is the number of processes to be used and cmd is the name of the MPI program to be run with cmdargs being the arguments of that program.The head node (bugaboo) can be used for short test runs and for debugging purposes. All long running programs and programs that use a large number of processes must be submitted to the queueing system using qsub - see the Running Jobs web page.Within the "unified environmen" there are two supported ways of requesting processors within a PBS submission script:

  1. -l procs=N
  2. -l nodes=n:ppn=m

The first method requests N processors for the job. The scheduler will assign the next available N processors to the job. This method results in the smallest waiting time in the queue and is the recommended method, if there are no particular reasons using method 2.

The second method  requests exactly m processors each on n nodes, i.e., a total of N = m*n processors. Due to the constraint that the scheduler has to find n nodes that have at least m idle processors, this method leads to longer waiting times than method 1. Be aware that -l nodes=n is totally equivalent to -l nodes=n:ppn=1, which in almost all cases is undesirable. DO NOT USE -l nodes=n UNLESS THIS IS ESSENTIAL FOR YOUR JOB. Ask support@westgrid.ca, if in doubt about the best strategy.

Here is a sample job submission script for Bugaboo:

#!/bin/bash
#PBS -r n
#PBS -l walltime=48:00:00
#PBS -l procs=42
#PBS -l pmem=1600m
#PBS -m bea
#PBS -M johann_bach@nowhere.ca
 
cd $PBS_O_WORKDIR
echo "prog started at: `date`"
mpiexec prog
echo "prog finished at: `date`"

 

Specifying Memory Requirements

Similar to requesting a specific number of processors for a job using the -l procs=N syntax, each job also must specify how much memory it requires to run. This can be done in two ways, either by specifying the total amount of memory used by the job (summed over all processes) or by specifying the amount of memory per process. The syntax for the latter method (memory per process) is:

#PBS -l pmem=2000mb

wich would request 2000 MB per process, e.g., 20000 MB for a 10 processor job in total. Alternatively, the total amount of memory can be specified with:

#PBS -l mem=8000mb

which would request 8000 MB in total for the job, e.g., 800 MB per process for a 10 processor job. It is possible to use the gb unit to specify memory amounts in Gigabytes (GB). However, only integer amounts can be specified, a decimal point is not allowed.

If neither pmem nor mem are specified the system will assign 256 MB per process (corresponding to a  -l pmem=256mb specification). Jobs may get terminated, if they use significantly more memory than specified! The system will send an email to the owner of a job when the job uses more memory than assigned. For that reason it is important that a valid email address is specified using the

#PBS -M email@address.ca

syntax in the job submission script.

The current memory usage of a running job can be determined using the qstat -f <jobid> command (substitute the actual jobid of the job for <jobid>); the memory usage is shown in the output of the command under resources_used.vmem - this is the total amount used for the job, i.e., this must be divided by the number of processors to get the correct pmem amount.

Interactive and Batch Usage

Two nodes have been set aside for interactive work:

  • debugging of programs
  • testing of programs
  • short runs of programs
  • steered computations
  • and more

Basically all types of work that require interaction with programs and for which waiting in the batch queue is too cumbersome.

In order to run on the interactive nodes a job script must be submitted with the -I argument for qsub, e.g.

bach@bugaboo:~> qsub -I jobscript.pbs
qsub: waiting for job 6481270.b0 to start
qsub: job 6481270.b0 ready

bach@b402:~>

The system will read the job requirements (no. of processors, walltime, memory, etc.) from the PBS section of the submission script as it does for batch jobs. However, the system will not process any commands from the body of the submission script, but instead will wait until the resources become available on the interactive nodes (this can take up to 3 minutes) and then open a normal shell on one of the interactive nodes. At this point any interactive command can be entered and executed including the execution of parallel programs using "mpiexec myMPIprog" and the system will use as many processors as have been requested in the job submission script.

Currently interactive jobs are limited to a maximum walltime of 2 hours; the default walltime is 10 minutes. The two interactive nodes are the same as the 12-core nodes of the Bugaboo cluster with 24GB of memory, but they actually have 24 "processors" (hyperthreading is enabled).

All other programs must be submitted to the queuing system using the qsub command (without the -I argument) and a proper submission script.

Batch job policies

The following give a brief overview of the rules for batch jobs on the Bugaboo facility:

  • The default walltime is 72 hours;
  • The maximum walltime is 122 days (4 months);
  • A user may have running and submitted jobs that request a maximum number of processor-hours (the product of requested processors times walltime) of all jobs together of 120000 hours (5000 days); all additional jobs that exceed this limit will be queued under blocked jobs.
  • The maximum number of jobs that a user may have queued to run is 500. In particular, this means that the maximum size of an array job is 500 as well.
  • A maximum number of 10 jobs (per user) are considered for scheduling at any time. These jobs are listed under eligible jobs in the output of the showq command. There is no limit on the number of running jobs per user other than the number of processor-hours mentioned above. All other jobs will get listed under blocked jobs. These latter jobs will move to the eligible queue when the number of eligible jobs drops below 10.

File transfers

In addition to the login server, there is another server (file server) bugaboo-fs.westgrid.ca, which should be used for data transfers to or from the Bugaboo cluster.  Bugaboo-fs has a better network connection (10GigE interface) to the internet. This also takes load off the login server.  The gcp command can be used to efficiently transfer files between Bugaboo and other WestGrid sites.

 


Updated 2011-04-13.