Programming on the WestGrid Checkers System

Table of Contents

Introduction

Documentation

This page deals with compilation, debugging and optimization of serial and parallel programs on the WestGrid Checkers system. Especially if you are new to programming in a UNIX/Linux HPC environment, please start at the main WestGrid programming page for a more general introduction. On that page you will also find links to details about programming on other WestGrid machines.

More advanced programmers may want to refer to vendor supplied documentation:

  • For Intel compiler, debugger and mathematical library documentation, start at the Intel Software Development page and follow the link according to the language of interest. Choose the Linux version when there is a choice. Once on the language-specific compiler page, scroll down to the Product Documentation section for Getting Started and User's Guides.
  • For GCC (GNU Compiler Collection) documentation see gcc.gnu.org/onlinedocs/.

For compiler options not presented here, details are available through the UNIX man command: man ifort, man icc, man gcc , etc.

Hardware Considerations

Checkers is an SGI Altix XE320-based cluster with 160 8-core nodes (1280 cores total) connected with a high-bandwidth, low-latency InfiniBand network.  This makes the system suitable for distributed memory parallel jobs, typically programmed with MPI.

Hybrid OpenMP/MPI programs may also run effectively, but, OpenMP parallelization is limited to the eight cores within a node. Breezy is probably more suitable for pure OpenMP programs as it has 24 cores and 256 GB of memory per node.

Each 8-core node has 16 GB of memory, so, that will limit the size of jobs that can be run on Checkers.

More details about the Checkers hardware are available in the Checkers QuickStart Guide

Compiler Recommendation

See the programming table on the WestGrid software page for a comparison of the compilers available on the various WestGrid computers. The table also lists the specific version numbers of the compilers on Checkers.

Both Intel and GCC compilers are available on the Checkers cluster. Our expectation is that the Intel compilers will produce faster code, but, feedback to support@westgrid.ca would be appreciated if you experiment with both compilers.

Compiling Serial Code

Introduction

In the compilation discussion in the following, there are two examples shown for each language. One example illustrates compiler flags to use when developing new code or debugging. A second example shows optimization options that could be tried for production code. It is advisable to test that the non-optimized and production code give similar numerical results. Sensitivity of the answers to the changes introduced by the use of the optimization flags may be indicative of a problem with the stability of the algorithm you are using.

Note, the examples shown here are for the Intel compiler.

Fortran

Although g77 and gfortran are available, better results are generally expected with Intel Fortran compiler, which is called ifort.

By default, the Intel compiler will interpret your source code as fixed-form or free-form according to the file suffix.  Source code files ending in .f, .for or .ftn are treated as the older fixed-form Fortran style, whereas files with names ending in .f90 are treated as free-form. Source code ending in .F, .FOR, .FTN or .FPP (all fixed-form) or .F90 (free-form) is also accepted, but, will be preprocessed by fpp before compilation.

Example with debugging options (-CB for array bounds checking):

ifort -g -fpe0 -O0 -CB diffuse.f writeppm.f -o diffuse

Note that O0 in the above is the letter "oh" followed by the number "zero".

Examples with optimization options:

ifort -fast diffuse.f writeppm.f -o diffuse
ifort -O3 -axW diffuse.f writeppm.f -o diffuse

Caution regarding use of -fast in makefiles: The -fast option in the above example is equivalent to -O3, -ipo and -static. The -ipo option calls for interprocedural optimization. This leads to an error if -fast is used to link routines that have been compiled individually with the -c flag (as is often done in makefiles). This problem can be avoided by compiling two or more routines together, or by using -O3 instead of -fast in your makefile, as shown in the second example above. The -axW option in that example will turn on vectorization.

C

The C compilers available on Checkers are those from Intel (icc) and the GNU Compiler Collection (cc, gcc).  Faster code is expected from icc.

Example with a debugging option:

icc -g pi.c -o pi

Example with an optimization option:

icc -O3 pi.c -o pi

C++

The C++ compilers available on Checkers are those from Intel (icc, icpc) and the GNU Compiler Collection (g++). Code generated by the Intel compiler is expected to be faster than that from g++, but, you might like to try both.

The Intel compiler accepts C++ source code files ending in .C, .cc, .cp, .cpp, .cxx and .c++ . Files with a .c suffix will be treated as C source code.

Example with debugging options:

icpc -g pi.cxx -lm -o pi

Example with an optimization option:

icpc -O3 pi.cxx -lm -o pi

Running Serial Code

Interactive Runs

The Checkers login node may be used for short interactive runs during program development and porting.  For longer runs, the regular production batch queue should be used, as described in section on batch jobs below.

To run a compiled program interactively through an ssh window on the login node just type its name with any required arguments at the UNIX shell prompt. File redirection commands can be added if desired. For example, to run a program named diffuse, with input taken from diffuse.in and output (that normally goes the screen) sent to a file diffuse.out, type:

diffuse < diffuse.in > diffuse.out

Batch Runs

Production runs should be submitted as a batch job script to a TORQUE queue with the qsub command as described on the Running Jobs pages.

For serial jobs, an example job script is shown below. Replace the program name, diffuse, with the name of your executable.

#!/bin/bash
#PBS -S /bin/bash

cd $PBS_O_WORKDIR

echo "Current working directory is `pwd`"

echo "Starting run at: `date`"
./diffuse

It is recommended that you record the performance characteristics of your code for a series of test runs so that you can estimate the run time (walltime) of a long job more accurately. Similarly, you will need to know how your program's memory requirements scale as you increase the problem size. This kind of information is used during the batch job submission to ensure that your program is run on a node with appropriate hardware and runtime limits.

Parallel Programming

Introduction

The Checkers environment can be used for interactive development of parallel programs by running them directly on the login server. However, testing should be limited to one hour using a maximum of two CPUs.

Basic commands for compiling and running MPI or OpenMP-based parallel programs are given in the following sections.

Message Passing Interface (MPI)

Compiling

To use the Intel compilers for Fortran parallel MPI code use the wrapper script mpiifort. For C and C++ use mpiicc anc mpiicpc, respectively.
Note that the commands mpif77, mpif90, mpicc and mpicxx will invoke the GNU compilers.

Add debugging or optimization options, as appropriate, similar to what was shown for serial compilation in the previous section

To check exactly what commands are executed by these scripts, add a -show argument. For example,

mpif90 -show

To compile an MPI Fortran program, diffuse.f, with the Intel compiler, type:

mpiifort -O3 diffuse.f -o diffuse

Similarly, to compile an MPI C program, pi.c, linking with the standard math library, type;

mpiicc -O3 pi.c -lm -o pi

For a C++ program, the command line would look like:

mpiicpc -O3 pi.C -lm -o pi

Running

If your program allows, compare the results with a single processor to those from a two-processor run. Gradually increase the number of processors to see how performance scales. After you have learned the characteristics of your code, please do not run with more processors than can be efficiently used, as the system is typically very busy.

Long tests or production jobs should be submitted to a TORQUE queue with the qsub command as described on the Running Jobs pages.  Options for specifying the number and distribution of processors, memory and run time are mentioned there.

Here is an example of a script to run an MPI program, pn, using 2 processors. If the script file is named pn.pbs, submit the job with qsub pn.pbs. 

#!/bin/bash
#PBS -S /bin/bash
#PBS -l procs=2

# Script for running a parallel MPI job, pn, on Checkers
# 2010-01-12 DSP

cd $PBS_O_WORKDIR

echo "Current working directory is `pwd`"

NUM_PROCS=`/bin/awk 'END {print NR}' $PBS_NODEFILE`
echo "Running on $NUM_PROCS processors."

echo "Starting run at: `date`"
mpiexec ./pn

In the above script, the form "./pn" is used to ensure that the program can be run even if "." (the current directory) is not in your command PATH.

Source code for the pn sample program itself, pn.f, is available here.

Please note that if you are running MPI programs interactively, you will need to run mpdboot before running mpiexec.  You also need to specify the number of processes with the mpiexec "-n 2" option.  Finally, you should run mpdallexit at the end of your session to terminate the mpd daemon that was started with mpdboot.

mpdboot
mpiexec -n 2 ./pn
mpdallexit

Another alternative for interactive work is to use an "interactive" batch job, initiated with "qsub -I -l procs=2", for example.

OpenMP

Compiling

To compile a program containing OpenMP directives with Intel compilers, add a -ompenmp flag to the compilation.  Here are some examples:

ifort -openmp -fast diffuse.f writeppm.f -o diffuse
icc -openmp -O3 pi.c -lm -o pi
icpc -openmp -O3 pi.cxx -lm -o pi

Running

Long tests or production jobs should be submitted to a TORQUE queue with the qsub command as described on the Running Jobs pages.  Options for specifying the number of processors, memory and run time are mentioned there.

For OpenMP jobs, the environment variable OMP_NUM_THREADS should be set to the number of processors assigned to your job by TORQUE when submitting batch jobs with qsub. This is shown in the following script: 

#!/bin/bash
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=2

# Script for running an OpenMP sample program, pi, on two processors on Checkers
# 2010-01-13 DSP

cd $PBS_O_WORKDIR

echo "Current working directory is `pwd`"

NUM_PROCS=`/bin/awk 'END {print NR}' $PBS_NODEFILE`
echo "Running on $NUM_PROCS processors."

# Note: The OMP_NUM_THREADS should match the number of processors requested.
export OMP_NUM_THREADS=$NUM_PROCS

echo "Starting run at: `date`"
./pi

 

 

Debugging

Introduction

The Intel idb graphical debugger is available on Checkers. The gdb debugger is also available for use from character-based terminals.

Regardless of the debugger being used add a -g flag when compiling your code as a minimum prerequisite for using a debugger.

See the general comments on debugging on the main WestGrid programming page.

The following shows an example of debugging an MPI program using gdb.

First compile the program.

mpif77 -o hello hello.f -g
Start mpdboot to allow interactive use of mpiexec.
mpdboot
Add the -gdb flag to the mpiexec command line to enter a debugging session.  Use gdb commands at the gdb prompt.  Type quit to stop the debugging session. Note the output is prefixed by the MPI rank (0, 1 or both):
mpiexec -gdb -n 2 ./hello
0-1: (gdb) break 9
0-1: Breakpoint 2 at 0x401026: file hello.f, line 9.
0-1: (gdb) run
1: Continuing.
0: Continuing.
0-1:
0-1: Breakpoint 2, MAIN__ () at hello.f:9
0-1: 9 PRINT *, "Hello world from ",rank,hostname
0-1: Current language: auto; currently fortran
0-1: (gdb)
0-1: (gdb) print rank
0: $1 = 0
1: $1 = 1
0-1: (gdb) quit
rank 0 in job 1 checkers.westgrid.ca_52503 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
Shut down the mpd daemon using mpdallexit.
mpdallexit

Please write to support@westgrid.ca for help with debugging.

Linking with Installed Libraries

Introduction

See the Mathematical Libraries and Applications section of the WestGrid Software page for a description of some of the optimized linear algebra and Fourier transform libraries that can be linked with your code.

Improving Performance

Introduction

We encourage you to have your code reviewed by a WestGrid analyst. Please write to support@westgrid.ca .

Basic optimization techniques, some of which are applicable to the environment on Checkers, are outlined in these course notes.

Here is an example of profiling an MPI program to look for communication bottlenecks.

#!/bin/bash
#PBS -S /bin/bash
#PBS -l procs=4
cd $PBS_O_WORKDIR
unset TMPDIR
export -n TMPDIR
export PATH=/global/scratch/software/intel/impi/3.2.1.009/bin64:$PATH
mpirun -r ssh -trace -n 4 ./sample1
This will produce a profile file called sample1.stf which can be viewed (in a graphical environment) with the command:
traceanalyzer sample1.stf


Updated 2011-11-09.