R

Introduction

R is a language and environment for statistical computing and graphics. It is a GNU project which provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. 

See the main WestGrid software table for the version currently installed on WestGrid systems.

Some brief notes are given below about submitting R jobs on WestGrid systems.

R on Glacier

Selecting the R version

There are several versions of R installed on Glacier. As of this writing (Dec. 14, 2010) if you just type "R", you get version 2.7.1, since the installation directory /global/software/R-2.7.1/gcc/bin/R is on the default PATH.  However, version 2.12.0 is also available. You can access it by changing your PATH or by typing the full path to the R binary.

There are actually two variations on version 2.12.0, one compiled with GCC 3 compilers and one with GCC 4.  These are /global/software/R-2.12.0/bin/R and /global/software/R-2.12.0/gcc4/bin/R, respectively. So, for example, to access the GCC 4 version, you can type:

/global/software/R-2.12.0/gcc4/bin/R

Using Rmpi for Parallel Jobs

For parallel R with the Rmpi package on Glacier, it appears that one must use LAM MPI.  As such, lamrun, not mpiexec is used to start the R program.  Here is an example job script showing the use of LAM MPI to run a parallel R job.  The file r.in contains the R commands to run. Rmpi is discussed in the Bugaboo section below.

#!/bin/bash
#PBS -S /bin/bash

cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"

echo "Node file: $PBS_NODEFILE :"
echo "---------------------"
cat $PBS_NODEFILE
echo "---------------------"
NUM_PROCS=`/bin/awk 'END {print NR}' $PBS_NODEFILE`
echo "Running on $NUM_PROCS processors."

R=/global/software/R-2.12.0/gcc4/bin/R
echo $R

export LAMRSH="/usr/bin/ssh"

lamboot -v $PBS_NODEFILE

# With slaves being created within R, only the master process should be started.
/usr/bin/lamrun -np 1 $R --slave CMD BATCH r.in

lamhalt

 

Submitting R Jobs

This section was written with Bugaboo in mind, but, most of the material applies to other installations of R. Like other jobs on WestGrid systems, R jobs are run by submitting an appropriate script for batch scheduling using the qsub command. See documentation on running batch jobs for more information.

Running a Serial R Job

Any submission script for a serial program can be used with R.  For example:

#!/bin/bash
#PBS -r n
#PBS -m bea
#PBS -M jsbach@nowhere.ca
#PBS -l walltime=24:00:00
#PBS -l procs=1
 
cd $PBS_O_WORKDIR
R --vanilla < myRscript.R

will do for a small memory job on most WestGrid systems.

Using Rmpi for Parallel Jobs

Things get more complicated, if you want to use the Rmpi package. As with any other serial program you cannot simply do
#PBS -l procs=42
mpiexec serial_prog
and expect that the program will run in parallel using 42 processors. Instead you need to change serial_prog into a parallel program that calls MPI functions to spread the load between the processors. R is not different in this aspect.
 
Thus, in order to run R in parallel you need to do at least two things:
  1. Load the Rmpi package;
  2. Use functions from the Rmpi package to distribute the workload between processors.
Rmpi contains functions that implement a master-slave algorithm: 
  • Initially only the master process is started
  • Then the master process spawns the slaves handing out different tasks to each of the slaves
  • At the end the master closes down the slaves
  • And then finishes itself
Here is a submission script that uses this schema
#!/bin/bash
#PBS -N Rmpi-hello
#PBS -l walltime=10:00
#PBS -l procs=10
cd $PBS_O_WORKDIR
mpiexec -n 1 R --vanilla < Rmpi-hello.R
Please note that on Glacier you should replace mpiexec with /global/software/bin/mpiexec in the above script.
This script requests 10 processors, but the mpiexec command only starts a single process (-n 1): the master process. The Rmpi-hello.R takes care of spawning the slaves. Here is the Rmpi-hello.R example script: 
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
   library("Rmpi")
}
library(rsprng)

# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
  if (is.loaded("mpi_initialize")){
      if (mpi.comm.size(1) > 0){
         print("Please use mpi.close.Rslaves() to close slaves.")
        mpi.close.Rslaves()
      }
     print("Please use mpi.quit() to quit R")
     .Call("mpi_finalize")
  }
}
# Spawn as many slaves as possible
mpi.spawn.Rslaves()
# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
# Tell all slaves to close down, and exit the program
mpi.close.Rslaves()
mpi.quit()

All the mpi.xxxx functions are defined in the Rmpi package.  Sample code of this parallel R program is the single line:

mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
which prints lines like "I am 7 of 11" to the output file. Thus, your own program must have the following structure:

# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
   library("Rmpi")
}
# load all packages that are needed
library(xyz)

# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
   if (is.loaded("mpi_initialize")){
     if (mpi.comm.size(1) > 0){
         print("Please use mpi.close.Rslaves() to close slaves.")
         mpi.close.Rslaves()
      }
      print("Please use mpi.quit() to quit R")
    .Call("mpi_finalize")
   }
}

# Spawn as many slaves as possible
mpi.spawn.Rslaves()


<insert own R MPI program code here>


# Tell all slaves to close down, and exit the program
mpi.close.Rslaves()
mpi.quit()  

You have to provide everything that replaces <insert own R MPI program code here> in the above.

A few remarks:
 
In the output you see lines like
I am 4 of 11
That is, it says "of 11" not "of 10" even though I requested only 10 processors in the R submission script. The reason for that is that the function mpi.spawn.Rslaves() spawns as many slaves as were requested by the submission script. Since the master process was started already you end up with 11 = 10 + 1 processes. This is ok as long as the master process does nothing more than spawning the slaves and all the work is done by the slaves. However, if you program is such that the master process itself does significant amounts of computing you must start only N-1 slaves. In that case use
Nprocs=Sys.getenv("PBS_NP")
mpi.spawn.Rslaves(nslaves=type.convert(Nprocs)-1)

Please note at the time of this wriring, some systems (including Checkers, Glacier and Lattice) do not support the PBS_NP environment variable, so, the number of processors has to be derived from other PBS variables, such as PBS_NODEFILE.

It is strongly suggested to experiment first before running a fullblown Rmpi calculation that is difficult to debug: 

  1. Run your serial R program first before even attempting to use Rmpi.
    Debug the serial R program until it runs without problems.
  2. Come up with a scheme of how to distribute work between slaves.
  3. Implement that scheme using Rmpi.
  4. Test the Rmpi program using only a small number of processors.
  5. Try using as many processors as appear to be reasonable.

Updated 2012-09-30.