- Home
- About WestGrid
- Researcher Profiles
- Resources & Services
- Support
- FAQ

# Using the consortium license to run MATLAB Distributed Computing Server workers

# Introduction

Starting in 2009, WestGrid has made an annual purchase of a 64-worker "consortium" license for the MATLAB Distributed Computing Server. In March 2013, the number of worker licenses was increased to 160. The consortium license allows researchers from Canadian academic institutions who have licensed the Parallel Computing Toolbox (or have access to it through a local server) to submit jobs to a WestGrid cluster, even if it is not located at their home institution. Orcinus is the only WestGrid cluster on which the Distributed Computing Server workers run under the consortium license. A separate SFU site license allows for Distributed Computing Server jobs to be run on Bugaboo by researchers from that university.

The MathWorks web site has detailed instructions for running MATLAB in parallel using a combination of the Parallel Computing Toolbox and the Distributed Computing Server. If you own a Parallel Computing Toolbox license and would like to get started using the MATLAB Distributed Computing Server environment on Orcinus, please contact WestGrid technical support to discuss such issues as whether you can submit remote jobs from your own machine or from a supported MATLAB installation at your institution. Researchers from several WestGrid institutions (including U of A, U of C, UBC and SFU) can avoid purchasing Parallel Computing Toolbox licenses, as these are available on local servers. Other institutions may also provide this support.

If you are not using an institution-supported version of MATLAB that has already been set up to submit remote jobs through the Distributed Computing Server and want to submit the jobs from own computer instead, the examples shown below give an idea of the steps required. However, since this involves modifying files within the MATLAB distribution on your own computer, it is sometimes difficult for WestGrid to troubleshoot problems should they arise.

Note that the specific details of the scripts used to submit jobs vary depending on the version of MATLAB being used. We encourage you to use the most recent version whenever possible. In particular, versions R2012a and later handle the remote connection issues in an easier manner, so, that the public key authentication setup described below is only needed for older versions.

# Using public key authentication to connect to Orcinus (pre-R2012a)

Note: if you are using R2012a or later, you may skip this section.

The MATLAB Parallel Computing Toolbox uses the SSH (secure shell) network protocol to log in to Orcinus to execute commands (such as qsub for submitting batch jobs). Similarly, the SCP (secure copy) protocol is used for transferring files back and forth between Orcinus and the system on which the Parallel Computing Toolbox is running. It would be very cumbersome to be prompted for a password every time MATLAB needed to execute a remote command or transfer a file. This can be avoided by using public key authentication. With this method, you have to enter your WestGrid password only once during a session in which you are submitting MATLAB jobs.

Before attempting to use the Parallel Computing Toolbox for remote MATLAB job submission on Orcinus, you should set up public key authentication and verify that you can use it connect to Orcinus with an SSH client and a (secure copy) file transfer client. The details of how you do that depend on what type of system (Linux, Microsoft Windows, MacOS X, ...) you are using.

In brief, for Linux and Macintosh systems, you generate keys with ssh-keygen (making sure that you use a pass phrase) and transfer the public key to the .ssh/authorized_keys file on Orcinus. Then, whenever you want to submit MATLAB jobs, you run commands like ssh-agent /bin/bash and ssh-add key_file (which should prompt you for the pass phrase associated with your ssh key) before starting MATLAB. On Microsoft Windows systems the idea is similar, in that you need to generate ssh keys and install the public key in .ssh/authorized_keys on Orcinus and then request that ssh/scp connections use the installed keys. However, the key generation and management software does not come pre-installed. Typically, PuTTY is used, as described here.

# Distributed Computing Server - basic concepts

As mentioned above, the MATLAB Parallel Computing Toolbox is used to control job submission when using the Distributed Computing Server installation of MATLAB on a WestGrid cluster, such as Orcinus. Details of these MathWorks products are available on their web site, including a user's guide for the Parallel Computing Toolbox. There is also an administrator's guide for the Distributed Computing Server, but, most end users will not need to look at that.

In the figure below, relationships are illustrated among the various software and hardware components involved in using the Parallel Computing Toolbox on your computer to submit a batch job on an Orcinus login node, which, in turn, will run workers under the Distributed Computing Server license on the Oricinus compute nodes.

Some of the details of the interaction among the various components shown above are either largely hidden from view or require only a one-time setup. For example, the SSH interactions in the diagram are taken care of by the public-key authentication setup described in the previous section or internally by MATLAB for versions R2012a and later. However, a basic understanding of what is going on "under the hood" is helpful, so, will be discussed below. For additional details, see the MathWorks web site, particularly the section on Using the Generic Scheduler Interface in the Parallel Computing Toolbox User's Guide.

The **Generic Scheduler Interface** is a way of describing to MATLAB where you want to run your job (on a remote cluster in most cases) and sending the necessary commands to the batch job scheduler on that remote system. This is done by defining something called a **scheduler object** in your MATLAB session. The scheduler object is a structure with several components, most of which will be the same from job to job. Specific examples of how to set up the scheduler object for submitting jobs to a WestGrid cluster will be given later in these notes.

One of the most important components of the scheduler object is a reference to the MATLAB code, called a **submit function**, that is used to construct the batch job script (or a series of scripts if you are submitting a number of tasks at the same time), send that script to the remote cluster, construct the TORQUE qsub command line that is used to submit jobs to the batch job system on the cluster for execution and then actually run the qsub command. The submit function can also be used to copy any data files that are needed for the job from your local machine to the remote cluster, although if you have a large data set that is referenced by several different jobs, you could manually copy that to the cluster ahead of time.

When the batch job script that is created by the submit function is actually executed on the compute nodes of the cluster, a Distributed Computing Toolbox worker will be started up. Some environment variables that are defined in the batch job are used by the MATLAB worker to locate such things as the directory where data needed for the job is located.

You may use one of two different submit functions, depending on whether your MATLAB calculations are essentially a number of independent (serial) tasks or whether you have a parallel calculation in which different workers need to communicate and exchange data as the calculation proceeds. You may not need to know the details of the submit function code if you are using an institutuional server to submit your jobs. However, if you are submitting the jobs directly from your own computer, you will have to edit a few lines of sample submit functions that MathWorks provides and install the submit functions where they can be found by your MATLAB session. Until these notes are more self-contained, please contact support@westgrid.ca for more specific advice on editing and installing the submit functions.

# Examples for MATLAB versions before R2011a

Some examples are given in the tutorial notes available here.

These examples rely on scripts that should be installed in the toolbox/local directory (matlab-toolboxwin.zip, matlab-toolboxunix.tar.gz).

Make sure that the directory referred to by remoteDataLocation exists.

Once your public key authentication is set up, make a script based on the following example:

### getschedule.m

function [ sched ] = getschedule()

%Change the following five lines

WestgridID='myWestgridID';

Email='myEmail@email_address';

Nprocs='1';

Wtime='00:05:00';

Memory='512mb';

SubmitArguments=strcat(' -l procs=',Nprocs,',mem=',Memory,',walltime=',Wtime,',

software=MDCE:',Nprocs,' -m bea -M ',Email);

VER=version('-release');

switch VER

case '2009a'

remoteMatlabRoot='/global/software/matlab-2009a';

case '2009b'

remoteMatlabRoot='/global/software/matlab-2009b';

case '2010a'

remoteMatlabRoot='/global/software/matlab-2010a';

case '2010b'

remoteMatlabRoot='/global/software/matlab-2010b';

otherwise

fprintf(' Matlab version %s is not supported\n',VER);

return;

end

clusterHost=strcat(WestgridID,'@orcinus.westgrid.ca');

remoteDataLocation=strcat('/global/scratch/',WestgridID);

sched = findResource('scheduler','type','generic');

set(sched,'ClusterSize',1);

set(sched, 'ClusterOsType', 'unix');

set(sched,'HasSharedFilesystem',0);

set(sched,'ClusterMatlabRoot',remoteMatlabRoot);

set(sched,'GetJobStateFcn',@pbsGetJobState);

set(sched,'DestroyJobFcn',@pbsDestroyJob);

set(sched,'SubmitFcn',{@pbsNonSharedSimpleSubmitFcn,clusterHost,

remoteDataLocation,SubmitArguments});

set(sched,'ParallelSubmitFcn',{@pbsNonSharedParallelSubmitFcn,clusterHost

remoteDataLocation,SubmitArguments})

%Change the following five lines

WestgridID='myWestgridID';

Email='myEmail@email_address';

Nprocs='1';

Wtime='00:05:00';

Memory='512mb';

SubmitArguments=strcat(' -l procs=',Nprocs,',mem=',Memory,',walltime=',Wtime,',

software=MDCE:',Nprocs,' -m bea -M ',Email);

VER=version('-release');

switch VER

case '2009a'

remoteMatlabRoot='/global/software/matlab-2009a';

case '2009b'

remoteMatlabRoot='/global/software/matlab-2009b';

case '2010a'

remoteMatlabRoot='/global/software/matlab-2010a';

case '2010b'

remoteMatlabRoot='/global/software/matlab-2010b';

otherwise

fprintf(' Matlab version %s is not supported\n',VER);

return;

end

clusterHost=strcat(WestgridID,'@orcinus.westgrid.ca');

remoteDataLocation=strcat('/global/scratch/',WestgridID);

sched = findResource('scheduler','type','generic');

set(sched,'ClusterSize',1);

set(sched, 'ClusterOsType', 'unix');

set(sched,'HasSharedFilesystem',0);

set(sched,'ClusterMatlabRoot',remoteMatlabRoot);

set(sched,'GetJobStateFcn',@pbsGetJobState);

set(sched,'DestroyJobFcn',@pbsDestroyJob);

set(sched,'SubmitFcn',{@pbsNonSharedSimpleSubmitFcn,clusterHost,

remoteDataLocation,SubmitArguments});

set(sched,'ParallelSubmitFcn',{@pbsNonSharedParallelSubmitFcn,clusterHost

remoteDataLocation,SubmitArguments})

### testserial.m

function testserial()

sched=getschedule

j=createJob(sched)

createTask(j,@rand,1,{3,3});

submit(j)

sched=getschedule

j=createJob(sched)

createTask(j,@rand,1,{3,3});

submit(j)

To run,

ssh-agent bash

ssh-add

(give passphrase)

matlab

testserial

quit

ssh-add

(give passphrase)

matlab

testserial

quit

You will see two numbers corresponding to the job. First is the job ID assigned by MATLAB and the second is the batch number assigned by batch system on Orcinus which has the form xxxxx.orca1.ibb.

Once the job has finished, you should get an email. Then to retrieve the results you can use these steps:

ssh-agent bash

ssh-add

(give passphrase)

matlab

sched=getschedule

%using as an example matlab job ID=1

j=findJob(sched,'ID',1)

results=getAllOutputArguments(j)

results{:}

ssh-add

(give passphrase)

matlab

sched=getschedule

%using as an example matlab job ID=1

j=findJob(sched,'ID',1)

results=getAllOutputArguments(j)

results{:}

Here is an example of a job using matlabpool and parfor.

In getschedule.m, change

Nprocs='1'

to

Nprocs='4';

### testparfor2.m

function a = testparfor2(N)

a = zeros(N,1);

parfor(i=1:N)

t=getCurrentTask();

a(i) = t.ID;

end

a = zeros(N,1);

parfor(i=1:N)

t=getCurrentTask();

a(i) = t.ID;

end

### testparfor.m

function testparfor()

sched=getschedule

j = createMatlabPoolJob(sched);

j.FileDependencies={'testparfor2.m'};

set(j,'MaximumNumberofWorkers',4);

set(j,'MinimumNumberofWorkers',4);

t = createTask(j,@testparfor2,1,{3})

alltasks = get(j, 'Tasks');

set(alltasks, 'CaptureCommandWindowOutput', true)

submit(j)

sched=getschedule

j = createMatlabPoolJob(sched);

j.FileDependencies={'testparfor2.m'};

set(j,'MaximumNumberofWorkers',4);

set(j,'MinimumNumberofWorkers',4);

t = createTask(j,@testparfor2,1,{3})

alltasks = get(j, 'Tasks');

set(alltasks, 'CaptureCommandWindowOutput', true)

submit(j)

The proc value in SubmitArguments should correspond to MaximumNumberofWorkers and MinimumNumberofWorkers. The parameter for testParfor2 is set to 3 because the actual number of workers is actual one less than the number of processors asked for. One processor is used by the master process.

To run,

ssh-agent bash

ssh-add

(give passphrase)

matlab

testparfor

quit

ssh-add

(give passphrase)

matlab

testparfor

quit

A parallel job will be slightly different as shown:

### testparallel.m

function testparallel()

sched=getschedule

% create the matlab job

pjob=createParallelJob(sched);

set(pjob, 'MaximumNumberOfWorkers', 4)

set(pjob, 'MinimumNumberOfWorkers', 4)

%create parallel task using colsum.m

set(pjob, 'FileDependencies', {'colsum.m'})

t=createTask(pjob, @colsum, 1, {})

%submit PBS job

submit(pjob)

sched=getschedule

% create the matlab job

pjob=createParallelJob(sched);

set(pjob, 'MaximumNumberOfWorkers', 4)

set(pjob, 'MinimumNumberOfWorkers', 4)

%create parallel task using colsum.m

set(pjob, 'FileDependencies', {'colsum.m'})

t=createTask(pjob, @colsum, 1, {})

%submit PBS job

submit(pjob)

If your job requires input data files and/or produces output files, it is easiest to transfer them to and from Orcinus using scp. It should be noted that the MATLAB jobs on Orcinus will start in your home directory. So if you transferred your input data to /global/scratch/myWestgridID/DATA directory, then you should include

cd /global/scratch/myWestgridID/DATA

in the main function that you are running in your job.

# Examples for MATLAB version R2011a

In this version, MathWorks changed how jobs are created and the scripts for sending jobs to the remote server.

The new scripts that needs to be installed in toolbox/local can be obtained from matlab2011-toolbox.zip.

The corresponding example scripts for running jobs are

### getschedule.m

function [ sched ] = getschedule()

%Change the following five lines

WestgridID='myWestgridID';

Email='myEmail@email_address';

Nprocs='1';

Wtime='00:01:00';

Memory='512mb';

submitArguments=strcat(' -l procs=',Nprocs,',mem=',Memory,',walltime=',Wtime,',software=MDCE:',Nprocs,' -m bea -M ',Email);

VER=version('-release');

switch VER

case '2011a'

remoteMatlabRoot='/global/software/matlab-2011a';

otherwise

fprintf(' Matlab version %s is not supported\n',VER);

return;

end

clusterHost='orcinus.westgrid.ca';

remoteDataLocation=strcat('/global/scratch/',WestgridID);

sched = findResource('scheduler','type','generic');

set(sched,'ClusterSize',str2num(Nprocs));

set(sched, 'ClusterOsType', 'unix');

set(sched,'HasSharedFilesystem',0);

set(sched,'ClusterMatlabRoot',remoteMatlabRoot);

set(sched,'GetJobStateFcn',@getJobStateFcn);

set(sched,'DestroyJobFcn',@destroyJobFcn);

set(sched,'SubmitFcn',{@distributedSubmitFcn,clusterHost,remoteDataLocation,submitArguments});

set(sched,'ParallelSubmitFcn',{@parallelSubmitFcn,clusterHost,remoteDataLocation,submitArguments});

%Change the following five lines

WestgridID='myWestgridID';

Email='myEmail@email_address';

Nprocs='1';

Wtime='00:01:00';

Memory='512mb';

submitArguments=strcat(' -l procs=',Nprocs,',mem=',Memory,',walltime=',Wtime,',software=MDCE:',Nprocs,' -m bea -M ',Email);

VER=version('-release');

switch VER

case '2011a'

remoteMatlabRoot='/global/software/matlab-2011a';

otherwise

fprintf(' Matlab version %s is not supported\n',VER);

return;

end

clusterHost='orcinus.westgrid.ca';

remoteDataLocation=strcat('/global/scratch/',WestgridID);

sched = findResource('scheduler','type','generic');

set(sched,'ClusterSize',str2num(Nprocs));

set(sched, 'ClusterOsType', 'unix');

set(sched,'HasSharedFilesystem',0);

set(sched,'ClusterMatlabRoot',remoteMatlabRoot);

set(sched,'GetJobStateFcn',@getJobStateFcn);

set(sched,'DestroyJobFcn',@destroyJobFcn);

set(sched,'SubmitFcn',{@distributedSubmitFcn,clusterHost,remoteDataLocation,submitArguments});

set(sched,'ParallelSubmitFcn',{@parallelSubmitFcn,clusterHost,remoteDataLocation,submitArguments});

### testserial.m

function testserial()

sched=getschedule

j=createJob(sched)

createTask(j,@rand,1,{3,3});

submit(j)

wait(j)

results = getAllOutputArguments(j);

results{:}

sched=getschedule

j=createJob(sched)

createTask(j,@rand,1,{3,3});

submit(j)

wait(j)

results = getAllOutputArguments(j);

results{:}

Public key authentication is no longer used. To run testserial,

matlab

testserial

Enter the username for orcinus.westgrid.ca:

(enter Westgrid ID)

Use an identity file to login to orcinus.westgrid.ca? (y or n)

n

Please enter the password for user fujinaga on orcinus.westgrid.ca:

(enter password)

testserial

Enter the username for orcinus.westgrid.ca:

(enter Westgrid ID)

Use an identity file to login to orcinus.westgrid.ca? (y or n)

n

Please enter the password for user fujinaga on orcinus.westgrid.ca:

(enter password)

# Examples for MATLAB version R2012a

In this version, more script changes.

The new scripts that needs to be installed in toolbox/local can be obtained from matlab2012a-toolboxunix.tar.gz.

The corresponding example scripts for running jobs are

### getcluster.m

function [ cluster ] = getcluster()

%Change the following five lines

WestgridID='myWestgridID';

Email='myEmail@email_address';

Nprocs='1';

Wtime='00:01:00';

Memory='512mb';

submitArguments=strcat(' -l procs=',Nprocs,',mem=',Memory,',walltime=',Wtime,',software=MDCE:',Nprocs,' -m bea -M ',Email);

VER=version('-release');

switch VER

case '2012a'

remoteMatlabRoot='/global/software/matlab-2012a';

otherwise

fprintf(' Matlab version %s is not supported\n',VER);

return;

end

clusterHost='orcinus.westgrid.ca';

remoteJobStorageLocation = strcat('/global/scratch/',WestgridID);

cluster = parallel.cluster.Generic();

set(cluster, 'HasSharedFilesystem', false);

set(cluster, 'ClusterMatlabRoot', remoteMatlabRoot);

set(cluster, 'OperatingSystem', 'unix');

% The IndependentSubmitFcn must be a MATLAB cell array that includes the three additional inputs

set(cluster, 'IndependentSubmitFcn', {@independentSubmitFcn, clusterHost, remoteJobStorageLocation, submitArguments});

% If you want to run communicating jobs (including matlabpool), you must specify a CommunicatingSubmitFcn

set(cluster, 'CommunicatingSubmitFcn', {@communicatingSubmitFcn, clusterHost, remoteJobStorageLocation, submitArguments});

set(cluster, 'GetJobStateFcn', @getJobStateFcn);

set(cluster, 'DeleteJobFcn', @deleteJobFcn);

%Change the following five lines

WestgridID='myWestgridID';

Email='myEmail@email_address';

Nprocs='1';

Wtime='00:01:00';

Memory='512mb';

submitArguments=strcat(' -l procs=',Nprocs,',mem=',Memory,',walltime=',Wtime,',software=MDCE:',Nprocs,' -m bea -M ',Email);

VER=version('-release');

switch VER

case '2012a'

remoteMatlabRoot='/global/software/matlab-2012a';

otherwise

fprintf(' Matlab version %s is not supported\n',VER);

return;

end

clusterHost='orcinus.westgrid.ca';

remoteJobStorageLocation = strcat('/global/scratch/',WestgridID);

cluster = parallel.cluster.Generic();

set(cluster, 'HasSharedFilesystem', false);

set(cluster, 'ClusterMatlabRoot', remoteMatlabRoot);

set(cluster, 'OperatingSystem', 'unix');

% The IndependentSubmitFcn must be a MATLAB cell array that includes the three additional inputs

set(cluster, 'IndependentSubmitFcn', {@independentSubmitFcn, clusterHost, remoteJobStorageLocation, submitArguments});

% If you want to run communicating jobs (including matlabpool), you must specify a CommunicatingSubmitFcn

set(cluster, 'CommunicatingSubmitFcn', {@communicatingSubmitFcn, clusterHost, remoteJobStorageLocation, submitArguments});

set(cluster, 'GetJobStateFcn', @getJobStateFcn);

set(cluster, 'DeleteJobFcn', @deleteJobFcn);

### testserial.m

function testserial()

cluster=getcluster

j=createJob(cluster)

createTask(j,@rand,1,{3,3});

submit(j)

wait(j)

results = fetchOutputs(j);

results{:}

cluster=getcluster

j=createJob(cluster)

createTask(j,@rand,1,{3,3});

submit(j)

wait(j)

results = fetchOutputs(j);

results{:}

Public key authentication is no longer used. To run testserial,

matlab

testserial

Enter the username for orcinus.westgrid.ca:

(enter Westgrid ID)

Use an identity file to login to orcinus.westgrid.ca? (y or n)

n

Please enter the password for user fujinaga on orcinus.westgrid.ca:

(enter password)

testserial

Enter the username for orcinus.westgrid.ca:

(enter Westgrid ID)

Use an identity file to login to orcinus.westgrid.ca? (y or n)

n

Please enter the password for user fujinaga on orcinus.westgrid.ca:

(enter password)

For a matlabpool parallel job, change Nprocs in getcluster.m to 8.

### testparfor.m

function testparfor()

cluster=getcluster

j = createCommunicatingJob(cluster,'Type','pool')

j.AttachedFiles={'testparfor2.m'};

t = createTask(j,@testparfor2,1,{7});

j.NumWorkersRange = [8,8]

submit(j)

wait(j)

results = fetchOutputs(j);

results{:}

cluster=getcluster

j = createCommunicatingJob(cluster,'Type','pool')

j.AttachedFiles={'testparfor2.m'};

t = createTask(j,@testparfor2,1,{7});

j.NumWorkersRange = [8,8]

submit(j)

wait(j)

results = fetchOutputs(j);

results{:}

### testparallel.m

function testparallel()

cluster=getcluster

% create the matlab job

pjob = createCommunicatingJob(cluster,'Type','spmd')

pjob.AttachedFiles={'colsum.m'};

%create parallel taks using colsum.m

t=createTask(pjob, @colsum, 1, {}) ;

pjob.NumWorkersRange = [8,8]

%submit PBS job

submit(pjob)

cluster=getcluster

% create the matlab job

pjob = createCommunicatingJob(cluster,'Type','spmd')

pjob.AttachedFiles={'colsum.m'};

%create parallel taks using colsum.m

t=createTask(pjob, @colsum, 1, {}) ;

pjob.NumWorkersRange = [8,8]

%submit PBS job

submit(pjob)

### colsum.m

function total_sum = colsum

if labindex == 1

% Send magic square to other labs

A = labBroadcast(1,magic(numlabs)) ;

else

% Receive broadcast on other labs

A = labBroadcast(1) ;

end

% Calculate sum of column identified by labindex for this lab

column_sum = sum(A(:,labindex)) ;

% Calculate total sum by combining column sum from all labs

total_sum = gplus(column_sum);

if labindex == 1

% Send magic square to other labs

A = labBroadcast(1,magic(numlabs)) ;

else

% Receive broadcast on other labs

A = labBroadcast(1) ;

end

% Calculate sum of column identified by labindex for this lab

column_sum = sum(A(:,labindex)) ;

% Calculate total sum by combining column sum from all labs

total_sum = gplus(column_sum);

# License status

Distributed Computing Server jobs on Orcinus are not started until Distributed Computing Server licenses are available. The license status can be checked by running the lmstat command on Oricinus:

/global/software/matlab-lm/etc/lmstat -a

Updated 2013-03-26.