Introduction
Matlab is a numerical computing and programming environment with a broad range of functionality (matrix manipulation, numerical linear algebra, generalpurpose graphics, etc.). Additionally, specialised application areas (e.g. bioinformatics or financial derivatives) are served by a large number of optional toolboxes
Matlab is installed on the ARC clusters along with all the toolboxes covered by the Oxford University Concurrent license.
The purpose of this examplebased tutorial is to guide the user through a few aspects of running efficient Matlab jobs on the ARC resources. These aspects are
 using the Matlab module,
 running Matlab scripts noninteractively from batch jobs and

running Matlab jobs on parallel hardware:
 overview of parallel computing in Matlab,
 using the Matlab Parallel Computing toolbox,
 exploiting trivial task parallelism and
 multithreaded Mex programming.
The Matlab Module
Like all applications, Matlab has to be loaded using the module utility prior to running it. This ensures that Matlab is included in the path. More information on modules can be found here.
Get the list of available versions of Matlab using module avail matlab command. Load the module for the default Matlab version with module load matlab. Load a particular version by specifying the Matlab version in the command, e.g. module load matlab/R2012a.
Loading the module gives access to the main Matlab product as well as to all the installed toolboxes, including the Parallel Computing Toolbox. To check toolbox availability (and the Matlab version), type ver at the prompt of an interactive Matlab session.
Noninteractive Matlab Sessions
During code development on standard machines Matlab is usually run in interactive mode, in this way making full use of its integrated environment. By contrast, given the "batch processing" nature of supercomputing resources, the preferred mode of operation for Matlab on our systems is noninteractive.
One way to run Matlab noninteractively is through
 redirecting the standard input and output when invoking Matlab and
 invoking Matlab from a submission script, submitted to the queue via the PBS Pro scheduler.
Input and output redirection is arguably the easiest way of running Matlab noninteractivelly. It is achieved using the Linux operators < and >, with Matlab taking a code file as an input and writing the output to a file, e.g. matlab < myScript.m > myOutput.txt. The main function/program (e.g. myScript.m) should have the exit command at the end in order to force Matlab to quit after finishing the execution of the code.
A simple example illustrating noninteractive Matlab use is found in /system/software/linuxx86_64/matlab/examples/seq directory. In this example, the MATLAB program main.m sets a linear system with the righthand side read from a file provided, solves it and saves the result to another file. A Matlab job is sent to the queue and executed on a backend node using the job scheduler. The Torque job job scheduler runs on the cluster arcus, while arcusb runs Slurm; a submission script example is provided for each.
Both submission scripts contains the following line to run the Matlab script:
matlab nodisplay nosplash < main.m > run.log
The flag nodisplay instruct Matlab to run without the GUI, while nosplash prevents the display of the Matlab logo. The < redirection operator ensures that Matlab runs the script main.m, while the > operator redirects the standard output (normally to the terminal) to run.log file.
Job submission is done with the command qsub run_torque.sh (Torque on arcus) or sbatch run_slurm.sh (Slurm on arcusb).
Running Matlab on parallel hardware
Overview of parallel computing in Matlab
Matlab was developed for a long time as a product for singleprocessor computing, partly because distributed parallelism was incompatible with Matlab's original design principles and partly because the potential market was perceived as too small to justify major development efforts. However, the advent of multicore CPUs and the changing nature of Matlab, from the original educational "matrix laboratory" to a complex technical computing environment, prompted a revision of this situation. Matlab now benefits from running on modern parallel hardware in at least two ways.
The first is a builtin feature of Matlab, which "naturally" exploits multicore processing via the underlying multithreaded libraries Intel MKL and FFTW. Thus, linear algebra operations (such as the solution to a linear system A\b or matrix products A*B) and FFT operations (using the function fft) are implicitly multithreaded and make use of all the cores available on a multicore system without user intervention or special extra programming. Some of the vectorised operations in Matlab are also multithreaded. However, this type of operations are only a part of Matlab programming and the vast proportion of the Matlab functionality are scripts or functions that can only use a single core.
Second, users can exploit parallel processing through a series of explicit programming techniques. The following techniques are briefly discussed below, with examples given:
 using the Matlab toolbox Parallel Computing Toolbox;
 trivial parallelism exploited through independent Matlab processes;
 multithreaded MEX programming.
Matlab has two toolboxes (licensed separately from the main distribution) that enable explicit parallel programming: the Parallel Computing Toolbox and the Distributed Computing Server. The Parallel Computing Toolbox is designed for programming multicore architectures, while the Distributed Computing Server extends the Matlab's functionality to large resources, such as clusters.
The functionality of the Parallel Computing Toolbox is extended from single cluster node processing to distributed processing across multiple nodes by the Distributed Computing Server. To learn more about the product, please visit the Distributed Computing Server webpage.
The University has licenses for the Parallel Computing Toolbox and users are encouraged to use it in their jobs run on the ARC hardware. However, neither the University nor the ARC have licenses for the Distributed Computing Engine. If you need to use this product, please contact the ARC staff to make your interest known.
Using the Parallel Computing Toolbox
The Parallel Computing Toolbox offers the programmer a range of highlevel parallelism contructs such as parfor (parallel for loops) and distributed arrays, which can be used to parallelise processing. Matlab scripts enhanced using these contructs can be run on a single multicore system (such as a node of the ARC clusters), utilising all the cores available for parallel processing. While this offers scope for faster execution, the programming is not without catches and requires both programming experience and the understanding of the underlying algorithms. The MathWorks pages are the best introduction to the product.
A simple example of multicore parallelism via the Parallel Computing Toolbox is provided in
/system/software/linuxx86_64/matlab/examples/par
The program main.m evaluates an expensive function within a for loop and stores the results in an array. The for loop is parallelised using the parfor construct; a parfor loop behaves like an ordinary for loop on a singlecore execution but shares the computational load between several workers (normally, each run on a separate core) in parallel execution. To make workers available for parallel execution, the command matlabpool is used in main.m; the example illustrates the behaviour of parfor both before and after the workers are initiated.
The example is run in batch mode with the command qsub run_torque.sh, (Torque on arcus) or sbatch run_slurm.sh (Slurm on arcusb). The submission file is :
#!/bin/bash #SBATCH nodes=1 matlab nodisplay nosplash < main.m > run.log 
Notice how MATLAB is instructed to not load the interactive window.
Note: do not turn java off when lauching MATLAB (i.e. do not invoke matlab nojvm); matlabpool uses the Java Virtual Machine.
After the job finishes, the CPU times spent executed the loops in main.m can be found in timings.dat, showing a clear speedup of the execution in parallel.
Exploiting trivial parallelism
An easy way to exploit multicore systems is to split the workflow into parts that can be processed completely independently. The typical example in this category is a parameter sweep, where the same Matlab script is run a large number of times using different inputs; these runs are indepent from each other and can be carried out concurrently. Thus, the entire workflow can be scheduled in jobs that group 8 independent runs to match the 8 cores available per compute node. This strategy is best coupled with the use of the Matlab mcc compiler in order to avoid an excessive use of licenses.
A simple example is found in
/system/software/linuxx86_64/matlab/examples/mcc
The file oscillator.m is a Matlab script that computes the solution of a damped oscillator of unit mass (using the Matlab ode45 solver) and outputs the maximum oscillation in that solution. The script is prepared for use as a standalone deployed executable using the Matlab function isdeployed.
To deploy the script as a standalone application, load the modules for Matlab and for the Intelcompilers, e.g.
module load matlab/R2012a intelcompilers/2012
Then, compile the script using mcc and the command
mcc v \
R nojvm \
R singleCompThread \
f ./mbuildopts.sh \
m oscillator.m
This command makes use of the options in the file mbuildopts.sh, provided alongside the Matlab script and customised for the Intel compilers. If no option file is passed through the option f, mcc uses the default options file, which uses the Gnu compilers gcc and g++; in principle, using the Intel compilers can lead to a faster executable.
The deployed executable is compiled to run using a single thread via the option singleCompThread. This is important as a number of process are to run concurrently on the same multicore system.
The mcc compilation creates an executable called oscillator. In addition to this, the process generates the files mccExcludedFiles.log and readme.txt, which can be safely discarded. Also, the wrapper script run_oscillator.sh is generated; this can be used to launch the executable oscillator into execution as it ensures the correct environment (paths to shared libraries and other environment variables) is set before execution. The ARC Matlab module updates all the necessary variables, and the executable oscillator can be launched directly, so using run_oscillator.sh is unecessary.
The submission script submitq.sh gives an example of how the deployed executable can be used to launch concurrent processes within the same job. On the clusters, the script requests a single compute node (#SBATCH nodes=1), so that 8 cores are available for processing. 8 separate processes are started with different parameters, such that the 8 processes compute a parameter sweep. The processes are started with a commands such as
oscillator 0.01 0.3 > result1 &
with the run parameters (representing the damping coefficient and the stiffness) passed on as command line arguments. The processes are started in the background (using the symbol &), such that the second process can start before the first finishes, and so on. At the end of the PBS script, a synchronisation point is necessary, which is implemented using a "wait" loop which "listens" for any processes calle oscillator; without this synchronisation, the job launches the oscillator processes into background execution and finishes, without waiting for the processes to finish. Each process prints the result (maximum oscillation) to the standard output; there is now way to "return" a numeric result from a stnadalone executable. It is easy to preserve the results after the job runs by redirecting the output to the files result*.
Finally, the use of mcc can be avoided altogether and Matlab can be run directly. For example, the first processing line in the PBS can be
matlab nojvm singleCompThread r "oscillator(0.01, 0.3); exit" > result1 &
However, deployed executables do not require Matlab licenses to run, which can make an important economy, especially in the case of a large number of concurrent processes (such as a parameter sweep).
Multithreaded MEX programming
Yet another way to exploit multicore systems is via multithreaded Mex programming. Mex (Matlab EXecutable) files are dynamically linked subroutines compiled from C, C++ or Fortran source code that can be run from within Matlab in the same way as Mfiles or builtin functions. These guidelines assume knowledge of serial Mex programming and provide an example of how to augment serial execution with multithreading through OpenMP. Coupled with OpenMP multithreading, Mex files become a powerful method to accelerate key parts of a Matlab program.
The main reason to write Mex files in C or Fortran (thus abandoning the highlevel abstracted Matlab programming) is to gain speed of execution in computationally intensive operations that otherwise become a bottleneck in an application. Typically, this is done to replace a function that is identified through profiling as being slow and/or called a large number of times. Nevertheless, this programming effort is rewarded to various degrees, with the greatest relative benefits normally met when a Mex replaces a Matlab script (Mfile). At the other extreme, Matlab operations that rely on performance libraries like FFTW (e.g. fftn) or BLAS/LAPACK (e.g. solution of a dense linear systems, A\b), which are highly optimised have nothing or very little to benefit from Mex programming. The best source for learning Mex programming is the Mathworks webpages.
The functionality necessary for building Mex files is provided by the Matlab C, C++ or Fortran APIs. The main component of a Mex source code is the external gateway function mexFunction, which manages the transfer of data between Mex files and the Matlab environment. This can be combined with C or Fortran routines or any external library in a manner similar to any other source code. (Note: the API provides a Fortran gateway function too but practice shows it is trickier to compile pure Fortran Mex code than pure C or C++; the easiest way forward for Fortran programmers is to use the C gateway function to invoke the Fortran routines, which are compiled separately using a Fortran compiler.) The Mex gateway function is compiled using the Matlab mex utility, which invokes a backend compiler (such as gcc) and manages the linking stage with the Matlab Mex API libraries. The compiler and compiler flags are configurable and the options can be stored in an options file.
An example can be found in the directory
/system/software/linuxx86_64/matlab/examples/mex
This reimplements the BLAS axpy function, which adds the input matrix y to the input matrix x multiplied by the scalar a. This example is purely illustrative and its utility does not stretch beyond this guide. Matlab uses the threaded MKL implementation of axpy directly and this function is chosen for this example only because it is very simple to program and to understand.
The directory contains the C files axpy.c (which contains the gateway function) and axpy_kernel.c (which performs the a*x+y operation). The main function axpy.c contains various checks on the input data and the function axpy_kernel.c contains OpenMP directive for threading the loop over the array index. The directory also contains the Mex option files mexOpts.sh that is modified to use the Intel C compiler and to trigger AVX and SSE vectorisation. Lastly, a simple makefile is also provided.
To generate the Mex file, use the commands
module load matlab/R2012a intelcompilers/2013
make
This generates the file axpy.mexa64, which can be run directly from Matlab. To test the function axpy thus generated, you can load an interactive session and start Matlab. At the Matlab prompt, enter the commands
M = 60000; N = 30000;
x = rand(M,N); y = rand(M,N); a = rand();
tic; z = axpy(a, x,y, 1); toc;
tic; z = axpy(a, x,y, 2); toc;
tic; z = axpy(a, x,y, 4); toc;
The first two lines generate some data sets to work on. The third line and after run the newly built function axpy using a, x and y as arguments as well as the number of OpenMP threads to use in the computation. The threading is done in the computational function axpy_kernel.c and its effect can be seen in a reduction in the computational time when going from 2 threads to 4. Remember this is illustrative, a decrease of computational time is unlikely beyond 8 threads (there are several reasons for this).
The accuracy of the results can be checked against Matlab's own operations with the command
norm(z  (a*x + y), 'fro')
which should produce a number of the order of machine eps.