## Wednesday, 5 June 2019

### Procedural Map Reduce and Parallelism

Since toy examples seem like fun, I thought I would make my own toy example in Ada, and play around with some parallelism while I was at it.

The idea is that we want to code a very simple but optimised map reduce function across a dataset consisting of several million 64-bit floating point values. In our code, we want to map the square function over the data points and then reduce with the sum function. In other words, we wish to calculate the sum of the squares of the data points, as illustrated by the following Common Lisp:

In reality, instead of using five integers for our data, we will be using 256 million 64-bit floating point values, occupying 2,048,000,000 bytes (2GiB) of RAM. That should make for a fairly decent computation.

A straightforward way to do this in Ada is shown below. (Code is available at this link.)

The Params module merely keeps the desired number of data points, so it can be easily shared with the subsequent parallelised version of the algorithm:

It should be readily seen from perusing the code that we are allocating an array of 256 million IEEE 64-bit floating point values, setting each value to Pi, and then summing the squares straightforwardly with "Sum := Sum + Val * Val;".

Compiling this with full optimisations (-O3), and running it gives us an output similar to the following.

Sum           :  2.52661872562787E+09
Duration (ms) :  203.672997000

The C example in the article mentioned in the first paragraph achieves 17ms for 32 million data values. That is about 33% faster than this naive Ada code. I would attribute this to the C compiler having AVX2 instructions activated, which may not be the case with the GNAT Ada compiler I used.

However, unlike C, Ada has tasking built into the language, so let's see what we need to do to parallelise the computation, and what effect this may have on the execution time. Here is a parallelised version of the algorithm.

There are two main additional complexities here: defining the worker task (in lines 28 to 59), and dividing up the array into suitable chunks depending on the number of worker tasks to be run (lines 63 to 87). Of these, I'm sure that latter of these two can be greatly simplified, but the initial approach works.

The main part of the code (lines 109 to 120) takes care of starting each of the worker tasks, collecting the results, and summing them to arrive at the overall total.

The last few lines (123 to 127) merely calculate the map reduce time and then print out the details.

Running this code with eight worker tasks results in output similar to the following.

Sum           :  2.52661872742146E+09
Duration (ms) :  111.897841000

This represents around a 45% performance improvement over the single-threaded Ada code, and about an 18% improvement over the C version from the referenced article.

Now, a slightly dissatisfying aspect of these results is that the measured duration of the process would vary somewhat from run to run, no doubt depending upon the vagaries of what was happening in the system from moment to moment. In order to be a bit more scientific, it would be nice to run the job a certain number of times and then report the mean duration. To do this, I first added an extra parameter for the number of calculation runs to perform:

I settled on a choice of fifty runs. As you can see, I also upped the number of data points to a thousand and twenty-four million, which consumes 8GiB of RAM to store the data array.

I then factored out the single calculation run into its own subroutine so that it could be called multiple times:

The factored out procedure can be seen in lines 21 to 39. This is called the required number of times in the main program (lines 68 to 70). Then the mean duration is calculated (line 72), and the results printed out as before.

As expected, this approach greatly evens out the variations previously obtained. The following is a sample output:

Data points    :  1024000000
Sum of squares :  1.01064747680946E+10
Duration (ms)  :  824.309429480

Comparing this again to the C time of 17ms for 32 million floats, the C code using AVX2 instructions (not available on many hardware architectures) does 34% better. So, the previously shown result was about right.

I made corresponding changes to the multi-threaded example, but also ensured that each task would run on a dedicated CPU, so that there would be no ping-ponging from one to the other. In this particular exercise, doing this did not seem to make much of a difference, as presumably the Linux scheduler already did an excellent job in spreading the load. Version two of the multi-threaded code follows.

Part one:

Part two:

I also used a more sensible method of setting the worker index ranges into the dataset.

The results of running this were as follows: