Performance
Although I had promised in the
previous post to follow the original series closely, a conversation with a friend prompted me to spend some time to see if I can improve upon the current performance of both
Python and
Golang by leveraging parallel processing. So I am going to take a small detour to figure out if the function performance can be improved using parallel processing before coming back to the database implementation in the next post.
In
Python,
numba allows parallel processing by calling the appropriate APIs and using the parallel option.
Golang has native support for
goroutines which allows concurrency and parallel processing trivially. In this post I will be trying out both approaches to see how they fare.
I will not be rigorously following benchmarking techniques to get the most precise results. The focus will be mostly on the approaches rather than the exact, reproducible results.
Status Quo
Before we start, let us revisit the where we left off, in both
Python and
Golang. I have also added a longer
20M test to make sure that the tests run slightly longer to reduce the impact of startup/teardown times.
| Implementation/Time(s) | 10M | 20M |
|---|---|---|
| Python |
~25 |
~60 |
| Golang |
~15 |
~40 |
We can also take a look at the code that gave the above results:
We start with
DivisorSum function. In
Python, we have added JIT compilation using the
numba
njit decorator which gives us the performance benefits to keeping it competitive with the compiled
Go code.
This is the equivalent
Golang code.
On to the wrapper function in
Python, which runs the above function across all the numbers serially, to find the best Witness.
The
Golang implementation is trivial as well
Finally, the tests in each language.
Python:
Golang:
Parallelization
Python
For python,
numba makes parallelizing a piece of cake by adding the
parallel=True argument to the
njit decorator. However, while doing so, the parallel functions do not necessarily run in any particular order and we cannot
map it to an object like we did previously. Instead, we need to make sure that the array is preallocated and that each parallel run can independently write to this preallocated array. At the end of the calculation, we use
np.argmax to find the index of the object with the maximum value.
given below:
With the above parallel code, in my Macbook M1 Air, the process ran in ~(20) seconds. This is 3x the speed of the non-parallelized version!
Golang
In golang, there is a little bit more effort to be done in parallelizing the code. First we need to define a Witness structure to hold the results.
We make a buffered channel of the
witness type with a buffer size equal to the number of iterations:
Then we make a sync group and add the counter equivalent to the number of iterations to make sure that we proceed only when all of the iterations are done.
Now we iterate across the elements and start one goroutine for each iteration of finding the Witness value. Within each iteration, push the output to the channel once the process has calculated the results. We also close the Waitgroup counter using the
defer keyword once the function has finished running.
We wait for all the iterations to be done using the
wg.Wait() command and then close the channel. Then we can iterate over the channel to find the index of the elements with the maximum Witness value.
Here is the final function in all its glory -
The above function took almost 40 seconds to run, which is roughly the same as the non-parallel version! However, I could see that the parallel cores were getting used but I realized that the overhead of creating 20million goroutines was too much, causing Amdahl's law to come in to play, negating any advantages of parallelization. We will have to do better!
Parallel Optimization
In the next iteration of the code, I try a more optimized approach, wherein I restrict the parallelism to a lower number, say 100. The work splitting function is smarter, and split the work across the limited number of goroutines evenly.
Now the test runs in 10s, which is
~4x faster than the non-parallel version!
Revisiting Python
We can now refactor the python implementation using the the same optimization logic.
This optimized implementation takes ~15 seconds to run, which is 4x faster than the non-parallel version, and 1.3x faster than the naive implementation!
big.LITTLE
I was initially concerned that the maximum performance improvement that we could attain with multi core processing was only ~4x, and not closer to 7x with 8 cores, in my Macbook. Turns out the M1 processor has a
big.LITTLE architecture with 4 heavy duty cores and 4 smaller efficiency cores which are much slower. This means, we're constrained heavily by the big 4 cores and it is unlikely that we can get a faster performance. Hence, the 4x speed up that we got in both
Python and
Golang makes perfect sense.
Conclusion
Here are the final results of the different approaches we tried out:
| Implementation/Time(s) | serial | parallel | parallel(optimized) |
|---|---|---|---|
| Python |
~60 |
~20 |
~15 |
| Golang |
~40 |
~40 |
~10 |
Looks like we were able to improve both the implementations by 4x, which I suppose is the ideal maximum speed up we can get. With the best implementation,
Golang continues to retain the 30% speedup over
Python. However,the
Python implementation was way simpler than the
Golang one which is always an aspect (probably the most important one) we need to consider before running behind benchmarks.