Multiprocessing in Python via C
Lately I've been using the interface between C and python, and they do play very well together! Most of the reasons for doing this are apparently obvious: C is much faster than python, so if you code your bottlenecks in C and wrap them with python, it's a win win scenario. In practice, this ends up not being done as often as one would think because python has many good libraries that are already doing this under the bonnet, for instance, numpy, pandas and others.
For me, there is another very attractive reason for calling C from python, even when there are C optimised numpy functions available: multiprocessing. It is true that numpy can be compiled to use multi cores, and in fact, in the Linux distributions I've tried numpy already has multiprocessing support by default. However, the operations that are using multi cores are limited: matrix multiplication is using many cores, but element wise operations are not.
So in this post I'm going to give you a small example on how to call C code from python, and how to use multi cores with C code. Both these tasks are documented around the web, but it took me a while to gather all the relevant information. This post is a good excuse to write it down in a place I can go back to whenever I need it. I hope it is helpful to you as well.
C code
The example we will be working on in an element wise square root of a numpy 2d array. In C, a numpy array is represented as a pointer, so our C function sqrt.c
is simply
#include "sqrt.h"
#include <math.h>
void c_sqrt(const double * matrix,
const unsigned long long int n_rows,
const unsigned long long int n_cols,
double * output) {
unsigned long long int i, j;
for(i=0; i<n_rows; i++){
for(j=0; j<n_cols; j++){
output[i*n_cols + j] = sqrt(matrix[i*n_cols + j]);
}
}
}
And the corresponding header file sqrt.h
#include <stdio.h>
void c_sqrt(const double * matrix,
const unsigned long long int n_rows,
const unsigned long long int n_cols,
double * output);
Wrapping C code with cython
There are more than one way to call C code from python. They are very well summarised here. From the four methods they present, I've only used two of them:

cython
It is quite versatile. All the main code is in C/C++, and then these C functions are called from a cython function that essentially does the conversion from python variables to C variables. This has the advantage of having C code that is python independent, and that could be easily wrapped by different languages, for example R.

python c api
This is the most versatile option. Python objects can actually be created directly in C, and python can call the C shared library directly. I recently had to use this api (actually, the numpy C API) because I needed to return an array from C which size was not known beforehand by the python code. I found that a clean way to accomplish this was to create the numpy array directly in the C code. This approach doesn't necessarily mean that the C code will be python dependent: we can always write the C code in a python agnostic way and then just write C wrappers for it.
To put it simple, for my use cases the difference between these two approaches is that in the first option, the wrapper is a cython function whereas in the second, the wrapper is a pure C function. In this post I'll do it the cython way.
The main purpose of these wrappers is to convert the python variables into C variables, and the other way around. In the cython case, in most cases, this conversion is mostly automatic and pretty simple. With numpy arrays, there are some tricks though. Before going through them, let me show how the wrapper wrapper.pyx
is going to look like
import numpy as np
cimport numpy as np
cdef extern from "sqrt.h":
void c_sqrt(const double * matrix,
const unsigned long long int n_rows,
const unsigned long long int n_cols,
double * output)
def c_wrapped_sqrt(np.ndarray[np.double_t, ndim=2, mode="c"] matrix):
cdef np.ndarray[np.double_t, ndim=2, mode="c"] to_return
to_return = np.empty_like(matrix)
c_sqrt(&matrix[0,0], matrix.shape[0], matrix.shape[1], &to_return[0,0])
return to_return
Some comments:

The way the C function will access memory of a numpy array is via a pointer. So we need to send the address to C, and that is what we are doing with
&matrix[0,0]
. I've seen other ways of doing this, for instance<double *> np.PyArray_DATA(matrix)
, which essentially returns a pointer of doubles to the numpy data inmatrix
. 
One way of having a numpy array changed or created by a C function is by passing the corresponding memory address as an argument. In our case, the memory is actually allocated in cython
to_return = np.empty_like(matrix)
, even though the data is only computed by the C program. 
A contiguous 2d numpy array is, under the bonnet, just a pointer to doubles. This means the data is organised in one big contiguous line. There are different ways of laying out a matrix in a line: C alignment, in which elements of a row are kept together, and at the end of one row, the next row starts; the other is Fortran alignment, in which the same argument applied, but column wise. The important thing is that the data that is passed is contiguous and, of course, that we are consistent with the alignment.

In case the input matrix is not a contiguous 2d array, which can happen for instance if it is the result of some slicing operations, then we would need to make it contiguous in the cython wrapper before calling the C function. That can be done with the function
np.ascontiguousarray
.
To tie together the cython wrapper with the C code, everything needs to be compiled together. The easiest way is to create a setup.py
file like the one below. The compilation will be done by running python setup.py build_ext inplace
.
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup(
name = "sqrt",
ext_modules=[ Extension("sqrt", ["wrapper.pyx", "sqrt.c"],
#extra_compile_args=["O3", "ffastmath"],
extra_link_args=["lm"])],
cmdclass = {'build_ext': build_ext}
)
The link argument lm
is needed because we are using the math.h
module in our C code. If we were not, that would not be needed.
As far as wrapping C code in python via cython, this is essentially it. We just need to compile the C and cython code by running
python setup.py build_ext inplace
We can now open a python terminal, run import sqrt
and call the function sqrt.c_sqrt_wrapper
.
Python code
To make it easier to compare results and to call either the numpy or the C versions of the element wise square root we can create a small python file, call it timer.py
:
import numpy as np
import sqrt
matrix_size = 600
np.random.seed(3435)
m = np.random.uniform(size=(matrix_size, matrix_size))
def numpy_call():
return np.sqrt(m)
def c_call():
return sqrt.c_wrapped_sqrt(m)
Performance
To time both the numpy and the C version of the code, we can use the magic from ipython
In [1]: import timer
In [2]: %timeit timer.numpy_call()
1000 loops, best of 3: 1.47 ms per loop
In [3]: %timeit timer.c_call()
100 loops, best of 3: 2.9 ms per loop
Immediately we see that the our C code is twice as slow as the numpy equivalent. That is expected: the compilation was done without any optimization. By uncommenting the extra_compile_args
in the setup.py
file, we can substantially improve the speed of our code
In [1]: import timer
In [2]: %timeit timer.numpy_call()
1000 loops, best of 3: 1.47 ms per loop
In [3]: %timeit timer.c_call()
1000 loops, best of 3: 1.47 ms per loop
So at the moment we have the speed of numpy, which was actually expected. Numpy runs on C.
Multiprocessing
It is surprisingly easy to improve the speed of a C program by using many cores, if you have them, of course!
Thanks to the OpenMP API, with just two extra lines we can easily parallelize our C code. And that is counting with the #include
#include "sqrt.h"
#include <math.h>
#include <omp.h>
void c_sqrt(const double * matrix,
const unsigned long long int n_rows,
const unsigned long long int n_cols,
double * output) {
unsigned long long int i, j;
#pragma omp parallel for private (i, j)
for(i=0; i<n_rows; i++){
for(j=0; j<n_cols; j++){
output[i*n_cols + j] = sqrt(matrix[i*n_cols + j]);
}
}
}
The #pragma omp parallel for private (i, j)
is a compiler directive, so that parallel code is generated at compile time. To compile this, we also need to add extra extra_compile_args
and extra_link_args
to the setup.py
file:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup(
name = "sqrt",
ext_modules=[ Extension("sqrt", ["wrapper.pyx", "sqrt.c"],
extra_compile_args=["O3", "ffastmath", "fopenmp"],
extra_link_args=["fopenmp", "lm"])],
cmdclass = {'build_ext': build_ext}
)
After recompiling everything, we get a significant boost in speed on a 4 core machine.
In [1]: import timer
In [2]: %timeit timer.numpy_call()
1000 loops, best of 3: 1.47 ms per loop
In [3]: %timeit timer.c_call()
1000 loops, best of 3: 888 µs per loop
Conclusions
Python is quite slow, but it can be used quite efficiently if one uses the right libraries, like numpy or pandas. The style one codes in can also have an impact of the performance of a python program. But sometimes, the standard libraries are not enough to have a fast running python code. For those cases, the ability to write python bottlenecks in C is a great tool to have in ones toolbox. And the ability to easily parallelize C code by adding few lines is just great!