Recently we have worked on the Kaggle Quora duplicate questions competition. In short, the challenge was to implement the best algorithm for finding out if given pair of questions is considered as duplicates or no. Eventually, our solution ended up among the
top 6%. While solving the task, we have learned many things, which we would like now to share with you.

The Problem

A few days before the competition end date, we got a new idea to try. We wanted to add yet another feature to our already quite complex model (definitely a subject for another post) and check if it improves the score. Unfortunately, after preparing and running the code, it turned out that it was awfully slow. The first estimation gave us a maximum of 2 question pairs checked per second, where the train & test datasets contained over 2,7 million pairs in total. Since such a calculation would not finish before the competition end, we needed to look for ways to speed up the process.


In each iteration, there were quite complex calculations taking place, with their execution going through multiple libraries. And while we had a feeling that the code on this path was quite often suboptimal, we were not really in a position to rewrite everything from scratch. Simultaneously, there was a much bigger problem - the code was running in a single thread! On a i7 4770 with 4 physical cores and Hyper-Threading (together 8 logical cores), only 1 was utilized.


We made some attempts at parallelization. Every question pair could be analyzed separately. We have tried some standard approaches like: joblib:

>>> from joblib import Parallel, delayed
>>> Parallel(n_jobs=1)(delayed(foo)(i) for i in data)

However, our function was using some external libraries, so this brought us problems with pickling. Python has strict limitations what can actually be pickled.


Considering the problems we have encountered so far, we needed to think of other ways to solve our problems. One of the ideas was to manually split our datasets to separate files, and manually run separate Python processes. However, as lazy programmers, we have found this way troublesome, not easily scalable and prone to error. Fortunately, we have found another way.

Python multiprocessingPool allows to automatically perform an operation in multiple processes.

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    result_df = pd.concat(, df_split))
    return result_df

This simple piece of code allows to run a specific function on DataFrame’s elements using a specified number of cores. It splits data to a number of partitions, spawns Pool, executes code on every partition and joins results. Simple and efficient.

We have run the code and... it hung. It appeared that there are problems when you try to use from e.g. a Jupyter Notebook. Running the code outside of the notebook worked!

We have spawn the computations on 7 cores (in order not to freeze other programs on the machine) and the new estimation gave us only... 96 hours. This was still too much. Fortunately there is AWS!

We have spawned a single AWS m4.16xlarge instance with 64 cores. This allowed the computation to finish in around 11 hours. Thanks to that, we managed to include the new feature in our final algorithm and jump to a higher position in the competition leaderboard.

Other blogs

If you are looking for an introduction to Machine Learning, take a look at our Machine learning by example presentation, with Natural Language Processing basics and Apache Spark.

If you would like to learn about more advanced Machine Learning topics, also take a look at our other blog posts:

Blog Comments powered by Disqus.
Find more articles like this in Blog section