Recently we have worked on the Kaggle Quora duplicate questions competition. In short, the challenge was to implement the best algorithm for finding out if given pair of questions is considered as duplicates or no. Eventually, our solution ended up among the
top 6%. While solving the task, we have learned many things, which we would like now to share with you.
A few days before the competition end date, we got a new idea to try. We wanted to add yet another feature to our already quite complex model (definitely a subject for another post) and check if it improves the score. Unfortunately, after preparing and running the code, it turned out that it was awfully slow. The first estimation gave us a maximum of 2 question pairs checked per second, where the train & test datasets contained over 2,7 million pairs in total. Since such a calculation would not finish before the competition end, we needed to look for ways to speed up the process.
In each iteration, there were quite complex calculations taking place, with their execution going through multiple libraries. And while we had a feeling that the code on this path was quite often suboptimal, we were not really in a position to rewrite everything from scratch. Simultaneously, there was a much bigger problem - the code was running in a single thread! On a
i7 4770 with 4 physical cores and Hyper-Threading (together 8 logical cores), only 1 was utilized.
We made some attempts at parallelization. Every question pair could be analyzed separately. We have tried some standard approaches like: joblib:
>>> from joblib import Parallel, delayed >>> Parallel(n_jobs=1)(delayed(foo)(i) for i in data)
Considering the problems we have encountered so far, we needed to think of other ways to solve our problems. One of the ideas was to manually split our datasets to separate files, and manually run separate Python processes. However, as lazy programmers, we have found this way troublesome, not easily scalable and prone to error. Fortunately, we have found another way.
Pool allows to automatically perform an operation in multiple processes.
def parallelize_dataframe(df, func): df_split = np.array_split(df, num_partitions) pool = Pool(num_cores) result_df = pd.concat(pool.map(func, df_split)) pool.close() pool.join() return result_df
This simple piece of code allows to run a specific function on DataFrame’s elements using a specified number of cores. It splits data to a number of partitions, spawns
Pool, executes code on every partition and joins results. Simple and efficient.
We have run the code and... it hung. It appeared that there are problems when you try to use
pool.map from e.g. a Jupyter Notebook. Running the code outside of the notebook worked!
We have spawn the computations on 7 cores (in order not to freeze other programs on the machine) and the new estimation gave us only... 96 hours. This was still too much. Fortunately there is AWS!
We have spawned a single AWS
m4.16xlarge instance with 64 cores. This allowed the computation to finish in around 11 hours. Thanks to that, we managed to include the new feature in our final algorithm and jump to a higher position in the competition leaderboard.
If you would like to learn about more advanced Machine Learning topics, also take a look at our other blog posts: