Deep learning for natural language processing, Part 2
In Part 1, I wrote about two useful advances in natural language processing: word embeddings (models that allows one to transform words into vectorised form while preserving semantic information) and recurrent neural networks (specifically LSTM model). Here let’s dive a little bit deeper and show specific use case.
Quora on Kaggle
Kaggle competitions are highly competitive. One can usually expect state of the art techniques to be used when trying to tackle the problems served by the organisers. Quora request to detect duplicate questions - a competition run a couple months ago - was definitely one of those challenges where a set of top notch data scientist employed quite sophisticated models.
Quora asked to help classify if a pair of questions are duplicates of each other. For example:
"Why do rockets look white?”, "Why are rockets and boosters painted white?”
If questions are duplicates they can get merged and into single one and de-clutter the platform. Originally Quora was using human moderators to achieve that so the business value of improving automated model is obvious.
Word embeddings for the win
Softwaremill’s team managed to finish in top 6% of the leaderboard. Our final solution was an ensemble of models, using dozens of features - as it often happens with high-ranking solutions on Kaggle, but here I wanted to focus only on the deep-learning and word embeddings part of it - to relate with Part 1.
Significant portion of our features set was calculated on vectorisations of questions. Once a word or a sentence of words is converted into vectors, one can do a lot of useful algebra on it. For Kaggle competition we were mostly interested in checking if specific collections of vectors have “similar meaning” which in geometric world means effectively that these “point in the similar direction”.
One way to calculate similarity between vectors is to use cosine similarity. A sensible way to compare meaning of two sentences is to calculate the angle between average of vectors for each word (as shown in Part1[link]). But in Kaggle’s Quora Question Pair’s it was useful to generate many more features by adding specific transformations to vectors, for example:
- use weighted (by inverted frequency) average of vectors in sentence rather than simple average - built on an intuition that the less frequent words are often more significant to the sense of the whole sentence
- compare direction of vectors for matching part of speeches (so e.g. compare vectorised noun chunks with each other, compare vectorised verbs etc)
- use top N similarities only
- calculate similarity between generated n-grams
Another useful measure of comparing vectors is taking Euclidean distance between their ends - the smaller the distance, the more similar the vectors. And one could apply this metric again in all approaches mentioned above with cosine similarity.
Another way of calculating distance between collection of vectors we used was Word’s Mover Distance, which is simply a minimal distance each of the words in a sentence has to be moved to transform into another sentence.
So these techniques, building on top of vectorised representation (we used only word2vec, but another embedding could be used as well), gave us a number of features that we coupled with some more “traditional” ones like editing distances, words frequencies etc. Now towards neural networks.
LSTM on sentences
As explained in Part 1, sentence can be seen as a time series of words and this perspective fits nicely with long short-term memory model, a special case of recurrent neural network. For Quora challenge, both evaluated sentences got fed into dedicated LSTM layers and then outputs merged into single layer.
embedding_layer = Embedding(nb_words, EMBEDDING_DIM, weights=[embedding_matrix], # loaded from pre-trained word2vec model input_length=MAX_SEQUENCE_LENGTH, trainable=False) # we use it only to lookup vector values, do not train it lstm_layer = LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm) # LSTM layer prototype sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') # input for first sentence embedded_sequences_1 = embedding_layer(sequence_1_input) x1 = lstm_layer(embedded_sequences_1) sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') # input for second sentence embedded_sequences_2 = embedding_layer(sequence_2_input) y1 = lstm_layer(embedded_sequences_2) merged = concatenate([x1, y1])
One trick that was useful in competition was to run the neural network model with LSTM layers twice, replacing which sentence gets fed into which input, and the averaging the prediction.
So how did we generate predictions using features from word embeddings (coupled with more traditional ones) and LSTM? So the, lets call them simple, even though there really are not, features were used originally as inputs in decision-trees based model trained using xgboost. The LSTM based on word2vec representations of sentences were coupled with additional hidden neural layers and produced deep learning model. We then combined those two with simple ensembler generating a weighted average of predictions.
However at this point in the competition, a data leak was identified. Data leak is basically an information that somehow is predictive for the target variable, but is there only because of errors in data selection of competition design, and is not really useful for tackling the problem in real word. What it meant is that any model that did not use this information would perform significantly worse that the ones that did. So it meant this feature has to be included into our LSTM-based model. Even further, we observed that feeding the rest of the features into deep learning model significantly improved it’s performance, while still having simple features-only model trained using different algorithm (boosted trees) with ensembling was producing superior results overall. That’s basically where some magic and a lot of try-and-fail approach happens in machine learning 😉
So eventually the solution looked as follows:
There are two separate models, both are fed with a set of common features, while Model B gets additionally raw sentences to be mapped via word2vec in LSTM model.
In Keras code it looked as follows - 3rd input “xf_input” appears:
embedding_layer = Embedding(nb_words, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False) lstm_layer = LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm) sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') embedded_sequences_1 = embedding_layer(sequence_1_input) x1 = lstm_layer(embedded_sequences_1) sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') embedded_sequences_2 = embedding_layer(sequence_2_input) y1 = lstm_layer(embedded_sequences_2) xf_input = Input(shape=(ADDITIONAL_FEATURES_NUM,)) # here come the additional features we used xf_hidden = Dense(25, activation='relu')(xf_input) xf_branch = Dropout(rate_drop_dense)(xf_hidden) xf_branch = BatchNormalization()(xf_branch) merged = concatenate([x1, y1]) merged = Dropout(rate_drop_dense)(merged) merged = BatchNormalization()(merged) merged = Dense(num_dense, activation=act)(merged) merged = Dropout(rate_drop_dense)(merged) lstm_branch = BatchNormalization()(merged) joined = concatenate([lstm_branch,xf_branch]) joined = Dense(15, activation='sigmoid')(joined) joined = Dropout(low_drop)(joined) joined = Dense(5, activation='sigmoid')(joined) preds = Dense(1, activation='sigmoid')(joined)
It can be visualised as:
So, as you see in practical applications, especially when every small difference in final classification score counts - like during Kaggle challenges, the details gets a little bit complicated. But definitely while participating in this challenge, we found word embeddings and LSTM super-useful for natural language processing and these were the bedrock of our solution.