Deep learning for natural language processing, Part 1

The machine learning revolution leaves no stone unturned. Natural language processing is yet another field that underwent a small revolution thanks to the second coming of artificial neural networks. Let’s just briefly discuss two advances in the natural language processing toolbox made thanks to artificial neural networks and deep learning techniques.

Word embeddings

Traditionally, lexical databases like WordNet were the main go-tos whenever one wanted to work on the semantics of words and sentences. Given a word, one could query the relevant synsets and reason about relations between words and how similar words are to each other. Simultaneously, Latent Semantic Analysis had been available for working with the structure of a document, or a sentence.

Those techniques were recently joined by a newer approach: word embeddings. The idea here is to use machine learning techniques (either based on artificial networks or more traditional statistical learning) to create a model that maps each word into a vector of numbers.

One of the most popular word embeddings is Google’s Word2Vec. It is implemented as a two-layer neural network that is trained to predict the context of each word. You can find a pre-trained model (GoogleNews was used as corpus) here: https://code.google.com/archive/p/word2vec/

Python ecosystem has a great set of libraries for all data science fields. I find Gensim to be a great choice for using pre-trained word2vec models as well as training your own. So let’s see how to work with these vectors:

from gensim.models import KeyedVectors

# this loads pre-trained word embeddings model - based on Google News corpus - this is a 3.4G file
model = KeyedVectors.load_word2vec_format('input/GoogleNews-vectors-negative300.bin', binary=True)

# here we fetch vector for a word 'airplane'
airplane_vec = model.word_vec('airplane')

# let's check dimensions of this vector

# (300,)
# this vector has 300 dimensions

# lets see first 5 values in this vector
# array([ 0.16894531, -0.20019531, -0.16113281,  0.11669922, -0.16015625], dtype=float32)

# lets see other word
# array([-0.02050781, -0.08447266, -0.1875    , -0.265625  , -0.03930664], dtype=float32)

As you can see, each word gets mapped into a quite sizable (300 dimensions) vector and individual values do not seem to hold much meaning. What exactly are these numbers? So in technical terms, in case of word2vec, these are simply values of weights observed in neural model after it was trained to predict context of a given word. So one can say it is the knowledge a neural network gained from observing how the language is used. These are a way to describe the use of given word in the language. Let’s see how we can use this representation.

Vectors' "magic"

The super interesting thing about the vectors generated by word embedding models is that the way the meaning of the word is encoded allows for reasoning about it in terms of a simple algebra. To find how similar one word is to another, one just has to compare the direction of vectors. Cosine similarity is a convenient way to do this. It basically outputs the cosine of the angle between two vectors. The tricky part is that here the space we are using has 300 dimensions:

from itertools import combinations

for a,b in combinations(('airplane','helicopter','bicycle','apple'),2):
    to_vec = lambda w: model.word_vec(w).reshape(1,-1)
    cos_similarity = cosine_similarity(to_vec(a),to_vec(b)).sum() #the output array actually has only single value so let's use sum
    print a, b, cos_similarity

# airplane helicopter 0.558049
# airplane bicycle 0.309346
# airplane apple 0.198847
# helicopter bicycle 0.243163
# helicopter apple 0.0966559
# bicycle apple 0.149487

So how about comparing the meaning of the whole sentence? Quite sensible results are achieved by simply comparing the mean vector for all words in the sentence.

import numpy as np

#outputs the average word2vec for words in this sentence
def average_vec(sentence):
    words = sentence.split()
    word_vecs = [model.word_vec(w) for w in words]
    return (np.array(word_vecs).sum(axis=0)/len(word_vecs)).reshape(1,-1)

compare = lambda a,b: cosine_similarity(average_vec(a),average_vec(b)).sum()

compare('Quick fox jumps over dog','Fast fox jumps over puppy')
# 0.89086312 - basically the same

compare('Quick fox jumps over dog','Fast animal jumps over another one')
# 0.71557522 - quite similar

compare('Fruit fell from the tree','An apple has fallen')
# 0.51716453 - still significant similarity - even though there is not a single shared word

compare('Quick fox jumps over dog','The judge entered the courtroom')
# 0.19947506 - completely unrelated sentence still has some similarity, but a low one

The substructures that appear in the vector space allow for spotting some interesting relations between vectors. When you sum, subtract vectors you can find results that basically… make sense. Gensim again has a nice feature to support such use cases:

# let's find two most similar to the sum of given words 
def sum_words(words):
    return model.most_similar(positive=words,topn=2)

[(u'bike', 0.7187385559082031), (u'scooter', 0.654274582862854)]

# And what will happen if we start with king, subtract man and add woman:
[(u'queen', 0.7118192315101624)]

How to further leverage this vector representation? Let’s get back to the sentence or document level and discuss how these vectors could be fed into an artificial neural network.

LSTM - sentence as a time series

Consider this: is a sentence just a set of words - or maybe the order matters?

The quick brown fox jumps over the lazy dog
The quick brown dog jumps over the lazy fox

Surely it does - simply changing the order of words can create a completely different meaning, sometimes the opposite of the original one. So, to really understand the semantics of a sentence, simply averaging the meaning of words - as we did earlier - is not enough.

Unfortunately, once one gets into the business of analyzing the structure of a sentence, things get quite complicated. Tagging parts of the sentence, finding the predicate, the subject, analyzing the dependency structure is quite complicated.

Deep learning techniques put forth the following proposal to address these issues: what if we think about the sentence as a time series or a temporal structure? This means that, when we consider the meaning of a given word, we “remember” what the previous word was. At the same time, we have some “memory” of words that could occur much earlier in the sentence.

Long-short term memory serves enabling the implementation of this idea well. LSTM is a special case of a recurrent neural network. It means there are connections between the preceding (looking from the perspective of the network’s input shape) and the following neurons. It allows feeding output of a “previous” neuron into the “next” neuron. LSTM builds on this idea of recurrent networks and adds a more complicated mechanism to allow the use of both recent and “older” (generated much earlier in the sequence) data.

When you construct a network with Keras, adding LSTM capabilities is just a matter of wiring a layer of type “LSTM”. Below is an annotated snippet of code constructing a network working on word embeddings and using LSTM to learn from sentence structure. By default Keras will use TensorFlow under the hood.

from gensim.models import KeyedVectors
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding
from keras.models import Model
import numpy as np

# first we need to tokenize our text, turning sequence of words into
# sequence of numbers

vocabulary_size = 10000 # number of supported words, size of our vocabulary

texts = ['This is our super simple',
         'Corpus of texts we will process',
         'IRL you would load this data from somewhere'

tokenizer = Tokenizer(vocabulary_size)
tokenizer.fit_on_texts(texts) # we fit tokenizer on texts we will process
sequences = tokenizer.texts_to_sequences(texts) # here the conversion to tokens happens

word_index = tokenizer.word_index

# let's pad these sequences so all have equal size
data = pad_sequences(sequences, maxlen=MAX_SENTENCE_LENGTH)

# let's use pre-trained word2vec again
word2vec = KeyedVectors.load_word2vec_format('../input/GoogleNews-vectors-negative300.bin', \

embedding_matrix = np.zeros((vocabulary_size, VECTOR_DIMENSION)) #word to vec - maps word id (from tokenizer) into vector space

# fill this matrix with values from pre-trained word2vec
for word, i in word_index.items():
    if word in word2vec.vocab:
        embedding_matrix[i] = word2vec.word_vec(word)

embedding_layer = Embedding(
        vocabulary_size, # how many words are mapped into vectors
        VECTOR_DIMENSION, # size of output vector dimension (we use pre-trained model with vectors of 300 values)
        weights=[embedding_matrix], # we initialize weight from pre-trained model
        input_length=MAX_SENTENCE_LENGTH, # how many words in the sentence we process
        trainable=False) # we will not update this layer

lstm_output_size = 30
lstm_layer = LSTM(
    lstm_output_size) # number of outputs

sentence_input = Input(shape=(MAX_SENTENCE_LENGTH,), dtype='int32') # the input takes 
embedded_sentence = embedding_layer(sentence_input)
lstm_layer = lstm_layer(embedded_sentence)

# you add all deep layers here - let's say we have a single one
size_of_dense = 10
deep_layer = Dense(

# and now let's assume we have output layer for binary classification task:
prediction = Dense(1, activation='sigmoid')(deep_layer)

model = Model(inputs=[sentence_input], \

# and now the model is ready to train

This code constructs the following network:

So at this point our network is ready to start training.

I hope that after this post, you got a grip of the word embeddings idea and how it can be combined with Long short-term memory layer. Thanks to libraries like Gensim and Keras it is quite easy to start using these techniques with just a couple of lines of Python code. In Part 2, I will show you how to leverage these to tackle a real-life problem.