Deep learning for natural language processing, Part 1
The machine learning revolution leaves no stone unturned. Natural language processing is yet another field that underwent a small revolution thanks to the second coming of artificial neural networks. Let’s just briefly discuss two advances in the natural language processing toolbox made thanks to artificial neural networks and deep learning techniques.
Traditionally, lexical databases like WordNet were the main go-tos whenever one wanted to work on the semantics of words and sentences. Given a word, one could query the relevant synsets and reason about relations between words and how similar words are to each other. Simultaneously, Latent Semantic Analysis had been available for working with the structure of a document, or a sentence.
Those techniques were recently joined by a newer approach: word embeddings. The idea here is to use machine learning techniques (either based on artificial networks or more traditional statistical learning) to create a model that maps each word into a vector of numbers.
One of the most popular word embeddings is Google’s Word2Vec. It is implemented as a two-layer neural network that is trained to predict the context of each word. You can find a pre-trained model (GoogleNews was used as corpus) here:
Python ecosystem has a great set of libraries for all data science fields. I find Gensim to be a great choice for using pre-trained word2vec models as well as training your own.
So let’s see how to work with these vectors:
from gensim.models import KeyedVectors # this loads pre-trained word embeddings model - based on Google News corpus - this is a 3.4G file model = KeyedVectors.load_word2vec_format('input/GoogleNews-vectors-negative300.bin', binary=True) # here we fetch vector for a word 'airplane' airplane_vec = model.word_vec('airplane') # let's check dimensions of this vector airplane_vec.shape # (300,) # this vector has 300 dimensions # lets see first 5 values in this vector airplane_vec[:5] # array([ 0.16894531, -0.20019531, -0.16113281, 0.11669922, -0.16015625], dtype=float32) # lets see other word model.word_vec('helicopter')[:5] # array([-0.02050781, -0.08447266, -0.1875 , -0.265625 , -0.03930664], dtype=float32)
As you can see, each word gets mapped into a quite sizable (300 dimensions) vector and individual values do not seem to hold much meaning. What exactly are these numbers? So in technical terms, in case of word2vec, these are simply values of weights observed in neural model after it was trained to predict context of a given word. So one can say it is the knowledge a neural network gained from observing how the language is used. These are a way to describe the use of given word in the language. Let’s see how we can use this representation.
The super interesting thing about the vectors generated by word embedding models is that the way the meaning of the word is encoded allows for reasoning about it in terms of a simple algebra. To find how similar one word is to another, one just has to compare the direction of vectors.
Cosine similarity is a convenient way to do this. It basically outputs the cosine of the angle between two vectors. The tricky part is that here the space we are using has 300 dimensions:
from itertools import combinations for a,b in combinations(('airplane','helicopter','bicycle','apple'),2): to_vec = lambda w: model.word_vec(w).reshape(1,-1) cos_similarity = cosine_similarity(to_vec(a),to_vec(b)).sum() #the output array actually has only single value so let's use sum print a, b, cos_similarity # airplane helicopter 0.558049 # airplane bicycle 0.309346 # airplane apple 0.198847 # helicopter bicycle 0.243163 # helicopter apple 0.0966559 # bicycle apple 0.149487
So how about comparing the meaning of the whole sentence? Quite sensible results are achieved by simply comparing the mean vector for all words in the sentence.
import numpy as np #outputs the average word2vec for words in this sentence def average_vec(sentence): words = sentence.split() word_vecs = [model.word_vec(w) for w in words] return (np.array(word_vecs).sum(axis=0)/len(word_vecs)).reshape(1,-1) compare = lambda a,b: cosine_similarity(average_vec(a),average_vec(b)).sum() compare('Quick fox jumps over dog','Fast fox jumps over puppy') # 0.89086312 - basically the same compare('Quick fox jumps over dog','Fast animal jumps over another one') # 0.71557522 - quite similar compare('Fruit fell from the tree','An apple has fallen') # 0.51716453 - still significant similarity - even though there is not a single shared word compare('Quick fox jumps over dog','The judge entered the courtroom') # 0.19947506 - completely unrelated sentence still has some similarity, but a low one
The substructures that appear in the vector space allow for spotting some interesting relations between vectors. When you sum, subtract vectors you can find results that basically… make sense. Gensim again has a nice feature to support such use cases:
# let's find two most similar to the sum of given words def sum_words(words): return model.most_similar(positive=words,topn=2) sum_words(['bicycle','engine']) [(u'bike', 0.7187385559082031), (u'scooter', 0.654274582862854)] # And what will happen if we start with king, subtract man and add woman: model.most_similar(positive=['king','woman'],negative=['man'],topn=1) [(u'queen', 0.7118192315101624)]
How to further leverage this vector representation? Let’s get back to the sentence or document level and discuss how these vectors could be fed into an artificial neural network.
LSTM - sentence as a time series
Consider this: is a sentence just a set of words - or maybe the order matters?
The quick brown fox jumps over the lazy dog The quick brown dog jumps over the lazy fox
Surely it does - simply changing the order of words can create a completely different meaning, sometimes the opposite of the original one. So, to really understand the semantics of a sentence, simply averaging the meaning of words - as we did earlier - is not enough.
Unfortunately, once one gets into the business of analyzing the structure of a sentence, things get quite complicated. Tagging parts of the sentence, finding the predicate, the subject, analyzing the dependency structure is quite complicated.
Deep learning techniques put forth the following proposal to address these issues: what if we think about the sentence as a time series or a temporal structure? This means that, when we consider the meaning of a given word, we “remember” what the previous word was. At the same time, we have some “memory” of words that could occur much earlier in the sentence.
Long-short term memory serves enabling the implementation of this idea well. LSTM is a special case of a recurrent neural network. It means there are connections between the preceding (looking from the perspective of the network’s input shape) and the following neurons. It allows feeding output of a “previous” neuron into the “next” neuron. LSTM builds on this idea of recurrent networks and adds a more complicated mechanism to allow the use of both recent and “older” (generated much earlier in the sequence) data.
When you construct a network with Keras, adding LSTM capabilities is just a matter of wiring a layer of type “LSTM”. Below is an annotated snippet of code constructing a network working on word embeddings and using LSTM to learn from sentence structure. By default Keras will use TensorFlow under the hood.
from gensim.models import KeyedVectors from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.layers import Dense, Input, LSTM, Embedding from keras.models import Model import numpy as np # first we need to tokenize our text, turning sequence of words into # sequence of numbers vocabulary_size = 10000 # number of supported words, size of our vocabulary MAX_SENTENCE_LENGTH = 10 texts = ['This is our super simple', 'Corpus of texts we will process', 'IRL you would load this data from somewhere' ] tokenizer = Tokenizer(vocabulary_size) tokenizer.fit_on_texts(texts) # we fit tokenizer on texts we will process sequences = tokenizer.texts_to_sequences(texts) # here the conversion to tokens happens word_index = tokenizer.word_index # let's pad these sequences so all have equal size data = pad_sequences(sequences, maxlen=MAX_SENTENCE_LENGTH) # let's use pre-trained word2vec again word2vec = KeyedVectors.load_word2vec_format('../input/GoogleNews-vectors-negative300.bin', \ binary=True) VECTOR_DIMENSION = 300 embedding_matrix = np.zeros((vocabulary_size, VECTOR_DIMENSION)) #word to vec - maps word id (from tokenizer) into vector space # fill this matrix with values from pre-trained word2vec for word, i in word_index.items(): if word in word2vec.vocab: embedding_matrix[i] = word2vec.word_vec(word) embedding_layer = Embedding( vocabulary_size, # how many words are mapped into vectors VECTOR_DIMENSION, # size of output vector dimension (we use pre-trained model with vectors of 300 values) weights=[embedding_matrix], # we initialize weight from pre-trained model input_length=MAX_SENTENCE_LENGTH, # how many words in the sentence we process trainable=False) # we will not update this layer lstm_output_size = 30 lstm_layer = LSTM( lstm_output_size) # number of outputs sentence_input = Input(shape=(MAX_SENTENCE_LENGTH,), dtype='int32') # the input takes embedded_sentence = embedding_layer(sentence_input) lstm_layer = lstm_layer(embedded_sentence) # you add all deep layers here - let's say we have a single one size_of_dense = 10 deep_layer = Dense( size_of_dense, activation='sigmoid' )(lstm_layer) # and now let's assume we have output layer for binary classification task: prediction = Dense(1, activation='sigmoid')(deep_layer) model = Model(inputs=[sentence_input], \ outputs=prediction) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc']) # and now the model is ready to train
This code constructs the following network:
So at this point our network is ready to start training.
I hope that after this post, you got a grip of the word embeddings idea and how it can be combined with Long short-term memory layer. Thanks to libraries like Gensim and Keras it is quite easy to start using these techniques with just a couple of lines of Python code.
In Part 2, I will show you how to leverage these to tackle a real-life problem.