Select Page
Introduction to Word2Vec & How it Works

Introduction to Word2Vec & How it Works

In this post, I will be giving the readers an Introduction to Word2Vec. We will first see What are Vector Space Models and then we will dive into Word Embeddings. The aim of this post is to understand Word2Vec model which is a neural word embedding model which has become very popular in recent time. So let’s get started.

In this post, we will be using the word ‘Word Embeddings’ to refer to dense representations of words in low-dimensional vector space. ‘Word Vectors’ or ‘Distributed Representations’ also mean the same.

1. Vector Space Model

To understand word embeddings and word2vec, we first need to understand some baseline models. We will first try to understand the Vector Space Model. Vector Space Models are used to represent words in a continuous vector space. So the words that are semantically similar are mapped close to each other. Vector Space Models have been used since 1990’s for distributional semantics. Some of the older models of estimating continuous representations for words were Latent Semantic Analysis and Latent Dirichlet Allocation. But with time, there have been new and better models for representing and finding similarities between those representations. Whatever the model may be but mostly all of them state that words that appear in the same context share semantic meaning.

Let’s first understand what representations mean and how the notion of relatedness works in vector space models.

1.1 Representation

In Vector Space Model, we can basically represent anything in form of a vector. Say, for example, we choose to represent a Sentence as a vector. Then the vector can be a huge N-dimensional vector which represents which words appear in that sentence.

Fig. Vector Representation of Sentence

The size of the vector will be the size of vocabulary. As we can see in Figure 1.1, the sentence vector is represented using the words that appear in that sentence. Every word in the vocabulary is given a fixed position in the vector. So suppose if a word “hello” appears in the sentence, then the bit at the position of the word “hello” will be “1”.

And this is just one way of representing a sentence. Another way can be to keep a count of words also. For Example, if word “hello” appears twice in a given sentence then we will increment the value at the position of the word “hello” in the vector. Also, we can use a vector to know if this word appears in a document of not. So all the elements in the vector will now represent a document, and if any bit is “1” then it means that this word appears in that specific document.

And we can go on making new representations and not just for words, We can also represent a word using a vector. We can use a vector to know if this word appears in a document of not. So all the elements in the vector will now represent a document, and if any bit is “1” then it means that this word appears in that specific document. And we can go on making new representations and not just for words, we can create representations for basically anything like a phrase, document, paragraphs, etc.

1.2 Notion of Relatedness

Relatedness is a property of vector space model to evaluate how much similar to vector are. But in order to find similarity between vectors, we have to convert the vague notion of relevance into a more precise definition that can be implemented with the program analogy. So in this process, we have to make a number of assumptions. These assumptions are based on the representation that we have used.

For example, If we used the vectors to represent documents based on the words that appear in that document then one way to find similarity between two documents(or vectors) will be to see how many words those two documents(or vectors) have in common.

Fig. Relatedness of Vectors

And again this is just one way of finding similarity, you can use different ways to find similarities between two vectors. All these totally depends on your representations. Like for example, you can also use cosine similarity to estimate how much similar two words are in vector space.

2. Word Embedding

Word Embeddings are dense vector representations of words in low dimensional vector space. Word Embeddings are used as neural language models in many Natural Language Processing tasks. Word Embedding are very useful when the NLP tasks uses Neural Networks or Deep Learning because the word embeddings representation can be directly used as input for Neural Networks. This makes them more useful in current times when Deep Learning in NLP is so much being used.

2.1 History of Word Embeddings

The first time this word ‘Word Embeddings’ was used in Bengio et al in 2003, where he proposed this new model for distributed representations of words using neural networks. Then in 2008, a paper by Collobert and Weston basically made this ‘Word Embeddings’ approach mainstream by proposing an ‘A unified architecture for Natural Language Processing’. But in 2013, Mikolov published a paper called Word2Vec which basically popularized the use of word embedding and especially pre-trained word embeddings in many NLP tasks.

2.2 Why Word Embedding Models Are Used?

Computer Vision systems work with rich, high dimensional vectors to store and process the data. But this was not the case in Natural Language Processing tasks. In NLP, all the words are treated as discrete atomic symbols i.e. word ‘Cat’ may be stored as ‘123’ and word ‘Dog’ may be stored as ‘346’. And these numbers carry no semantic or syntactic information. They are just some numbers representing a word. So while processing ‘cat’ this model can leverage very little about what it knows about ‘dogs’. Representing words as unique, discrete ids furthermore leads to data sparsity, and usually, means that we may need more data in order to successfully train statistical models.

2.3 Word Embedding Models

Word Embeddings are one of the few currently successful applications of Unsupervised Learning. All you need is a huge text corpus. That’s it. No annotations are required and that is why it became so widely used. Word Embedding models are very similar to that of Language Models. The quality of language model is measured based on the ability to learn a probability distribution over words in V.

Language Models are models which compute the probability of next word $w_T$

based in previous words i.e. $p(w_t : | : w_{t-1} , \cdots w_{t-n+1})$. By applying the chain rule together with the Markov assumption, we can approximate the product of a whole sentence or document by the product of the probabilities of each word given its $n$ previous words:

$p(w_1 , \cdots , w_T) = \prod\limits_i p(w_i \: | \: w_{i-1} , \cdots , w_{i-n+1})$

Language models are evaluated using perplexity. Perplexity is a cross-entropy based measure. And we use perplexity to evaluate Word Embeddings too.

3. Word2Vec

Word2Vec is the most popular word embedding model. Word2Vec is considered as a starter of ‘Deep Learning in NLP’. However, Word2Vec is not deep. But the output of Word2Vec is what Deep Learning models can easily understand. Word2vec is basically a computationally efficient predictive model for learning word embeddings from raw text. The purpose of Word2Vec is to group words that semantically similar in vector space. It computes similarities mathematically. Given huge amount of data

Given a huge amount of data, Word2Vec model can create a very rich distributed representation of words that also preserves the semantic relationships that words have with other words. A train Word2Vec model can basically understand the meaning of a word based on past appearances. The most famous example with be $v(king) – v(man) + v(women) ≈ v(queen)$. This means that we can now cluster words, find similar words, find similar relationships of different words and more.

Another good example will be that a $v(man)$ is related to a $v(boy)$ in a similar way how $v(woman)$ is related to a $v(girl)$

There are two different approaches in Word2Vec. Let’s see both of them.

3.1 Continous Bag of Words(CBOG) Model

In CBOG model, the input to the neural networks is $w_(i-2), w_(i-1), w_(i+1), w_(i+2)$ and we will have to predict the word wi. Basically, it is saying that given the context predict the word in center. The below image will help you understand the model better.

Fig. CBOG Word2Vec Model

3.2 Skip Gram Model

The skip gram model is the opposite of CBOG model. In this model, the wi is the input and the output is $w_(i-2), w_(i-1), w_(i+1), w_(i+2)$. Basically, it is saying that given a target word, predict the context words. The below image will help you understand better.

Fig. Skipgram Word2Vec Model

The CBOG model gives better accuracy and is faster then Skipgram model but it requires huge of the data. According to Mikolov, this skipgram model works better in practice because most of the time, people don’t have huge amount of data. So skipgram model is better for small training data.