Life Lessons from Gradient Descent

Life Lessons from Gradient Descent

It’s been a long time since I was introduced to Gradient Descent algorithm. Basically, gradient descent is an algorithm in optimization theory which helps us find out the most optimal solution to a function. It is based on a simple idea, rather than finding out the best solution to a problem immediately, move in the direction that seems to take us to the optimal solution. I have made a small list of philosophical thoughts that compare Gradient Descent to real life struggles.


I recently discovered a similar post. I would encorage you to check “Life is a gradient descent


1. Drastic Changes NEVER WORK

Do you remember the time when you told yourself, “I am going to work for next 10 hours continuously and get this thing done!” and then it didn’t happen? That time when you went to bed at 2:00 in the night and thought of waking up at 6:00 in the morning to complete the work, only to wake up much later?

Gradient Descent teaches us to make changes proportionally(negative of gradient or derivative) rather than drastically so that we reach the optimal solution one day instead of overshooting and making our condition even worse. I know tackling a problem head-on is tempting and it gives us an adrenaline rush when we think about it but, that does not give you long-term results. Everyone wants to make a definitive to-do list and strike off items one by one, but habitual and long-term changes are not made like that. Small steps in the right direction are the only way to succeed in the long run.


2. Journey is IMPORTANT not the state

Paths of People

Different Paths to Success

In life, we see many people achieving same things and we judge them all using the same criteria. Most of the people will judge a billionaire by looking at his wealth, his house, his car and so on, but they will never see the journey has been through to reach this successful state. Some people are fortunate to be born with some facility and some are not. If both have reached the same state, that does not mean both took the same path and experienced the same journey. For example, one of my friends was born into a rich family with million dollar family business and another friend was born in a middle-class family where only his dad earned money through driving a cab. If both of them are now running a company with same net-profit and turnover, then it would be foolish of us to think that they both have gone through the same struggles.


3. There can be better states, just keep WORKING

gradient landscape

As shown in the image above, there are two states a person can achieve. In our case, assume lower the better. So State 2 is better than State 1. When a lazy person is on State 1 he/she might think that he has already reached the lowest point and won’t make efforts to improve. But an active and ambitious person always tries different paths and keeps improving no matter what; even if he has to take a path is full of hardship. But soon the ambitious person finds another slope that goes further downhill that his current state (i.e. State 1). This is when he reaches State 2. Success is subject and can be deceiving, only the one who keeps trying different paths (stochasticity) gets to the most optimal state.


These are the three philosophical thoughts I relate to when I compare Gradient Descent to Life Struggles. If you have any other suggestion or example, comment below.

Introduction to Word2Vec & How it Works

Introduction to Word2Vec & How it Works

In this post I will be giving the readers an Introduction to Word2Vec. We will first see What are Vector Space Models and then we will dive into Word Embeddings. The aim of this post is to understand Word2Vec model which is a neural word embedding model which has become very popular in the recent time. So let’s get started.


In this post, we will be using the word ‘Word Embeddings’ to refer to dense representations of words in low-dimensional vector space. ‘Word Vectors’ or ‘Distributed Representations’ also mean the same.

1. Vector Space Model

To understand word embeddings and word2vec, we first need to understand some baseline models. We will first try to understand the Vector Space Model. Vector Space Models are used to represent words in a continuous vector space. So the words that are semantically similar are mapped close to each other. Vector Space Models have been used since 1990’s for distributional semantics. Some of the older models of estimating continuous representations for words were Latent Semantic Analysis and Latent Dirichlet Allocation. But with time, there have been new and better models for representing and finding similarities between those representations. Whatever the model may be but mostly all of them state that words that appear in the same context share semantic meaning.

Let’s first understand what representations mean and how the notion of relatedness works in vector space models.


1.1 Representation

In Vector Space Model, we can basically represent anything in form of a vector. Say, for example, we choose to represent a Sentence as a vector. Then the vector can be a huge N-dimensional vector which represents which words appear in that sentence.

Vector Space Model

Figure 1.1 Vector Space Model

The size of the vector will be the size of vocabulary. As we can see in Figure 1.1, the sentence vector is represented using the words that appear in that sentence. Every word in the vocabulary is given a fixed position in the vector. So suppose if a word “hello” appears in the sentence, then the bit at the position of the word “hello” will be “1”.

And this is just one way of representing a sentence. Another way can be to keep a count of words also. For Example, if word “hello” appears twice in a given sentence then we will increment the value at the position of the word “hello” in the vector. Also, we can use a vector to know if this word appears in a document of not. So all the elements in the vector will now represent a document, and if any bit is “1” then it means that this word appears in that specific document.

And we can go on making new representations and not just for words, We can also represent a word using a vector. We can use a vector to know if this word appears in a document of not. So all the elements in the vector will now represent a document, and if any bit is “1” then it means that this word appears in that specific document. And we can go on making new representations and not just for words, we can create representations for basically anything like a phrase, document, paragraphs, etc.


1.2 Notion of Relatedness

Relatedness is a property of vector space model to evaluate how much similar to vector are. But in order to find similarity between vectors, we have to convert the vague notion of relevance into a more precise definition that can be implemented with the program analogy. So in this process, we have to make a number of assumptions. These assumptions are based on the representation that we have used.

For example, If we used the vectors to represent documents based on the words that appear in that document then one way to find similarity between two documents(or vectors) will be to see how many words those two documents(or vectors) have in common.

Relatedness of Vectors

Figure 1.2 Relatedness of Vectors

And again this is just one way of finding similarity, you can use different ways to find similarities between two vectors. All these totally depends on your representations. Like for example, you can also use cosine similarity to estimate how much similar two words are in vector space.


2. Word Embedding

Word Embeddings are dense vector representations of words in low dimensional vector space. Word Embeddings are used as neural language models in many Natural Language Processing tasks. Word Embedding are very useful when the NLP tasks uses Neural Networks or Deep Learning because the word embeddings representation can be directly used as input for Neural Networks. This makes them more useful in current times when Deep Learning in NLP is so much being used.


2.1 History of Word Embeddings

The first time this word ‘Word Embeddings’ was used in Bengio et al in 2003, where he proposed this new model for distributed representations of words using neural networks. Then in 2008, a paper by Collobert and Weston basically made this ‘Word Embeddings’ approach mainstream by proposing an ‘A unified architecture for Natural Language Processing’. But in 2013, Mikolov published a paper called Word2Vec which basically popularized the use of word embedding and especially pre-trained word embeddings in many NLP tasks.


2.2 Why Word Embedding Models Are Used?

Computer Vision systems work with rich, high dimensional vectors to store and process the data. But this was not the case in Natural Language Processing tasks. In NLP, all the words are treated as a discrete atomic symbols i.e. word ‘Cat’ may be stored as ‘123’ and word ‘Dog’ may be stored as ‘346’. And these numbers carry no semantic or syntactic information. They are just some numbers representing a word. So while processing ‘cat’ this model can leverage very little about what it knows about ‘dogs’. Representing words as unique, discrete ids furthermore leads to data sparsity, and usually, means that we may need more data in order to successfully train statistical models.


2.3 Word Embedding Models

Word Embeddings are one of the few currently successful applications of Unsupervised Learning. All you need is a huge text corpus. That’s it. No annotations are required and that is why it became so widely used. Word Embedding models are very similar to that of Language Models. The quality of language model is measured based on the ability to learn a probability distribution over words in \(V\).

Language Models are models which compute the probability of next word \(w_T\) based in previous words i.e. \(p(w_t \: | \: w_{t-1} , \cdots w_{t-n+1})\). By applying the chain rule together with the Markov assumption, we can approximate the product of a whole sentence or document by the product of the probabilities of each word given its nn previous words:

\(p(w_1 , \cdots , w_T) = \prod\limits_i p(w_i \: | \: w_{i-1} , \cdots , w_{i-n+1})\)

Language models are evaluated using perplexity. Perplexity is a cross-entropy based measure. And we use perplexity to evaluate Word Embeddings too.


3. Word2Vec

Word2Vec is the most popular word embedding model. Word2Vec is considered as a starter of ‘Deep Learning in NLP’. However, Word2Vec is not deep. But the output of Word2Vec is what Deep Learning models can easily understand. Word2vec is basically a computationally efficient predictive model for learning word embeddings from raw text. The purpose of Word2Vec is to group words that semantically similar in vector space. It computes similarities mathematically. Given huge amount of data

Given a huge amount of data, Word2Vec model can create a very rich distributed representation of words that also preserves the semantic relationships that words have with other words. A train Word2Vec model can basically understand the meaning of a word based on the past appearances. The most famous example with be \(v(king) – v(man) + v(women) ≈ v(queen)\). This means that we can now cluster words, find similar words, find similar relationships of different words and more.

Another good example will be that a \( v(man) \) is related to a \(v(boy)\) in a similar way how \(v(woman)\) is related to a \(v(girl)\)

There are two different approaches in Word2Vec. Let’s see both of them.


3.1 Continous Bag of Words(CBOG) Model

In CBOG model, the input to the neural networks is \(w_(i-2), w_(i-1), w_(i+1), w_(i+2)\) and we will have to predict the word \(w_i\). Basically, it is saying that given the context predict the word in center. The below image will help you understand the model better.

CBOG Word2Vec

CBOG Word2Vec Model

The CBOG model gives better accuracy and is faster then Skipgram model but it requires huge of the data.


3.2 Skip Gram Model

The skip gram model is the opposite of CBOG model. In this model, the \(w_i\) is the input and the output is \(w_(i-2), w_(i-1), w_(i+1), w_(i+2)\). Basically, it is saying that given a target word, predict the context words. The below image will help you understand better.

Skipgram Word2Vec

Skipgram Word2Vec

According to Mikolov, this skipgram model works better in practice because most of the time, people don’t have huge amount of data. So skipgram model is better for small training data.


In the next post, I will writing about all the implementation details of these two models that we understood now. So stay tuned.

Why I Read So Much And How It All Started

Why I Read So Much And How It All Started

A mind needs books as a sword needs a whetstone if it is to keep its edge.- Tyrion Lannister, Book Series - ASOIAF

As Tyrion says, ‘Books are needed to keep minds sharp’, I had to follow his instructions being such a big fan. Keeping this aside, I would give the same advice to anyone who asks me ‘Why I read so much?’. (more…)

How To Create a Facebook Messenger ChatBot In 5 Minutes – Step By Step Instructions

How To Create a Facebook Messenger ChatBot In 5 Minutes – Step By Step Instructions

The year 2017 will open a new era of how humans interact with digital services. Chatbots are the new interface for humans to interact with computers. Chatbots allow us to get the same digital services that we have been using but without the need of any installation of an app. And that is not even the best part of Chatbots. Chatbots do all the stuff using natural language.

So we can just say stuff like “Order A Game of Thrones book from Amazon”. The Chatbot sees the rest of the things and will order on your behalf. All this happen without even installing or opening Amazon App.

So when Facebook first announced the Messenger platform last year, I started learning about it and created some of my pet projects. I wanted to write tutorials about my experiences but was very busy with other kinds of stuff. But now, I have decided that I will continue posting some new tutorials related to ChatBots and Natural Language Processing.

We are going to make a simple messenger chatbot which will just reply the same thing back that it received to the user. So suppose if I message “Hiii” to the Bot, it will reply “Hiii” back to me. I know this is very simple and not of any use. But this is like the Hello World project for people who want to learn how to make ChatBots.

Let’s Get Started. 😀





This post consists of common terminologies along with their explanation that I will be using in my other articles.

Note: This is an ongoing post and will be updated as I come across new terminologies


Principle Component Analysis

Principle Component Analysis is a technique used to reduce the dimensions of a given set of data features (fields) and their values. It is used to reduce the number of features in our data without compromising the accuracy too much. Basically, we try to emphasize variation and bring out strong patterns in a dataset. This is done by combining a group of possibly related features into a single new feature called a ‘principle component’.

As an example, consider a set of features about houses.

  • It may have combinable features, like the length and breadth of a plot of land. These can be combined into “area”.
  • It may list similar data multiple times, like area in square meters as well as square feet.
  • Even without clear correlations, we can often combine features and receive a satisfactory result that needs less space and makes algorithms process this data faster.


In 2 dimensions, PCA combines features by finding a new vector (1 dimension) and projecting the values onto it. For larger dimensions, PCA reduces values in a ‘space’ of n dimensions to a ‘subspace’ of k dimensions (k<n).

  1. First, we standardize the data set by Scaling (reduce all data ranges to [0,1]) and Normalization (subtract the means to get a mean of 0 for each feature)
  2. The subspace is chosen to maximize the variance in the dataset. This means the magnitude of their projections (distance from the origin in the subspace) must be maximized.
  3. This can be achieved by choosing new vectors along eigenvectors of the covariance matrix. The covariance matrix defines the ‘spread’ of the data. Its values show how the values of its axes change with respect to each other, thus providing the shape of data. By definition, eigenvectors denote directions along which projection lengths will scale linearly, so the variance is not reduced.
  4. Since we need to maximize the values and we only need k vectors, we choose k eigenvectors, corresponding to the k largest eigenvalues.
  5. Now that the new vectors are obtained, multiply them by the data to get create the reduced set of values.





Auto Encoder is a type of unsupervised learning neural network whose job is to predict the input itself but through a bottleneck. Because of this bottleneck, the neural network has to learn to represent the input in lower dimensions. That is why it is used in applications where there is a need for dimensionality reduction. Auto-Encoders are similar to Principle Component Analysis but are more effective in learning the mapping because of the non-linearity introduced between layers of the neural network.





The dataset used in most Machine Learning problems consist of two things: pattern + noise. The job of the machine learning model is to learn the pattern in the data and ignore the noise. If it is also learning the noise then it’s overfitting. Let’s take an example of pattern + noise in a house pricing dataset. So the features in this dataset can be the number of rooms, area, location, etc. So based on these given features we can estimate the price of the house. But as we know that not all the houses that have the same features have the same price. This variation in price — of houses with same features — is called noise.

Our model must only learn the pattern (simpler model) but learn to ignore the noise(higher order polynomial). So to make the model ignore the noise, we need to have a mechanism that penalizes the model everytime it considers the noise(higher order polynomial) while training.  This mechanism of penalizing every time the model chooses higher order polynomial — which has an insignificant reduction in error — is called regularization.




Latent Space

The word “latent” means “hidden”. It is used that way in machine learning — you observe some data which is in the space that you can observe, and you want to map it to a latent space where similar data points are closer together.




Random Variables

Random Variables are central objects of probability theory whose possible numerical outcomes are based on randomness, stochasticity, aleatory, etc. It is defined as a function that maps outcomes to numerical values. There are two types of random variables.



1. Continuous Random Variable

It can store an infinite amount of values in it. Usually, when continuous variables are plotted, they create a line in the graph.

For example, the weight of a person, the height of a person, the temperature of water, etc can be represented using any real number.

2. Discrete Random Variable

It can store a limited amount of values in it. Usually, to plot discrete variables, histograms are used.

For example, the number of candies in a jar, no. of days in a week, etc can be represented using any positive integer.




Knowledge Base

Knowledge Base is a large collection of curated knowledge, such as Freebase, YAGO, or CYC. The word knowledge base is also applied when this knowledge is automatically constructed, such as with NELL. Most of the knowledge bases include subject-verb-object triples automatically extracted from large text corpora. We formally define a knowledge base as a collection of triples, (es, r, et), where each triple expresses some relation r between a source entity es and a target entity et. As stated above, the relations r could be from an underlying ontology (such as /TYPE/OBJECT/TYPE, from Freebase, or CONCEPT:LOCATEDAT, from NELL), or they could be verb phrases extracted from the text, such as “plays for”. The entities e could be formal representations of real-world people, places, categories, or things (such as /M/02MJMR from Freebase or CONCEPT:BARACKOBAMA from NELL, both representing U.S. President Barack Obama), or they could be noun phrases taken directly from the text, such as the string “Obama”.

Referred From: Doctoral Thesis of Matthew Gardner






















Pin It on Pinterest