This post consists of common terminologies along with their explanation that I will be using in my other articles.

Note: This is an ongoing post and will be updated as I come across new terminologies


Principle Component Analysis

Principle Component Analysis is a technique used to reduce the dimensions of a given set of data features (fields) and their values. It is used to reduce the number of features in our data without compromising the accuracy too much. Basically, we try to emphasize variation and bring out strong patterns in a dataset. This is done by combining a group of possibly related features into a single new feature called a ‘principle component’.

As an example, consider a set of features about houses.

  • It may have combinable features, like the length and breadth of a plot of land. These can be combined into “area”.
  • It may list similar data multiple times, like area in square meters as well as square feet.
  • Even without clear correlations, we can often combine features and receive a satisfactory result that needs less space and makes algorithms process this data faster.


In 2 dimensions, PCA combines features by finding a new vector (1 dimension) and projecting the values onto it. For larger dimensions, PCA reduces values in a ‘space’ of n dimensions to a ‘subspace’ of k dimensions (k<n).

  1. First, we standardize the data set by Scaling (reduce all data ranges to [0,1]) and Normalization (subtract the means to get a mean of 0 for each feature)
  2. The subspace is chosen to maximize the variance in the dataset. This means the magnitude of their projections (distance from the origin in the subspace) must be maximized.
  3. This can be achieved by choosing new vectors along eigenvectors of the covariance matrix. The covariance matrix defines the ‘spread’ of the data. Its values show how the values of its axes change with respect to each other, thus providing the shape of data. By definition, eigenvectors denote directions along which projection lengths will scale linearly, so the variance is not reduced.
  4. Since we need to maximize the values and we only need k vectors, we choose k eigenvectors, corresponding to the k largest eigenvalues.
  5. Now that the new vectors are obtained, multiply them by the data to get create the reduced set of values.





Auto Encoder is a type of unsupervised learning neural network whose job is to predict the input itself but through a bottleneck. Because of this bottleneck, the neural network has to learn to represent the input in lower dimensions. That is why it is used in applications where there is a need for dimensionality reduction. Auto-Encoders are similar to Principle Component Analysis but are more effective in learning the mapping because of the non-linearity introduced between layers of the neural network.





The dataset used in most Machine Learning problems consist of two things: pattern + noise. The job of the machine learning model is to learn the pattern in the data and ignore the noise. If it is also learning the noise then it’s overfitting. Let’s take an example of pattern + noise in a house pricing dataset. So the features in this dataset can be the number of rooms, area, location, etc. So based on these given features we can estimate the price of the house. But as we know that not all the houses that have the same features have the same price. This variation in price — of houses with same features — is called noise.

Our model must only learn the pattern (simpler model) but learn to ignore the noise(higher order polynomial). So to make the model ignore the noise, we need to have a mechanism that penalizes the model everytime it considers the noise(higher order polynomial) while training.  This mechanism of penalizing every time the model chooses higher order polynomial — which has an insignificant reduction in error — is called regularization.




Latent Space

The word “latent” means “hidden”. It is used that way in machine learning — you observe some data which is in the space that you can observe, and you want to map it to a latent space where similar data points are closer together.




Random Variables

Random Variables are central objects of probability theory whose possible numerical outcomes are based on randomness, stochasticity, aleatory, etc. It is defined as a function that maps outcomes to numerical values. There are two types of random variables.



1. Continuous Random Variable

It can store an infinite amount of values in it. Usually, when continuous variables are plotted, they create a line in the graph.

For example, the weight of a person, the height of a person, the temperature of water, etc can be represented using any real number.

2. Discrete Random Variable

It can store a limited amount of values in it. Usually, to plot discrete variables, histograms are used.

For example, the number of candies in a jar, no. of days in a week, etc can be represented using any positive integer.




Knowledge Base

Knowledge Base is a large collection of curated knowledge, such as Freebase, YAGO, or CYC. The word knowledge base is also applied when this knowledge is automatically constructed, such as with NELL. Most of the knowledge bases include subject-verb-object triples automatically extracted from large text corpora. We formally define a knowledge base as a collection of triples, (es, r, et), where each triple expresses some relation r between a source entity es and a target entity et. As stated above, the relations r could be from an underlying ontology (such as /TYPE/OBJECT/TYPE, from Freebase, or CONCEPT:LOCATEDAT, from NELL), or they could be verb phrases extracted from the text, such as “plays for”. The entities e could be formal representations of real-world people, places, categories, or things (such as /M/02MJMR from Freebase or CONCEPT:BARACKOBAMA from NELL, both representing U.S. President Barack Obama), or they could be noun phrases taken directly from the text, such as the string “Obama”.

Referred From: Doctoral Thesis of Matthew Gardner






















Pin It on Pinterest