This post consists of most common terminologies along with their explanation that is used in Machine Learning and Deep Learning papers. I will be using this post to refer to most common terms that I use in my other posts.
Principle Component Analysis is a technique used to reduce the dimensions of a given set of data features (fields) and their values. It is used to reduce the number of features in our data without compromising the accuracy too much. Basically, we try to emphasize variation and bring out strong patterns in a dataset. This is done by combining a group of possibly related features into a single new feature called a ‘principle component’.
As an example, consider a set of features about houses.
- It may have combinable features, like the length and breadth of a plot of land. These can be combined into “area”.
- It may list similar data multiple times, like area in square meters as well as square feet.
- Even without clear correlations, we can often combine features and receive a satisfactory result that needs less space and makes algorithms process this data faster.
In 2 dimensions, PCA combines features by finding a new vector (1 dimension) and projecting the values onto it. For larger dimensions, PCA reduces values in a ‘space’ of n dimensions to a ‘subspace’ of k dimensions (k<n).
- First, we standardize the data set by Scaling (reduce all data ranges to [0,1]) and Normalization (subtract the means to get a mean of 0 for each feature)
- The subspace is chosen to maximise the variance in the dataset. This means the magnitude of their projections (distance from the origin in the subspace) must be maximised.
- This can be achieved by choosing new vectors along eigenvectors of the covariance matrix. The covariance matrix defines the ‘spread’ of the data. Its values show how the values of its axes change with respect to each other, thus providing the shape of data. By definition, eigenvectors denote directions along which projection lengths will scale linearly, so the variance is not reduced.
- Since we need to maximise the values and we only need k vectors, we choose k eigenvectors, corresponding to the k largest eigenvalues.
- Now that the new vectors are obtained, multiply them by the data to get create the reduced set of values.
The dataset used in most Machine Learning problems consist of two things: pattern + noise. The job of the machine learning model is to learn the pattern in the data and ignore the noise. If it is also learning the noise then it’s overfitting.Let’s take an example of pattern + noise in a house pricing dataset. So the features in this dataset can be the number of rooms, area, location, etc. So based on these given features we can estimate the price of the house. But as we know that not all the houses that have the same features have the same price. This variation in price — of houses with same features — is called noise.
Our model must only learn the pattern (simpler model) but learn to ignore the noise(higher order polynomial). So to make the model ignore the noise, we need to have a mechanism that penalizes the model everytime it considers the noise(higher order polynomial) while training. This mechanism of penalizing every time the model chooses higher order polynomial — which has an insignificant reduction in error — is called regularization.
1. Continuous Random Variable
It can store an infinite amount of values in it. Usually, when continuous variables are plotted, they create a line in the graph.
For example, the weight of a person, the height of a person, the temperature of water, etc can be represented using any real number.
2. Discrete Random Variable
It can store a limited amount of values in it. Usually, to plot discrete variables, histograms are used.
For example, the number of candies in a jar, no. of days in a week, etc can be represented using any positive integer.