Tame the Curse of Dimensionality! Learn Dimensionality Reduction (PCA) and implement it with Python and Scikit-Learn.

Image source: unsplash.com.

In the novel Flatland, characters living in a two-dimensional world find themselves perplexed and unable to comprehend when they encounter a three-dimensional being. I use this analogy to illustrate how similar phenomena occur in Machine Learning when dealing with problems involving thousands or even millions of dimensions (i.e. features): surprising phenomena happen, which have disastrous implications on our Machine Learning models.

I’m sure you felt stunned, at least once, by the huge number of features involved in modern Machine Learning problems. Every Data Science practitioner, sooner or later, will face this challenge. This article will explore the theoretical foundations and the Python implementation of the most used Dimensionality Reduction algorithm: Principal Component Analysis (PCA).

Why do we need to reduce the number of features?

Datasets involving thousands or even millions of features are common nowadays. Adding new features to a dataset can bring in valuable information, however, they will slow the training process and make it harder to find good patterns and solutions. In Data Science this is called the Curse of Dimensionality and it often leads to skewed interpretation of data and inaccurate predictions.

Machine learning practitioners like us can benefit from the fact that for most ML problems, the number of features can be reduced consistently. For example, consider a picture: the pixels near the border often don’t carry any valuable information. However, the techniques to safely reduce the number of features in a ML problem are not trivial and need an explanation that I will provide in this post.

Image by the author.

The tools I will present not only simplify the computation effort and boost the prediction accuracy, but they will also serve as a tool to graphically visualize high-dimensional data. For…