Table of Contents


Last Updated: 2/6/2023

Dimensionality Reduction

Dimensionality Reduction

What is Dimensionality Reduction?

Imagine you have a really big box full of different kinds of toys. There are dolls, cars, balls, stuffed animals, and all sorts of other things. It's hard to keep track of all the toys because there are so many different kinds, and they are all mixed together. One way to make it easier to keep track of the toys is to sort them into smaller boxes based on what they are. For example, you could have a box just for dolls, a box just for cars, and a box just for balls. This way, finding a specific toy is easier because you know which box to look in. This is similar to what we do in computer science when we have a lot of data. Sometimes, the data has many different features or characteristics, like the toys in the big box. It can be hard to make sense of all the data when it has so many different features. One way to make it easier to understand the data is to use a technique called dimensionality reduction. This is like sorting the data into smaller groups based on the different features. For example, you might group the data based on age, gender, location, or interests. This way, you can look at each group separately and see how it differs from the other groups. Dimensionality reduction is a useful way to make sense of large, complex datasets, and it can help us find patterns and trends that might not be obvious otherwise.

3d-2d

"When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data. This is called dimensionality reduction." — Machine Learning A Probabilistic Perspective, Kevin P. Murphy.

Examples

Here are a few examples of how you can apply dimensionality reduction in real life:

  1. Image compression: When we take a picture with our phones or cameras, the image consists of millions of tiny pixels, each with its color and brightness value. This is a lot of data and can take up a lot of space. Dimensionality reduction helps to identify patterns in the data and compress the image without losing too much quality, making the image easier to store and share.
  2. Speech recognition: When we speak, our voices produce sound waves that can be measured and analyzed. These sound waves have many different features, like pitch, volume, and frequency. Dimensionality reduction identifies patterns in the data and simplifies them to make it easier to analyze and recognize speech.
  3. Fraud detection: When we use our credit cards or make online transactions, we create a lot of data about what we buy, when, and how we pay. This data can have many different features, like the type of product, the price, and the location of the transaction. Dimensionality reduction identifies patterns in the data and simplifies it, making it easier to detect fraudulent activity.
  4. Marketing: Companies often have a lot of data about their customers, like their age, gender, location, and purchase history. This data can have many different features, and it can take time to make sense of it. Dimensionality reduction can identify patterns in the data and group customers into smaller, more manageable groups based on their characteristics. This can help companies target their marketing efforts more effectively.
  5. Data visualization: When we have a large dataset, it can be hard to understand or visualize it just by looking at a long list of numbers. Dimensionality reduction helps to identify data patterns and reduce them to two or three dimensions, making it easier to visualize and understand. This can be useful for finding patterns and trends in the data that might not be obvious otherwise. Data visualization is essential to communicate conclusions to people who are not data scientists.

Imagine you are organizing a book club. So, you have a list of members with their names, ages, genders, locations, and reading preferences. That is a lot of information, and it can be hard to make sense of it all. To make it easier, you group the members into smaller, more manageable groups based on their characteristics. For example, you could create a group for men, a group for women, a group for younger members, and a group for older members. You could also create groups based on reading preferences, like mystery, romance, sci-fi, etc. This way, you can look at each group separately and see how it differs from the other groups. You can also use the groups to plan book club events or discussions that will be more tailored to the interests of each group. By reducing the dimensionality of the data, you are able to make sense of it in a more organized and meaningful way.

The Curse of Dimensionality

Most people think the amount of data is important, however, this expression is not entirely true. More examples to train may be beneficial, while more features of each example may not be so. There is a problem in Artificial Intelligence called The Curse of Dimensionality. It happens when our dataset has too many columns or features. The "curse of dimensionality" refers to the challenges of working with high-dimensional data. High-dimensional data is data that has a large number of features or dimensions, which can make it difficult to analyze and understand. The main problem is finding patterns or trends inside high-dimensional data can be hard. As the number of dimensions increases, the volume of the data grows exponentially. As a result, finding meaningful patterns or relationships in the data becomes increasingly difficult. Another challenge of high-dimensional data is that it can be hard to visualize. When we have a lot of dimensions, it can be difficult to plot the data on a chart or graph, because we can't easily represent so many dimensions in two or three dimensions. Finally, high-dimensional data can be computationally expensive to work with, because it requires more processing power and storage to handle the larger volume of data. The curse of dimensionality is a common problem in machine learning and data analysis, but there are several techniques that can help us to solve or reduce it. These include dimensionality reduction, which reduces the number of dimensions in the data, and feature selection, which selects a subset of the most relevant dimensions to focus on. You will learn more about the Curse of Dimensionality later.

Remember:

  • In general, dimensionality reduction reduces noise in the data and removes redundant or highly correlated features.
  • Dimensionality Reduction has two major application: reduce noise and visualization.

new dimensions

Dimensionality Reduction in Real Life

  1. [Kaggle] Dimensionality Reduction
  2. [Kaggle] Swarm Behaviour Classification
  3. [Kaggle] Human Glioblastoma Dataset
  4. [Kaggle] 4D Temperature Monitoring with BEL
  5. [Kagglel] ICMR Dataset or Gene Expression Profiles of Breast Cancer
  6. [Kaggle] Bitcoin - Transactional Data

If you want to learn more about Dimensionality Reduction you should check out the following resources:

Textbooks

Chapter 8 (Dimensionality Reduction): Many Machine Learning problems involve thousands or even millions of features for each training instance. Not only do all these features make training extremely slow, but they can also make it much harder to find a good solution, as we will see. This problem is often referred to as the curse of dimensionality

Chapter 3 (Unsupervised Learning and Preprocessing - Dimensionality Reduction, Feature Extraction, and Manifold Learning): Unsupervised transformations of a dataset are algorithms that create a new representation of the data which might be easier for humans or other machine learning algorithms to understand compared to the original representation of the data. A common application of unsupervised transformations is dimensionality reduction, which takes a high-dimensional representation of the data, consisting of many features, and finds a new way to represent this data that summarizes the essential characteristics with fewer features. A common application for dimensionality reduction is reduction to two dimensions for visualization purposes

Chapter 9.3 (nsupervised Learning - Dimensionality Reduction): Many modern machine learning algorithms, such as ensemble algorithms and neural networks handle well very high-dimensional examples, up to millions of features. With modern computers and graphical processing units (GPUs), dimensionality reduction techniques are used much less in practice than in the past. The most frequent use case for dimensionality reduction is data visualization: humans can only interpret on a plot the maximum of three dimensions.

Videos



Podcasts

Dimensionality reduction redux: this episode covers UMAP, an unsupervised algorithm designed to make high-dimensional data easier to visualize, cluster, etc. It’s similar to t-SNE but has some advantages. This episode gives a quick recap of t-SNE, especially the connection it shares with information theory, then gets into how UMAP is different (many say better). Between the time we recorded and released this episode, an interesting argument made the rounds on the internet that UMAP’s advantages largely stem from good initialization, not from advantages inherent in the algorithm. We don’t cover that argument here obviously, because it wasn’t out there when we were recording, but you can find a link to the paper below.

More features are not always better! With an increasing number of features to consider, machine learning algorithms suffer from the curse of dimensionality, as they have a wider set and often sparser coverage of examples to consider. This episode explores a real life example of this as Kyle and Linhda discuss their thoughts on purchasing a home. The curse of dimensionality was defined by Richard Bellman, and applies in several slightly nuanced cases. This mini-episode discusses how it applies on machine learning. This episode does not, however, discuss a slightly different version of the curse of dimensionality which appears in decision theoretic situations. Consider the game of chess. One must think ahead several moves in order to execute a successful strategy. However, thinking ahead another move requires a consideration of every possible move of every piece controlled, and every possible response one's opponent may take. The space of possible future states of the board grows exponentially with the horizon one wants to look ahead to. This is present in the notably useful Bellman equation.

In this episode I talk about the problems of dimensionality reduction and clustering. I explain the applications of each one of these problems and also the most famous methods for solving them, such as PCA, KPCA, ICA and NNMF for the dimensionality reduction and the Kmeans for the clustering problems. In the end I also explain the autoencoders, which are powerfull neural networks that can be used for both problems.