Table of Contents


Last Updated: 2/6/2023

Clustering

Clustering

What is Clustering?

Clustering is a way of dividing data into groups, or clusters, based on the characteristics of the data. We can use clustering to group data points that have similarities between them and separate those that are different from each other.

The goal of clustering is to divide the data into groups, or clusters, such that the data points within each cluster are similar, and the data points in different clusters are different.

There are many different algorithms and techniques that can be used for clustering, and the choice of method will depend on the data's specific characteristics and the analysis's goals. Some algorithms require specifying the number of clusters that you want to create and then iteratively assigning data points to the nearest cluster based on their similarity. The algorithm then adjusts the clusters' position to optimize the grouping of the data points. Other algorithms can create a cluster hierarchy based on the data points' similarity. Starting by treating each data point as its own cluster, and then iteratively combining the most similar clusters until all the data points are in a single cluster.

Examples

Here are a few examples of how machine learning clustering is used in real life:

  1. Customer segmentation: Companies often use clustering to group their customers into different segments based on their characteristics and behaviors. For example, a retailer might use clustering to group its customers into segments based on their age, location, income, and purchase history. This can help the company target its marketing efforts more effectively and understand its customer base better.
  2. Document classification: Clustering can be used to group documents, like news articles or research papers, into different categories based on their content. For example, a news website might use clustering to group articles into categories like politics, sports, entertainment, and business. This can help the website organize its content and make it easier for readers to find what they want.
  3. Fraud detection: Clustering can be used to identify patterns of fraudulent activity in financial transactions. For example, a bank might use clustering to group transactions based on their characteristics, like the amount, the location, and the timing. This can help the bank identify unusual patterns that might indicate fraud and flag them for further investigation.
  4. Medical diagnosis: Clustering can group patients into different categories based on their symptoms and medical history. This can help doctors identify patterns in the data and make more informed diagnoses and treatment decisions.
  5. Image classification: Clustering can group images into different categories based on their content. For example, a social media website might use clustering to group user-uploaded images into categories like pets, landscapes, people, and food. This can help the website organize its content and make it easier for users to find what they are looking for.

Imagine you are organizing a party and have a list of guests with their names, ages, and interests. You want to group the guests into different clusters based on their characteristics so that you can plan activities and games that will be enjoyable for everyone. One way to do this is to use clustering. You could start by grouping the guests into different clusters based on their ages to have a cluster of kids, teenagers, and adults. You could further refine the clusters based on interests so that you have a cluster of kids who like sports, a cluster of teenagers who like music, and another cluster of adults who like board games. By grouping the guests into clusters based on their characteristics, you can more easily plan activities and games that will be enjoyable for everyone. This is similar to how machine learning clustering is used to group data points based on their characteristics and find patterns in the data.

Remember:

  • Since labels are not required, clustering is unsupervised learning, if we had labels it would be classification.

Clustering in Real Life

  1. [Kaggle] Worldwide Cities Data
  2. [Kaggle] Mall Customer Segmentation Data
  3. [Kaggle] Customer Personality Analysis
  4. [Kaggle] Wine Dataset for Clustering
  5. [Kaggle] Credit Card Dataset for Clustering

If you want to learn more about Clustering you should check out the following resources:

Textbooks

Chapter 10 (Unsupervised Learning - Clustering Methods): This chapter will instead focus on unsupervised learning, a set of statistical tools intended for the setting in which we have only a set of features X1, X2,...,Xp measured on n observations. We are not interested in prediction, because we do not have an associated response variable Y. Rather, the goal is to discover interesting things about the measurements on X1, X2,...,Xp. Is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations? Unsupervised learning refers to a diverse set of techniques for answering questions such as these. In this chapter, we will focus on two particular types of unsupervised learning: principal components analysis, a tool used for data visualization or data pre-processing before supervised techniques are applied, and clustering, a broad class of methods for discovering unknown subgroups in data.

Chapter 3 (Unsupervised Learning and Preprocessing - Clustering): As we described earlier, clustering is the task of partitioning the dataset into groups, called clusters. The goal is to split up the data in such a way that points within a single cluster are very similar and points in different clusters are different. Similarly to classification algorithms, clustering algorithms assign (or predict) a number to each data point, indicating which cluster a particular point belongs to.

Chapter 10 (Discovering Underlying Topics in the Newsgroups Dataset with Clustering and Topic Modeling)

Chapter 9 (Unsupervised Learning Techniques - Clustering) As you enjoy a hike in the mountains, you stumble upon a plant you have never seen before. You look around and you notice a few more. They are not identical, yet they are sufficiently similar for you to know that they most likely belong to the same species (or at least the same genus). You may need a botanist to tell you what species that is, but you certainly don’t need an expert to identify groups of similar-looking objects. This is called clustering: it is the task of identifying similar instances and assigning them to clusters, or groups of similar instances.

Videos



Podcasts

DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions out organically to find clusters. In particular, it can fit irregularly-shaped clusters, and it can also identify outlier points that don't belong to any of the clusters. Pretty cool!

Many people know K-means clustering as a powerful clustering technique but not all listeners will be as familiar with spectral clustering. In today’s episode, Sibylle Hess from the Data Mining group at TU Eindhoven joins us to discuss her work around spectral clustering and how its result could potentially cause a massive shift from the conventional neural networks. Listen to learn about her findings. Visit our website for additional show notes Thanks to our sponsor, Weights & Biases

Have you ever wondered how you can use clustering to extract meaningful insight from a time-series single-feature data? In today’s episode, Ehsan speaks about his recent research on actionable feature extraction using clustering techniques. Want to find out more? Listen to discover the methodologies he used for his research and the commensurate results. Visit our website for extended show notes! https://clear.ml/ ClearML is an open-source MLOps solution users love to customize, helping you easily Track, Orchestrate, and Automate ML workflows at scale.

Kmeans (sklearn vs FAISS), finding n_clusters via inertia/silhouette, Agglomorative, DBSCAN/HDBSCAN