Last Updated: 2/6/2023
Anomaly Detection
What is Anomaly Detection?
Anomaly detection is like a treasure hunt. Imagine you are looking for treasure in your backyard. You know that the treasure is something shiny and valuable, but you don't know exactly what it looks like or where it is. You start by looking for clues, like shiny objects or things that are out of place. These clues can help you narrow down where the treasure might be. Now imagine that instead of treasure, you are looking for something unusual or unexpected in a large amount of data. Anomaly detection is a way of using clues or patterns in the data to find things that are different from what you usually see. This can be useful for finding problems or mistakes or discovering new and interesting things. For example, if you look at a graph of how much money a store makes each day, you might see a pattern where most days, the store makes a similar amount of money. But one day, the store makes much more or much less money than usual. This might be an anomaly or something unusual. You might want to investigate this anomaly to see if it was caused by something special, like a sale or a holiday. So, in summary, anomaly detection is a way of finding unusual or unexpected things in a large amount of data. It can be like a treasure hunt, where you are looking for clues to help you discover something new and interesting.
"Anomaly detection is the task of detecting instances that deviate strongly from the norm. These instances are called anomalies, or outliners, while the normal instances are called inliers." — Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, Aurélien Géron.
Examples
Here are a few examples of how anomaly detection is used in real life:
- In Fraud detection, AI models can identify unusual patterns in financial transactions that might indicate fraud. For example, if a credit card is suddenly being used to make a lot of high-dollar purchases in a short period of time, this might be an anomaly that warrants further investigation. Other indicators are geographical region, description, and time of the transaction. Recognizing these transactions helps potential victims before the money is withdrawn from their accounts.
- In Network security, it can be used to identify unusual patterns of network activity that might indicate a cyber attack. For example, if a computer starts sending out a large number of network requests to unfamiliar servers, this could be an anomaly that indicates malware or a virus.
- In Manufacturing, identify unusual patterns in production data that might indicate a problem with a manufacturing process. For example, if the number of defective products produced by a machine suddenly spikes, this could be an anomaly that indicates a problem with the machine or the production process.
- In Healthcare, Anomaly detection can identify unusual patterns in patient data that might indicate potential health problems. For example, if a patient's vital signs (like heart rate or blood pressure) suddenly change significantly, this could be an anomaly that warrants further investigation.
- Environmental monitoring: Anomaly detection can identify unusual patterns in environmental data that might indicate a problem or change in the environment. For example, if the levels of certain pollutants in the air suddenly spike, this could be an anomaly that indicates an environmental problem.
Novelty Detection
When talking about anomaly detection, novelty detection and outliers detection are also mentioned. Novelty detection is a type of machine learning task that involves finding things that are new or different from what we have seen before. It is similar to anomaly detection, which involves finding things that are unusual or unexpected. Anomaly detection is typically used to identify things that are unusual or unexpected within a known dataset. In other words, we have some idea of what "normal" looks like, and we are trying to find things that are different from that. Novelty detection, on the other hand, is used to identify things that are completely new and different from anything we have seen before. This can be more challenging than anomaly detection because we don't have a baseline of what "normal" looks like to compare against. One key use case for novelty detection is in the field of online fraud detection. For example, a company might use novelty detection to identify new types of fraudulent activity it has never seen before. This can be challenging because the company doesn't have any examples of this new type of fraud to use as a baseline. Instead, it has to rely on more general patterns and characteristics to identify the fraud. In summary, anomaly detection is used to identify unusual or unexpected things within a known dataset, while novelty detection is used to identify completely new and different things that we have not seen before.
Outliers Detection
Outlier detection, also known as outlier analysis or anomaly detection, is the process of identifying data points that are unusual or unexpected within a dataset. These data points, known as outliers, can be caused by errors in data collection, measurement, or entry, or they can be the result of unusual or unexpected events. Outlier detection is similar to anomaly detection, but the two terms are not exactly the same. Anomaly detection is a more general term that can refer to any method used to identify unusual or unexpected patterns in data. Outlier detection is a specific type of anomaly detection that focuses on identifying individual data points that are significantly different from the other points in the dataset. There are several different techniques that can be used for outlier detection, including statistical methods, machine learning algorithms, and domain-specific approaches. The choice of method will depend on the nature of the data and the specific goals of the analysis. In summary, outlier detection is a specific type of anomaly detection that focuses on identifying unusual or unexpected data points within a dataset. It is similar to, but not the same as, anomaly detection.
Anomaly Detection in Real Life
- [Kaggle] Anomaly Detection
- [Kaggle] NYC Taxi Traffic
- [Kaggle] Credit Card Anomaly Detection
- [Kaggle] Time Series with anomalies
- [Kaggle] Medical Anomaly Detection
- [Kaggle] Network Anomaly Detection Dataset
- [Kaggle] Marble Surface Anomaly Detection - 2
- [Kaggle] Healtcare Providers Data For Anomaly Detection
If you want to learn more about Anomaly Detection you should check out the following resources:
Textbooks
Videos
Podcasts
Between the time we recorded and released this episode, an interesting argument made the rounds on the internet that UMAP’s advantages largely stem from good initialization, not from advantages inherent in the algorithm. We don’t cover that argument here obviously, because it wasn’t out there when we were recording, but you can find a link to the paper below.