Statistical anomaly detection

Anomaly detection, also known as outlier detection, is a technique used to identify unusual patterns or behaviours that deviate significantly from the norm. These deviations, or anomalies, can often be indicative of problems or issues that warrant further investigation. Applications of anomaly detection can be found in various domains, such as fraud detection, network security, quality control, and system health monitoring. There are some forms of anomaly detection, but in this explanation, I will focus on statistical anomaly detection.

Statistical methods for anomaly detection are based on the assumption that the majority of the data points are governed by some underlying probability distribution. By fitting a statistical model to the data, we can estimate the probability of observing each data point. Data points with low probabilities are considered anomalous since they are unlikely to be generated by the same process as the majority of the data. One of the most widely used statistical models for anomaly detection is the Gaussian (normal) distribution.

Normal Distribution

A normal distribution, also known as the Gaussian distribution or bell curve, is a continuous probability distribution characterized by its mean (μ) and standard deviation (σ). The mean represents the central tendency of the data, while the standard deviation measures the dispersion or spread of the data.

The probability density function (PDF) of a normal distribution is given by:

f(x) = (1 / (σ √(2 π))) exp(-(x - μ)^2 / (2 σ^2))

where x is a data point, and exp() represents the exponential function. Using the PDF, we can estimate the probability of observing a given data point within the distribution.

Anomaly Detection Using Normal Distribution

To implement anomaly detection using the normal distribution, we follow these general steps:

Data preprocessing: Clean and preprocess the data to remove noise and handle missing values. This step might also involve feature scaling or normalization to ensure that features have comparable magnitudes.
Model fitting: Fit a normal distribution to the data by calculating the mean and standard deviation of the dataset.
Threshold selection: Choose a probability threshold below which a data point is considered anomalous. This threshold can be determined empirically based on domain knowledge or by using cross-validation techniques on a labelled dataset.
Anomaly detection: For each data point, calculate its probability using the fitted normal distribution. All probability is multiplied and creates the probability of all. If the probability is below the chosen threshold, classify the data point as an anomaly.