HomeMachine Learning5 Essential Approaches for Robust Outlier Detection

5 Essential Approaches for Robust Outlier Detection

Introduction

Have you ever encountered strange data points in your dataset while exploring it? One or a few that seem unduly different from the vast majority of observations, thereby significantly skewing your averages and inflating variances? I went there too. These points are outliers. Their impact goes beyond changing data statistics: outliers can easily ruin the performance of any predictive analytics models you create. It is therefore crucial to detect and manage them robustly in any data project. This article lists and compares five essential approaches to detecting them, along with a short Python example for each.

1. The Z-Score Method

Z-score calculation is a simple method that works best for normally distributed data variables. It measures the number of standard deviations between each point and the mean. Essentially, a data point with a Z score of 3 or more (or -3 or less) is flagged as an outlier: this means that there is a distance of more than three standard deviations between that point and the mean. Despite its simplicity, it has the disadvantage that means and standard deviations are inherently very sensitive to extreme values.


import numpy as np
from scipy import stats

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250])
z_scores = np.abs(stats.zscore(data))
outliers = data[z_scores > 3]

print(outliers)

2. The Interquartile Range (IQR) Method

Are your data variables not normally distributed? IQR is then a better and more robust bet than Z-score calculations. This method uses percentiles, including determining the gap between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile). Cutoff points at 1.5 times the IQR below Q1 and above Q3 are calculated, as shown below, and they act as a “fence.” In other words, any point outside of these two barriers on either side is flagged as an outlier. The good news: The robustness of the IQR comes from the fact that extreme values ​​do not change quartiles in the same way that they change means and standard deviations.


import numpy as np

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250])
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = data[(data < lower_fence) | (data > upper_fence)]

print(outliers)

3. Isolation Forests

When processing complex, high-dimensionality data sets, traditional methods such as Z-scores and IQR are no longer effective. Enter isolation forests, a machine learning technique that learns to isolate anomalies from “normal” data. The idea is similar to classic decision trees for classification and regression: outliers are rare data points, so it’s much easier to isolate them via tree partitions. So, when a point is very easily separated from the others by the tree algorithm, there is a good chance that it is an outlier.


import numpy as np
from sklearn.ensemble import IsolationForest

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250]).reshape(-1, 1)
model = IsolationForest(contamination=0.1, random_state=42)
predictions = model.fit_predict(data)
outliers = data[predictions == -1]

print(outliers)

4. Median Absolute Deviation (MAD)

This is a considerably more robust version of the Z-score, so to speak: MAD uses the median – insensitive to outliers – and absolute deviations from it to calculate an improved “Z-score”. Be aware, however, that although it can be applied to non-normal variables, it is normally used on unidimensional data, that is, it is a univariate technique.


import numpy as np
from scipy.stats import median_abs_deviation

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250])
mad = median_abs_deviation(data, scale="normal")
median = np.median(data)
modified_z_scores = np.abs(data - median) / mad
outliers = data[modified_z_scores > 3]

print(outliers)

5. Density-Based Clustering: DBSCAN

This is an excellent approach for identifying outliers in spatial data or datasets with complex groupings. The DBSCAN algorithm builds groups around points close to each other in areas of high density. When applying it, isolated data points in low density areas are automatically identified as noise, i.e., outliers. Just like method number 3 (isolation forests), this is a multivariate technique that allows multidimensional data points to be evaluated in the process of outlier detection.


import numpy as np
from sklearn.cluster import DBSCAN

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250]).reshape(-1, 1)
model = DBSCAN(eps=5, min_samples=2)
labels = model.fit_predict(data)
outliers = data[labels == -1]

print(outliers)

Conclusion

Choosing the right outlier detection method comes down to understanding your data. Z-score and IQR are quick and easy options for univariate data, with IQR being the safest choice when your variables are not normally distributed. MAD offers a more robust univariate alternative for cases where extreme values ​​might otherwise distort the result. When your data has multiple dimensions or a complex structure, Isolation Forests and DBSCAN extend outlier detection beyond simple statistical thresholds, capturing relationships that simpler methods completely ignore. There is no best approach, only the one best suited to the shape and scale of your data.

Ivan Palomares Carrascosa is a leader, writer, speaker, and advisor in AI, machine learning, deep learning, and LLM. He trains and guides others in leveraging AI in the real world.

For further details, please visit Here.

“`

Must Read
Related News

LEAVE A REPLY

Please enter your comment!
Please enter your name here