An empirical analysis about whether ML models make more mistakes when making predictions on outliers
Outliers are individuals that are very different from the majority of the population. Traditionally, among practitioners there is a certain mistrust in outliers, this is why ad-hoc measures such as removing them from the dataset are often adopted.
However, when working with real data, outliers are on the order of business. Sometimes, they are even more important than other observations! Take for instance the case of individuals that are outliers because they are very high-paying customers: you don’t want to discard them, actually, you probably want to treat them with extra care.
An interesting — and quite unexplored — aspect of outliers is how they interact with ML models. My feeling is that data scientists believe that outliers harm the performance of their models. But this belief is probably based on a preconception more than on real evidence.
Thus, the question I will try to answer in this article is the following:
Is an ML model more likely to make mistakes when making predictions on outliers?
Suppose that we have a model that has been trained on these data points:
We receive new data points for which the model should make predictions.
Let’s consider two cases:
- the new data point is an outlier, i.e. different from most of the training observations.
- the new data point is “standard”, i.e. it lies in an area that is pretty “dense” of training points.
We would like to understand whether, in general, the outlier is harder to predict than the standard observation.