Multi-Million Dollar Mistake in Data Science Applications

Multi-Million Dollar Mistake in Data Science Applications

Recently I was training a machine learning model and found something which I have never experienced before. The model I trained worked well in the validation and testing phase.

As I deployed it and tested it against real-life cases, its accuracy just dropped drastically. I searched, and that's when I learned about Data Leakages.

Well, what is Data Leakage?

Data Leakage is the use of information in the model training process which would not be available at prediction time, causing the predictive scores (metrics) to overestimate the model’s utility when running in a production environment.

Let me explain the complex definition, we usually split datasets into two parts representing the training set and test set. What if the test set has the information you are trying to predict? Of course, the model will produce accurate results.

It is poor generalization and over-estimation of expected performance. Data leakage often occurs subtly and inadvertently and may cause over-fitting.

How does it Happen?

Data leakage can occur by:

  • Leaking data from the test set into the training set.
  • Leaking the correct prediction or ground truth into the test data.
  • Leaking of information from the future into the past.
  • Distorting information from samples outside of the model’s intended use.
  • Any of the above present in third-party data joined to the training set.

Types of Data Leakage

Leakage can occur in many steps in the machine learning process.

The leakage causes can be sub-classified into two sources of leakage for a model:

1. Target/Feature Leakage:

Target leakage occurs when your predictors include data that will not be available at the time you make predictions. This can include leaks which partially give away the label.

Example:

Let's say you want to predict who will get sick with pneumonia. The top rows of data look like this:

People take antibiotic medicines after getting pneumonia to recover. The model would see that anyone who has a value of False for took_antibiotic_medicine didn't have pneumonia.

The raw data shows a strong relationship between those columns but took_antibiotic_medicine is frequently changed after the value got_pneumonia is determined.

Since validation data comes from the same source as training data, the pattern will repeat itself invalidation, and the model will have great validation (or cross-validation) scores. This is target leakage.

2. Train-test Contamination:

This leak occurs when you aren’t careful to distinguish training data from validation data.

Validation is meant to be a measure of how the model does on data it hasn’t considered before. You can corrupt this process in subtle ways if the validation data affects pre-processing behavior.

Example:

We will use a dataset about credit card applications. The result is that information about each credit card application is stored in a DataFrame X. We'll use it to predict they accepted which applications in a Series y.

Using Cross-Validation on the dataset we get:

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Since there is no preprocessing, we don't need a pipeline
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y,
cv=5,
scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())

Output:

It’s very rare to find models that are accurate 98% of the time. It happens, but it’s uncommon enough that we should scrutinize the data for leakage.

A few variables look suspicious. For example, does **expenditure** mean expenditure on this card or on cards used before applying?

Basic data comparisons can be very helpful:

expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

print('Fraction of those who did not receive a card and had no expenditures: %.2f' \
%((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' \
%(( expenditures_cardholders == 0).mean()))

Output:

As shown above, everyone who did not receive a card had no expenditures, while only 2% of those who received a card had no expenditures. It’s not surprising that our model appeared to have high accuracy.

But this also seems to be a case of target leakage, where expenditures probably mean expenditures on the card they applied for.

Since **share** is partially determined by **expenditure**, it should be excluded too. The variables **active** and **majorcards** are a little less clear, but from the description, they sound concerning.

In most situations, it's better to be safe than sorry if you can't track down the people who created the data to find out more.

Avoiding Data Leakages

  • Remove Leaky Variables
  • Pipelines Usage
  • Add Noise to Dataset
  • Use a Holdout Dataset

Thank you for reading this post, I hope you enjoyed and learn something new today. Feel free to contact me through my blog if you have questions I will be more than happy to help.

You can find more posts I’ve published related to Python and Machine Learning.
Stay safe and Happy coding!

References

[1] https://en.wikipedia.org/wiki/Leakage_(machine_learning)

[2] https://www.kaggle.com/alexisbcook/data-leakage