Why this matters
Data leakage silently inflates model performance during training and validation, then fails in production. As a Data Scientist, you will:
- Build churn, fraud, demand, or credit risk models under strict time and data constraints.
- Engineer features from logs, transactions, text, or images where future information is easy to leak.
- Set up cross-validation that matches real-world deployment (especially time-based).
Preventing leakage gives realistic metrics, stable models, and trust from stakeholders.
Who this is for
- Data Scientists and ML Engineers shipping models to production.
- Analysts moving from EDA to predictive modeling.
- Students practicing Kaggle-like projects who want real-world rigor.
Prerequisites
- Basic supervised learning concepts (train/validation/test).
- Familiarity with cross-validation and pipelines.
- Comfort with feature engineering (aggregations, encoding, scaling).
Concept explained simply
Leakage happens when your model learns from information it would not have at prediction time. This could be future data, the target (directly or indirectly), or statistics computed using the full dataset (including validation/test or future rows).
Mental model
Imagine a sealed box at prediction time. Only data available before the event lives inside the box. If a feature uses anything outside the box (future rows, target, test folds), it's a leak.
Common types of leakage
- Target leakage: Features derived from the target or post-outcome data (e.g., including refund flag in fraud prediction).
- Train–test contamination: Fitting imputers/scalers/selectors on the full dataset; computing aggregates before splitting.
- Temporal leakage: Using future periods to create features for past predictions.
- Group leakage: Same entity appears in train and validation in a way that shares information (e.g., duplicate users across folds with target-related features).
- Proxy leakage: Unique IDs or timestamps that encode the target indirectly (e.g., specific clinic code that only appears for positive cases).
Detection checklist
- Was the split done before any fitting or aggregation?
- Do feature computations only use data available up to the prediction timestamp?
- Are preprocessing steps (imputation, scaling, encoding, selection) fit only on training folds?
- Is cross-validation aligned with the deployment scenario (time-aware or group-aware)?
- Do any features correlate suspiciously close to the target (|r| > 0.98) or are they derived from post-event flags?
- Are entity IDs, timestamps, or location codes leaking label information?
Worked examples
Example 1 — Average spend aggregation (time leakage)
Task: Predict next-month churn. You compute for each user: avg_spend_last_3_months. You accidentally aggregate using all months, including months after the prediction month.
Show why it's leakage
You used future transactions that won't exist at prediction time. The fix: for each reference date, compute features from data strictly before that date (use expanding or rolling windows with proper cutoffs).
Example 2 — Scaling on full data (train–test contamination)
Task: Predict default. You fit StandardScaler on the full dataset, then split into train/test.
Show why it's leakage
Test set statistics influenced the scaler, making validation optimistic. The fix: put scaler inside a Pipeline and fit only on training data (or on each CV fold separately).
Example 3 — Target encoding with KFold done wrong
Task: Categorical merchant_id target encoding for fraud detection. You compute mean(target) per merchant using the whole training set, then use it for both train and validation.
Show why it's leakage
Train rows "see" their own labels via the encoding. The fix: out-of-fold target encoding—compute encoding for each fold using only other folds; for test, fit encoding on the full training set.
Example 4 — Proxy feature
Task: Readmission prediction. Feature includes discharge_department_code. Certain rare departments only discharge critical cases.
Show why it's leakage
The code acts as a near-label proxy learned from process artifacts. Audit rare categories and high target correlation; consider removing, grouping, or proving availability at prediction time.
How to prevent leakage
- Split first, then compute: Always split by time/groups before feature computation and fitting.
- Pipelines everywhere: Put imputation, scaling, selection, and models inside one Pipeline so CV fits steps per fold.
- Time-aware validation: Use rolling/expanding windows or TimeSeriesSplit. Never shuffle time.
- Group-aware validation: If the same entity can appear multiple times, use GroupKFold or grouped time splits.
- Out-of-fold encodings: For target encoding, stacking, model-based features—compute with OOF strategy.
- Backtesting: Simulate deployment by scoring future slices only. Compare metrics across slices to detect drift/leakage.
- Data contracts: Document for each feature: source, timestamp, availability lag, and transformation.
Implementation sketch (scikit-learn style)
# Pseudocode
# Split by time
train = data[data.date < '2023-01-01']
valid = data[(data.date >= '2023-01-01') & (data.date < '2023-03-01')]
# Pipeline to avoid contamination
pipe = Pipeline([
('impute', SimpleImputer(strategy='median')),
('scale', StandardScaler()),
('model', LogisticRegression(max_iter=1000))
])
# TimeSeriesSplit for CV
tss = TimeSeriesSplit(n_splits=4)
cv_scores = cross_val_score(pipe, X=train_features, y=train_target, cv=tss)
# Target encoding (OOF) concept:
# for fold in CV:
# fit encoding on other folds
# transform fold
# fit final encoding on full train before scoring test
Exercises
Note: The Quick Test is available to everyone; only logged-in users get saved progress.
Exercise 1 — Spot the leaks
You are predicting whether a customer will churn in March 2024. You have features per customer constructed on April 5, 2024:
- months_active: count of months active up to April 5, 2024
- avg_txn_value_90d: average transaction value from Jan 1 to Apr 5, 2024
- last_ticket_resolved_date: support ticket last resolved date (could be after March)
- tenure_months: months since signup (as of March 1, 2024)
- is_discount_eligible: flag from pricing system as of Feb 28, 2024
Which features leak for a March prediction? Write the leaking feature names and why.
Expected output format
List features that leak and a one-line reason for each.
Exercise 2 — Safe target encoding plan
Design a step-by-step plan to compute target-encoded merchant_id safely with 5-fold CV and then score a held-out test set. Write the steps clearly.
Expected output format
Numbered steps covering data split, OOF computation, regularization, and final fit.
Self-checklist for exercises
- Did you separate data by prediction date before computing features?
- Did your target encoding avoid using a row’s own label?
- Are all preprocessing steps fit only on training folds?
Common mistakes and how to self-check
- Fitting on all data: Any transform fit before splitting is a red flag. Self-check: verify fit() is called inside CV/Pipeline.
- Shuffled time series: Random CV on time data hides leakage. Self-check: confirm folds respect chronology.
- Global aggregations: Averages across full dataset. Self-check: recompute with windowed/expanding features.
- Leaky encodings: Target encoding without OOF. Self-check: confirm fold-wise computation excludes the fold.
- Proxy IDs: High-card IDs with unreal accuracy. Self-check: remove or hash/bucket, re-evaluate.
Practical projects
- Time-based churn model: Build monthly churn prediction with rolling-window features and backtesting across 6 months.
- Fraud detection with OOF encodings: Use merchant and user categorical encodings with 5-fold OOF, evaluate stability by month.
- Group-aware recommendation: Predict next purchase category ensuring the same user never appears in both train and validation folds.
Learning path
- Before this: Data splitting strategies, Pipelines/transformers, Robust EDA.
- Now: Leakage Prevention in feature engineering.
- Next: Safe target encoding and stacking, Time-series backtesting, Monitoring for drift after deployment.
Next steps
- Refactor your current project to use pipelines and proper CV.
- Add time-aware feature generation with strict cutoffs.
- Run a rolling backtest and compare to your prior results—if the gap shrinks, you likely removed leakage.
Mini challenge
Scenario: You forecast daily demand for April 2024 on March 31. Your dataset includes: weather (actual), promotions calendar (planned), inventory levels (as of month end), and sales up to April 30. Identify all sources of leakage and outline how to rebuild features without using any post-March 31 information.