Cross-Validation

A technique to estimate how well an ML model will generalise to unseen data.

What is Cross-Validation?

Splitting your data into train / test naively gives you one estimate of model performance. Cross-validation gives you many estimates — by splitting the data multiple ways — making your performance estimate more reliable.

The classic approach is **K-fold cross-validation**: split the data into K equal folds, train on K-1, validate on the held-out one, rotate, average. K=5 or K=10 are typical. For time-series, use **time-series cross-validation** to respect temporal order.

Why this matters: a single train/test split can give a misleadingly good or bad score by luck. Cross-validation reveals the variability — a model with mean=82% accuracy and std=2% is different from one with mean=82% and std=15%.

Why this matters

Cross-validation is the #1 thing junior data scientists skip and senior interviewers probe. Knowing it signals real ML maturity.

Real-world example (India)

A Hyderabad credit-risk team caught overfitting in their model only because cross-validation revealed accuracy varied from 71% to 86% across folds — a sign the model was unstable. They added regularisation; the variance dropped; deployment succeeded.