Skip to content

Data Understanding: How Your Data Was Born Matters

Data doesn't appear from nowhere. Every dataset is the output of a process — a system, a workflow, a set of human decisions. Understanding that process is as important as understanding the data itself.

The Data Generating Process

The Data Generating Process (DGP) describes how observations ended up in your dataset. Who created this data? What system recorded it? What incentives shaped it? What events had to occur for a row to exist at all?

This sounds abstract, but the implications are concrete.

In credit risk, loan application data is generated by a specific process: a customer applies → a credit officer reviews the application → the bank approves or rejects. Only approved loans have repayment outcomes. Rejected loans have no label. This is survival bias: your training data contains only customers the bank already believed were good risks. If you train a default model on this data and apply it to all applicants, you are extrapolating into a region your model has never seen.

Understanding the DGP means asking: for which population is this data representative? The answer is almost never "everyone."

Population vs Sample

Your training data is a sample. The model will be deployed on a population. These two distributions need to match — and they often don't.

Common mismatches:

  • Geographic: trained on customers from urban areas, deployed nationally
  • Temporal: trained on Q1 data, deployed in Q4 (holiday spending patterns differ significantly)
  • Selection bias: trained on customers who opted in to a feature, deployed to all customers
  • Survivorship bias: trained only on customers who completed a transaction, deployed before any transaction occurs

The practical check: describe your training population precisely. Then describe your production population. List every dimension where they differ. Each difference is a potential failure mode.

EDA as Hypothesis Testing

Exploratory Data Analysis is often treated as "looking at data to get familiar with it." That framing is too passive.

Effective EDA is hypothesis testing about the DGP. You form a hypothesis ("income is normally distributed among applicants"), then check whether the data confirms or refutes it. When you find something unexpected, the question isn't "is this an error?" — it's "what does this tell me about how the data was generated?"

Target distribution — is the target imbalanced? If 97% of loans are repaid, a model that predicts "repaid" for everyone achieves 97% accuracy but is useless. Class imbalance shapes your choice of metrics, loss function, and thresholds.

Missing values — the three types have very different implications:

Type Definition Implication
MCAR (Missing Completely At Random) Missingness is unrelated to any variable Safe to drop or impute with mean/median
MAR (Missing At Random) Missingness depends on other observed variables Impute using other features
MNAR (Missing Not At Random) Missingness depends on the missing value itself The missing value is signal

MNAR is the dangerous one. If "income" is missing because customers with very low income chose not to disclose it, then income_is_missing = True is a strong predictor of default. Imputing with the median destroys that signal.

Outliers — noise or signal? A transaction of $1,000,000 in a dataset of consumer purchases might be a data entry error (noise) or a legitimate high-value transaction (signal) or a fraud attempt (signal of a different kind). The DGP tells you which interpretation is plausible.

The Representativeness Principle

Training data must be representative of the distribution the model will face in production. This is the single most important data requirement.

When it's violated, the model learns patterns that don't generalize. This is not an overfitting problem — it's a distributional mismatch that no amount of regularization can fix.

Practical steps:

  1. Define the reference population for production deployment
  2. Compare training and production distributions for every feature — use PSI (Population Stability Index) or simple histograms
  3. Document every known mismatch and decide whether to resample, reweight, or accept the risk

What EDA Actually Produces

EDA should produce not just charts, but decisions:

  • Which features have distributions inconsistent with the DGP → investigate, potentially drop
  • Which missing value patterns are MNAR → create missingness indicator features
  • Which outliers are signal → keep and handle explicitly
  • Which segments of the population are underrepresented → flag for validation

A profiling tool like ydata-profiling is useful for surface-level checks, but it cannot tell you whether the data is correct relative to how it was generated. That judgment requires domain knowledge.

The goal of data understanding is not familiarity — it's confidence that your training data represents the right slice of reality for the problem you're solving.