Surrogate Keys vs. Natural Keys: Why Stability Matters

Every row in a database needs an identifier. The decision of what that identifier is — a business key from the source system, or a system-generated one — has long-term consequences for the reliability of your data model.

Natural Keys: The Intuitive Choice

A natural key is an identifier that comes from the business domain or source system: customer_id from the CRM, order_number from the order management system, an email address, a product SKU.

The appeal is obvious: these identifiers already exist, have business meaning, and make data readable without joins. A fact table row with customer_id = 'C-1042' immediately tells you which customer it refers to.

The problems become apparent over time:

Source systems change natural keys. A company migrates CRMs. Customer IDs are reassigned. Order numbers are reformatted when the order system is upgraded. Your historical data now refers to identifiers that mean something different — or nothing at all.

Merges and deduplication. Two source systems have the same customer (same person, different IDs). A deduplication process merges them into one canonical record. Which ID do you keep? Both historical records need to point to the same entity.

Multi-source identity. You're ingesting from three different databases that all identify products differently. There's no common natural key — you have to build one.

PII leakage. If the natural key is an email address, it appears in every fact table row, in every join, in every exported report. That's a compliance risk.

Surrogate Keys: System-Generated Stability

A surrogate key is an identifier generated by the data system, decoupled from the source: a UUID, a hash, an auto-incrementing integer.

The data system controls this key. It never changes because the source changes. It's assigned once when the row enters the warehouse.

Why this matters:

Stability. When the source system reassigns a customer ID, the surrogate key for that customer's historical records remains unchanged. Joins work across time.

Deduplication without ambiguity. When two source records are determined to be the same entity, they map to one surrogate key. The history of both collapses cleanly into one entity.

Source system decoupling. The warehouse model is independent of the source system's key strategy. You can ingest from three different systems with different ID schemes and present a unified view.

The Standard Pattern

The standard approach in analytical engineering:

Ingest the natural key from the source (store it as a reference, not as the join key)
Assign a surrogate key at the staging layer — typically a hash of the natural key(s) or a UUID
Use the surrogate key in all fact-dimension joins

In staging:

SELECT
    {{ dbt_utils.generate_surrogate_key(['customer_id']) }} AS customer_sk,
    customer_id AS source_customer_id,
    name,
    email,
    ...
FROM {{ source('raw', 'customers') }}

The dimension table has a customer_sk (surrogate, used for joins) and a source_customer_id (natural, for tracing back to the source).

Handling Duplicates at the Source

Source systems sometimes send duplicate records — the same entity with slightly different data. Before assigning a surrogate key, you need to pick one canonical version.

Common approaches:

Last-write-wins: take the row with the most recent updated_at. Simple and usually correct.

Explicit deduplication logic: for known sources of duplication (e.g., two systems each have a version of the same customer), apply a deterministic rule to pick the authoritative record.

Snapshot deduplication: for SCD sources, filter to the current version (WHERE valid_to IS NULL) before assigning the surrogate key.

The key point: deduplication must happen before the surrogate key is assigned. If you assign surrogate keys to duplicates, you have two rows in your dimension table for the same entity, and every downstream join will produce doubled results.

Natural Keys as Join Filters, Not Join Keys

Natural keys still have a role — just not as the primary join key. They're useful for:

Source traceability: finding the original source record from a warehouse row
Cross-system lookups: matching records when the surrogate key is unknown
Human-readable filters: WHERE source_customer_id = 'C-1042' in ad-hoc queries

Keep them in the model as a reference column. Don't use them as the primary join key between facts and dimensions.

The Takeaway

Natural keys are fragile because the business domain changes. Surrogate keys are stable because the data system controls them. Use surrogate keys as the primary join key in all fact-dimension relationships. Retain natural keys as a reference column for traceability. And deduplicate before assigning surrogate keys — downstream joins depend on dimension tables having exactly one row per entity.