Skip to content

Data Engineering: Principles & Patterns

A series on the foundational ideas behind modern data pipelines — how to think about them, design them, and operate them reliably. Each post focuses on a principle rather than a tool. Code examples are drawn from the jaffle-shop reference implementation.


Module 1 — Foundations & Philosophy

The mental models that underpin everything else. Start here.

# Post What it covers
1.1 Two Mental Models: Mutable Entities vs. Immutable Events The most fundamental split in data engineering — how you classify data determines every design decision
1.2 ELT vs. ETL: A Paradigm Shift, Not Just a Letter Swap Why modern stacks load raw first and transform inside the warehouse
1.3 Layered Pipeline Architecture: Why Not One Script? Separation of concerns across ingestion, staging, intermediate, and marts
1.4 Batch vs. Streaming vs. Micro-batch Choosing the right processing model — and why real-time is often the wrong answer
1.5 Data Warehouse vs. Data Lake vs. Lakehouse A tradeoff between structure, cost, and flexibility — not a tool war

Module 2 — Ingestion

Getting data in reliably, at scale, without surprises.

# Post What it covers
2.1 Three Ingestion Patterns: Full Refresh, Incremental, CDC Choosing by data nature, not tool capability
2.2 Schema Evolution: When Your Source Changes Without Warning The most underestimated operational challenge — breaking vs. non-breaking changes
2.3 Idempotency: Pipelines That Are Safe to Run Twice Why retries are inevitable and how to design for them

Module 3 — Storage

How data is stored affects every query that runs against it.

# Post What it covers
3.1 Storage Formats: Why Parquet Is Not Just "Smaller CSV" Row vs. column orientation, file format tradeoffs, compression
3.2 Partitioning and Clustering: Designing for Query Patterns Partition pruning, the small files problem, clustering as a complement

Module 4 — Data Modeling

Structuring data for how analysts actually query it.

# Post What it covers
4.1 Dimensional Modeling: Facts, Dimensions, and Star Schema Why denormalization is correct in analytics
4.2 Surrogate Keys vs. Natural Keys Stability over intuition — why business keys are fragile
4.3 Slowly Changing Dimensions: When History Matters SCD Type 1/2/6 — the business question determines the approach
4.4 Fact Table Patterns: Transactional, Periodic Snapshot, Accumulating Snapshot Three distinct patterns for three types of measurements

Module 5 — Transformation

Building the logic that turns raw data into trusted analytics.

# Post What it covers
5.1 Staging → Intermediate → Marts: The Case for Layered Transforms One responsibility per layer, clear boundaries, controlled blast radius
5.2 Incremental Processing and Late-Arriving Data Lookback windows, deduplication, and when to trigger a full rebuild
5.3 The Semantic Layer: Define Metrics Once, Use Everywhere Centralizing metric definitions to eliminate inconsistency at scale

Module 6 — Quality & Reliability

Trustworthy data doesn't happen by accident.

# Post What it covers
6.1 Data Quality as Code: Tests That Ship With the Pipeline Schema tests, business logic tests, freshness SLAs — fail fast, fail early
6.2 Data Contracts: The Agreement Between Producers and Consumers Making implicit dependencies explicit before they break silently
6.3 Data Observability: Beyond "Did the Job Succeed?" Freshness, volume, schema, distribution — and why pipeline success ≠ data correctness

Module 7 — Orchestration

Coordinating pipelines that run reliably, recover gracefully, and scale.

# Post What it covers
7.1 DAGs: The Right Mental Model for Pipeline Dependencies Directed, acyclic, graph — why each property matters
7.2 Trigger Patterns: Cron, Sensors, and Event-Driven Pipelines Schedule-based vs. data-aware triggering — the coupling tradeoff
7.3 Backfilling and Catchup: Reprocessing Historical Data Idempotency as a prerequisite, partitioned backfill, blast radius control
7.4 Pipeline as Code: Configuration Over Imperative DAG Definitions Separating pipeline behavior from pipeline logic for maintainability

Module 8 — Serving

Delivering data to the consumers who depend on it.

# Post What it covers
8.1 Designing the Serving Layer for Different Consumers BI vs. ML vs. operational — different requirements, different designs
8.2 Feature Engineering: Bridging Analytics and Machine Learning Point-in-time correctness, RFM framework, when you need a feature store

All examples reference the jaffle-shop repository — a complete demo pipeline built with dlt, dbt, DuckDB, and Dagster.