Data Engineering: Principles & Patterns

A series on the foundational ideas behind modern data pipelines — how to think about them, design them, and operate them reliably. Each post focuses on a principle rather than a tool. Code examples are drawn from the jaffle-shop reference implementation.

Module 1 — Foundations & Philosophy

The mental models that underpin everything else. Start here.

#	Post	What it covers
1.1	Two Mental Models: Mutable Entities vs. Immutable Events	The most fundamental split in data engineering — how you classify data determines every design decision
1.2	ELT vs. ETL: A Paradigm Shift, Not Just a Letter Swap	Why modern stacks load raw first and transform inside the warehouse
1.3	Layered Pipeline Architecture: Why Not One Script?	Separation of concerns across ingestion, staging, intermediate, and marts
1.4	Batch vs. Streaming vs. Micro-batch	Choosing the right processing model — and why real-time is often the wrong answer
1.5	Data Warehouse vs. Data Lake vs. Lakehouse	A tradeoff between structure, cost, and flexibility — not a tool war

Module 2 — Ingestion

Getting data in reliably, at scale, without surprises.

#	Post	What it covers
2.1	Three Ingestion Patterns: Full Refresh, Incremental, CDC	Choosing by data nature, not tool capability
2.2	Schema Evolution: When Your Source Changes Without Warning	The most underestimated operational challenge — breaking vs. non-breaking changes
2.3	Idempotency: Pipelines That Are Safe to Run Twice	Why retries are inevitable and how to design for them

Module 3 — Storage

How data is stored affects every query that runs against it.

#	Post	What it covers
3.1	Storage Formats: Why Parquet Is Not Just "Smaller CSV"	Row vs. column orientation, file format tradeoffs, compression
3.2	Partitioning and Clustering: Designing for Query Patterns	Partition pruning, the small files problem, clustering as a complement

Module 4 — Data Modeling

Structuring data for how analysts actually query it.

#	Post	What it covers
4.1	Dimensional Modeling: Facts, Dimensions, and Star Schema	Why denormalization is correct in analytics
4.2	Surrogate Keys vs. Natural Keys	Stability over intuition — why business keys are fragile
4.3	Slowly Changing Dimensions: When History Matters	SCD Type 1/2/6 — the business question determines the approach
4.4	Fact Table Patterns: Transactional, Periodic Snapshot, Accumulating Snapshot	Three distinct patterns for three types of measurements

Module 5 — Transformation

Building the logic that turns raw data into trusted analytics.

#	Post	What it covers
5.1	Staging → Intermediate → Marts: The Case for Layered Transforms	One responsibility per layer, clear boundaries, controlled blast radius
5.2	Incremental Processing and Late-Arriving Data	Lookback windows, deduplication, and when to trigger a full rebuild
5.3	The Semantic Layer: Define Metrics Once, Use Everywhere	Centralizing metric definitions to eliminate inconsistency at scale

Module 6 — Quality & Reliability

Trustworthy data doesn't happen by accident.

#	Post	What it covers
6.1	Data Quality as Code: Tests That Ship With the Pipeline	Schema tests, business logic tests, freshness SLAs — fail fast, fail early
6.2	Data Contracts: The Agreement Between Producers and Consumers	Making implicit dependencies explicit before they break silently
6.3	Data Observability: Beyond "Did the Job Succeed?"	Freshness, volume, schema, distribution — and why pipeline success ≠ data correctness

Module 7 — Orchestration

Coordinating pipelines that run reliably, recover gracefully, and scale.

#	Post	What it covers
7.1	DAGs: The Right Mental Model for Pipeline Dependencies	Directed, acyclic, graph — why each property matters
7.2	Trigger Patterns: Cron, Sensors, and Event-Driven Pipelines	Schedule-based vs. data-aware triggering — the coupling tradeoff
7.3	Backfilling and Catchup: Reprocessing Historical Data	Idempotency as a prerequisite, partitioned backfill, blast radius control
7.4	Pipeline as Code: Configuration Over Imperative DAG Definitions	Separating pipeline behavior from pipeline logic for maintainability

Module 8 — Serving

Delivering data to the consumers who depend on it.

#	Post	What it covers
8.1	Designing the Serving Layer for Different Consumers	BI vs. ML vs. operational — different requirements, different designs
8.2	Feature Engineering: Bridging Analytics and Machine Learning	Point-in-time correctness, RFM framework, when you need a feature store

All examples reference the jaffle-shop repository — a complete demo pipeline built with dlt, dbt, DuckDB, and Dagster.