Blog

March 28, 2026
4 min read

Backfilling and Catchup: Reprocessing Historical Data

Every data pipeline will eventually need to reprocess historical data. A bug is discovered that corrupted two months of records. A business rule changes retroactively. A new model is added that needs to be populated from the beginning. The pipeline was down for three days and missed scheduled runs.

How you handle these situations — and whether your pipeline was designed to support them safely — determines whether reprocessing is a controlled operation or an emergency.

March 28, 2026
3 min read

Batch vs. Streaming vs. Micro-batch: Choosing the Right Processing Model

"We need real-time data" is one of the most common requirements in data engineering — and one of the most frequently misunderstood. Real-time is a spectrum, not a binary, and the right point on that spectrum depends on the actual business need, not the ambition.

March 28, 2026
3 min read

DAGs: The Right Mental Model for Pipeline Dependencies

Modern data pipelines are not sequences of steps — they're dependency graphs. Step C can't run until both A and B have completed. Steps D and E can run in parallel once C finishes. If F fails, G and H should not start. A Directed Acyclic Graph (DAG) is the data structure that represents these relationships precisely.

March 28, 2026
4 min read

Data Contracts: The Agreement Between Producers and Consumers

Most data pipeline failures don't originate in the pipeline itself. They originate upstream: a backend engineer adds a column, renames a field, or changes the semantics of a value — unaware that a downstream data pipeline depends on the exact current schema. The pipeline breaks, or worse, silently produces wrong results.

Data contracts are the formalization of the agreement between the producer of data and the consumer of that data.

March 28, 2026
4 min read

Data Observability: Beyond "Did the Job Succeed?"

A pipeline job that completes successfully is not the same as data that is correct. Jobs can succeed while loading stale data, incorrect aggregations, missing rows, or schema-shifted values. "The job ran" answers the operational question. It says nothing about whether the data is trustworthy.

Data observability is the practice of continuously understanding the state of the data — not just the state of the pipeline.

March 28, 2026
4 min read

Data Quality as Code: Tests That Ship With the Pipeline

Data quality checks are often treated as operational afterthoughts: a dashboard someone checks weekly, an alert someone set up once and forgot about, a manual audit done before a quarterly report. By the time a problem is discovered, bad data has propagated through the entire stack.

The alternative: define quality checks in code, run them as part of every pipeline execution, and fail fast when expectations are violated.

March 28, 2026
3 min read

Dimensional Modeling: Facts, Dimensions, and Star Schema

Normalized schemas are correct for transactional systems. They eliminate redundancy, enforce referential integrity, and make writes fast. They are also painful for analytics: answering "total revenue by product category for customers who signed up in Q1" requires joining six tables and knowing the exact relationship between each.

Dimensional modeling solves this. It's a schema design approach built for how analysts actually query data, not how applications write it.

March 28, 2026
3 min read

ELT vs. ETL: A Paradigm Shift, Not Just a Letter Swap

ETL and ELT differ by more than the order of two letters. They represent fundamentally different assumptions about where computation should happen — and those assumptions have very different consequences for how you build, debug, and evolve data pipelines.

March 28, 2026
3 min read

Fact Table Patterns: Transactional, Periodic Snapshot, Accumulating Snapshot

Not all measurements are the same. Some record an event that happens at a point in time. Some capture a state that changes continuously. Some track a process that passes through multiple stages. Each pattern calls for a different fact table design.

March 28, 2026
5 min read

Feature Engineering: Bridging Analytics and Machine Learning

The gap between an analytical mart and a machine learning feature table is larger than it first appears. An analyst's dim_customers table answers "what is this customer's current profile?" A machine learning feature table must answer "what was this customer's profile at the moment of the event we're predicting?" These are fundamentally different questions, and conflating them is one of the most common sources of training data bugs in ML systems.

March 28, 2026
3 min read

Idempotency: Pipelines That Are Safe to Run Twice

A well-designed pipeline should produce the same result whether it runs once or ten times. This property — idempotency — is not a nice-to-have. It's the foundation of operational reliability.

March 28, 2026
4 min read

Incremental Processing and Late-Arriving Data

Processing only new data seems like an obvious optimization: why rebuild everything when you only need to process what changed? The reality is more complex. Incremental processing introduces correctness challenges — particularly around late-arriving data — that don't exist with full rebuild approaches.

March 28, 2026
3 min read

Three Ingestion Patterns: Full Refresh, Incremental, and CDC

Choosing how to ingest data is one of the first decisions in a pipeline — and one where the wrong choice creates problems that compound over time. There are three fundamental patterns, each with different tradeoffs in simplicity, latency, and correctness.

March 28, 2026
3 min read

Layered Pipeline Architecture: Why Not One Script?

The most natural instinct when building a data pipeline is to write one script that does everything: connects to the source, applies transformations, and writes results to the destination. It works on day one. By month six, it's the most feared file in the codebase.

March 28, 2026
2 min read

Two Mental Models for Data: Mutable Entities vs. Immutable Events

Before choosing a tool, a write strategy, or a pipeline pattern — you need to answer one question: does this data change, or does it accumulate?

This is the most fundamental split in data engineering, and getting it wrong causes double-counting, silent data loss, and corrupted history downstream.

March 28, 2026
4 min read

Partitioning and Clustering: Designing for Query Patterns

Partitioning is often described as "splitting a table into smaller pieces." That description is accurate but misses the point. The purpose of partitioning is not to create smaller files — it's to allow the query engine to skip large portions of data based on filter conditions. Designed correctly, partitioning eliminates reads. Designed incorrectly, it creates new problems.

March 28, 2026
4 min read

Pipeline as Code: Configuration Over Imperative DAG Definitions

The second pipeline you build will look a lot like the first. The third even more so. Without a principled approach to pipeline definition, you end up with duplicated code, inconsistent patterns, and orchestration logic that's harder to change than the transformation logic it was meant to manage.

Pipeline-as-code is the practice of defining pipeline behavior in code rather than through UI configuration — and configuration-driven orchestration takes this further by separating pipeline behavior from pipeline logic.

March 28, 2026
3 min read

Schema Evolution: When Your Source Changes Without Warning

Source systems change. A backend engineer adds a column. A product team renames a field. A third-party API deprecates an attribute and replaces it with two new ones. Your pipeline was working yesterday and now it isn't — or worse, it's still running but silently producing wrong results.

Schema evolution is the most underestimated operational challenge in data engineering.

March 28, 2026
3 min read

The Semantic Layer: Define Metrics Once, Use Everywhere

Ask three analysts at the same company what "monthly active users" means and you may get three different answers. One filters to users who logged in. One filters to users who performed any action. One includes trial accounts, another excludes them. All three produce different numbers from the same underlying data.

This is the metrics consistency problem, and it gets worse as organizations scale.

March 28, 2026
4 min read

Designing the Serving Layer for Different Consumers

A fact table that perfectly serves an executive dashboard is often a poor fit for a machine learning model. A feature table optimized for model training is unnecessarily complex for a business analyst. The serving layer is not one thing — it's a set of tables designed for specific consumers, each with different requirements.

March 28, 2026
4 min read

Slowly Changing Dimensions: When History Matters More Than Current State

A dimension describes an entity — a customer, a product, a salesperson. Dimensions change over time: customers move to different cities, change their subscription tier, or update their email. Products get recategorized. Salespeople change regions.

The question every data model must answer: when a dimension changes, what do you do with history?

March 28, 2026
4 min read

Staging → Intermediate → Marts: The Case for Layered Transforms

Every transformation layer should have one clear responsibility. When you mix cleaning, joining, and business logic in the same model, you get a model that's hard to test, hard to debug, and hard to change. Layered transformation architecture solves this by giving each responsibility its own layer with clear rules.

March 28, 2026
4 min read

Storage Formats: Why Parquet Is Not Just "Smaller CSV"

When people first encounter Parquet, the framing is often "it's like CSV but compressed." This undersells both what Parquet does and why the format matters for query performance. The difference between row-oriented and column-oriented storage is not about compression — it's about fundamentally different access patterns.

March 28, 2026
3 min read

Surrogate Keys vs. Natural Keys: Why Stability Matters

Every row in a database needs an identifier. The decision of what that identifier is — a business key from the source system, or a system-generated one — has long-term consequences for the reliability of your data model.

March 28, 2026
4 min read

Trigger Patterns: Cron, Sensors, and Event-Driven Pipelines

Every pipeline step needs to know when to run. The simplest answer — "run on a schedule" — works well in isolation but breaks down when pipelines have dependencies on upstream data that doesn't arrive on a perfectly predictable cadence.

There are three fundamental trigger patterns, each with different tradeoffs in simplicity, correctness, and coupling.

March 28, 2026
4 min read

Data Warehouse vs. Data Lake vs. Lakehouse: A Tradeoff, Not a War

Every few years, the data industry declares that one storage paradigm has won and the others are dead. Warehouses were obsolete when lakes arrived. Lakes failed when lakehouses emerged. None of these declarations aged well, because they missed the point: each architecture makes different tradeoffs, and the right choice depends on your constraints.