Skip to content

Blog

Backfilling and Catchup: Reprocessing Historical Data

Every data pipeline will eventually need to reprocess historical data. A bug is discovered that corrupted two months of records. A business rule changes retroactively. A new model is added that needs to be populated from the beginning. The pipeline was down for three days and missed scheduled runs.

How you handle these situations — and whether your pipeline was designed to support them safely — determines whether reprocessing is a controlled operation or an emergency.

DAGs: The Right Mental Model for Pipeline Dependencies

Modern data pipelines are not sequences of steps — they're dependency graphs. Step C can't run until both A and B have completed. Steps D and E can run in parallel once C finishes. If F fails, G and H should not start. A Directed Acyclic Graph (DAG) is the data structure that represents these relationships precisely.

Data Contracts: The Agreement Between Producers and Consumers

Most data pipeline failures don't originate in the pipeline itself. They originate upstream: a backend engineer adds a column, renames a field, or changes the semantics of a value — unaware that a downstream data pipeline depends on the exact current schema. The pipeline breaks, or worse, silently produces wrong results.

Data contracts are the formalization of the agreement between the producer of data and the consumer of that data.

Data Observability: Beyond "Did the Job Succeed?"

A pipeline job that completes successfully is not the same as data that is correct. Jobs can succeed while loading stale data, incorrect aggregations, missing rows, or schema-shifted values. "The job ran" answers the operational question. It says nothing about whether the data is trustworthy.

Data observability is the practice of continuously understanding the state of the data — not just the state of the pipeline.

Data Quality as Code: Tests That Ship With the Pipeline

Data quality checks are often treated as operational afterthoughts: a dashboard someone checks weekly, an alert someone set up once and forgot about, a manual audit done before a quarterly report. By the time a problem is discovered, bad data has propagated through the entire stack.

The alternative: define quality checks in code, run them as part of every pipeline execution, and fail fast when expectations are violated.

Dimensional Modeling: Facts, Dimensions, and Star Schema

Normalized schemas are correct for transactional systems. They eliminate redundancy, enforce referential integrity, and make writes fast. They are also painful for analytics: answering "total revenue by product category for customers who signed up in Q1" requires joining six tables and knowing the exact relationship between each.

Dimensional modeling solves this. It's a schema design approach built for how analysts actually query data, not how applications write it.

ELT vs. ETL: A Paradigm Shift, Not Just a Letter Swap

ETL and ELT differ by more than the order of two letters. They represent fundamentally different assumptions about where computation should happen — and those assumptions have very different consequences for how you build, debug, and evolve data pipelines.

Feature Engineering: Bridging Analytics and Machine Learning

The gap between an analytical mart and a machine learning feature table is larger than it first appears. An analyst's dim_customers table answers "what is this customer's current profile?" A machine learning feature table must answer "what was this customer's profile at the moment of the event we're predicting?" These are fundamentally different questions, and conflating them is one of the most common sources of training data bugs in ML systems.

Incremental Processing and Late-Arriving Data

Processing only new data seems like an obvious optimization: why rebuild everything when you only need to process what changed? The reality is more complex. Incremental processing introduces correctness challenges — particularly around late-arriving data — that don't exist with full rebuild approaches.

Layered Pipeline Architecture: Why Not One Script?

The most natural instinct when building a data pipeline is to write one script that does everything: connects to the source, applies transformations, and writes results to the destination. It works on day one. By month six, it's the most feared file in the codebase.

Partitioning and Clustering: Designing for Query Patterns

Partitioning is often described as "splitting a table into smaller pieces." That description is accurate but misses the point. The purpose of partitioning is not to create smaller files — it's to allow the query engine to skip large portions of data based on filter conditions. Designed correctly, partitioning eliminates reads. Designed incorrectly, it creates new problems.

Pipeline as Code: Configuration Over Imperative DAG Definitions

The second pipeline you build will look a lot like the first. The third even more so. Without a principled approach to pipeline definition, you end up with duplicated code, inconsistent patterns, and orchestration logic that's harder to change than the transformation logic it was meant to manage.

Pipeline-as-code is the practice of defining pipeline behavior in code rather than through UI configuration — and configuration-driven orchestration takes this further by separating pipeline behavior from pipeline logic.

Schema Evolution: When Your Source Changes Without Warning

Source systems change. A backend engineer adds a column. A product team renames a field. A third-party API deprecates an attribute and replaces it with two new ones. Your pipeline was working yesterday and now it isn't — or worse, it's still running but silently producing wrong results.

Schema evolution is the most underestimated operational challenge in data engineering.

The Semantic Layer: Define Metrics Once, Use Everywhere

Ask three analysts at the same company what "monthly active users" means and you may get three different answers. One filters to users who logged in. One filters to users who performed any action. One includes trial accounts, another excludes them. All three produce different numbers from the same underlying data.

This is the metrics consistency problem, and it gets worse as organizations scale.

Designing the Serving Layer for Different Consumers

A fact table that perfectly serves an executive dashboard is often a poor fit for a machine learning model. A feature table optimized for model training is unnecessarily complex for a business analyst. The serving layer is not one thing — it's a set of tables designed for specific consumers, each with different requirements.

Slowly Changing Dimensions: When History Matters More Than Current State

A dimension describes an entity — a customer, a product, a salesperson. Dimensions change over time: customers move to different cities, change their subscription tier, or update their email. Products get recategorized. Salespeople change regions.

The question every data model must answer: when a dimension changes, what do you do with history?

Staging → Intermediate → Marts: The Case for Layered Transforms

Every transformation layer should have one clear responsibility. When you mix cleaning, joining, and business logic in the same model, you get a model that's hard to test, hard to debug, and hard to change. Layered transformation architecture solves this by giving each responsibility its own layer with clear rules.

Storage Formats: Why Parquet Is Not Just "Smaller CSV"

When people first encounter Parquet, the framing is often "it's like CSV but compressed." This undersells both what Parquet does and why the format matters for query performance. The difference between row-oriented and column-oriented storage is not about compression — it's about fundamentally different access patterns.

Trigger Patterns: Cron, Sensors, and Event-Driven Pipelines

Every pipeline step needs to know when to run. The simplest answer — "run on a schedule" — works well in isolation but breaks down when pipelines have dependencies on upstream data that doesn't arrive on a perfectly predictable cadence.

There are three fundamental trigger patterns, each with different tradeoffs in simplicity, correctness, and coupling.

Data Warehouse vs. Data Lake vs. Lakehouse: A Tradeoff, Not a War

Every few years, the data industry declares that one storage paradigm has won and the others are dead. Warehouses were obsolete when lakes arrived. Lakes failed when lakehouses emerged. None of these declarations aged well, because they missed the point: each architecture makes different tradeoffs, and the right choice depends on your constraints.