How do I know if our AI problem is actually a data problem?

Run the reproducibility test: can you rebuild yesterday's retrieval index — or last month's training set — from scratch and get the same result? If not, every quality problem is unattributable: you can't tell whether the model, the prompt, or the data changed. Other tells: eval scores that move when nobody changed the model, retrieval results engineers spot-check by hand before trusting, and pipeline fixes that consume the roadmap.

Why does RAG quality degrade over time?

Usually the corpus, not the model. Documents go stale, ingestion silently drops or mangles sources, chunking changes shift what's retrievable, and nobody owns freshness — so retrieval quietly serves older, thinner context while everyone tunes prompts. Without lineage from an answer back to the exact source snapshot it used, degradation is invisible until users report it.

Do we need a data platform before shipping an LLM feature?

You need a platform proportional to the feature. A model answering from a static corpus needs versioned snapshots and a rebuildable index — days of work, not a replatform. An agent acting on live business data needs lineage, freshness guarantees, and governance, because every action it takes inherits the quality of the data underneath it. The mistake is shipping the second kind of feature on the first kind of platform.

What does 'data lineage' mean for an AI system, concretely?

That for any model output you can answer: which source documents or records fed it, which pipeline version transformed them, and when. Concretely — versioned ingestion, deterministic transforms, and an index or feature store you can rebuild to a point in time. It's what turns 'the model got worse' from an argument into a diff, and it's what a regulated-industry review will ask you to show.

← Notes

Field note

Your data platform is the real bottleneck on your AI roadmap

AI initiatives stall on data long before they stall on models: brittle pipelines, no lineage, retrieval nobody trusts. Here's how to tell — and the test that settles it.

Last updated July 2, 2026

When an AI initiative stalls, the model gets the blame and the data gets the roadmap. The usual pattern: a promising feature ships, quality drifts, the team tunes prompts and swaps models — and the actual cause is underneath all of it: brittle ingestion, no lineage, no reproducibility, and a retrieval layer nobody fully trusts. Model improvements are gated on plumbing. This piece is the diagnosis: how to tell when your bottleneck is the data platform, not the model — and the one test that settles it.

Why data problems wear a model costume

A model failure and a data failure look identical at the surface: a wrong answer. But they respond to completely different fixes, and teams almost always reach for the model fix first — it’s visible, it’s interesting, and prompt changes are cheap to try. The data fix is invisible and unglamorous, so it loses the prioritization fight until it’s the only thing left.

The result is a familiar loop: quality dips → prompt tuning → marginal recovery → quality dips again. If you’ve been around that loop more than twice, the model was never your problem. Something under it is moving.

The four failure modes

Symptom you see	What’s actually broken
Eval scores move when nobody touched the model	No reproducibility — the corpus or index shifted under the harness
RAG answers get thinner and staler over time	No ingestion ownership — sources silently dropped, corpus aging
”Why did it say that?” has no answer	No lineage — you can’t trace an output to the data that fed it
Every AI feature starts with two weeks of pipeline work	Brittle, feature-specific plumbing instead of a platform

1. Brittle ingestion

Each AI feature gets its own scraper, parser, and loader, built to demo speed. Sources change format and the pipeline half-fails: no alert, no gap report — just a quietly thinner corpus. The feature keeps answering, from less.

2. No reproducibility

Ask the team to rebuild yesterday’s retrieval index from scratch. If the honest answer is “we can’t” or “it would come out different,” then your eval harness is scoring a moving target. A regression could be the model, the prompt, the chunking, or the corpus — and you have no way to isolate which. Reproducibility is what makes every other quality signal mean something.

3. No lineage

When an answer is wrong — or when a regulator, auditor, or security review asks — you need to trace it: which documents fed this output, transformed by which pipeline version, ingested when. Without lineage, “the model got worse” is an argument. With it, it’s a diff. In regulated environments this stops being an engineering preference and becomes an audit requirement.

4. Retrieval nobody trusts

The tell is behavioral: engineers spot-check retrieval results by hand before believing them, and product quietly narrows the feature’s scope to the slices of the corpus that work. A retrieval layer that’s trusted in one corner and worked-around everywhere else isn’t a platform — it’s a liability with an API.

The test that settles it

One question separates a data platform from a pile of pipelines:

Can you rebuild yesterday’s index — or last quarter’s training set — from source, and get the same result?

Yes means you have reproducibility, which means lineage mostly exists and quality problems are attributable. No means every downstream investment — evals, observability, model upgrades — is standing on sand. This is the data-layer counterpart of The Production-Readiness Bar: the bar scores the system around your model; this test scores the ground it stands on.

What good looks like

Not a two-year replatform. Proportional to the feature, a sound data foundation is four properties:

Owned ingestion — pipelines with contracts, alerts on partial failure, and a named owner, so the corpus can’t thin out silently.
Deterministic transforms — versioned, so the same inputs produce the same index, every time.
Point-in-time rebuilds — snapshots of corpus and index, so evals score against a pinned world and incidents can be replayed.
Lineage on demand — output → sources → pipeline version, answerable in minutes, because someone will eventually ask.

A static-corpus assistant needs a lightweight version of this — days of work. An agent acting on live business data needs the full set, because every action it takes inherits the quality of the data underneath it. The expensive mistake is shipping the second kind of feature on the first kind of platform.

If your roadmap keeps stalling on plumbing, that’s the data-platform architecture engagement: ingestion, transformation, lineage, and retrieval designed for reproducibility and governance — so data stops being the reason every AI initiative runs late.

Questions this raises

Straight answers.

How do I know if our AI problem is actually a data problem?: Run the reproducibility test: can you rebuild yesterday's retrieval index — or last month's training set — from scratch and get the same result? If not, every quality problem is unattributable: you can't tell whether the model, the prompt, or the data changed. Other tells: eval scores that move when nobody changed the model, retrieval results engineers spot-check by hand before trusting, and pipeline fixes that consume the roadmap.
Why does RAG quality degrade over time?: Usually the corpus, not the model. Documents go stale, ingestion silently drops or mangles sources, chunking changes shift what's retrievable, and nobody owns freshness — so retrieval quietly serves older, thinner context while everyone tunes prompts. Without lineage from an answer back to the exact source snapshot it used, degradation is invisible until users report it.
Do we need a data platform before shipping an LLM feature?: You need a platform proportional to the feature. A model answering from a static corpus needs versioned snapshots and a rebuildable index — days of work, not a replatform. An agent acting on live business data needs lineage, freshness guarantees, and governance, because every action it takes inherits the quality of the data underneath it. The mistake is shipping the second kind of feature on the first kind of platform.
What does 'data lineage' mean for an AI system, concretely?: That for any model output you can answer: which source documents or records fed it, which pipeline version transformed them, and when. Concretely — versioned ingestion, deterministic transforms, and an index or feature store you can rebuild to a point in time. It's what turns 'the model got worse' from an argument into a diff, and it's what a regulated-industry review will ask you to show.

Data Platform Architecture

This is the work, not just the writeup.

If this is your situation, the data platform architecture is where it gets fixed — by the person who wrote this.

Request an audit