What does production-ready mean for an LLM application?

It means the system has a reproducible deployment path, an evaluation harness that catches regressions before users do, observability into what the model actually did, guardrails on inputs and outputs, a cost ceiling that's attributable and bounded, and a named human owner who is on call for it. The model working is table stakes; production-ready is everything around it.

Why do most AI POCs never reach production?

Because a prototype optimizes for showing what's possible, and production optimizes for what's safe, observable, affordable, and owned. The prototype skips deployment, evals, guardrails, cost control, and ownership precisely because skipping them is what made it fast to build. Those six things are the work — and they're what the demo deferred.

How long does it take to get an LLM prototype to production?

It depends almost entirely on how many of the six bars are already met and your regulatory posture, not on model work. A focused audit can establish where you stand against the bar in 1–2 weeks; the buildout that closes the gaps is scoped from there. The honest answer is that the timeline is set by deployment, governance, and ownership debt — not by the model.

← Notes

Field note

The AI production-readiness bar: the 6 things that block your POC

Your AI prototype works and won't ship. It's almost never the model. Here's the 6-part production-readiness bar most teams fail — and how to score yourself.

Last updated June 29, 2026

A production-ready AI system has six things a prototype almost never does: a reproducible deployment path, an evaluation harness, observability, guardrails, a cost ceiling, and a named owner. The model working is the start of the job, not the end of it. When a promising AI prototype stalls before production, it is almost never the model that’s blocking it — it’s one or more of these six.

This is the bar we assess against in a production-readiness audit. Here it is in the open, so you can score yourself before talking to anyone.

Why your prototype works and your production system doesn’t

A prototype is built to answer one question: is this possible? So it skips everything that doesn’t serve the demo — and skipping that is exactly what made it fast. Production answers a different set of questions: is it safe, can we see what it did, can we afford it, and who owns it when it breaks at 2am? None of those were in scope for the notebook. That’s not a failure of the team; it’s the nature of the two artifacts. The gap between them is the work the demo deferred, and it has six named parts.

The six-part bar

Score each from 0 (absent) to 5 (production-grade). Anything you can’t honestly score above a 3 is a blocker, not a polish item.

#	The bar	The question it answers	Common fail signal
1	Deployment	Can we ship a change reproducibly and roll it back?	Click-ops, long-lived keys, “only Sam can deploy it”
2	Evaluation	Will we catch a regression before users do?	No evals; quality judged by vibes and demos
3	Observability	Can we see what the model actually did?	No traces, no logs of prompts/outputs, silent failures
4	Guardrails	Are inputs and outputs controlled?	No input/output policy, no human-in-the-loop where it matters
5	Cost	Is spend bounded and attributable?	A surprise bill; no cost-per-feature, no ceiling that pages
6	Ownership	Who runs this, and what wakes them up?	”The AI team,” i.e. no one specifically, no on-call

1. Deployment: can you ship and roll back, reproducibly?

If shipping a change means one person running commands by hand, you don’t have a deployment path — you have a person. Production needs a pipeline that is reproducible, gated, and rollback-safe, with no click-ops and no long-lived credentials. Until then, every release is a held breath.

2. Evaluation: will you catch a regression before users do?

Evals are the unit tests of an LLM feature. Without a harness — a representative dataset, scored against a rubric, run on every change — quality is judged by whoever last clicked through a demo. A prompt tweak that quietly degrades 8% of responses ships invisibly. No evals, no production.

3. Observability: can you see what the model actually did?

When an agent touches a real system, “it usually works” is not an operational posture. You need traces of what was retrieved, what was prompted, what came back, and what action was taken — so that when something goes wrong you can answer what happened without re-running it and hoping. Silent failure is the default state of an unobserved LLM system.

4. Guardrails: are inputs and outputs actually controlled?

Guardrails are the input and output controls that keep a model’s behavior inside a policy: validation, filtering, policy-as-code, and human-in-the-loop on the actions that carry real consequences. In regulated settings this is frequently the thing that fails security review and kills the launch — not because the model is wrong, but because nothing constrains it when it is.

5. Cost: is spend bounded, and attributable to a decision?

Inference and GPU spend climb faster than the value they produce when no one can attribute the cost to a feature, a team, or a decision. Agentic workflows make this sharper — multi-step agents can consume many times the tokens of a single call. Production needs per-feature cost visibility and a ceiling that pages before it bankrupts. A bill is not a budget.

6. Ownership: who runs this, and what wakes them up?

“The AI team owns it” means no one owns it. A production system needs a named human who runs it, knows what to watch, and is on call when it breaks. When the one person who understood the system leaves, an unowned model in production becomes a liability no one can safely change. Ownership is a control, not an org-chart footnote.

How to use this bar

Score your system honestly against all six. The pattern is almost always lopsided — a strong model and a 1/5 on observability and ownership. That lopsidedness is the diagnosis: the work isn’t more model, it’s the five unglamorous bars around it. Fix the lowest scores first; they’re the ones keeping you in the demo.

If you’d rather not grade your own homework, that’s the production-readiness audit: a fixed-scope assessment against this exact bar, with a prioritized risk register and a scoped path to close the gaps. You can take the findings and run, or have the same person who scoped it do the buildout.

Questions this raises

Straight answers.

What does production-ready mean for an LLM application?: It means the system has a reproducible deployment path, an evaluation harness that catches regressions before users do, observability into what the model actually did, guardrails on inputs and outputs, a cost ceiling that's attributable and bounded, and a named human owner who is on call for it. The model working is table stakes; production-ready is everything around it.
Why do most AI POCs never reach production?: Because a prototype optimizes for showing what's possible, and production optimizes for what's safe, observable, affordable, and owned. The prototype skips deployment, evals, guardrails, cost control, and ownership precisely because skipping them is what made it fast to build. Those six things are the work — and they're what the demo deferred.
How long does it take to get an LLM prototype to production?: It depends almost entirely on how many of the six bars are already met and your regulatory posture, not on model work. A focused audit can establish where you stand against the bar in 1–2 weeks; the buildout that closes the gaps is scoped from there. The honest answer is that the timeline is set by deployment, governance, and ownership debt — not by the model.

Production-Readiness Audit

This is the work, not just the writeup.

If this is your situation, the production-readiness audit is where it gets fixed — by the person who wrote this.

Request an audit