Skip to content
RingMod Request an audit
Framework

The Production-Readiness Bar

A six-dimension standard for deciding whether an AI system — LLM or agentic — is ready for production. Score each dimension from 0 to 5. The model working is table stakes, not a passing grade.

Version 1.0 Updated June 30, 2026 CC BY 4.0

A prototype answers one question — is this possible? Production answers four more: is it safe, can we see what it did, can we afford it, and who owns it when it breaks at 2am? The Production-Readiness Bar names the six dimensions where that gap lives, so you can score a system honestly instead of arguing about whether it “feels ready.”

It is published in the open so any team can self-assess. For the argument behind it — why prototypes stall and why it is almost never the model — read the field note. For an assessment against this exact bar, that is the production-readiness audit.

The scale

What a score means.

One scale, applied to every dimension, so a “3” means the same thing everywhere. Anything you cannot honestly score above 3 is a blocker, not a polish item.

0 Absent
Not addressed at all.
1–2 Ad hoc
Exists informally or as one-off scripts — not reproducible, not owned.
3 Partial
Works in the common case; gaps appear under load, failure, or audit.
4 Production-grade
Reproducible, monitored, documented; holds under normal failure.
5 Governed
Production-grade plus policy-as-code, audited, and resilient to personnel change.

The six dimensions

The bar, in full.

01

Deployment

Can you ship a change reproducibly and roll it back?

A reproducible, gated path to ship and roll back a change — no click-ops, no long-lived credentials.

0
Shipping a change means one person running commands by hand.
5
Every change ships through a gated, reproducible pipeline with keyless auth; rollback is a routine, tested operation.
Fail
Click-ops, long-lived keys, “only Sam can deploy it.”
02

Evaluation

Will you catch a regression before users do?

An evaluation harness — a representative dataset scored against a rubric, run on every change — that gates releases.

0
Quality is judged by whoever last clicked through a demo.
5
Evals run in CI on every change, block regressions, and are versioned alongside the prompts and code they cover.
Fail
No evals; a prompt tweak that quietly degrades 8% of responses ships invisibly.
03

Observability

Can you see what the model actually did?

Traces of what was retrieved, prompted, returned, and acted on — enough to reconstruct any decision after the fact.

0
Silent failure is the default; you re-run it and hope.
5
Every model interaction is traced and queryable; failures alert with the context needed to diagnose them.
Fail
No traces, no prompt/output logs, no idea what happened at 2am.
04

Guardrails

Are inputs and outputs actually controlled?

Input/output validation, filtering, policy-as-code, and human-in-the-loop on the actions that carry real consequences.

0
Nothing constrains the model when it is wrong.
5
Inputs and outputs are policy-checked in code; consequential actions require human approval; the policy itself is tested.
Fail
Passes the demo, fails security review; no human-in-the-loop where it matters.
05

Cost

Is spend bounded and attributable to a decision?

Per-feature cost visibility and a ceiling that pages before it bankrupts — especially for multi-step agents.

0
A surprise bill; no one can attribute spend to a feature, team, or decision.
5
Spend is attributed by feature and team, forecast, and capped by alarms that page before the budget breaks.
Fail
Agentic workflows quietly burn many times the tokens of a single call; a bill arrives in place of a budget.
06

Ownership

Who runs this, and what wakes them up?

A named human who runs the system, knows what to watch, and is on call when it breaks.

0
“The AI team” owns it — i.e., no one specifically.
5
A named owner with a documented operating model, on-call rotation, and a runbook; the system survives that person changing.
Fail
When the one person who understood it leaves, no one can safely change it.

Score yourself

The scorecard.

Score each dimension 0–5 against the scale above. The result is almost always lopsided — a strong model next to a 1/5 on observability and ownership. That lopsidedness is the diagnosis: the work is rarely more model, it is the unglamorous bars around it. Fix the lowest first.

# Dimension The question it answers Score 0–5
1 Deployment Can you ship a change reproducibly and roll it back? __ / 5
2 Evaluation Will you catch a regression before users do? __ / 5
3 Observability Can you see what the model actually did? __ / 5
4 Guardrails Are inputs and outputs actually controlled? __ / 5
5 Cost Is spend bounded and attributable to a decision? __ / 5
6 Ownership Who runs this, and what wakes them up? __ / 5

Reading it: any dimension at 3 or below is a launch blocker. A single low score is usually what is keeping an otherwise-good system in the demo.

Reference & reuse

Cite it, build on it.

The Production-Readiness Bar is published under CC BY 4.0 — use it, adapt it, fold it into your own standards. Attribute RingMod and link back so a reader can find the current version.

Suggested citation

The Production-Readiness Bar v1.0. RingMod (Matt Woolly), 2026. https://ringmod.ai/production-readiness-bar/

Questions this raises

Straight answers.

What is the Production-Readiness Bar?
It is an open, six-dimension standard for deciding whether an AI system — LLM or agentic — is ready for production: deployment, evaluation, observability, guardrails, cost, and ownership. You score each dimension from 0 to 5. The model working is table stakes, not a passing grade.
How is this different from a generic AI readiness checklist?
It is scored, not a checklist: each dimension has a 0–5 rubric, so a score means the same thing across systems and over time. It is weighted toward the things that actually block regulated-industry launches — guardrails, cost control, and named ownership — and it is drawn from production work, not from a vendor maturity model that exists to sell a product.
Can I use the Production-Readiness Bar for my own assessments?
Yes. It is published openly under CC BY 4.0 — score your system against the six dimensions, share the result, and attribute the source with a link. If you would rather not grade your own homework, a production-readiness audit assesses against this exact bar.
Who maintains it, and how often does it change?
It is maintained by RingMod and versioned, so a reference to “v1.0” is stable. It is updated as the practice evolves — particularly as agentic systems change what observability, guardrails, and cost control have to cover.

Production-Readiness Audit

Don’t want to grade your own homework?

A fixed-scope assessment against this exact bar, with a prioritized risk register and a scoped path to close the gaps — by the person who wrote it.

Request an audit