Why is our AI inference bill so high?

Usually it isn't the per-token price — it's the absence of attribution. No one can say which feature, team, or decision drives the spend, so nothing gets optimized. Add multi-step agents that make many calls per task, retries, oversized context windows, and the biggest model used by default, and the bill compounds quietly.

Isn't the answer just to switch to a cheaper model?

Sometimes — but it's the last lever, not the first. The early wins are attribution (cost per feature), right-sizing the model to the task instead of defaulting to the biggest, trimming context and retrieval bloat, caching repeated calls, and capping runaway agent loops. Model choice matters, but you can't choose well without per-feature cost data.

What's different about cost control for agentic systems?

A single chat call is one unit of spend; an agent can take dozens of steps per task — each a model call, plus tool use and retries. Cost scales with autonomy. Without a per-task budget and a ceiling that halts a runaway loop, one bad trajectory can cost many times a normal request.

How do we make AI spend predictable?

Attribute it (tag spend by feature, team, and workload), forecast it against usage, and cap it with alarms that page before the budget breaks — the same FinOps discipline as cloud, applied to tokens and GPUs. A bill is not a budget.

← Notes

Field note

FinOps for AI: why your inference bill keeps climbing

AI spend climbs faster than the value when no one can attribute it. The fix isn't a cheaper model — it's per-feature cost visibility and a ceiling that pages.

Last updated June 30, 2026

An AI inference bill that climbs faster than the value it produces looks like a pricing problem. It almost never is. It’s an attribution problem: no one can tie the spend to a feature, a team, or a decision, so no one can decide what to cut. You can’t optimize what you can’t attribute — and a bill is not a budget.

Why it climbs

Four things compound, usually at once. The biggest model gets used by default, even for tasks a smaller one handles. Context grows unchecked — whole documents stuffed into every call “to be safe.” Retries and failures quietly double some requests. And then agents multiply all of it: a multi-step agent makes many calls per task, so cost scales with autonomy in a way a single chat call never did. None of these shows up until the invoice does.

The drivers, and the fixes

Cost driver	What’s actually happening	The fix
No attribution	Spend can’t be tied to a feature, team, or decision	Tag and meter spend per feature before optimizing anything
Biggest model by default	Paying frontier prices for tasks a smaller model handles	Match model to task difficulty; route, don’t default
Context & retrieval bloat	Huge context shipped on every call	Trim retrieval, summarize, cache; make every token earn its place
Agentic multiplication	Dozens of calls per task, plus tool use and retries	Per-task token budgets and a ceiling that halts runaway loops
No ceiling	Spend is discovered via the invoice	Alarms that page before the budget breaks; hard caps
Idle / over-provisioned GPUs	Paying for capacity you don’t use (self-hosted)	Right-size, autoscale, or move to serverless inference

Notice the order: a cheaper model is one row, near the middle. The first move is always attribution, because every other decision depends on it.

Agentic systems are where this gets sharp

The shift to agents changes the cost curve. A chat completion is one bounded unit of spend. An agent is a loop — it plans, calls tools, calls the model again, retries when something fails — and a single task can fan out into dozens of calls. That’s the leverage you wanted; it’s also a new failure mode, where one pathological trajectory burns many times a normal request before anyone notices. Cost scaling with autonomy is why a per-task budget and a hard ceiling stop being nice-to-haves.

This is the Cost dimension of the Production-Readiness Bar: spend bounded and attributable to a decision. And it leans on Observability — you cannot attribute what you cannot see, so cost control and tracing are the same project from two angles.

The discipline, not the heroics

FinOps for AI isn’t a one-time teardown; it’s the same loop cloud FinOps already runs — attribute, forecast, cap — applied to tokens and GPUs. Done once, the spend drops; done as a standing control, it stays down. The deliverable that matters is the one that survives after the engagement: per-feature cost visibility and budgets that page, not a clever prompt that shaved tokens for a month.

If your inference and cloud spend are climbing faster than the value and no one can say where the money goes, that’s the AI cost optimization engagement: a measured teardown of where the spend lands and a remediation plan with the savings quantified before any change ships — priced on a share of what’s actually saved.

Questions this raises

Straight answers.

Why is our AI inference bill so high?: Usually it isn't the per-token price — it's the absence of attribution. No one can say which feature, team, or decision drives the spend, so nothing gets optimized. Add multi-step agents that make many calls per task, retries, oversized context windows, and the biggest model used by default, and the bill compounds quietly.
Isn't the answer just to switch to a cheaper model?: Sometimes — but it's the last lever, not the first. The early wins are attribution (cost per feature), right-sizing the model to the task instead of defaulting to the biggest, trimming context and retrieval bloat, caching repeated calls, and capping runaway agent loops. Model choice matters, but you can't choose well without per-feature cost data.
What's different about cost control for agentic systems?: A single chat call is one unit of spend; an agent can take dozens of steps per task — each a model call, plus tool use and retries. Cost scales with autonomy. Without a per-task budget and a ceiling that halts a runaway loop, one bad trajectory can cost many times a normal request.
How do we make AI spend predictable?: Attribute it (tag spend by feature, team, and workload), forecast it against usage, and cap it with alarms that page before the budget breaks — the same FinOps discipline as cloud, applied to tokens and GPUs. A bill is not a budget.

AI Cost Optimization

This is the work, not just the writeup.

If this is your situation, the ai cost optimization is where it gets fixed — by the person who wrote this.

Request an audit