Building a Cost Optimisation Loop for AI Agents

AI models are getting cheaper. But most organisations can't tell you what their AI agent actually costs per outcome.

They see monthly bills. They see token counts. What they don't see is: "It cost us £0.47 to deploy that load balancer, and £2.30 to configure that firewall rule."

Without that visibility, you can't optimise. You're just hoping costs go down as models get cheaper.

There's a better way: build a system that finds the cheapest path to the correct outcome, automatically.

You Can't Optimise What You Can't Measure

Here's the problem with most AI agent deployments: there's nothing to measure against.

A chatbot conversation has no defined outcome. A copilot suggestion has no success criteria. An agent that "helps with tasks" produces results that are subjective, variable, and impossible to cost-attribute.

To build a cost optimisation loop, you need structure:

Defined services - What is this agent supposed to do? What are the inputs? What does success look like?
Discrete pipelines - Each run is a measurable unit with a start, end, and outcome
Clear success criteria - Did we achieve the customer's intent, or didn't we?

Without these edges, there's nothing to optimise. With them, you can build a feedback loop that compounds over time.

The Mechanism

Cost optimisation for AI agents isn't magic. It's measurement plus experimentation plus routing.

Step 1: Track Everything

Every pipeline run captures:

Model used - Which LLM handled this request?
Tokens in/out - How much did this specific call cost?
Iterations - How many attempts to reach success?
Outcome - Success, failure, or partial?
Service type - What kind of task was this?

This gives you a cost per pipeline, attributable to a specific service and outcome.

Step 2: Establish Baselines

Once you have data, patterns emerge:

"Firewall classification averages £0.12 per rule with GPT-4"
"Load balancer deployments take 4 iterations on average"
"WAF policy analysis costs 3x more than basic config generation"

Now you have something to optimise against.

Step 3: Experiment

The question isn't "which model is cheapest?" It's "which model is cheapest for this task?"

Try the cheaper model on the next firewall classification
Did it work? How many iterations?
Compare to baseline

Some tasks need the expensive model. Some don't. You won't know which until you test.

Step 4: Route Intelligently

With enough data, you can route automatically:

Low-risk, routine tasks → Cheap model, fail fast, retry if needed
High-risk, complex tasks → Expensive model, more verification
Unknown tasks → Start cheap, escalate on failure

The routing logic itself becomes a tuneable parameter.

Step 5: Learn at Fleet Level

Individual pipeline optimisation is good. Fleet-level learning is better.

"All firewall classifications across all customers work fine with the cheap model"
"This specific prompt phrasing reduces iterations from 8 to 3"
"Services with this schema pattern need the expensive model"

The learnings from one service improve all similar services.

The Tradeoffs

Cost optimisation isn't about minimising cost. It's about appropriate cost.

Cheap isn't always better:

A £0.05 pipeline that fails 30% of the time costs more than a £0.15 pipeline that works first time
A fast, cheap model that produces subtly wrong configs creates expensive problems downstream
Skipping verification to save tokens is false economy in high-risk domains

Expensive isn't always necessary:

Routine tasks don't need frontier models
Verification steps can often use smaller models than generation steps
Most iterations are wasted on edge cases that a better prompt would handle

The goal is finding the minimum cost path to the correct outcome - not the minimum cost path to an outcome.

Why Structure Matters

This optimisation loop only works because we have structure to measure against.

In AI That Removes the Boring Parts, we talked about using AI for classification and analysis tasks. Those tasks have defined inputs, defined outputs, and clear success criteria. That's what makes them optimisable.

Similarly, in What Does an AI Agent in Production Actually Look Like?, we described systems with audit trails, rollback capabilities, and measurable outcomes. Those aren't just governance requirements - they're the foundation for optimisation.

Ad-hoc AI assistance can't be optimised because there's nothing to measure. A chatbot that "helped with something" has no cost-per-outcome metric.

Structured agentic systems - with defined services, discrete pipelines, and clear success criteria - can be measured, experimented on, and continuously improved.

The Compounding Effect

The more pipelines run, the more data you collect. The more data you collect, the better your routing. The better your routing, the lower your costs.

This is a flywheel:

Run pipelines → collect cost/outcome data
Analyse data → identify optimisation opportunities
Experiment → find cheaper paths that work
Update routing → apply learnings to future pipelines
Repeat

Early pipelines subsidise the learning. Later pipelines benefit from it. Over time, the system converges on the cheapest reliable path for each service type.

What This Looks Like in Practice

We're building this optimisation loop into NetOrca Pack. The foundation is already there:

Services are schema-defined - Every service has a clear definition of what success looks like
Pipelines are discrete and auditable - Every run is tracked from intent to outcome
Cost is attributable - Every LLM call is logged with model, tokens, and result

The next layer is automated experimentation and routing - trying cheaper models, learning what works, and applying those learnings across the fleet.

The goal: every pipeline runs at the minimum cost required to achieve the customer's intent. Not cheaper than that (which risks failure), not more expensive (which wastes resources).

That's what cost optimisation for AI agents actually looks like. Not hoping models get cheaper. Building a system that finds the cheapest path, automatically, and gets better at it over time.