✶Essay1 June 2026 · 6 min read

Everyone's Burning Millions on AI Coding. I Tried to Fix the Bill — and Built the Wrong Thing.

I spent days building a proxy to cut my AI coding bill. The benchmark deleted it — and showed me the one setup change that actually cuts cost by an order of magnitude. It wasn't the model.

AI codingcostbenchmarkspost-mortem

You've seen the headlines — the industry torching hundreds of millions of dollars in tokens on AI coding agents. My reaction was probably the same as a lot of engineers': there's got to be waste in there, and waste is a product.

So I built one. Hannah was a proxy you'd point your coding agent at. It would trim and compress the context being re-sent every turn, and route the easy work to a cheap local model while the hard work went to a frontier model. Classic middleman-takes-a-cut economics. On paper, obvious.

Then I benchmarked it properly, and the data quietly took the idea out back.

The setup

I built the same two real apps — a Next.js PWA and a mobile game — four ways: on Codex (gpt-5.5), on Claude (opus-4.8), on a local model, and through my router. Same tasks, real build-and-test verification on every run, actual token and dollar accounting.

Three things fell out, and none of them were what I expected.

1. "Subscription" coding agents are not free — they just report $0

The harness shows you nothing because it isn't pricing that path, but the API-equivalent bill is real: ~~$0.48 to build the PWA on Codex, ~~$0.95 on Claude. And the bill is dominated by output tokens (~~$25–30/M), not the input you're tempted to compress — cached input reads are nearly free (~~$0.50/M). A terser model beats a verbose one on the same task, even when it reads more context. The thing I built Hannah to optimise barely mattered.

2. The 10× lever isn't the model — it's the harness

This is the one that actually matters. Same model (gpt-5.5), same kind of build, two harnesses:

Harness	Input tokens / turn	Build cost
Native Codex CLI	~50–70k	~$5–11
Lean harness (Pi)	~4k	$0.22

That's ~15× more context shipped per turn — purely from the harness, before a single optimisation. The native CLI re-ships a big system prompt, full tool schemas, and accumulated history every turn; the lean harness ships a tiny prompt and four tools.

And no — caching doesn't save you. The native CLI cached well here (most of its input was cache-reads), but at enough volume even near-free reads add up. The harness sends more; caching only softens the blow.

3. Routing barely pays on real work

On a from-scratch build there's no big "easy slice" to offload — the planner sent almost everything to the frontier anyway, and the local model got one or two buggy steps. The routed run ended up the most expensive and the slowest: you pay a planner tax on top of frontier executors. Routing only wins when there's a large, separable, low-risk chunk — which complex builds don't have.

The conclusion: it's a setup choice, not a product

Across both tasks, one setup was consistently the cheapest, the fastest, and the only clean first pass:

Pi + Codex. $0.22–0.48 a build, 2–4 minutes, correct on the first try. Roughly an order of magnitude cheaper than the same build in a heavyweight native harness.

And it needs no proxy, no routing, no local model — the exact machinery I'd spent days building. The whole savings story collapses to two decisions you make once, for free: use a lean harness, pair it with a terse frontier model.

The proxy was a clever answer to a question a good default already solves. Two things did survive the cull, and I think they're the real lesson:

Cost has to be visible — per-task, per-model, in real dollars. "It says $0" is how the bill gets to nine figures.
You cannot trust an agent's own "done." A wall-clipping game with a dead control pad passed every "no console errors" check I had. A functional check — actually run it, screenshot it — caught the bug in seconds.

I'll publish the full benchmark and the Hannah code separately for anyone who wants to pull the numbers apart. But the one-line takeaway is cheap and worth more than the proxy ever was: before you optimise the middle, look at the edges.

←All writings