Lawrence Huibuilds AI · writes in public
01Home02Projects03Writings04About05Resume07Email
Essay19 June 2026 · 5 min read

Three Tools Want to Cut Your AI Token Bill. They're Fixing Three Different Leaks.

Caveman, Headroom, and the proxy I built all promise cheaper agents. Put side by side, they attack three different points in the pipe — and only one of them is aimed at the tokens that actually cost money.

AI codingcostcontexttooling

Two token-compression tools went around my corner of GitHub this week — Headroom and Caveman — and I'd already spent days building a third one of my own. The instinct behind all three is the same: agents read and write a lot, tokens cost money, so squeeze the tokens.

But put them next to each other and the interesting thing isn't that they compress. It's where they compress. There are three leaks in an agent's token bill, not one, and these tools each plug a different one. Pick the wrong one for your setup and you save nothing — or you make it worse.

The pipe has three points

Every turn, an agent does three things that cost tokens: it reads context you feed in, it emits a reply, and it does all of this through a harness that may or may not be caching the boring parts. Output, input, harness. Each tool lives at exactly one of these.

INPUT context in MODEL the work OUTPUT the reply Headroom Caveman ≈ 50× the cost HARNESS / CACHE Hannah — compress only when the cache isn't already doing it
Three tools, three leaks. Headroom trims the input, Caveman trims the output, Hannah decides whether either is worth it.

Caveman compresses the output. Its tagline — "why use many token when few token do trick" — is the whole pitch. It's a skill that makes the model write tersely, and it reports a ~65% cut in output tokens across real tasks, reasoning tokens left untouched.

Headroom compresses the input. It routes content by type, runs AST-aware compression on code and a trained model on prose, and claims 60–95% fewer input tokens with answers held steady. It's the most technically ambitious of the three, and notably it ships a CacheAligner to keep provider caches hitting — which tells you the authors already saw the trap I'm about to describe.

My proxy decides whether to compress at all. That sounds like a non-answer until you look at the prices.

Output tokens are where the money is

Here's the number that orders everything. When I benchmarked real builds properly, output tokens ran ~$25–30 per million. Cached input reads ran ~$0.50 per million. That's a 50× gap.

So a tool that shaves input is fighting over the cheap tokens, and a tool that shaves output is fighting over the expensive ones. Caveman, the simplest of the three, is aimed straight at the costly side of the bill. On a strong-cache harness it'll out-save a sophisticated input compressor for a fraction of the complexity. That surprised me, and it shouldn't have — it's the same lesson my own benchmark taught me: a terser model beats a verbose one even when it reads more context.

On a strong cache, compressing the input can cost you

This is the part most people miss, and it's why my proxy's real job is to do nothing on purpose.

Claude Code serves 90%+ of its input from the prompt cache at roughly a tenth of the price. That cached prefix is stable — same bytes every turn, so the cache keeps hitting. Now run an input compressor over it. You've rewritten the prefix. The hash changes, the cache misses, and the input you were paying 10% for snaps back to full price. The "optimisation" just sent your bill up.

So the proxy resolves a strategy once, per harness, and never touches it again:

Harness Cache behaviour Right move
Codex (subscription) Weak — ~10% hit, context re-sent near-full every turn Compress + trim. Input is the bill.
Claude Code Strong — 91%+ from cache at ~10% cost Passthrough. Touch nothing. Watch the numbers.

On Codex that compress-and-trim path cut input ~35% a turn and pulled the cache hit rate from 10% to 59%, quality held. On Claude the same machinery is worse than useless. Same code, opposite verdict, decided entirely by what the harness already does for free.

So which tool do you reach for

Not whichever has the best benchmark on its README. The one that matches your leak:

  • Verbose model, costs piling up on replies? Compress the output. That's the expensive token, on every harness. Caveman's lane.
  • Weak-cache harness re-sending huge context? Compress the input. Codex, RAG-heavy pipelines, anything stateless. Headroom's lane.
  • Strong-cache harness? Mostly leave the input alone and make the cost visible instead. The cache is already doing the work; don't break it.

The honest version of all of this: compression isn't a product, it's a response to a specific leak. Before you bolt a compressor onto your agent, find out where the bill is actually going and whether your harness is already discounting it. Same advice I gave myself after deleting most of my proxy — before you optimise the middle, look at the edges.