Tokenomics: Where AI Agents Actually Spend Their Tokens

You drop a GitHub issue into your agentic coding pipeline and wait. Ten minutes later, the PR arrives — working code, tests passing, documentation updated. Then you get the bill.

Token pricing for reasoning models runs anywhere from $2 to $15 per million input tokens. A single agentic software engineering task on GPT-5 can consume over 40 million tokens. That's not a typo — forty million. At current rates, one PR costs somewhere between the price of a coffee and a three-course dinner, depending on your model and provider.

The question that keeps engineering leaders awake: where does it all go?

What the Data Says

A team at Concordia University ran the numbers. Their paper, Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering (Salim et al., arXiv 2601.14470), analysed 30 software development tasks executed by the ChatDev multi-agent framework on GPT-5. They mapped ChatDev's internal phases to the standard Software Development Life Cycle — Design, Coding, Code Completion, Code Review, Testing, and Documentation — then measured exactly where every token was spent.

The results are not what most developers expect.

Code Review Eats 60% of Your Budget

Development Stage	Average Token Share
Code Review	59.4%
Code Completion	26.8% (in affected tasks)
Documentation	20.1%
Testing	10.3%
Coding (initial)	8.6%
Design	2.4%

The dominant cost is not writing code. It's reviewing code. The programmer and reviewer agents engage in multi-turn dialogue — pointing out issues, suggesting fixes, iterating. Each round passes the full code context back and forth. Across 30 tasks, this single phase consumed nearly 60% of all tokens.

Initial Coding, meanwhile, accounts for less than 9%. Design is cheaper still at 2.4%. The conventional intuition — "generation is expensive, review is cheap" — is inverted.

The Communication Tax

Across all phases, the ratio of input to output tokens sits at roughly 2:1:

Token Type	Average Share
Input	53.9%
Output	24.4%
Reasoning	21.6%

Input tokens consistently dominate. Agents spend more than half their budget consuming context rather than producing useful output. This "communication tax" (formally named by Wang et al., 2025) appears to be an inherent property of conversational multi-agent architectures. Each agent in the chain receives the full conversation history, codebase context, and prior agent outputs before it can contribute its own work.

Stanford's Digital Economy Lab independently confirmed this pattern (Bai et al., 2026) across eight frontier LLMs on 500 SWE-bench tasks. Their findings: agentic coding consumes over 1,000× more tokens than single-turn code reasoning. Input tokens dominate at a ratio exceeding 150:1 in some configurations. Token usage varies by up to 30× across runs of the same task — and higher token burn does not correlate with higher accuracy. Accuracy peaks at intermediate cost and plateaus.

Stages Have Distinct Fingerprints

Not all software engineering activities consume tokens the same way:

Coding is output-heavy (58% output, 6.9% input): the agent receives a concise design spec and produces verbose source code.
Code Review is input-heavy (51.4% input): the reviewer consumes the full codebase and produces a short critique.
Documentation is extremely input-heavy (80.2% input): the agent reads the entire codebase to write a handful of explanatory paragraphs.
Design burns heavy reasoning tokens (36.0%): planning and architectural decisions require the model's chain-of-thought.

This creates a "cost map" for different engineering activities. A refactoring task that triggers heavy review cycles will cost fundamentally different from a greenfield feature that passes through once.

Why This Matters Now

The sheer scale of token consumption in agentic systems has shifted from an academic curiosity to a production bottleneck. OpenRouter's weekly token volume grew from 0.4 trillion in December 2024 to 27 trillion by March 2026 — a 68× increase in 15 months. Agent workflows, with their iterative loops of reasoning, tool use, and self-correction, are the primary driver.

This has three concrete implications for teams building on agentic pipelines:

1. Optimise Review, Not Generation

If you're trying to cut agent costs, the lever is not cheaper code generation — it's reducing review iterations. Strategies include:

Human-in-the-loop gating: Insert a human review before the agentic code review phase. A human can approve or redirect before the expensive multi-turn review dialogue begins.
Single-pass review prompts: Structure the reviewer agent to produce all feedback in one shot rather than iterating.
Context pruning: Strip irrelevant conversation history before passing context to the reviewer. Much of what the reviewer "reads" is past discussion, not current code.

2. Input Tokens Are the Real Cost Driver

Output token pricing gets all the attention, but input tokens dominate in agentic workflows. This changes the economics of which model to choose:

Models with larger context windows (GPT-5 at 400K, Claude Opus at 200K) are valuable because they reduce the need for context truncation — but they also encourage agents to pass full context indiscriminately.
Token caching helps. Google's and Anthropic's APIs offer cached input pricing at roughly 50% discount. Architect your agent workflows to hit the cache — reuse conversation prefixes, batch similar requests.
For internal agent loops, consider smaller, cheaper models for the "reading" phases and reserve expensive reasoning models for actual code generation.

3. Variance Is the Enemy of Predictability

A single task can consume anywhere from 17,000 to 40,000 reasoning tokens (per the ChatDev data), and 30× variance between runs of the same task (per Stanford). This makes cost estimation fundamentally difficult. Frontier models themselves fail to predict their own token consumption before execution (Pearson's r < 0.15).

Practical mitigations:

Set hard token budgets per task phase, not per task. If review exceeds 100K input tokens, escalate to a human.
Run each task twice with different seeds and take the cheaper result. Given the stochasticity, this often produces the same quality at lower cost.
Monitor token distribution per phase — if review is consuming more than 60% of your budget, something is wrong with your agentic loop design.

The Road Ahead

The tokenomics research points toward a clear research agenda. Current conversational multi-agent architectures — where each agent receives the full dialogue history — are fundamentally wasteful. Future systems will need:

Differentiable token budgets: Agents that can estimate and communicate their expected token consumption before executing, enabling intelligent routing decisions.
Structured agent communication protocols: Rather than passing raw conversation history, agents could exchange structured summaries, diff-like updates, or capability attestations.
Graph-based coordination topologies: Research from MultiAgentBench shows that graph topologies best balance performance against coordination overhead, reducing redundant context passing.

The most important takeaway for engineers: the next generation of cost-efficient agentic systems will not come from cheaper models. They will come from smarter collaboration protocols. The communication tax is a design tax, not a model tax.