The Cost of Indirection

A question we've started to ask ourselves while giving Claude instructions to write our features is whether our usual coding practices still apply in the context of AI-generated and AI-maintained code.

We've repeated this pattern many times: we give the agent a prompt and look at the implementation. We iterate until we get something that works. Then we review the code and start asking the agent to do what we used to do ourselves: clean it up, make it easier to understand, improve reusability, introduce better abstractions.

At some point, though, the question becomes unavoidable: does this process still make sense?

The industry seems to be pushing us in a direction where we pay less and less attention to the code. LLMs write code faster than we can review it. Companies are starting to automate the review process as well.

So what exactly are we doing when we "refactor" AI-generated code? Is it just a form of programmer nostalgia, an attempt to retain some degree of control? Or does it still provide real value?

More importantly, does it actually make the code easier to maintain or reason about when the most likely maintainer is no longer a human, but another AI agent? Are we helping the agent by making code more readable for humans, or unintentionally making things harder for it?

My initial thoughts

From my limited understanding of how LLMs work under the hood, my initial reaction to these questions was fairly intuitive: these agents were trained on programming languages in much the same way they were trained to produce human-readable text and hold conversations. The input was, after all, real code written by humans over several decades.

So my intuition was that, given the nature of the data these models were trained on, making a program easier to reason about for a human would also make it easier for a model to understand.

A valid concern with this interpretation is the sheer diversity of possible implementations for the same program, even within the same language. Models can write in many different styles, so it's worth asking whether some of these choices help or hurt their performance.

How to measure it

Experimenting with how humans maintain code is hard. There are too many variables that are difficult to control or measure: level of expertise, prior knowledge of the codebase, personal preferences, investment in a particular pattern or architecture. Experiments take time, and results are hard to reproduce.

With AI agents, the setup is much simpler. "Easy to maintain" might translate into more concrete questions: how much it costs and how long it takes for an agent to implement a new feature in an existing system.

I decided to design and "vivecode" a small experiment to explore these questions. Not long ago, this would have been an overly ambitious task. But we live in a different world now, where even chronic procrastinators can spend a few hours over a weekend and actually get something meaningful done.

Designing the experiment

The overall idea is simple: generate equivalent codebases (same functionality) with different implementation flavors. On top of these base implementations, I created a set of end-to-end tests to ensure they actually work and are functionally equivalent.

The next step was to define a prompt spec for a new task that introduces a non-trivial change to the system. For this new task, I added another set of e2e tests, this time focused only on validating the new feature.

Finally, I wrote* a script that programmatically calls the Claude CLI with the prompt and one of the code variants. The script collects metrics and runs the e2e tests, saving the results. Since we're dealing with a non-deterministic process, each variant is tested multiple times (3 to 5 runs), and results are averaged. After each run, the codebase is reset to its initial state.

I conducted two separate experiments, each with its own set of variants. The first one focused on system architecture, or more explicitly, code structure. The code itself was more or less equally well written across all variants, but organized differently.

The second experiment focused more on code style: duplication, abstractions, and similar trade-offs.

* When I say "I wrote", I mean I asked Claude to write it. At some point it got too meta (asking Claude to write a script that calls Claude), so I'm taking credit for the sake of clarity.

Metrics

The following metrics are extracted from claude -p --output-format json, which returns structured data after each agent run.

Core metrics (from Claude)

total_cost_usd Dollar cost of the full run (input + output + cache, at API pricing). This is the single best summary metric, as it captures everything: how much the agent read, how much it wrote, and how many turns it took to get there.

num_turns Number of agent loop iterations (read → think → act cycles). A turn might be "read a file", "edit a function", or "run the build". More turns means the agent needed more steps to navigate and implement the requested change.

usage.output_tokens Tokens generated by the model (code, tool calls, reasoning). This represents the "writing" cost.

usage.input_tokens Non-cached input tokens. Usually very small, as most input is served from cache.

usage.cache_read_input_tokens Tokens loaded from the prompt cache across all turns. This is the dominant cost driver, as it represents how much context the agent reads (file contents, system prompt, conversation history). It's the best proxy for what we can call "navigation cost."

usage.cache_creation_input_tokens Tokens written to cache the first time a prefix is seen. Higher on initial runs, lower on subsequent runs due to the API-level cache.

duration_ms Wall clock time, including API calls and tool execution (file reads, TypeScript builds, etc.).

stop_reason How the agent finished. end_turn means it completed naturally. tool_use means it was cut off by the max-turn limit (the agent wanted to continue but ran out of budget).

Supplementary metrics (from git)

After each agent run, and before resetting to baseline, the script captures:

files_changed Number of existing files modified (git diff --name-only). This is the best proxy for edit dispersion: how spread out the changes are across the codebase.

new_files Files created by the agent that didn't exist before (git ls-files --others).

lines_added Total lines of code written (git diff --numstat).

Correctness metrics (from test runner)

baseline_tests_pass Whether the agent broke existing functionality. Acts as a regression check.

task_tests_pass Whether the agent correctly implemented the requested feature. These are feature-specific tests written before the experiment and committed to the baseline, so the agent never sees them as part of its task.

First experiment: project structure

The initial code structure and overall project architecture effectively become both the context and the guide for the agent's subsequent maintenance tasks.

My intention here was to measure how much these initial structural decisions affect model performance once agents start writing and maintaining most of the code.

Domain

The project for this experiment was a simple REST API for a wallet/money transfer system, built with Node.js, Express, and SQLite.

It implements account creation, deposits, withdrawals, transfers between accounts, and transaction history, along with business rules such as overdraft protection, a minimum transaction amount of $1, daily transfer limits of $10,000 per account, and atomic transfers.

Variants

Single file, flat: Everything lives in a single app.ts: routes, SQL, validation, all inline.
Single file, structured: A single app.ts, but organized around classes such as DB, AccountService, TransferService, TransactionService, and Validators.
Multi-file, light: Files are separated by concern, with a relatively lightweight structure: app.ts, routes.ts, services.ts, validators.ts, db.ts, and types.ts.
Layered architecture: A more traditional layered setup, including a repository pattern, service layer, DTOs, middleware, error classes, and a DI container, spread across around 15 files.
Hexagonal architecture (ports and adapters): A domain core built around ports (interfaces) and adapters (HTTP, persistence), with application services and strict dependency inversion.

Task prompt

This is the exact prompt used during the experiment:

Second experiment: code abstractions

Whereas the first experiment focused on capturing the cost of structural decisions in the architecture of the code, the second one shifts attention to the abstractions we introduce to make code "cleaner."

In particular, I was interested in the techniques we use to refactor duplicated code into shared abstractions (the DRY principle), with the goal of reducing maintenance effort.

The main question here is whether these abstractions work just as well for a coding agent, actually making its job easier, or if, on the contrary, they end up forcing the model to do more work to achieve the same results.

Domain

The project for this experiment was a CLI data processing tool that reads a CSV file of sales data and generates different analytical reports, built with Node.js and TypeScript.

It implements four report types: revenue (grouped by region or category), trend (monthly, with growth rates and moving averages), anomaly (statistical outlier detection using standard deviation), and ranking (top salespeople with regional breakdowns), each outputting structured JSON to stdout.

Variants

Inline: Everything lives in a single main.ts file (~284 lines). Each report function is fully self-contained: CSV parsing, data aggregation, and output formatting are duplicated across all four report handlers. There are no shared utilities or abstractions.
Abstracted: The same logic is split across ~8 files (~235 lines total). Shared concerns are extracted into reusable modules: csv.ts for parsing, aggregations.ts for functions like groupBy, sumBy, and topN, and stats.ts for mean, stdDev, growthRate, and movingAverage. Each report type has its own file under reports/, and main.ts is reduced to CLI argument handling and dispatch (~47 lines).

Task prompt

Results

First experiment: project structure

Metric	1. Flat	2. Structured	3. Multi-light	4. Full arch	5. Hexagonal
total_cost_usd	$0.25	$0.26	$0.26	$0.53	$0.89
duration_ms	99,877	97,903	112,398	194,564	274,364
num_turns	10	12	16	35	53
output_tokens	2,862	3,627	3,948	8,927	15,295
input_tokens	12	14	14	29	43
files_changed	1	1	5	9	14
new_files	0	0	0	0	0
lines_added	30	28	22	27	60
cache_read_input_tokens	217,563	268,370	276,591	768,109	1,445,814
cache_creation_input_tokens	12,361	15,622	10,837	21,399	33,355
stop_reason	end_turn (5/5)	end_turn (5/5)	end_turn (5/5)	end_turn (5/5)	end_turn (5/5)
baseline_tests_pass	5/5	5/5	5/5	5/5	5/5
task_tests_pass	5/5	5/5	5/5	5/5	5/5

Second experiment: code abstractions

Metric	1. Inline	2. Abstracted
total_cost_usd	$0.13	$0.13
duration_ms	60,700	50,229
num_turns	7	12
output_tokens	2,146	2,047
input_tokens	9	12
files_changed	1	1
new_files	0	1
lines_added	86	3 (+60 new file)
cache_read_input_tokens	142,282	215,564
cache_creation_input_tokens	9,999	9,391
stop_reason	end_turn (5/5)	end_turn (5/5)
baseline_tests_pass	5/5	5/5
task_tests_pass	5/5	5/5

Interpretation

After running the experiment, I started to see some patterns. I'm aware that this was not a particularly rigorous process in terms of scientific methodology or experimental design. Still, it provides a useful starting point, and more signal than intuition alone.

Architecture doesn't affect correctness

At least in this experiment, I was not able to show that architecture has any impact on correctness.

Every single variant across both experiments achieved a 100% pass rate (task + baseline). The agent can implement the feature regardless of the structure.

So the question is not whether it can do it, but at what cost.

The cost mechanism: indirection is key

More complex architectures cost more to maintain. This aligns with my initial intuition, but the underlying mechanism becomes more clear when looking at the metrics.

The first three structural variants (flat, structured, multi-light) are essentially identical in cost. The jump appears at variant 4 (2.1×) and increases further at variant 5 (3.6×).

So the problem doesn't seem to be the number of files, but how many steps the agent needs to follow to trace a single logical change through the system.

In V3, a change is relatively direct: the agent reads a handful of files, each clearly mapped to a specific concern, and applies targeted edits.

In V4, the same change is fragmented across layers. The agent has to follow chains of dependencies: routes to services, services to repositories, repositories to the database, just to understand where a change should happen. Each concern is split across multiple files that only make sense when read in sequence.

In V5, this indirection increases further. Each step introduces an additional level of abstraction: interfaces, ports, adapters. The agent can no longer rely on direct imports alone, it has to resolve "what implements this?" at every boundary.

The number of turns reflects this clearly. V3 completes the task in 16 turns with a small number of targeted edits. V4 requires 35 turns. V5 goes up to 53 turns, as each conceptual change requires multiple steps just to locate and understand the relevant code.

Reading dominates the cost

Looking at the metrics, cache_read_input_tokens dominates everything else.

Across all variants, the agent reads roughly 75 to 95 times more tokens than it writes. And since the full conversation history is re-read on every turn, this cost compounds quickly. More turns mean disproportionately more reading.

This is what drives the cost increase in the more complex variants. The hexagonal version, for example, doesn't write dramatically more code (roughly 60 lines versus 30 in the flat version), but it goes through many more turns, repeatedly reprocessing the same context.

The result is a 3.6× increase in cost, driven almost entirely by accumulated reading rather than code generation.

What about code abstractions?

The second experiment isolates a different dimension: abstraction, not structure.

Interestingly, the more abstract variant runs 18% faster, despite taking nearly twice as many turns (7 vs 12).

A possible explanation is that the abstracted version works with smaller, more focused files. Each read is cheaper, and edits are more surgical. The agent spends more time navigating, but less time generating large chunks of code.

The inline variant behaves in the opposite way: fewer turns, but heavier ones, dominated by large reads and code generation.

In the end, both variants cost roughly the same, as navigation (reading) and generation (writing) balance each other out.

Conclusion

Choosing a good architecture still matters. Agents tend to follow existing patterns, so the initial structure becomes the foundation for all future changes. Architectures that keep related concerns close together and make dependencies explicit are cheaper to navigate and modify than those that scatter logic across multiple layers.

Some of the traditional arguments in favor of more complex architectures may also be less relevant nowadays. The classic "what if we need to replace the database?" is no longer as strong as it once was, since LLMs can handle large-scale refactors across a codebase relatively quickly.

On the other hand, abstraction at the code level, like removing duplication or extracting reusable functions, does not seem to hurt the model. In fact, it can help. Writing DRY code, splitting logic into smaller functions, and keeping responsibilities clear may reduce the amount of code the agent needs to generate, without introducing significant additional cost.

Having reusable pieces of code also means the agent doesn't have to recreate them from scratch, which can reduce potential errors while saving both time and tokens.

Finally, we might ask: is it actually easier for a human to follow the logic in a hexagonal architecture than in a layered one? Or are we also paying the cost of navigating these layers of indirection?

If we stop writing code by hand, this might become less relevant. But it doesn't feel like these findings are exclusive to agents. The difference is that now we can measure the cost, instead of relying on personal preference or intuition.

The full experiment code, including all variants and scripts, is available on GitHub.