Mon. May 11th, 2026

Why AI Agents Cost More Than LLMs (And How to Stop Bleeding Tokens)

The topic of Why AI Agents Cost More Than LLMs (And How to Stop Bleeding Tokens) is currently the subject of lively debate — readers and analysts are keeping a close eye on developments.

This is taking place in a dynamic environment: companies’ decisions and competitors’ reactions can quickly change the picture.

I was building a small bookmark app last weekend. You send it a URL, Gemini
summarizes and tags the page, the result goes into Postgres. A few hundred lines
of TypeScript.

The first version cost almost nothing. One LLM call per URL, that’s it. Then I
added “tools” so the model could fetch pages, look up similar bookmarks, or
check things against Google Search.

That’s where most people building agents land for the first time. Going from a
plain chat call to an agent loop is way more expensive than docs make it sound,
and the reason isn’t obvious until you watch the round trips happen one by one.
Let’s do that.

That’s it. Two numbers on your bill. If your prompt is 500 tokens and the answer
is 200, you pay for 700 tokens. Done.

Tools are how the model talks to the outside world. Calling an API, querying a
database, fetching a URL, anything. You describe each tool with a small JSON
schema, and the model can ask to “call” one mid-conversation. You actually run
the function, send the result back, and the model writes its final answer using
that result.

This is the part that surprises people. The model got asked a question, and
instead of answering, it asked you to run a function. So you do that and
ship the result back:

The reason is simple. The model can’t predict what the tool will return. The
temperature in Tokyo isn’t in its training data, the API hasn’t been hit yet,
the result doesn’t exist. You can’t write “It’s 23°C in Tokyo” before you know
it’s 23°C.

So turn 1 is “decide what to do.” Turn 2 is “use what you learned.” They can’t
be merged. The model has no memory between calls.

One exception is worth knowing about: server-side tools. Things like
googleSearch or urlContext in Gemini run inside Google’s own servers, and
the API returns one merged response. From your side it looks like a single call.
You lose some control (you can’t see exactly what got searched), but you save
a round trip.

Here’s where the cost lives. Look at what turn 2 has to send compared to turn 1:

Your system prompt and tool definitions get sent to the API twice. Turn 1
doesn’t free you from re-sending everything in turn 2, because the model is
stateless. It forgets the whole conversation between calls.

And that’s the best case. Exactly one tool call, no follow-ups. Real agents are
worse. A lot worse.

That’s 4 LLM turns total. Ask, get tool calls, send back results, ask again,
get more calls, send results, finally write the summary.

For comparison, a plain generateContent({ contents: “summarize this:n” + pageText })
costs ~1500 input + 200 output. About 1700 tokens.

It gets worse. Cost grows quadratically with the number of turns, because
each turn replays everything that came before. A 10-turn agent isn’t 10x the
cost. It’s closer to 30x.

The biggest lever by far. Every major provider supports it now: OpenAI,
Anthropic, Google. The system prompt and tool schemas don’t change between
turns, so cache them once and pay about 25% of the input cost on every reuse.

For my 4-turn flow this cuts input costs by roughly half. Anthropic and OpenAI
do the same thing with different syntax.

Gemini also has implicit caching. It auto-caches recent prefixes for you with
zero code changes. You just see cheaper retries. Check if your provider has it
on before reinventing the wheel.

The “decide which tool to call” turn is dumb work. It barely needs reasoning.
It’s pattern matching on a question. The final synthesis turn is where you
actually want a smart model.

In a 4-turn flow, three of the turns can run on the cheap model. Only the last
one, the user-facing answer, needs the expensive one. For high-volume agents
this saves more than caching does.

The model can ask for multiple tools in a single response. Code I see in
tutorials usually does functionCalls[0] and silently drops the rest, turning
what could be one round trip into many.

For “summarize all my React bookmarks from last month,” the model might call
searchBookmarks and getDateRange in parallel. Handle both, and you save a
whole round trip.

Tools have a real cost, and they buy you real value. The reason you reach for
them is the same reason they’re expensive. You’re forcing the model to use
facts that exist outside its head instead of making them up.

A plain LLM call will happily tell you the weather in Tokyo. It’ll just be
wrong.

Most apps shouldn’t be agents. If your task is “summarize this text I’m pasting
in” or “rewrite this email,” you don’t need tools. You need one call. A lot of
agent frameworks make it really easy to add tools by default, which makes it
really easy to spend 5x what you should.

Tools earn their cost when you have side effects (writing to a DB, sending a
message), grounded data (today’s weather, this user’s bookmarks, current docs),
or chained reasoning where intermediate steps actually need verification.

Last week I added one tool to a Gemini call and watched the cost go from 850
tokens to 1530 for the same question. Once I started parallelizing calls and
caching the system prompt, I got the bookmark agent down to about 4500 tokens
across all four turns. Still 2.5x a plain call, but way better than the 7900
the naive version was burning.

Your agent isn’t a smarter LLM. It’s the same LLM with a longer receipt. Once
you can read the receipt, every optimization becomes obvious.

If you like my content support by like and share 💟 also dont forget to follow me on Twitter/X and LinkedIn. If you want me to connect, checkout my site. See you in next one.

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment’s permalink.

For further actions, you may consider blocking this person and/or reporting abuse

DEV Community — A space to discuss and keep up software development and manage your software career

Built on Forem — the open source software that powers DEV and other inclusive communities.

Why it matters

News like this often changes audience expectations and competitors’ plans.

When one player makes a move, others usually react — it is worth reading the event in context.

What to look out for next

The full picture will become clear in time, but the headline already shows the dynamics of the industry.

Further statements and user reactions will add to the story.

Related Post