post agentic · 2025-03-22 · 5 min read
When the multi-agent split is wrong, and a single LLM call probably wins
The agentic-systems space spent 2024 evangelising multi-agent architectures. Specialists, orchestrators, planners, executors, hand-offs, message buses. By mid-2025 it became defensible to say something heretical out loud: most LLM workloads do not need a multi-agent split. A single well-scoped LLM call, with the right tools available, is faster, cheaper, and more debuggable.
This post is the decision framework I use, plus three concrete cases where I almost-built a multi-agent system before realising a single call would win.
The hidden cost of “splitting it into agents”
A multi-agent system isn’t free. Each agent adds:
| Cost | What you pay |
|---|---|
| Latency | Each agent is at least one LLM call. N agents serially = N × latency. Even parallel, the slowest path is the slowest call. |
| Tokens | Every hand-off carries context. By agent 3, you’re spending 5x the tokens of a single call. |
| Observability | Tracing failures across 4 agents and a router is harder than tracing one call. |
| Failure modes | Each hand-off is a place where the format / typing of intermediate state can drift. |
| Debugging | ”Why did the router pick agent B instead of agent C?” is now a question you regularly answer in pull-request reviews. |
If a single LLM call with the right tools can do the job at acceptable quality, do that. Defaults matter.
When the split does win
There are three real reasons to split:
1. Tool-call surface is too large for one agent. If you have 50+ tools, an LLM gets confused about which to use. Splitting by domain (one agent per related set of tools) keeps each agent’s tool-call surface narrow. The router then becomes a domain classifier.
2. The roles think differently. A “planner” that decomposes a goal into steps thinks differently from an “executor” that runs each step. Forcing one prompt to do both compresses badly. Two agents, two prompts, two prompt iterations.
3. Outputs need different post-processing. If you need structured JSON for tool outputs and free-form prose for the user-facing summary, two agents (with different output schemas and different temperature settings) is cleaner than one agent juggling modes.
That’s the list. Most other reasons are folklore.
Case 1: The “intent classifier + handler” anti-pattern
A common mistake. You have a chat interface; users say things like “show me cafes in Berlin”, “route from A to B”, “is there traffic on I-405”. The instinct is:
user msg → intent classifier (LLM) → routing → search agent / route agent / traffic agentThe cost: two LLM calls per turn (one for classification, one for the actual handler). Latency doubles. The classifier is a single trivial decision that the router is somehow doing with a 12-billion-parameter model.
The single-call alternative:
user msg → one LLM call with all 8 tools available → it picksModern tool-calling LLMs (Claude Sonnet 4.x, GPT-4o, Gemini 2.x) are extremely good at “given this user message and this list of tools, which one fits?” That decision is the classification. Skip the dedicated classifier; let the LLM tool-call directly.
Case 2: The “synthesiser” agent that adds nothing
Multi-agent demos love a “Synthesiser” agent at the end:
question → router → 3 specialists → synthesiser → answerThe synthesiser receives outputs from the three specialists and “composes a final answer”. This pattern reads well in architecture diagrams.
In production it usually adds latency without adding quality. Why? Because the last specialist that runs already has the context it needs to answer. Make that specialist’s prompt do the synthesis as part of its output. One fewer call. Same answer.
Cases where a synthesiser does earn its keep: when the specialists produce structured outputs (JSON tables, numeric stats) and a separate agent transforms them into prose for the user. Two genuinely different jobs.
Case 3: The “validator” agent that rubber-stamps
Another common pattern:
generator (LLM) → output → validator (LLM) → "this looks correct" → userAsking an LLM to validate another LLM’s output is, more often than not, expensive theatre. The validator usually agrees. When it disagrees, it’s about as likely to be wrong as the original. You doubled your cost and added latency for a marginal-to-zero quality lift.
The fix: replace the validator-LLM with a deterministic validator. Schema validation. A small list of regex checks. A hard-coded list of “things this output must contain.” For tool-call outputs, validate the JSON against your tool’s schema. For prose, run a sentence-count check or a forbidden-phrase regex.
A deterministic validator is faster, cheaper, and more reliable than an LLM judging an LLM.
My decision tree
Does a single tool-call-equipped LLM produce acceptable output? │ ┌─────────┴─────────┐ │ │ YES NO │ │ Use one call. ┌──────┴───────┐ Stop here. │ │ ▼ ▼ Different output Tool-call surface modes (JSON vs is huge (50+ tools)? prose)? │ │ ▼ ▼ Split for output Split by domain. pipeline. Use a router OR per-domain tool sets.If the answer to the first question is yes, you’re done. Anything more is over-engineering.
What “acceptable output” means
The honest version: try the single-call architecture first. Run your evals. If the eval pass rate is above your bar, ship it. If it’s below your bar, identify which failures multi-agent would actually fix, and build the minimum split that addresses those.
Don’t start from architecture and back into a use-case.
A note on Agno, AG-UI, and the multi-agent ecosystem
I work with Agno (multi-agent framework) and AG-UI (agent-UI streaming protocol) in production. They are excellent at what they’re for. Nothing here is anti-Agno; Agno makes the multi-agent split easier when you do need it, which is exactly when its value lands.
What I’m pushing back on is the default. The default in 2024 was “split everything into agents”. The default in 2025 should be “single call until proven insufficient, then split with intent.”
The agents we ship in production at TomTom that do use multiple agents do so because each split passes the decision-tree above with a clear yes. None of them have a synthesiser agent. None of them have an LLM validator. Each agent earns its keep by handling a tool surface or role that the others meaningfully cannot.
Default to one. Earn the split.