post agentic Β· 2025-07-15 Β· 6 min read
Designing tools for LLMs: what makes a tool the model actually uses
The interesting thing about LLM tool-use is that the quality of your tool design drops out as the quality of your agentβs output. A perfectly capable Claude or GPT will produce vague answers if your tool schema is ambiguous, and tight grounded answers if your tool schema is sharp. The model is not the bottleneck. The interface is.
This post is the patterns I now reach for after authoring two production MCP servers (tomtom-mcp and tomtom-traffic-analytics-mcp) and watching real agents call them tens of thousands of times. Some of these are obvious in hindsight; I had to learn each one the hard way.
1. Tool names are documentation
LLMs pick which tool to call almost entirely from the name and description. The body of your tool implementation is invisible to them. So the name is doing real work.
Avoid: get_data search run_query
Prefer: search-restaurants-by-location get-route-details-between-points list-active-traffic-incidents-in-areaThe bad names are ambiguous: an LLM looking at four get_data tools has to read every description carefully. The good names are self-describing: an LLM glancing at a list immediately knows what each does.
A simple heuristic: if I rename your tool, can I still tell what it does? If yes, rename it.
2. Descriptions are prompts
The description field on a tool isnβt documentation for human developers. Itβs a prompt fragment the LLM reads at every call. Treat it like a prompt.
Bad description (read like a docstring):
Returns a list of POIs (points of interest) matching the given queryparameters. The results may include restaurants, shops, or otheramenities depending on the query.Good description (read like a prompt):
Use this tool when the user asks to FIND or SEARCH for a place by name,category, or vague description. Returns up to 20 ranked results.Do NOT use this tool for routing between two known places β useget-route-details-between-points instead.The good one tells the LLM when to call it and when not to. That second part is the secret weapon: explicitly disambiguating from sibling tools. Models hesitate less when the rule for βis this the right toolβ is on the tin.
3. Argument schemas: tight, named, ordered by importance
LLMs are good at filling in named arguments. Theyβre bad at remembering positional ones. Theyβre terrible at obeying constraints that arenβt expressed in the schema.
// Avoid: ambiguous, untyped, easy to misfill{ q: string, loc: string, n: number}
// Prefer: self-documenting, constrained{ query: string, // free-text search term, eg "vegan restaurants" centerLat: number, // latitude of search centre, decimal degrees centerLon: number, // longitude of search centre, decimal degrees radiusMeters: number, // 50 to 50000, default 1000 limit: number, // 1 to 50, default 10 minRating: number | null // 0 to 5; null means no filter}Three rules:
- Names are full words.
querynotq,radiusMetersnotr. - Bounds in the description. β1 to 50, default 10β so the LLM doesnβt try
limit: 1000000. - Required vs nullable explicit.
minRating: number | nullsays βyou can omit thisβ.minRating: number(required) says βyou must give a number, even if itβs 0β. These are different contracts; LLMs respect the difference.
4. Error envelopes that let the agent recover
The default tool-call error path is βthe API call failed, the framework throws, the LLM sees a stack trace and gives up.β That is bad. Replace it with a structured error envelope:
// Avoid: unstructured failurethrow new Error("connection refused");
// Prefer: structured envelopereturn { ok: false, error: { code: "UPSTREAM_TIMEOUT", message: "The geocoding service did not respond in 5 seconds.", retryable: true, suggestedAction: "Retry the call once. If it fails again, fall back to a coarser-grained search." }};The agent, given this envelope, can read the suggestedAction and actually do something useful. The LLM is good at following English instructions; βif X fails, try Yβ is exactly the shape it handles best.
Common code values worth standardising across all your tools:
INVALID_ARGUMENTSβ the schema check failed; LLM should re-read the description and retry with corrected argsNOT_FOUNDβ the lookup succeeded but returned zero results; LLM should broaden the queryUPSTREAM_TIMEOUTβ transient; LLM should retry onceUPSTREAM_UNAVAILABLEβ non-transient; LLM should explain the failure to the user
5. Idempotency, where it makes sense
Every tool that reads should be idempotent. Same arguments β same result (modulo external state changes). This is obvious for GET-shaped tools.
The interesting case: tools that write. If your tool creates a resource or sends a message, accept an idempotencyKey argument:
{ message: string, recipient: string, idempotencyKey: string // any unique string; same key = same op}The LLM, when it retries (and it will, when itβs confused), can pass the same key. Your tool then no-ops the duplicate instead of sending the message twice.
This single pattern has saved me from two production incidents involving βthe agent retried the slack-post tool four times because it didnβt understand the success response and now the channel has four copies of the same alert.β
6. Cost-aware tool surfaces
Some tools are expensive (paid APIs, heavy compute). Tell the LLM in the description.
Use this tool to compute a routing matrix between many points. Each callcosts significantly more than a single point-to-point routing call.PREFER get-route-details-between-points for fewer than 5 stops.The model will respect the guidance most of the time. Combined with rate-limiting at the server, you have two layers: the LLM avoids expensive calls when it doesnβt need them, the server enforces the limit when it does.
7. Return shape: structured, not prose
Tool outputs that are big blobs of natural-language prose are wasteful. The LLM has to parse them again. Return structured data:
// Avoid: prose-shaped tool output"I found 3 restaurants near Brandenburger Tor: CafΓ© Einstein at 4.6 stars, the Barn Roastery at 4.7 stars, and Distrikt Coffee at 4.6 stars. CafΓ© Einstein is closest at 250m..."
// Prefer: structured data, agent renders the prose{ "results": [ { "name": "The Barn Roastery", "rating": 4.7, "distanceMeters": 320 }, { "name": "CafΓ© Einstein", "rating": 4.6, "distanceMeters": 250 }, { "name": "Distrikt Coffee", "rating": 4.6, "distanceMeters": 410 } ], "totalFound": 3}The model can iterate over results directly, sort, filter, decide. With prose, it has to parse before it can act.
Compose the user-facing prose in the agentβs response, not in your toolβs response.
What I no longer do
A few practices Iβve stopped doing:
- Over-using βautoβ tools that try to do many things based on argument shape. They confuse models. One tool, one job.
- Hiding internal IDs in opaque cookies (βpass
cursor: 'abc123'to paginateβ). Models lose them across turns. Use simple offset / limit pagination instead. - Returning HTML or markdown in tool responses. Just data. The LLM does the rendering.
The mental model
Treat your tool surface like an API for a junior engineer. Names that mean what they say. Arguments that fill in the way they read. Errors that suggest what to try next. Outputs that are easy to chain into the next decision.
If your junior engineer would file a bug against your toolβs design, your LLM is probably already silently working around it. You just canβt see the work-around because the LLM is too polite to complain.
Practical checklist
When I review a new tool, this is the list I run:
- Name is verb-first and self-describing
- Description says when to use AND when not to use
- Argument names are full words, not abbrevs
- Required vs nullable is correct in the schema
- Bounds and defaults documented in argument descriptions
- Errors return a structured envelope with
code,message,retryable,suggestedAction - Side-effecting tools accept
idempotencyKey - Expensive tools warn in their description; cheap alternatives mentioned
- Return shape is structured data, not prose
If a tool clears all nine, agents use it correctly the first time. If it fails three or more, youβll spend the next sprint adding βthe agent keeps doing Xβ defensive patches.
The MCP servers Iβve shipped follow this checklist. So do the agentic systems built on top of them. The pattern transfers.