post agentic · 2025-07-15 · 6 min read
Designing tools for LLMs: what makes a tool the model actually uses
The interesting thing about LLM tool-use is that the quality of your tool design drops out as the quality of your agent’s output. A perfectly capable Claude or GPT will produce vague answers if your tool schema is ambiguous, and tight grounded answers if your tool schema is sharp. The model is not the bottleneck. The interface is.
This post is the patterns I now reach for after authoring two production MCP servers (tomtom-mcp and tomtom-traffic-analytics-mcp) and watching real agents call them tens of thousands of times. Some of these are obvious in hindsight; I had to learn each one the hard way.
1. Tool names are documentation
LLMs pick which tool to call almost entirely from the name and description. The body of your tool implementation is invisible to them. So the name is doing real work.
Avoid: get_data search run_query
Prefer: search-restaurants-by-location get-route-details-between-points list-active-traffic-incidents-in-areaThe bad names are ambiguous: an LLM looking at four get_data tools has to read every description carefully. The good names are self-describing: an LLM glancing at a list immediately knows what each does.
A simple heuristic: if I rename your tool, can I still tell what it does? If yes, rename it.
2. Descriptions are prompts
The description field on a tool isn’t documentation for human developers. It’s a prompt fragment the LLM reads at every call. Treat it like a prompt.
Bad description (read like a docstring):
Returns a list of POIs (points of interest) matching the given queryparameters. The results may include restaurants, shops, or otheramenities depending on the query.Good description (read like a prompt):
Use this tool when the user asks to FIND or SEARCH for a place by name,category, or vague description. Returns up to 20 ranked results.Do NOT use this tool for routing between two known places — useget-route-details-between-points instead.The good one tells the LLM when to call it and when not to. That second part is the secret weapon: explicitly disambiguating from sibling tools. Models hesitate less when the rule for “is this the right tool” is on the tin.
3. Argument schemas: tight, named, ordered by importance
LLMs are good at filling in named arguments. They’re bad at remembering positional ones. They’re terrible at obeying constraints that aren’t expressed in the schema.
// Avoid: ambiguous, untyped, easy to misfill{ q: string, loc: string, n: number}
// Prefer: self-documenting, constrained{ query: string, // free-text search term, eg "vegan restaurants" centerLat: number, // latitude of search centre, decimal degrees centerLon: number, // longitude of search centre, decimal degrees radiusMeters: number, // 50 to 50000, default 1000 limit: number, // 1 to 50, default 10 minRating: number | null // 0 to 5; null means no filter}Three rules:
- Names are full words.
querynotq,radiusMetersnotr. - Bounds in the description. “1 to 50, default 10” so the LLM doesn’t try
limit: 1000000. - Required vs nullable explicit.
minRating: number | nullsays “you can omit this”.minRating: number(required) says “you must give a number, even if it’s 0”. These are different contracts; LLMs respect the difference.
4. Error envelopes that let the agent recover
The default tool-call error path is “the API call failed, the framework throws, the LLM sees a stack trace and gives up.” That is bad. Replace it with a structured error envelope:
// Avoid: unstructured failurethrow new Error("connection refused");
// Prefer: structured envelopereturn { ok: false, error: { code: "UPSTREAM_TIMEOUT", message: "The geocoding service did not respond in 5 seconds.", retryable: true, suggestedAction: "Retry the call once. If it fails again, fall back to a coarser-grained search." }};The agent, given this envelope, can read the suggestedAction and actually do something useful. The LLM is good at following English instructions; “if X fails, try Y” is exactly the shape it handles best.
Common code values worth standardising across all your tools:
INVALID_ARGUMENTS— the schema check failed; LLM should re-read the description and retry with corrected argsNOT_FOUND— the lookup succeeded but returned zero results; LLM should broaden the queryUPSTREAM_TIMEOUT— transient; LLM should retry onceUPSTREAM_UNAVAILABLE— non-transient; LLM should explain the failure to the user
5. Idempotency, where it makes sense
Every tool that reads should be idempotent. Same arguments → same result (modulo external state changes). This is obvious for GET-shaped tools.
The interesting case: tools that write. If your tool creates a resource or sends a message, accept an idempotencyKey argument:
{ message: string, recipient: string, idempotencyKey: string // any unique string; same key = same op}The LLM, when it retries (and it will, when it’s confused), can pass the same key. Your tool then no-ops the duplicate instead of sending the message twice.
This single pattern has saved me from two production incidents involving “the agent retried the slack-post tool four times because it didn’t understand the success response and now the channel has four copies of the same alert.”
6. Cost-aware tool surfaces
Some tools are expensive (paid APIs, heavy compute). Tell the LLM in the description.
Use this tool to compute a routing matrix between many points. Each callcosts significantly more than a single point-to-point routing call.PREFER get-route-details-between-points for fewer than 5 stops.The model will respect the guidance most of the time. Combined with rate-limiting at the server, you have two layers: the LLM avoids expensive calls when it doesn’t need them, the server enforces the limit when it does.
7. Return shape: structured, not prose
Tool outputs that are big blobs of natural-language prose are wasteful. The LLM has to parse them again. Return structured data:
// Avoid: prose-shaped tool output"I found 3 restaurants near Brandenburger Tor: Café Einstein at 4.6 stars, the Barn Roastery at 4.7 stars, and Distrikt Coffee at 4.6 stars. Café Einstein is closest at 250m..."
// Prefer: structured data, agent renders the prose{ "results": [ { "name": "The Barn Roastery", "rating": 4.7, "distanceMeters": 320 }, { "name": "Café Einstein", "rating": 4.6, "distanceMeters": 250 }, { "name": "Distrikt Coffee", "rating": 4.6, "distanceMeters": 410 } ], "totalFound": 3}The model can iterate over results directly, sort, filter, decide. With prose, it has to parse before it can act.
Compose the user-facing prose in the agent’s response, not in your tool’s response.
What I no longer do
A few practices I’ve stopped doing:
- Over-using “auto” tools that try to do many things based on argument shape. They confuse models. One tool, one job.
- Hiding internal IDs in opaque cookies (“pass
cursor: 'abc123'to paginate”). Models lose them across turns. Use simple offset / limit pagination instead. - Returning HTML or markdown in tool responses. Just data. The LLM does the rendering.
The mental model
Treat your tool surface like an API for a junior engineer. Names that mean what they say. Arguments that fill in the way they read. Errors that suggest what to try next. Outputs that are easy to chain into the next decision.
If your junior engineer would file a bug against your tool’s design, your LLM is probably already silently working around it. You just can’t see the work-around because the LLM is too polite to complain.
Practical checklist
When I review a new tool, this is the list I run:
- Name is verb-first and self-describing
- Description says when to use AND when not to use
- Argument names are full words, not abbrevs
- Required vs nullable is correct in the schema
- Bounds and defaults documented in argument descriptions
- Errors return a structured envelope with
code,message,retryable,suggestedAction - Side-effecting tools accept
idempotencyKey - Expensive tools warn in their description; cheap alternatives mentioned
- Return shape is structured data, not prose
If a tool clears all nine, agents use it correctly the first time. If it fails three or more, you’ll spend the next sprint adding “the agent keeps doing X” defensive patches.
The MCP servers I’ve shipped follow this checklist. So do the agentic systems built on top of them. The pattern transfers.