post agentic · 2025-07-15 · 6 min read

Designing tools for LLMs: what makes a tool the model actually uses

#mcp#agentic#llm#tool-design

The interesting thing about LLM tool-use is that the quality of your tool design drops out as the quality of your agent’s output. A perfectly capable Claude or GPT will produce vague answers if your tool schema is ambiguous, and tight grounded answers if your tool schema is sharp. The model is not the bottleneck. The interface is.

This post is the patterns I now reach for after authoring two production MCP servers (tomtom-mcp and tomtom-traffic-analytics-mcp) and watching real agents call them tens of thousands of times. Some of these are obvious in hindsight; I had to learn each one the hard way.

1. Tool names are documentation

LLMs pick which tool to call almost entirely from the name and description. The body of your tool implementation is invisible to them. So the name is doing real work.

Avoid:   get_data
         search
         run_query

Prefer:  search-restaurants-by-location
         get-route-details-between-points
         list-active-traffic-incidents-in-area

The bad names are ambiguous: an LLM looking at four get_data tools has to read every description carefully. The good names are self-describing: an LLM glancing at a list immediately knows what each does.

A simple heuristic: if I rename your tool, can I still tell what it does? If yes, rename it.

2. Descriptions are prompts

The description field on a tool isn’t documentation for human developers. It’s a prompt fragment the LLM reads at every call. Treat it like a prompt.

Bad description (read like a docstring):

Returns a list of POIs (points of interest) matching the given query
parameters. The results may include restaurants, shops, or other
amenities depending on the query.

Good description (read like a prompt):

Use this tool when the user asks to FIND or SEARCH for a place by name,
category, or vague description. Returns up to 20 ranked results.
Do NOT use this tool for routing between two known places — use
get-route-details-between-points instead.

The good one tells the LLM when to call it and when not to. That second part is the secret weapon: explicitly disambiguating from sibling tools. Models hesitate less when the rule for “is this the right tool” is on the tin.

3. Argument schemas: tight, named, ordered by importance

LLMs are good at filling in named arguments. They’re bad at remembering positional ones. They’re terrible at obeying constraints that aren’t expressed in the schema.

// Avoid: ambiguous, untyped, easy to misfill
{
  q: string,
  loc: string,
  n: number
}

// Prefer: self-documenting, constrained
{
  query: string,                    // free-text search term, eg "vegan restaurants"
  centerLat: number,                // latitude of search centre, decimal degrees
  centerLon: number,                // longitude of search centre, decimal degrees
  radiusMeters: number,             // 50 to 50000, default 1000
  limit: number,                    // 1 to 50, default 10
  minRating: number | null          // 0 to 5; null means no filter
}

Three rules:

Names are full words. query not q, radiusMeters not r.
Bounds in the description. “1 to 50, default 10” so the LLM doesn’t try limit: 1000000.
Required vs nullable explicit. minRating: number | null says “you can omit this”. minRating: number (required) says “you must give a number, even if it’s 0”. These are different contracts; LLMs respect the difference.

4. Error envelopes that let the agent recover

The default tool-call error path is “the API call failed, the framework throws, the LLM sees a stack trace and gives up.” That is bad. Replace it with a structured error envelope:

// Avoid: unstructured failure
throw new Error("connection refused");

// Prefer: structured envelope
return {
  ok: false,
  error: {
    code: "UPSTREAM_TIMEOUT",
    message: "The geocoding service did not respond in 5 seconds.",
    retryable: true,
    suggestedAction: "Retry the call once. If it fails again, fall back to a coarser-grained search."
  }
};

The agent, given this envelope, can read the suggestedAction and actually do something useful. The LLM is good at following English instructions; “if X fails, try Y” is exactly the shape it handles best.

Common code values worth standardising across all your tools:

INVALID_ARGUMENTS — the schema check failed; LLM should re-read the description and retry with corrected args
NOT_FOUND — the lookup succeeded but returned zero results; LLM should broaden the query
UPSTREAM_TIMEOUT — transient; LLM should retry once
UPSTREAM_UNAVAILABLE — non-transient; LLM should explain the failure to the user

5. Idempotency, where it makes sense

Every tool that reads should be idempotent. Same arguments → same result (modulo external state changes). This is obvious for GET-shaped tools.

The interesting case: tools that write. If your tool creates a resource or sends a message, accept an idempotencyKey argument:

{
  message: string,
  recipient: string,
  idempotencyKey: string  // any unique string; same key = same op
}

The LLM, when it retries (and it will, when it’s confused), can pass the same key. Your tool then no-ops the duplicate instead of sending the message twice.

This single pattern has saved me from two production incidents involving “the agent retried the slack-post tool four times because it didn’t understand the success response and now the channel has four copies of the same alert.”

6. Cost-aware tool surfaces

Some tools are expensive (paid APIs, heavy compute). Tell the LLM in the description.

Use this tool to compute a routing matrix between many points. Each call
costs significantly more than a single point-to-point routing call.
PREFER get-route-details-between-points for fewer than 5 stops.

The model will respect the guidance most of the time. Combined with rate-limiting at the server, you have two layers: the LLM avoids expensive calls when it doesn’t need them, the server enforces the limit when it does.

7. Return shape: structured, not prose

Tool outputs that are big blobs of natural-language prose are wasteful. The LLM has to parse them again. Return structured data:

// Avoid: prose-shaped tool output
"I found 3 restaurants near Brandenburger Tor: Café Einstein at 4.6 stars, the Barn Roastery at 4.7 stars, and Distrikt Coffee at 4.6 stars. Café Einstein is closest at 250m..."

// Prefer: structured data, agent renders the prose
{
  "results": [
    { "name": "The Barn Roastery", "rating": 4.7, "distanceMeters": 320 },
    { "name": "Café Einstein", "rating": 4.6, "distanceMeters": 250 },
    { "name": "Distrikt Coffee", "rating": 4.6, "distanceMeters": 410 }
  ],
  "totalFound": 3
}

The model can iterate over results directly, sort, filter, decide. With prose, it has to parse before it can act.

Compose the user-facing prose in the agent’s response, not in your tool’s response.

What I no longer do

A few practices I’ve stopped doing:

Over-using “auto” tools that try to do many things based on argument shape. They confuse models. One tool, one job.
Hiding internal IDs in opaque cookies (“pass cursor: 'abc123' to paginate”). Models lose them across turns. Use simple offset / limit pagination instead.
Returning HTML or markdown in tool responses. Just data. The LLM does the rendering.

The mental model

Treat your tool surface like an API for a junior engineer. Names that mean what they say. Arguments that fill in the way they read. Errors that suggest what to try next. Outputs that are easy to chain into the next decision.

If your junior engineer would file a bug against your tool’s design, your LLM is probably already silently working around it. You just can’t see the work-around because the LLM is too polite to complain.

Practical checklist

When I review a new tool, this is the list I run:

Name is verb-first and self-describing
Description says when to use AND when not to use
Argument names are full words, not abbrevs
Required vs nullable is correct in the schema
Bounds and defaults documented in argument descriptions
Errors return a structured envelope with code, message, retryable, suggestedAction
Side-effecting tools accept idempotencyKey
Expensive tools warn in their description; cheap alternatives mentioned
Return shape is structured data, not prose

If a tool clears all nine, agents use it correctly the first time. If it fails three or more, you’ll spend the next sprint adding “the agent keeps doing X” defensive patches.

The MCP servers I’ve shipped follow this checklist. So do the agentic systems built on top of them. The pattern transfers.