post agentic Β· 2026-03-15 Β· 6 min read
Evals for agentic systems: what to measure, what to skip
The first eval suite I built for an agentic system measured βdid the final answer match the expected answerβ. It passed at 92% and I shipped. The system then failed in production for reasons the eval suite couldnβt see: tool calls timing out silently, the agent retrying the same tool 5 times in a row, output formatting that broke the downstream UI.
The eval was looking at the right output but not the right things. This post is the five eval categories I now run on every agentic system, the gotchas of LLM-as-judge, and how to ship a useful eval harness in a week.
The mistake of evaluating only the final answer
Agentic systems have many internal decisions: which tool to call, with what arguments, in what order, when to stop, how to recover from errors. The final answer is the visible output of those decisions. Evaluating only the output is like reviewing a chess game by looking at the final position β you canβt tell if the moves were good, you can only tell if the endgame was won.
Five things to measure, in the order Iβd implement them:
1. Output quality (the obvious one)
Does the final answer match the expected answer? Or, if thereβs no exact-match expected answer, is it acceptable?
Two flavours:
Exact match / structured output: when you can specify the right answer programmatically.
def eval_routing_question(test_case): response = agent.run(test_case.question) expected = test_case.expected_route return { "passed": response.route == expected.route, "details": {"got": response.route, "expected": expected.route}, }Semantic match / LLM-as-judge: when the answer can be phrased in many valid ways.
def eval_explanation_quality(test_case): response = agent.run(test_case.question) score = llm_judge( question=test_case.question, answer=response.text, rubric=""" Score 1-5: 5 = factually correct AND well-explained 3 = factually correct but unclear 1 = factually wrong or confused """, ) return {"passed": score >= 4, "score": score}This category gets you started but doesnβt catch most production issues.
2. Tool-call correctness
Did the agent call the right tools, with the right arguments, in a defensible order?
def eval_tool_calls(test_case): response = agent.run(test_case.question) actual_tools = [tc.tool_name for tc in response.tool_calls] return { "right_tools_called": set(actual_tools) >= set(test_case.required_tools), "no_extra_tools": set(actual_tools) <= set(test_case.allowed_tools), "correct_order": is_subsequence(test_case.required_order, actual_tools), }Three checks I run:
- Required tools: the agent must call these. Missing one = test fail.
- Allowed tools: the agent must not call anything outside this set. Calling
delete_userwhen the question was βlook up userβ = bad. - Argument shapes: the args passed to each tool match the toolβs schema (validate against the Pydantic / Zod model). Strict-mode catches the agent inventing args that donβt exist.
This is the eval category that caught most of my early bugs. The output was fine but the agent was burning compute calling tomtom-search three times when it should have called tomtom-routing once.
3. Latency and cost
The agent answered correctly, but it took 47 seconds and cost $0.84. Both unacceptable.
def eval_efficiency(test_case): start = time.perf_counter() response = agent.run(test_case.question) duration = time.perf_counter() - start
cost = response.usage.input_tokens * MODEL_INPUT_COST \ + response.usage.output_tokens * MODEL_OUTPUT_COST
return { "duration_p95_ok": duration <= test_case.latency_budget_p95, "cost_per_query_ok": cost <= test_case.cost_budget, "duration": duration, "cost": cost, "tool_calls_count": len(response.tool_calls), }Specific things to track:
- End-to-end latency: under the user-experience budget? p95 latency for a chat UI is ~5s; for a background job, 60s might be fine.
- Cost per query: under the per-query budget? Multiply by expected QPM to get $/month.
- Tool-call count: more calls = more cost + latency. Track this even when others pass; rising trend signals a regression.
4. Failure-mode resilience
The agent works on the happy path. What does it do when:
- A tool times out.
- A tool returns a structured error.
- The user asks something out of scope.
- The user provides ambiguous input (βshow me the latest reportβ β which report?).
- The agentβs LLM call fails or rate-limits.
This is the category most teams skip and most teams regret.
def eval_tool_failure_recovery(test_case): # Inject a tool failure with mock_tool_failure("tomtom-routing", error_code="UPSTREAM_TIMEOUT"): response = agent.run(test_case.question) return { "agent_handled_gracefully": response.status != "crashed", "user_facing_error_helpful": llm_judge_is_helpful(response.text), "no_secret_leakage": "API_KEY" not in response.text, }Build this eval by:
- Running the system, intercepting tool calls, injecting failures.
- Checking the agentβs response is still helpful (not just βan error occurredβ).
- Verifying no internal state leaks (API keys, internal IDs, stack traces).
For a multi-agent system, also test partial failures: agent A succeeds, agent B fails. Does the orchestrator handle it?
5. Drift detection (the operational one)
Your evals pass on day 1. On day 30, with no code change, they pass at 87%. What happened?
LLMs change behaviour. Models get retrained. Toolsβ upstream APIs return different shapes. The world moves; your evals need to detect when it has.
Two specific checks:
Snapshot regression:
# Run evals against the current model. Save the answer.# Run again next week. Has the answer drifted on any test case?def detect_drift(test_case): snapshot = load_snapshot(test_case.id) current = agent.run(test_case.question) similarity = embedding_similarity(snapshot.answer, current.text) return { "drifted": similarity < 0.85, "similarity": similarity, }Behaviour distribution drift:
- Average tool-call count per query: did it spike from 2.3 to 4.1?
- Average latency: trending up?
- Cost-per-query: trending up?
These donβt fail or pass β they alert. Drift in the wrong direction is a signal to investigate.
LLM-as-judge: the gotchas
Using an LLM to grade another LLMβs outputs is tempting because it scales. Itβs also full of gotchas.
1. Judge model bias. Using GPT-4 to grade GPT-4 outputs introduces bias. The judge will tend to rate its own familyβs outputs higher. Use a different model for judging.
2. Position bias. When asking the judge to compare A vs B, the judge over-weights the option presented first. Always run pairwise comparisons twice (A,B then B,A), average.
3. Verbosity bias. LLM judges often rate longer responses higher even when length is irrelevant. Add explicit guidance: βdo not penalise terse answers when terse is appropriateβ.
4. Inconsistency. Same prompt, same answer, different score. Run the judge 3-5 times and average; or use deterministic settings (temperature=0) and treat single-shot scores as noisy.
5. Calibration drift. Judge models change. A score of 4 today may be a 3 next month from the same prompt. Re-anchor with a small set of human-graded examples periodically.
When LLM-as-judge is OK: high-volume eval where exact match isnβt possible, youβve audited the judge against a human-graded set, youβve controlled for the biases above.
When it isnβt: critical safety evals, anything legal/medical, anywhere youβd be uncomfortable defending the score in a meeting.
Building a useful eval harness in a week
Realistic scope for an agentic system that already exists:
Day 1: collect 30-50 real user queries from logs. Annotate each with: expected tools called, expected output (where applicable), latency budget, cost budget.
Day 2-3: build the eval runner. Iterate over test cases, run the agent, collect metrics. Keep the runner simple β Python script + JSON output is fine.
Day 4: add the 5 eval categories. Most are 20 lines each. The hard one is failure-mode injection; that takes a day on its own.
Day 5: run the suite. Inspect failures. Half will be eval bugs (your expected output was wrong). Half will be real agent bugs.
Day 6-7: wire to CI. Every PR runs evals. Failure rate over a threshold blocks merge. Ship.
After this: keep the eval set growing. Every production bug becomes a test case. Every regression caught in CI tightens the safety net.
What I no longer do
- Eval only on output quality. Doesnβt catch tool-call regressions or latency drift.
- Use the same model as both agent and judge. Self-evaluation bias.
- Run evals once at PR time and never again. Drift is real; weekly cron-runs catch it.
- Treat LLM-as-judge scores as ground truth. Theyβre noisy. Treat them as a smoke alarm, not a measurement.
- Skip failure-mode evals because βitβll be fineβ. It will not be fine.
Closing
Five eval categories: output quality, tool-call correctness, latency + cost, failure-mode resilience, drift detection. None of them solo are sufficient; together they catch ~90% of production issues before users do. Build the harness in a week, wire to CI, grow the test set with every bug. The system that ships at 95% pass-rate stays at 95% only if youβre watching it.