post agentic · 2026-03-15 · 6 min read

Evals for agentic systems: what to measure, what to skip

#agentic#llm#evals#testing#production

The first eval suite I built for an agentic system measured “did the final answer match the expected answer”. It passed at 92% and I shipped. The system then failed in production for reasons the eval suite couldn’t see: tool calls timing out silently, the agent retrying the same tool 5 times in a row, output formatting that broke the downstream UI.

The eval was looking at the right output but not the right things. This post is the five eval categories I now run on every agentic system, the gotchas of LLM-as-judge, and how to ship a useful eval harness in a week.

The mistake of evaluating only the final answer

Agentic systems have many internal decisions: which tool to call, with what arguments, in what order, when to stop, how to recover from errors. The final answer is the visible output of those decisions. Evaluating only the output is like reviewing a chess game by looking at the final position — you can’t tell if the moves were good, you can only tell if the endgame was won.

Five things to measure, in the order I’d implement them:

1. Output quality (the obvious one)

Does the final answer match the expected answer? Or, if there’s no exact-match expected answer, is it acceptable?

Two flavours:

Exact match / structured output: when you can specify the right answer programmatically.

def eval_routing_question(test_case):
response = agent.run(test_case.question)
expected = test_case.expected_route
return {
"passed": response.route == expected.route,
"details": {"got": response.route, "expected": expected.route},
}

Semantic match / LLM-as-judge: when the answer can be phrased in many valid ways.

def eval_explanation_quality(test_case):
response = agent.run(test_case.question)
score = llm_judge(
question=test_case.question,
answer=response.text,
rubric="""
Score 1-5:
5 = factually correct AND well-explained
3 = factually correct but unclear
1 = factually wrong or confused
""",
)
return {"passed": score >= 4, "score": score}

This category gets you started but doesn’t catch most production issues.

2. Tool-call correctness

Did the agent call the right tools, with the right arguments, in a defensible order?

def eval_tool_calls(test_case):
response = agent.run(test_case.question)
actual_tools = [tc.tool_name for tc in response.tool_calls]
return {
"right_tools_called": set(actual_tools) >= set(test_case.required_tools),
"no_extra_tools": set(actual_tools) <= set(test_case.allowed_tools),
"correct_order": is_subsequence(test_case.required_order, actual_tools),
}

Three checks I run:

This is the eval category that caught most of my early bugs. The output was fine but the agent was burning compute calling tomtom-search three times when it should have called tomtom-routing once.

3. Latency and cost

The agent answered correctly, but it took 47 seconds and cost $0.84. Both unacceptable.

def eval_efficiency(test_case):
start = time.perf_counter()
response = agent.run(test_case.question)
duration = time.perf_counter() - start
cost = response.usage.input_tokens * MODEL_INPUT_COST \
+ response.usage.output_tokens * MODEL_OUTPUT_COST
return {
"duration_p95_ok": duration <= test_case.latency_budget_p95,
"cost_per_query_ok": cost <= test_case.cost_budget,
"duration": duration,
"cost": cost,
"tool_calls_count": len(response.tool_calls),
}

Specific things to track:

4. Failure-mode resilience

The agent works on the happy path. What does it do when:

This is the category most teams skip and most teams regret.

def eval_tool_failure_recovery(test_case):
# Inject a tool failure
with mock_tool_failure("tomtom-routing", error_code="UPSTREAM_TIMEOUT"):
response = agent.run(test_case.question)
return {
"agent_handled_gracefully": response.status != "crashed",
"user_facing_error_helpful": llm_judge_is_helpful(response.text),
"no_secret_leakage": "API_KEY" not in response.text,
}

Build this eval by:

  1. Running the system, intercepting tool calls, injecting failures.
  2. Checking the agent’s response is still helpful (not just “an error occurred”).
  3. Verifying no internal state leaks (API keys, internal IDs, stack traces).

For a multi-agent system, also test partial failures: agent A succeeds, agent B fails. Does the orchestrator handle it?

5. Drift detection (the operational one)

Your evals pass on day 1. On day 30, with no code change, they pass at 87%. What happened?

LLMs change behaviour. Models get retrained. Tools’ upstream APIs return different shapes. The world moves; your evals need to detect when it has.

Two specific checks:

Snapshot regression:

# Run evals against the current model. Save the answer.
# Run again next week. Has the answer drifted on any test case?
def detect_drift(test_case):
snapshot = load_snapshot(test_case.id)
current = agent.run(test_case.question)
similarity = embedding_similarity(snapshot.answer, current.text)
return {
"drifted": similarity < 0.85,
"similarity": similarity,
}

Behaviour distribution drift:

These don’t fail or pass — they alert. Drift in the wrong direction is a signal to investigate.

LLM-as-judge: the gotchas

Using an LLM to grade another LLM’s outputs is tempting because it scales. It’s also full of gotchas.

1. Judge model bias. Using GPT-4 to grade GPT-4 outputs introduces bias. The judge will tend to rate its own family’s outputs higher. Use a different model for judging.

2. Position bias. When asking the judge to compare A vs B, the judge over-weights the option presented first. Always run pairwise comparisons twice (A,B then B,A), average.

3. Verbosity bias. LLM judges often rate longer responses higher even when length is irrelevant. Add explicit guidance: “do not penalise terse answers when terse is appropriate”.

4. Inconsistency. Same prompt, same answer, different score. Run the judge 3-5 times and average; or use deterministic settings (temperature=0) and treat single-shot scores as noisy.

5. Calibration drift. Judge models change. A score of 4 today may be a 3 next month from the same prompt. Re-anchor with a small set of human-graded examples periodically.

When LLM-as-judge is OK: high-volume eval where exact match isn’t possible, you’ve audited the judge against a human-graded set, you’ve controlled for the biases above.

When it isn’t: critical safety evals, anything legal/medical, anywhere you’d be uncomfortable defending the score in a meeting.

Building a useful eval harness in a week

Realistic scope for an agentic system that already exists:

Day 1: collect 30-50 real user queries from logs. Annotate each with: expected tools called, expected output (where applicable), latency budget, cost budget.

Day 2-3: build the eval runner. Iterate over test cases, run the agent, collect metrics. Keep the runner simple — Python script + JSON output is fine.

Day 4: add the 5 eval categories. Most are 20 lines each. The hard one is failure-mode injection; that takes a day on its own.

Day 5: run the suite. Inspect failures. Half will be eval bugs (your expected output was wrong). Half will be real agent bugs.

Day 6-7: wire to CI. Every PR runs evals. Failure rate over a threshold blocks merge. Ship.

After this: keep the eval set growing. Every production bug becomes a test case. Every regression caught in CI tightens the safety net.

What I no longer do

Closing

Five eval categories: output quality, tool-call correctness, latency + cost, failure-mode resilience, drift detection. None of them solo are sufficient; together they catch ~90% of production issues before users do. Build the harness in a week, wire to CI, grow the test set with every bug. The system that ships at 95% pass-rate stays at 95% only if you’re watching it.