post
Evals for agentic systems: what to measure, what to skip
Most agent evals measure the wrong thing. Output quality alone misses 80% of production failure modes. The five eval categories I now run on every agentic system, the LLM-as-judge gotchas, and how to ship a useful eval harness in a week.