Discuss a project
Back to Blog

LLM evals for production: what to test before an agent touches real users

2026-05-19·2 min read
LLMEvalsAgentsObservabilityEnterprise AI

If an LLM system has no evaluations, it is not production-ready. It is a demo with hope attached.

Evals are not only about answer quality. In enterprise systems, they are how you control regressions, security, cost, latency, and operational risk.

What needs to be evaluated

For most LLM and agent workflows, I split evals into seven layers.

  1. Intent recognition. Did the system understand what the user is trying to do?
  2. Retrieval quality. Did it find the right documents, snippets, records, or past cases?
  3. Answer quality. Is the answer correct, grounded, complete, and useful?
  4. Tool behavior. Did the agent call the right tool with the right arguments at the right time?
  5. Refusals and escalation. Did it refuse unsafe requests and escalate ambiguous cases?
  6. Cost and latency. Is the workflow still economically viable under realistic usage?
  7. Human handoff. Can an operator understand what happened and take over?

The final answer is only one observable output. The path matters as much as the result.

The minimum eval set

For a serious pilot, I want at least:

  • 30-50 real examples from the business process;
  • a golden set with expected outcomes;
  • negative cases and adversarial cases;
  • regression checks for every prompt, retrieval, and tool change;
  • manual review for high-risk scenarios;
  • tracing that connects input, context, model calls, tool calls, and output.

This does not require a huge platform on day one. It requires discipline.

Common mistakes

Teams often make the same mistakes:

  • using synthetic examples that do not match production language;
  • testing only happy paths;
  • scoring answers but ignoring retrieval failures;
  • changing prompts without regression checks;
  • accepting “looks good” as a metric;
  • not separating model errors from product logic errors.

That is how pilots look impressive and then collapse when real users arrive.

A better question

Instead of asking “which model is best?”, ask:

What workflow behavior must remain stable after every model, prompt, retrieval, and tool change?

That is the foundation of production LLM engineering.

Have a similar AI task?

Send a short brief and I will suggest the smallest paid next step: consultation, audit, security review, or build.