LLM evals for production: what to test before an agent touches real users

2026-05-19·2 min read

If an LLM system has no evaluations, it is not production-ready. It is a demo with hope attached.

Evals are not only about answer quality. In enterprise systems, they are how you control regressions, security, cost, latency, and operational risk.

What needs to be evaluated

For most LLM and agent workflows, I split evals into seven layers.

Intent recognition. Did the system understand what the user is trying to do?
Retrieval quality. Did it find the right documents, snippets, records, or past cases?
Answer quality. Is the answer correct, grounded, complete, and useful?
Tool behavior. Did the agent call the right tool with the right arguments at the right time?
Refusals and escalation. Did it refuse unsafe requests and escalate ambiguous cases?
Cost and latency. Is the workflow still economically viable under realistic usage?
Human handoff. Can an operator understand what happened and take over?

The final answer is only one observable output. The path matters as much as the result.

For a serious pilot, I want at least:

This does not require a huge platform on day one. It requires discipline.

Teams often make the same mistakes:

That is how pilots look impressive and then collapse when real users arrive.

Instead of asking “which model is best?”, ask:

What workflow behavior must remain stable after every model, prompt, retrieval, and tool change?

That is the foundation of production LLM engineering.