LLM EngineeringApril 2, 20267 min read

    LLM evaluation: what to measure before an AI feature ships

    A production-focused guide to LLM evaluation: golden datasets, groundedness, retrieval quality, refusal behavior, latency, cost, and regression tests.

    LLM evaluation is the difference between an impressive demo and a feature you can keep improving. Without evals, every prompt change, model upgrade, or retrieval tweak is a guess. With evals, quality becomes something you can measure, discuss, and regression-test before users feel the impact.

    Start with a golden dataset

    Build a representative set of real questions, expected behaviors, edge cases, and "should not answer" examples. The dataset does not need to be huge at first; it needs to reflect the situations that would actually hurt trust if the system failed.

    Measure retrieval and generation separately

    In RAG systems, the answer can fail because retrieval brought back the wrong context or because the model ignored good context. Evaluate retrieval relevance, coverage, and source quality separately from answer correctness and groundedness.

    Track production signals too

    • Latency and timeout rate by workflow.
    • Cost per successful task, not just cost per request.
    • Refusal and escalation rates for unsupported questions.
    • User feedback tied back to prompts, model versions, and retrieved sources.

    The goal is not a perfect score. The goal is a system where quality can be observed, improved, and kept stable as the product changes.

    FAQ

    What should an LLM evaluation measure?
    At minimum, evaluate task success, groundedness, retrieval quality, refusal behavior, safety constraints, latency, and cost. For RAG systems, evaluate retrieval separately from generation so you know where failures come from.

    More reading