You've seen it everywhere on LinkedIn: practitioners and researchers say things like, “our model hit 78% on the benchmark.”
That sounds good, right? That depends.
Researchers in AI measurement science (AI metrology) just published new benchmarking analysis on 22 frontier LLMs, using three well‑known tests: GPQA‑Diamond (graduate‑level science questions), BIG‑Bench Hard (a set of especially challenging reasoning tasks), and Global‑MMLU Lite (a multilingual knowledge test balanced across languages).
In the new NIST AI 800‑3 report from CAISI and NIST’s Information Technology Laboratory, the authors note that common benchmark analyses can rely on implicit assumptions or produce inaccurate uncertainty estimates. Therefore, they propose a statistical modeling approach (including GLMMs) to distinguish benchmark accuracy from generalized accuracy and quantify uncertainty more rigorously.
GLMMs, or generalized linear mixed models, are a well-established statistical technique in fields like biostatistics and educational testing that can model both the variation between different benchmark questions and the inconsistency of a model's responses to the same question across repeated trials.
The report identifies three issues in AI benchmark evaluations:
- Evaluations rely on implicit assumptions
- Evaluations conflate how models perform on specific questions in a benchmark (benchmark accuracy) with generalized accuracy
- When confidence intervals are reported, they're often calculated using methods that don't match the metric being estimated.
NIST AI 800-3 argues that the statistical validity of LLM evaluations benefits from evaluators explicitly adopting a model for analyzing evaluation results and disclosing related assumptions.
The authors also propose that the most appropriate times to use generalized linear mixed models (GLMMs) are:
- When a benchmark is not very large.
- When many LLMs are evaluated.
- When an evaluator wants to estimate the difficulty of individual items in a benchmark.
- When an evaluator is interested in understanding sources of variation.
- When an explanatory statistical model is preferred to regression-free techniques.
Why This Matters
For teams and researchers evaluating AI performance, the report adds a new tool to the toolbox: keep running benchmarks like BIG‑Bench Hard, but analyze results with GLMMs to estimate generalized accuracy and quantify uncertainty more explicitly, when the statistical assumptions are reasonable.
For policymakers and procurement teams, the benchmark vs. generalized accuracy distinction is directly relevant. If you’re deciding whether to deploy an AI system based on benchmark scores, you need to know whether those scores generalize and how much uncertainty surrounds them. The report argues that common evaluations can leave those questions under‑specified.
For anyone following AI development and metrology more broadly, the takeaway is that evaluation practice hasn’t caught up with model capability. The report's core point, that evaluators should choose an explicit statistical model, disclose its assumptions, and in some cases report multiple models, are also standards we also endorse as a lab.
Read the report or summary to find out more. NIST AI 800-3, "Expanding the AI Evaluation Toolbox with Statistical Models," was published on February 19, 2026. The report was authored by Drew Keller, Kweku Kwegyir-Aggrey, Ryan Steed, Anita K. Rao, Julia L. Sharp, and A. Stevie Bergman at NIST's CAISI and ITL.
Join our newsletter for AI safety news and research updates