Ai2 opens olmo-eval for models still in progress

Ai2 released olmo-eval, an open source workbench for comparing models inside the development loop rather than after it.

Ai2, the Allen Institute for Artificial Intelligence, published olmo-eval on June 12, 2026, as an open source workbench for evaluating large language models during development, not only after a model is finished. The official post describes it as an extension of OLMES, the 2024 standard created to make model scores easier to compare across releases. The useful shift is clear: evaluation becomes part of the build loop, with repeated comparisons between checkpoints, instead of a final judgment attached to a frozen model.

During model training, every change in data, architecture, hyperparameters, or scale can improve some behaviors while weakening others. Traditional benchmarks mostly answer the question, "What score does this model get?" olmo-eval is aimed at a more operational question: "What changed since the previous checkpoint, and is that change reliable or just statistical noise?" Ai2 highlights per-question analysis, standard errors, and the minimum detectable effect, meaning the smallest difference that can reasonably be separated from random variation.

The project also separates what is being evaluated from how the model is run. A task defines the dataset, request format, and scoring method. A harness defines the runtime context: model, tools, isolated environment, optional judge model, and execution policy. That separation lets the same benchmark run as a plain baseline, as a tool-using task, or inside a containerized sandbox when the evaluation needs to execute code or simulate an interaction. A sandbox, in this context, is an isolated environment that limits the effects of a model action on the rest of the system.

The relevance goes beyond the OLMo project itself. Teams fine-tuning models or building agents often need to add an internal evaluation quickly, then rerun it dozens of times as weights and prompts change. If every test requires a custom integration, measurements arrive too late or become inconsistent. olmo-eval offers a registry of benchmark tasks and composable suites, named variants, inference providers such as vLLM and LiteLLM, and normalized storage for predictions at both aggregate and instance level.

This release does not solve the broader benchmark quality problem. Benchmarks can still be incomplete, contaminated, or poorly aligned with real use. It does, however, provide a concrete way to make evaluation less handmade. As models improve through rapid iteration and agents increasingly need to be tested with their tools, the ability to see exactly what changed between two versions becomes almost as important as the final score published in a model card.