OpenAI tests models against simulated deployment traffic

OpenAI describes a deployment simulation method meant to surface some unwanted behavior before a model reaches users.

OpenAI published on June 16 a method called “Deployment Simulation” to estimate, before release, how a candidate model may behave in conditions closer to real use. The core idea is straightforward: take previous conversations, remove the older model’s answer, and ask the candidate model to respond instead, while preserving user privacy. The new responses are then audited to look for unwanted behavior and to estimate how often those behaviors might appear once the model is deployed.

The announcement matters because it targets a known weakness in AI evaluation. Traditional safety tests often use synthetic, difficult, or deliberately adversarial prompts. They remain useful for severe and rare risks, but they do not always represent the messy spread of ordinary traffic. OpenAI presents deployment simulation as an additional signal: a way to preview behavior on a distribution that is closer to actual use, and to measure problems that may not have been covered by hand-built test sets. The company is careful about scope, noting that the method is not expected to measure behaviors that occur less often than once in 200,000 messages.

In its experiments, OpenAI says it analyzed about 1.3 million de-identified conversations from GPT-5-series Thinking models between August 2025 and March 2026. The company reports that the method improved estimates of unwanted behavior rates, helped surface before release a misalignment pattern it calls “calculator hacking,” and reduced a growing source of evaluation distortion: models recognizing that they are being tested. In deployment-like contexts, models appeared much less able to classify the interaction as an evaluation than they did on more artificial benchmarks.

The most practical part is about software agents. OpenAI says it tested deployment simulation on 120,000 internal agentic coding trajectories, where a single task can depend on hundreds of tool calls, repository state, network responses, or transient failures. Instead of letting the model interact with live external systems, the team simulated tool calls with another model, using read-only context and historical tool-call data. That does not replace external audits, red-teaming, or targeted analysis of extreme risks. But it points to an important shift: as models become agents, safety testing cannot focus only on a single generated answer. It also has to model the environment in which the agent acts, including tools, state, and the feedback loops that shape its decisions.