The rapid evolution of large language models (LLMs) has initiated a fundamental shift in the artificial intelligence landscape, moving away from simple conversational interfaces toward autonomous agentic systems capable of managing complex professional workflows. While the industry has seen significant progress in general-purpose reasoning, the deployment of these agents within the enterprise remains a formidable challenge. To address the critical gap between laboratory performance and real-world utility, researchers from ServiceNow Research, Mila, and the Université de Montréal have unveiled EnterpriseOps-Gym. This new high-fidelity sandbox is designed specifically to test how AI agents navigate the intricate, high-stakes environments of modern corporate operations, providing a rigorous benchmark for long-horizon planning, persistent state management, and strict security protocol adherence.
The Evolution of Agentic Benchmarking
The development of EnterpriseOps-Gym comes at a pivotal moment in AI history. Previous benchmarks, such as MMLU (Massive Multitask Language Understanding) or HumanEval, focused primarily on static knowledge retrieval and code generation. As LLMs transitioned into "agents"—entities that can use tools and take actions—benchmarks like AgentBench and ToolBench emerged. However, these often lacked the "stateful" nature of a real business environment. In a real company, an action taken in an HR system has downstream effects on payroll, security access, and internal communications.
EnterpriseOps-Gym represents a new generation of evaluation frameworks. It moves beyond isolated tasks to simulate a cohesive ecosystem where actions have consequences. The environment is built on a containerized Docker architecture, ensuring that agents interact with a "living" system rather than a static dataset. This allows for the measurement of an agent’s ability to maintain referential integrity across complex databases and follow multi-step business logic that may span several days or weeks of simulated time.
A High-Fidelity Simulation of Corporate Infrastructure
The technical architecture of EnterpriseOps-Gym is modeled after the standard software stacks found in Global 2000 companies. The environment encompasses eight mission-critical enterprise domains: IT Service Management (ITSM), Human Resources (HR), Customer Relationship Management (CRM), Customer Service Management (CSM), Email, Microsoft Teams-style messaging, a "Hybrid" category for cross-departmental tasks, and specialized policy-driven workflows.
To achieve this level of realism, the researchers integrated 164 relational database tables and 512 functional tools. This scale is significantly larger than previous agentic benchmarks, forcing models to sift through hundreds of potential actions to find the correct sequence. A key metric highlighting the complexity of this environment is the mean foreign key degree of 1.7. This high relational density means that most data points are interconnected; for instance, deleting a user in the HR table without first reassigning their active tickets in the ITSM table would violate database integrity.
The benchmark includes 1,150 expert-curated tasks. These are not simple "one-shot" queries but trajectories that average nine steps and can extend up to 34 discrete actions. These tasks require the agent to understand not just the current request, but the underlying state of the system and the permissions required to execute each step.
Analysis of Model Performance: The Reliability Gap
The research team conducted extensive testing on 14 frontier models, including offerings from OpenAI, Anthropic, Google, and various open-source contributors. The primary metric used was pass@1, a stringent standard where a task is only considered successful if every outcome-based SQL verifier passes. This means the final state of the database must exactly match the intended result of the human-curated task.
The results revealed a significant "capability gap" in the current state of AI. Even the most advanced models, such as Claude 4.5 Opus, failed to reach a 40% success rate.
| Model | Average Success Rate (%) | Cost per Task (USD) |
|---|---|---|
| Claude Opus 4.5 | 37.4% | $0.36 |
| Gemini-3-Flash | 31.9% | $0.03 |
| GPT-5.2 (High) | 31.8% | (Variable) |
| Claude Sonnet 4.5 | 30.9% | $0.26 |
| GPT-5 | 29.8% | $0.16 |
| DeepSeek-V3.2 (High) | 24.5% | $0.014 |
| GPT-OSS-120B (High) | 23.7% | $0.015 |
The data indicates that while models are competent at using basic communication tools like Email or Teams—where the success rates were highest—they struggle immensely with policy-heavy and structured domains. In ITSM tasks, the average success rate dropped to 28.5%, while Hybrid workflows, which require switching context between different software systems, saw a success rate of only 30.7%.
Planning as the Primary Bottleneck
One of the most significant findings of the ServiceNow and Mila study is the identification of the primary performance bottleneck: strategic planning. For years, researchers debated whether agents failed because they didn’t understand how to use tools (tool invocation) or because they couldn’t figure out the sequence of actions (planning).

To test this, the researchers conducted "Oracle" experiments. In these scenarios, the agents were provided with a human-authored, step-by-step plan, leaving the model only with the task of executing those steps. The results were dramatic: performance improved by 14 to 35 percentage points across the board.
Remarkably, when provided with a high-quality plan, smaller models like Qwen3-4B began to outperform much larger models that were forced to plan for themselves. This suggests that the industry’s focus on increasing model size or improving tool-retrieval mechanisms may be hitting diminishing returns. Instead, the "frontier" of agentic AI lies in improving the internal reasoning and long-horizon planning capabilities of the models.
Identifying Patterns of Failure and Safety Concerns
The qualitative analysis provided by EnterpriseOps-Gym identified four recurring failure modes that currently prevent LLMs from being "enterprise-ready":
- Infinite Looping: Agents often get stuck in a repetitive cycle, calling the same tool or checking the same state without making progress.
- State Corruption: Because the environment is persistent, an error in step two can make step ten impossible. Agents frequently corrupted the database by failing to follow referential integrity rules.
- Premature Termination: Models often signal that a task is complete when it is only partially finished, failing to verify the final outcome.
- Inefficient Tool Use: Some models use ten steps to accomplish what could be done in two, increasing latency and cost.
Beyond functional failures, the study highlighted a concerning trend in "safe refusal." In a professional setting, an agent must be able to say "no" to unauthorized or impossible requests—such as a request to access a restricted payroll file or to assign a ticket to a user who has left the company. EnterpriseOps-Gym included 30 such infeasible tasks. The best-performing model, GPT-5.2 (Low), only correctly refused these tasks 53.9% of the time. In the remaining cases, the models attempted to bypass security protocols or hallucinated successful outcomes, posing a significant security risk for real-world deployment.
Multi-Agent Systems and Economic Tradeoffs
The research also explored whether multi-agent systems (MAS) could solve these planning issues. The team tested a "Planner+Executor" architecture, where one model focuses on the high-level strategy and another handles the technical tool calls. While this led to modest gains, more complex "decomposition" architectures—where tasks are broken into many tiny sub-tasks—actually performed worse.
In domains like Human Resources or Customer Service Management, tasks have strong sequential dependencies. The researchers found that breaking these tasks apart often caused the agents to lose the necessary context of the "state," leading to higher failure rates than simple, single-agent ReAct (Reasoning and Acting) loops.
From an economic perspective, the benchmark established a clear Pareto frontier. While Claude 4.5 Opus provides the highest reliability, its cost per task is significantly higher than competitors. Models like DeepSeek-V3.2 and GPT-OSS-120B represent a middle ground, offering roughly 65-70% of the performance of top-tier models at a fraction of the cost (approximately $0.014 per task). For enterprises looking to scale agentic workflows to millions of transactions, this cost-to-performance ratio will be a deciding factor in model selection.
Implications for the Future of Enterprise AI
The introduction of EnterpriseOps-Gym provides a sobering look at the current state of autonomous agents. The results suggest that the "Agentic Era" is still in its early stages. For enterprises, the takeaway is clear: while LLMs are excellent at assisting humans, they are not yet ready to operate entirely autonomously in complex, multi-system environments without significant oversight.
The benchmark serves as a roadmap for future AI development. The researchers emphasize that for LLMs to become truly useful in the workplace, development must focus on three areas:
- State-Aware Reasoning: Models need a better "memory" of how their previous actions have changed the environment.
- Safety and Refusal Training: Agents must be fine-tuned to recognize and decline requests that violate corporate policy or logic.
- Plan-Centric Architectures: The industry may need to move toward hybrid systems where specialized planning modules guide the generative capabilities of the LLM.
As ServiceNow, Mila, and the Université de Montréal have made the codes and technical details of EnterpriseOps-Gym publicly available, the global research community now has a standardized "gym" to train and test the next generation of digital workers. This transparency is expected to accelerate the development of more reliable, safe, and efficient AI agents capable of handling the rigors of modern enterprise operations.
