Causely is the missing intelligence layer for your Production Ops agents
We ran 72 experiments across four agent configurations, with and without Causely.
The results show Causely improves both Coding and SRE agents: 63% faster diagnoses, 60% fewer tokens, 100% root-cause accuracy, and hallucinated incidents eliminated or halved.
Causely improves every agent on every dimension that matters
Causely reduces token consumption and improves response times across all agent configurations.
Tokens per run (thousands)
Time to diagnosis (seconds)
Causely transforms raw telemetry into actionable knowledge, resulting in up to 63% faster diagnoses
Mean time to diagnosis, Coding vs SRE agents
Mean time to diagnosis, by agent configuration
| Configuration | Base (s) | +Causely (s) | % Change |
|---|---|---|---|
| Claude Code | 91.7 | 30.5 | −66.7% |
| Codex | 49.3 | 32.2 | −34.8% |
| HolmesGPT (Gemini Pro 3) | 74.5 | 12.8 | −82.8% |
| HolmesGPT (Claude Sonnet) | 74.7 | 23.7 | −68.3% |
Causely understands complex topologies to drive accurate root cause analysis
Diagnostic accuracy, active-fault scenario
Per-configuration accuracy
| Configuration | Base | +Causely |
|---|---|---|
| Claude Code | 100% | 100% |
| Codex | 83% | 100% |
| HolmesGPT (Gemini Pro 3) | 83% | 100% |
| HolmesGPT (Claude Sonnet) | 83% | 100% |
Causely preserves accuracy where baseline agents break down
Per-use-case accuracy and resource use (pooled across configurations)
| Use case | Acc. base | Acc. +C | Time base | Time +C | Tokens base | Tokens +C |
|---|---|---|---|---|---|---|
| Health assessment | 87.5% | 100% | 44s | 19s | 151K | 100K |
| Impact analysis | 75.0% | 91.7% | 116s | 31s | 427K | 181K |
| Root cause diagnosis | 75.0% | 100% | 148s | 57s | 694K | 351K |
| Remediation / triage | 100% | 100% | 50s | 29s | 233K | 176K |
Causely stops AI agents from fabricating incidents that don't exist
Hallucination rate, healthy-baseline scenario
| Configuration | Base | +Causely |
|---|---|---|
| Claude Code | 0% | 0% |
| Codex | 67% | 33% |
| HolmesGPT (Gemini Pro 3) | 0% | 0% |
| HolmesGPT (Claude Sonnet) | 67% | 0% |
Causely tells agents when nothing is wrong
Token consumption per run, healthy-baseline scenario
* HolmesGPT (Gemini Pro 3) is the exception. On one outlier run, the model continued probing even after Causely reported no active root cause. The configuration still reached the correct answer but consumed more tokens than its baseline.
Causely reduces tool calls by 4.8×
Tool calls per run, active-fault scenario
Token consumption per run, active-fault scenario
Token consumption, active-fault scenario
| Configuration | Avg base | Avg +C | Max base | Max +C | % Change |
|---|---|---|---|---|---|
| Claude Code | 126K | 56K | 278K | 58K | −55.7% |
| Codex | 467K | 216K | 615K | 456K | −53.7% |
| HolmesGPT (Gemini Pro 3) | 334K | 94K | 813K | 111K | −71.7% |
| HolmesGPT (Claude Sonnet) | 304K | 126K | 416K | 158K | −58.4% |
Tool invocations per investigation, active-fault scenario
| Configuration | Avg base | Avg +C | Max base | Max +C | % Change |
|---|---|---|---|---|---|
| Claude Code | 13.7 | 4.0 | 29 | 4 | −70.7% |
| Codex | 22.0 | 4.5 | 33 | 8 | −79.5% |
| HolmesGPT (Gemini Pro 3) | 16.3 | 3.3 | 28 | 4 | −79.6% |
| HolmesGPT (Claude Sonnet) | 23.3 | 3.8 | 30 | 5 | −83.6% |
Causely reduces cost per investigation by 57%
Mean cost per run, active-fault scenario
Cost per run, active-fault scenario
| Configuration | Base | +Causely | % Change |
|---|---|---|---|
| Claude Code | $0.216 | $0.117 | −46.0% |
| Codex | $0.119 | $0.049 | −58.6% |
| HolmesGPT (Gemini Pro 3) | $0.044 | $0.011 | −75.7% |
| HolmesGPT (Claude Sonnet) | $0.286 | $0.150 | −47.4% |
Causal intelligence is the missing piece for reliable, production-grade AI agents
Incident management gets faster and more reliable. Causal intelligence cuts mean time-to-diagnosis by more than half and raises accuracy across every agent configuration, which directly lowers the incident burden on engineers and shortens the window from page to resolution across the org.
Your existing agents, tools, and observability stack stay in place. Only the information the agent reasons from changes, from raw telemetry to a grounded diagnosis and structured environment state. Time, tokens, and accuracy all improve in lockstep.
Today, causal intelligence saves expensive developer hours on every investigation, directly lowering the cost of incident response. Looking ahead, as AI providers move to usage-based pricing, agent token consumption turns into a balance-sheet liability. Staying ahead means running agents that are both efficient and accurate, not ones burning tokens reconstructing context.
How the benchmark was run
Environment
OpenTelemetry Astronomy Shop (CNCF-maintained, polyglot, 24 microservices across Kubernetes, PostgreSQL, Valkey/Redis, Kafka, OpenSearch, feature flags). Deployed on a local kind cluster with native OTel instrumentation. The Causely mediator ran alongside the application in treatment runs. Baseline runs used the same cluster without the Causely MCP server, holding raw telemetry constant across conditions.
Agent configurations
Two Coding agents (Claude Code with Claude Sonnet, Codex with GPT-5.4-mini), each with shell and kubectl access. Two SRE agents (HolmesGPT with Gemini Pro 3 Flash Lite, HolmesGPT with Claude Sonnet), each with HolmesGPT's standard built-in toolsets (task planning, bash, pod logs, cluster queries, network probes, ephemeral debug pods). Treatment condition adds the Causely MCP server without removing any baseline tools.
Fault and rubric
A code-level defect was injected to produce a complex, multi-service failure: the blast radius spans a mix of healthy and degraded services, with no infrastructure alarms or process-level failures to anchor the investigation. Responses were scored against a pre-registered rubric defined prior to data collection.
| Measurement | Definition |
|---|---|
| Wall-clock time | Seconds from prompt submission to final agent response. |
| Correctness | Binary score against a pre-registered rubric. |
| Token consumption | Input and output tokens reported by the provider's usage fields. |
| Tool invocations | Count of tool calls per run, stratified by tool category. |
| Query cost | Derived from token counts at published per-token pricing. |
The exact queries submitted to each agent
Healthy-Baseline Scenario, 3 queries
Application running with normal traffic and no injected faults. Correct answer to all three: "no active incidents."
Health assessment
What's the current health of the otel-demo namespace? Are there any active incidents or issues?
Impact analysis
I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What services are impacted and how widespread is it?
Root cause diagnosis
I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What's the root cause?
Active-Fault Scenario, 6 queries
A code-level defect in the payment service rejects every transaction and propagates to three downstream services. The correct root cause is the payment-service defect, not any downstream symptom.
Health assessment
What's the current health of the otel-demo namespace? Are there any active incidents or issues?
Impact analysis
I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What services are impacted and how widespread is it?
Root cause diagnosis
I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What's the root cause?
Root cause diagnosis (ownership projection)
My team owns the payment service. We got paged about checkout errors. Is this our fault?
Remediation and triage
Is it safe to restart the payment pods, or is there a deeper issue we need to fix first?
Impact analysis
Checkout is down. Is this a single team issue or do multiple teams need to be involved?
See what Causely does for the agents in your environment
Talk to an engineer about running the benchmark on your own telemetry, or integrating Causely with an existing agent workflow.