Causely is the missing intelligence layer for your Production Ops agents

We ran 72 experiments across four agent configurations, with and without Causely.

The results show Causely improves both Coding and SRE agents: 63% faster diagnoses, 60% fewer tokens, 100% root-cause accuracy, and hallucinated incidents eliminated or halved.

Key Findings

Causely improves every agent on every dimension that matters

Across 72 runs, access to Causely improved every agent configuration on every measured dimension. The four numbers below are means across the four configurations: Coding (Claude Code, Codex) and SRE (HolmesGPT) agents. Per-configuration breakdowns follow in the sections below.
63% faster
time to diagnosis
seconds from prompt to a correct root cause
60% fewer
tokens per run
total model tokens consumed per investigation
14% higher
root-cause accuracy
average improvement across agent configurations
57% lower
cost per investigation
average API spend per run

Causely reduces token consumption and improves response times across all agent configurations.

Tokens per run (thousands)

Time to diagnosis (seconds)

Speed

Causely transforms raw telemetry into actionable knowledge, resulting in up to 63% faster diagnoses

Causely maintains a real-time semantic and causal understanding of your environment. Agents no longer need to reconstruct that understanding through multiple expensive CLI and tool calls on every query. Both Coding and SRE agents see mean time-to-diagnosis fall by more than half, and every individual configuration improves with Causely. SRE agents show the largest gain because more of their baseline runtime was spent reconstructing the environment state.

Mean time to diagnosis, Coding vs SRE agents

Mean time to diagnosis, by agent configuration

ConfigurationBase (s)+Causely (s)% Change
Claude Code91.730.5−66.7%
Codex49.332.2−34.8%
HolmesGPT (Gemini Pro 3)74.512.8−82.8%
HolmesGPT (Claude Sonnet)74.723.7−68.3%
Accuracy

Causely understands complex topologies to drive accurate root cause analysis

Causely derives causality graphs across your service topology, anticipating common root-cause and symptom patterns based on data flow and interactions across services. Both Coding (Claude Code, Codex) and SRE agents (HolmesGPT) see meaningful accuracy gains under Causely, with the largest improvements on questions that require evidence across multiple services. Our experiments find that baseline AI agents are more likely to either misdiagnose symptoms as root causes or conflate anomalies as incidents. Causely provides agents with both root-cause diagnoses from its causal inference engine and a real-time structured interpretation of environment state, grounded in observed telemetry and topology.

Diagnostic accuracy, active-fault scenario

Per-configuration accuracy

ConfigurationBase+Causely
Claude Code100%100%
Codex83%100%
HolmesGPT (Gemini Pro 3)83%100%
HolmesGPT (Claude Sonnet)83%100%

Causely preserves accuracy where baseline agents break down

Baseline agents hold up on single-entity tasks but drop to around 75% accuracy on impact analysis and root-cause diagnosis, the categories that require correlating evidence across multiple services. These are also the most token-expensive at baseline (694K mean tokens, 148 seconds mean time), so agents spend the most context exactly where they are least reliable. Because Causely already represents the cross-service relationships in its topology and causality model, accuracy holds up where baseline breaks down.

Per-use-case accuracy and resource use (pooled across configurations)

Use caseAcc. baseAcc. +CTime baseTime +CTokens baseTokens +C
Health assessment87.5%100%44s19s151K100K
Impact analysis75.0%91.7%116s31s427K181K
Root cause diagnosis75.0%100%148s57s694K351K
Remediation / triage100%100%50s29s233K176K
Hallucinations

Causely stops AI agents from fabricating incidents that don't exist

Causely provides agents with a clear signal when no root cause is active, so healthy environments are recognized as healthy instead of mistaken for a hidden incident. Both Coding and SRE agents hallucinate less under Causely, with SRE agents eliminating hallucinations entirely. Our experiments find that baseline agents, faced with a healthy cluster and a leading query, keep searching raw telemetry until normal traffic starts to resemble an incident. Causely removes the ambiguity by giving the agent an accurate interpretation of environment state using advanced machine learning and learned symptom analytics.

Hallucination rate, healthy-baseline scenario

ConfigurationBase+Causely
Claude Code0%0%
Codex67%33%
HolmesGPT (Gemini Pro 3)0%0%
HolmesGPT (Claude Sonnet)67%0%

Causely tells agents when nothing is wrong

Causely continuously builds dynamic models of expected behavior for key service metrics and applies machine learning to decide when an anomaly actually constitutes a symptom rather than routine variation. Without that distinction, baseline agents treat every deviation as a potential incident and keep searching, burning hundreds of thousands of tokens on healthy clusters. With Causely's explicit "no active root cause" response, the agent has a grounded stop condition and token consumption on healthy investigations drops by roughly half to two thirds in three of four configurations.

Token consumption per run, healthy-baseline scenario

* HolmesGPT (Gemini Pro 3) is the exception. On one outlier run, the model continued probing even after Causely reported no active root cause. The configuration still reached the correct answer but consumed more tokens than its baseline.

Efficiency

Causely reduces tool calls by 4.8×

Because Causely delivers a grounded diagnosis alongside a structured interpretation of environment state, agents spend their reasoning budget on resolution rather than exploratory search across raw telemetry. Both Coding and SRE agents see their tool-call volume collapse by roughly 4.8× and their token consumption drop by 60%, with the heaviest baseline run (813K tokens) resolving in 111K under Causely. Worst-case is the envelope that determines context-window pressure and provider rate limits.

Tool calls per run, active-fault scenario

Token consumption per run, active-fault scenario

Token consumption, active-fault scenario

ConfigurationAvg baseAvg +CMax baseMax +C% Change
Claude Code126K56K278K58K−55.7%
Codex467K216K615K456K−53.7%
HolmesGPT (Gemini Pro 3)334K94K813K111K−71.7%
HolmesGPT (Claude Sonnet)304K126K416K158K−58.4%

Tool invocations per investigation, active-fault scenario

ConfigurationAvg baseAvg +CMax baseMax +C% Change
Claude Code13.74.0294−70.7%
Codex22.04.5338−79.5%
HolmesGPT (Gemini Pro 3)16.33.3284−79.6%
HolmesGPT (Claude Sonnet)23.33.8305−83.6%
Reliability improves alongside efficiency. Coding agents that rely on free-form shell access incur between 1.5 and 3.5 failed tool calls per baseline run, driven by mistyped commands and missing binaries. Under Causely's typed causal interface, failed calls fall to zero.
Cost

Causely reduces cost per investigation by 57%

Mean per-run cost falls by roughly half for both Coding and SRE agents, with the best configuration dropping 76%. Baseline cost also spikes when telemetry is ambiguous or the scenario is healthy, while cost under Causely stays bounded and predictable, the property that makes agent-driven investigation economically viable at volume.

Mean cost per run, active-fault scenario

Cost per run, active-fault scenario

ConfigurationBase+Causely% Change
Claude Code$0.216$0.117−46.0%
Codex$0.119$0.049−58.6%
HolmesGPT (Gemini Pro 3)$0.044$0.011−75.7%
HolmesGPT (Claude Sonnet)$0.286$0.150−47.4%
What the benchmark reveals

Causal intelligence is the missing piece for reliable, production-grade AI agents

The same agent, same prompt, same telemetry produces a materially better answer faster and cheaper when it can query a causal model of the environment instead of reconstructing one from raw signals. Across 72 runs, diagnoses arrived 63% faster, consumed 60% fewer tokens, cost 57% less per investigation, and hallucinated incidents dropped by roughly 75% on average.
For engineering leadership

Incident management gets faster and more reliable. Causal intelligence cuts mean time-to-diagnosis by more than half and raises accuracy across every agent configuration, which directly lowers the incident burden on engineers and shortens the window from page to resolution across the org.

For platform and SRE teams

Your existing agents, tools, and observability stack stay in place. Only the information the agent reasons from changes, from raw telemetry to a grounded diagnosis and structured environment state. Time, tokens, and accuracy all improve in lockstep.

For finance and procurement

Today, causal intelligence saves expensive developer hours on every investigation, directly lowering the cost of incident response. Looking ahead, as AI providers move to usage-based pricing, agent token consumption turns into a balance-sheet liability. Staying ahead means running agents that are both efficient and accurate, not ones burning tokens reconstructing context.

Methodology

How the benchmark was run

The study is a fully crossed factorial benchmark across four agent configurations (spanning Coding and SRE agents) and two causal access levels (baseline, and baseline plus the Causely MCP server). Each of the eight resulting cells ran under two scenarios: an active-fault scenario, with a code-level defect injected into the payment service, and a healthy-baseline scenario, with the same application running without faults. Active-fault cells received six queries and healthy-baseline cells three, totaling 72 runs. Prompts, models, and permissions were held constant across conditions within a configuration, so any observed difference is attributable to the presence or absence of Causely.

Environment

OpenTelemetry Astronomy Shop (CNCF-maintained, polyglot, 24 microservices across Kubernetes, PostgreSQL, Valkey/Redis, Kafka, OpenSearch, feature flags). Deployed on a local kind cluster with native OTel instrumentation. The Causely mediator ran alongside the application in treatment runs. Baseline runs used the same cluster without the Causely MCP server, holding raw telemetry constant across conditions.

Agent configurations

Two Coding agents (Claude Code with Claude Sonnet, Codex with GPT-5.4-mini), each with shell and kubectl access. Two SRE agents (HolmesGPT with Gemini Pro 3 Flash Lite, HolmesGPT with Claude Sonnet), each with HolmesGPT's standard built-in toolsets (task planning, bash, pod logs, cluster queries, network probes, ephemeral debug pods). Treatment condition adds the Causely MCP server without removing any baseline tools.

Fault and rubric

A code-level defect was injected to produce a complex, multi-service failure: the blast radius spans a mix of healthy and degraded services, with no infrastructure alarms or process-level failures to anchor the investigation. Responses were scored against a pre-registered rubric defined prior to data collection.

MeasurementDefinition
Wall-clock timeSeconds from prompt submission to final agent response.
CorrectnessBinary score against a pre-registered rubric.
Token consumptionInput and output tokens reported by the provider's usage fields.
Tool invocationsCount of tool calls per run, stratified by tool category.
Query costDerived from token counts at published per-token pricing.
Query Catalog

The exact queries submitted to each agent

The healthy-baseline queries use identical phrasing to the first three active-fault queries, including the phrase "I'm seeing CheckoutServiceHighRequestErrors", to control for phrasing effects and to measure whether agents can distinguish a genuine incident from a mistaken premise. This is intentionally challenging. An agent that always confirms what the user implies will score well on fault-scenario accuracy but generate unacceptable false-positive rates on healthy baselines.

Healthy-Baseline Scenario, 3 queries

Application running with normal traffic and no injected faults. Correct answer to all three: "no active incidents."

Q1

Health assessment

What's the current health of the otel-demo namespace? Are there any active incidents or issues?

Q2

Impact analysis

I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What services are impacted and how widespread is it?

Q3

Root cause diagnosis

I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What's the root cause?

Active-Fault Scenario, 6 queries

A code-level defect in the payment service rejects every transaction and propagates to three downstream services. The correct root cause is the payment-service defect, not any downstream symptom.

Q1

Health assessment

What's the current health of the otel-demo namespace? Are there any active incidents or issues?

Q2

Impact analysis

I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What services are impacted and how widespread is it?

Q3

Root cause diagnosis

I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What's the root cause?

Q4

Root cause diagnosis (ownership projection)

My team owns the payment service. We got paged about checkout errors. Is this our fault?

Q5

Remediation and triage

Is it safe to restart the payment pods, or is there a deeper issue we need to fix first?

Q6

Impact analysis

Checkout is down. Is this a single team issue or do multiple teams need to be involved?

See what Causely does for the agents in your environment

Talk to an engineer about running the benchmark on your own telemetry, or integrating Causely with an existing agent workflow.