New72 experiments on the latest agentsRead the study

Causely is the missing intelligence layer for your Production Ops agents

We ran 72 experiments across four agent configurations, with and without Causely.

The results show Causely improves both Coding and SRE agents: 63% faster diagnoses, 60% fewer tokens, 100% root-cause accuracy, and hallucinated incidents eliminated or halved.

Talk to an engineer

Key Findings

Causely improves every agent on every dimension that matters

Across 72 runs, access to Causely improved every agent configuration on every measured dimension. The four numbers below are means across the four configurations: Coding (Claude Code, Codex) and SRE (HolmesGPT) agents. Per-configuration breakdowns follow in the sections below.

63% faster

time to diagnosis

seconds from prompt to a correct root cause

60% fewer

tokens per run

total model tokens consumed per investigation

14% higher

root-cause accuracy

average improvement across agent configurations

57% lower

cost per investigation

average API spend per run

Causely reduces token consumption and improves response times across all agent configurations.

Tokens per run (thousands)

Time to diagnosis (seconds)

Speed

Causely transforms raw telemetry into actionable knowledge, resulting in up to 63% faster diagnoses

Causely maintains a real-time semantic and causal understanding of your environment. Agents no longer need to reconstruct that understanding through multiple expensive CLI and tool calls on every query. Both Coding and SRE agents see mean time-to-diagnosis fall by more than half, and every individual configuration improves with Causely. SRE agents show the largest gain because more of their baseline runtime was spent reconstructing the environment state.

Mean time to diagnosis, Coding vs SRE agents

Mean time to diagnosis, by agent configuration

Configuration	Base (s)	+Causely (s)	% Change
Claude Code	91.7	30.5	−66.7%
Codex	49.3	32.2	−34.8%
HolmesGPT (Gemini Pro 3)	74.5	12.8	−82.8%
HolmesGPT (Claude Sonnet)	74.7	23.7	−68.3%

Accuracy

Causely understands complex topologies to drive accurate root cause analysis

Causely derives causality graphs across your service topology, anticipating common root-cause and symptom patterns based on data flow and interactions across services. Both Coding (Claude Code, Codex) and SRE agents (HolmesGPT) see meaningful accuracy gains under Causely, with the largest improvements on questions that require evidence across multiple services. Our experiments find that baseline AI agents are more likely to either misdiagnose symptoms as root causes or conflate anomalies as incidents. Causely provides agents with both root-cause diagnoses from its causal inference engine and a real-time structured interpretation of environment state, grounded in observed telemetry and topology.

Diagnostic accuracy, active-fault scenario

Per-configuration accuracy

Configuration	Base	+Causely
Claude Code	100%	100%
Codex	83%	100%
HolmesGPT (Gemini Pro 3)	83%	100%
HolmesGPT (Claude Sonnet)	83%	100%

Causely preserves accuracy where baseline agents break down

Baseline agents hold up on single-entity tasks but drop to around 75% accuracy on impact analysis and root-cause diagnosis, the categories that require correlating evidence across multiple services. These are also the most token-expensive at baseline (694K mean tokens, 148 seconds mean time), so agents spend the most context exactly where they are least reliable. Because Causely already represents the cross-service relationships in its topology and causality model, accuracy holds up where baseline breaks down.

Per-use-case accuracy and resource use (pooled across configurations)

Use case	Acc. base	Acc. +C	Time base	Time +C	Tokens base	Tokens +C
Health assessment	87.5%	100%	44s	19s	151K	100K
Impact analysis	75.0%	91.7%	116s	31s	427K	181K
Root cause diagnosis	75.0%	100%	148s	57s	694K	351K
Remediation / triage	100%	100%	50s	29s	233K	176K

Hallucinations

Causely stops AI agents from fabricating incidents that don't exist

Causely provides agents with a clear signal when no root cause is active, so healthy environments are recognized as healthy instead of mistaken for a hidden incident. Both Coding and SRE agents hallucinate less under Causely, with SRE agents eliminating hallucinations entirely. Our experiments find that baseline agents, faced with a healthy cluster and a leading query, keep searching raw telemetry until normal traffic starts to resemble an incident. Causely removes the ambiguity by giving the agent an accurate interpretation of environment state using advanced machine learning and learned symptom analytics.

Hallucination rate, healthy-baseline scenario

Configuration	Base	+Causely
Claude Code	0%	0%
Codex	67%	33%
HolmesGPT (Gemini Pro 3)	0%	0%
HolmesGPT (Claude Sonnet)	67%	0%

Causely tells agents when nothing is wrong

Causely continuously builds dynamic models of expected behavior for key service metrics and applies machine learning to decide when an anomaly actually constitutes a symptom rather than routine variation. Without that distinction, baseline agents treat every deviation as a potential incident and keep searching, burning hundreds of thousands of tokens on healthy clusters. With Causely's explicit "no active root cause" response, the agent has a grounded stop condition and token consumption on healthy investigations drops by roughly half to two thirds in three of four configurations.

Token consumption per run, healthy-baseline scenario

* HolmesGPT (Gemini Pro 3) is the exception. On one outlier run, the model continued probing even after Causely reported no active root cause. The configuration still reached the correct answer but consumed more tokens than its baseline.

Efficiency

Causely reduces tool calls by 4.8×

Because Causely delivers a grounded diagnosis alongside a structured interpretation of environment state, agents spend their reasoning budget on resolution rather than exploratory search across raw telemetry. Both Coding and SRE agents see their tool-call volume collapse by roughly 4.8× and their token consumption drop by 60%, with the heaviest baseline run (813K tokens) resolving in 111K under Causely. Worst-case is the envelope that determines context-window pressure and provider rate limits.

Tool calls per run, active-fault scenario

Token consumption per run, active-fault scenario

Token consumption, active-fault scenario

Configuration	Avg base	Avg +C	Max base	Max +C	% Change
Claude Code	126K	56K	278K	58K	−55.7%
Codex	467K	216K	615K	456K	−53.7%
HolmesGPT (Gemini Pro 3)	334K	94K	813K	111K	−71.7%
HolmesGPT (Claude Sonnet)	304K	126K	416K	158K	−58.4%

Tool invocations per investigation, active-fault scenario

Configuration	Avg base	Avg +C	Max base	Max +C	% Change
Claude Code	13.7	4.0	29	4	−70.7%
Codex	22.0	4.5	33	8	−79.5%
HolmesGPT (Gemini Pro 3)	16.3	3.3	28	4	−79.6%
HolmesGPT (Claude Sonnet)	23.3	3.8	30	5	−83.6%

Reliability improves alongside efficiency. Coding agents that rely on free-form shell access incur between 1.5 and 3.5 failed tool calls per baseline run, driven by mistyped commands and missing binaries. Under Causely's typed causal interface, failed calls fall to zero.

Cost

Causely reduces cost per investigation by 57%

Mean per-run cost falls by roughly half for both Coding and SRE agents, with the best configuration dropping 76%. Baseline cost also spikes when telemetry is ambiguous or the scenario is healthy, while cost under Causely stays bounded and predictable, the property that makes agent-driven investigation economically viable at volume.

Mean cost per run, active-fault scenario

Cost per run, active-fault scenario

Configuration	Base	+Causely	% Change
Claude Code	$0.216	$0.117	−46.0%
Codex	$0.119	$0.049	−58.6%
HolmesGPT (Gemini Pro 3)	$0.044	$0.011	−75.7%
HolmesGPT (Claude Sonnet)	$0.286	$0.150	−47.4%

What the benchmark reveals

Causal intelligence is the missing piece for reliable, production-grade AI agents

The same agent, same prompt, same telemetry produces a materially better answer faster and cheaper when it can query a causal model of the environment instead of reconstructing one from raw signals. Across 72 runs, diagnoses arrived 63% faster, consumed 60% fewer tokens, cost 57% less per investigation, and hallucinated incidents dropped by roughly 75% on average.

For engineering leadership

Incident management gets faster and more reliable. Causal intelligence cuts mean time-to-diagnosis by more than half and raises accuracy across every agent configuration, which directly lowers the incident burden on engineers and shortens the window from page to resolution across the org.

For platform and SRE teams

Your existing agents, tools, and observability stack stay in place. Only the information the agent reasons from changes, from raw telemetry to a grounded diagnosis and structured environment state. Time, tokens, and accuracy all improve in lockstep.

For finance and procurement

Today, causal intelligence saves expensive developer hours on every investigation, directly lowering the cost of incident response. Looking ahead, as AI providers move to usage-based pricing, agent token consumption turns into a balance-sheet liability. Staying ahead means running agents that are both efficient and accurate, not ones burning tokens reconstructing context.

Methodology

How the benchmark was run

The study is a fully crossed factorial benchmark across four agent configurations (spanning Coding and SRE agents) and two causal access levels (baseline, and baseline plus the Causely MCP server). Each of the eight resulting cells ran under two scenarios: an active-fault scenario, with a code-level defect injected into the payment service, and a healthy-baseline scenario, with the same application running without faults. Active-fault cells received six queries and healthy-baseline cells three, totaling 72 runs. Prompts, models, and permissions were held constant across conditions within a configuration, so any observed difference is attributable to the presence or absence of Causely.

Environment

OpenTelemetry Astronomy Shop (CNCF-maintained, polyglot, 24 microservices across Kubernetes, PostgreSQL, Valkey/Redis, Kafka, OpenSearch, feature flags). Deployed on a local kind cluster with native OTel instrumentation. The Causely mediator ran alongside the application in treatment runs. Baseline runs used the same cluster without the Causely MCP server, holding raw telemetry constant across conditions.

Agent configurations

Two Coding agents (Claude Code with Claude Sonnet, Codex with GPT-5.4-mini), each with shell and kubectl access. Two SRE agents (HolmesGPT with Gemini Pro 3 Flash Lite, HolmesGPT with Claude Sonnet), each with HolmesGPT's standard built-in toolsets (task planning, bash, pod logs, cluster queries, network probes, ephemeral debug pods). Treatment condition adds the Causely MCP server without removing any baseline tools.

Fault and rubric

A code-level defect was injected to produce a complex, multi-service failure: the blast radius spans a mix of healthy and degraded services, with no infrastructure alarms or process-level failures to anchor the investigation. Responses were scored against a pre-registered rubric defined prior to data collection.

Measurement	Definition
Wall-clock time	Seconds from prompt submission to final agent response.
Correctness	Binary score against a pre-registered rubric.
Token consumption	Input and output tokens reported by the provider's usage fields.
Tool invocations	Count of tool calls per run, stratified by tool category.
Query cost	Derived from token counts at published per-token pricing.

Query Catalog

The exact queries submitted to each agent

The healthy-baseline queries use identical phrasing to the first three active-fault queries, including the phrase "I'm seeing CheckoutServiceHighRequestErrors", to control for phrasing effects and to measure whether agents can distinguish a genuine incident from a mistaken premise. This is intentionally challenging. An agent that always confirms what the user implies will score well on fault-scenario accuracy but generate unacceptable false-positive rates on healthy baselines.

Healthy-Baseline Scenario, 3 queries

Application running with normal traffic and no injected faults. Correct answer to all three: "no active incidents."

Health assessment

What's the current health of the otel-demo namespace? Are there any active incidents or issues?

Impact analysis

I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What services are impacted and how widespread is it?

Root cause diagnosis

I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What's the root cause?

Active-Fault Scenario, 6 queries

A code-level defect in the payment service rejects every transaction and propagates to three downstream services. The correct root cause is the payment-service defect, not any downstream symptom.

Health assessment

What's the current health of the otel-demo namespace? Are there any active incidents or issues?

Impact analysis

I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What services are impacted and how widespread is it?

Root cause diagnosis

I'm seeing CheckoutServiceHighRequestErrors in the otel-demo namespace. What's the root cause?

Root cause diagnosis (ownership projection)

My team owns the payment service. We got paged about checkout errors. Is this our fault?

Remediation and triage

Is it safe to restart the payment pods, or is there a deeper issue we need to fix first?

Impact analysis

Checkout is down. Is this a single team issue or do multiple teams need to be involved?

Open Source

Reproduce it yourself

The repository contains all experiment scripts, the complete benchmark query set, ground-truth answers for every query and scenario, and full environment setup instructions for deploying the OTel Demo cluster and Causely mediator. The benchmark is designed to be reproducible on any machine that can run a local kind cluster.

Independent verification, alternative grading interpretations, additional agent frameworks, and results from different hardware or cloud environments are welcome. If you run the benchmark and obtain substantially different results, open an issue. That is more useful than unverified agreement.

View the repo

See what Causely does for the agents in your environment

Talk to an engineer about running the benchmark on your own telemetry, or integrating Causely with an existing agent workflow.

Talk to an engineer