Making Observability Work: From Hype to Causal Insight

A few weeks back, I joined Charity Majors, Paige Cruz, Avi Freedman, Shahar Azulay, and Adam LaGreca for a roundtable on the state of modern observability. It was an honest conversation about where we are, what’s broken, and where things are heading. You can read the full summary on The New Stack. This exchange inspired me to write down my thoughts and to expand on them.

Let’s Not Rename Observability — Let’s Make It Work

Every few months, a new term pops up: understandability, explainability, controllability. And sure, we all want better systems. But do we really need new language? Or do we need better outcomes?

As I said in the panel: six people will have twelve opinions on what observability means. But the real point is this: users don’t care what we call it! They want to catch root causes in production early and minimize the business impact during an incident. And that requires systems that deliver causal, actionable insights that allow teams to return their software into a healthy state quickly.

Renaming observability and packaging it as something else is not conducive to getting to the true outcome everyone wants – they are just words without the proper action and change to back it up. We saw this before, that “monitoring” was replaced with “observability” without no actual change.

Observability should support the full software lifecycle, from design to incident to fix. If it’s not helping you move faster and safer, then it’s just more dashboards. And no one wants that. Let’s not get stuck in terminology and instead get back to building systems that help us move.

Value Over Volume

One of the loudest themes from the roundtable: cost. Observability spend is skyrocketing, while the value people get is… questionable.

We are stuck in a cycle of collecting everything, just in case! But volume does not equal value. More logs, more traces, more storage doesn’t solve problems, it mostly just adds noise.

Any vendor claiming to innovate in observability has to answer this question: how do you shift the focus from collecting more data to instead only delivering useful insights?

Causely’s answer: we do mediation at the cluster (i.e., edge-based processing), and only send distilled insights to the cloud . Our system doesn’t aim to collect all the data. It aims to understand what’s wrong, fast. That means we’re not just watching metrics, we’re diagnosing causal chains. That’s how we make observability affordable, and more importantly useful.

From Optional to Invisible: The Future of OpenTelemetry

During the roundtable I said that OpenTelemetry will have won when people use it without realizing it. When the libraries, frameworks and programming languages that we use every day come with an out of the box integration. Developers will use it like any other core feature of their language, like if-statements, variables and comments.

That said, this vision is more of a stretch goal than a milestone on the horizon. In many ways, OpenTelemetry has already “won.” Making code observable, whether through OpenTelemetry or another system, is no longer optional, it is expected. Observability has become a baseline capability, and OpenTelemetry helped set that standard.

Smart Automation, Not AI Hype

A few years ago, AIOps promised to make sense of our systems with artificial intelligence and mostly delivered confusion. The hype faded, the promises didn’t hold up, and most teams were left with “smart” alert suppressors or dashboards that looked impressive but didn’t help when it counted.

Today’s AI wave is louder: LLMs that summarize incidents, tools that promise auto-remediation with natural language, anomaly detectors wrapped in glossy UI. But most of them suffer from the same core flaw: LLMs are general-purpose tools, they excel at pattern recognition, but lack deep real-time causal reasoning needed to adopt to novel, dynamic environments.

At Causely, we’re not building a chatbot. We’re not doing log clustering or time-series correlation. We’re building a causal reasoning system.

Our system encodes known failure modes as structured models. That means it can infer root causes even when the triggering signal isn’t directly observable. It doesn’t just suppress noise; it explains it. It doesn’t just summarize symptoms; it traces causality.

This isn’t black-box “AI.” It’s smart, explainable automation rooted in mathematics — built to help engineers understand why things happen and what to do next.

And the goal isn’t to replace humans! It’s to get them out of the loop for the boring stuff: the CPU spike, the misconfigured downstream service, the memory leak that shows up every Tuesday. These aren’t mysteries. They’re patterns. And they can be handled automatically.

That way engineers can spend time on doing what they love, building.

Closing Thoughts

The roundtable showed there’s still strong alignment across the industry: observability remains essential, but it needs to deliver more than just data. We need clearer value, smarter automation, and systems that help us move faster, not just monitor more.

Thanks to Adam, Charity, Paige, Shahar, and Avi for a thoughtful and honest discussion. It’s good to see real progress, and even better to debate what is coming next.