<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Causely Blog</title>
    <link>https://causely.ai/blog</link>
    <description>Latest posts from Causely about application reliability, observability, and cloud native technologies</description>
    <language>en</language>
    <lastBuildDate>Fri, 06 Mar 2026 23:07:35 GMT</lastBuildDate>
    
    <item>
      <title><![CDATA[Reliability Is Managed In Services, But Felt In Transactions]]></title>
      <link>https://causely.ai/blog/reliability-is-managed-in-services-but-felt-in-transactions</link>
      <guid>https://causely.ai/blog/reliability-is-managed-in-services-but-felt-in-transactions</guid>
      <pubDate>Thu, 19 Feb 2026 19:18:14 GMT</pubDate>
      <description><![CDATA[Reliability is managed in services, but users experience outcomes. In complex, multi-service and AI-driven architectures, systems can look healthy in isolation while end-to-end workflows still fail. Product reliability needs visibility at the level of transactions and flows.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2026/02/sharks-love-transactions.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Modern reliability practice is incredibly good at one thing: making complex systems operable. We break systems into services, assign owners, instrument the surfaces, and build crisp operational feedback loops. When something goes wrong, we can usually answer: which service is unhealthy, and which team should act? That’s what makes running large, distributed systems tractable in the first place. But product reliability asks a different question:<strong> </strong></p><p>Did the user succeed?</p><p>That question often gets neglected. Not because teams don’t care, but because most reliability management is organized around services, while the user experiences transactions.</p><h2 id="reliability-is-felt-in-transactions-not-services"><strong>Reliability Is Felt in Transactions, Not Services</strong></h2><p>Consider an end-user experience like placing an order in an application. From the user’s perspective, this is one thing they are trying to accomplish. They add an item, confirm, and expect the order to go through. For them, the outcome is binary: the order worked, or it didn’t.</p><p>From a system perspective, that same experience fans out across many services. Inventory and pricing services check availability, payment authorization happens synchronously, a machine-learning-based fraud detection system evaluates the order in real time, and downstream fulfillment, recommendation updates, and notification steps continue asynchronously.</p><p>Each service might look “fine” when viewed in isolation. Latency could be slightly elevated in one place; retries could be masking failures in another, and asynchronous steps could be lagging, without any single service showing a clear red flag.</p><p>This is not a theoretical edge case. It’s a natural consequence of composing many independently reliable components into an end-to-end experience.</p><p>Reliability is managed at the service level, but it is felt at the level of transactions and flows.</p><h2 id="why-transaction-level-reliability-doesn%E2%80%99t-collapse-into-one-simple-metric">Why Transaction-Level Reliability Doesn’t Collapse Into One Simple Metric</h2><p>You can try to model this by defining reliability objectives for individual transactions, and in some cases that works well. A critical HTTP path on a frontend service can and should have a clear objective.</p><p>But systems rarely stop there. Real products are full of flows that are asynchronous, multi-step, or partially decoupled. Password reset workflows, AI-assisted report generation, onboarding sequences, background processing pipelines; these don’t fit neatly into a single request-response boundary.</p><p>In these cases, the user-facing “success” signal isn’t always available at the moment a request returns. You often only know whether the user truly succeeded after downstream work completes and state converges—sometimes minutes later, sometimes after multiple systems reconcile.</p><p>This is also where many SLO programs end up skewing toward internal service-level contracts: useful for ownership and operations, but not always capturing whether the user intent completed successfully end-to-end.</p><p>So the challenge isn’t “measure reliability.” It’s: measure reliability at the level where success is actually determined.</p><h2 id="product-ownership-doesn%E2%80%99t-map-cleanly-to-services"><strong>Product Ownership Doesn’t Map Cleanly to Services</strong></h2><h3 id=""></h3><p>There is another dimension where this becomes visible: products and features are rarely owned by a single service. They span multiple microservices, owned by different teams, evolving at different speeds. Developers and product managers tend to think in terms of features and user journeys, not in terms of which service is emitting which metric.</p><p>In a DevOps model, that creates friction. The people responsible for improving a product’s reliability need to understand how changes affect end-to-end behavior. Service-level views are necessary for fixing issues, but they are not sufficient for reasoning how reliable a feature or product feels to users over time.</p><p>And this is where a lot of reliability programs stall: they’re excellent at service health and incident response, but weaker at answering:</p><ul><li>Which user-facing flows are degrading?</li><li>Which products are accumulating reliability risk?</li><li>What is reliability trending like from the user’s perspective?</li><li>Where are we “green” locally but failing end-to-end?</li></ul><p>Consider a subscription upgrade flow initiated through a support chatbot. A customer messages: “Upgrade me to Pro”. The bot confirms success immediately.</p><p>Behind the scenes, that single interaction fans out across a chain of services, agents and actions: billing is updated, an entitlements agent validates the new plan and pushes updated permissions, usage limits refresh, and internal automation agents orchestrate background jobs reconcile state across systems.</p><p>No single service is down. The agents are executing as designed. Everything looks acceptable in isolation.</p><p>But minutes or hours later, the customer hits a limit they shouldn’t have, or a feature remains locked. Support tickets spike. Revenue recognition is delayed. Trust erodes.</p><p>From a service view, everything looks acceptable. From a product view, the upgrade flow is unreliable.</p><h2 id="closing-the-gap-requires-productand-flow-centric-views"><strong>Closing the Gap Requires Product- and Flow-Centric Views</strong></h2><p>Addressing that gap requires additional lenses. Reliability systems need ways to look at signals through transaction-centric and product-centric views. Views that reflect how users experience the system and how teams build products across services, not only how services behave in isolation.</p><p>Services remain the building blocks. Service-level discipline remains essential. But to understand what users are actually feeling, and how reliable a product truly is, reliability needs to be observable at the level where experience happens: across transactions, flows, and products composed from many services working together.</p><p>We’ve been working on addressing this gap in our own product. We are exploring how reliability systems can make transaction- and product-level views first-class, while strengthening the service foundations teams rely on today. We’ll share more about that soon.</p><p>In the meantime, we’re curious whether this resonates: Does this match what you’re seeing in your own systems, or do you think service level reliability already cover more of this than we’ve described? </p><p>Let us know! We love to hear and read what you think: <a href="mailto:community@causely.ai" rel="noreferrer">reach out to us via email</a> or <a href="https://www.linkedin.com/posts/severinneumann_reliability-is-managed-in-services-users-share-7430639096806436864-nhEa?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAMovzoBr0mJ9JwO6vH6gE-ew2o1-Hnr5pc" rel="noreferrer">discuss with us on LinkedIn</a>!</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Alerts Aren’t the Investigation: What Comes Next in Incident Response?]]></title>
      <link>https://causely.ai/blog/alerts-arent-the-investigation-what-comes-next-in-incident-response</link>
      <guid>https://causely.ai/blog/alerts-arent-the-investigation-what-comes-next-in-incident-response</guid>
      <pubDate>Mon, 09 Feb 2026 18:01:22 GMT</pubDate>
      <description><![CDATA[Alerts are signals, not explanations. By explicitly mapping alerts to symptoms and inferred root causes, Causely turns alert noise into a coherent explanation of what is actually happening in the system.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2026/02/New-Feature----alerts-to-root-cause.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>In a recent post,&nbsp;<a href="https://www.causely.ai/blog/alerts-arent-the-investigation?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Alerts Aren’t the Investigation</u></a>, we described a problem that shows up in&nbsp;nearly every&nbsp;on-call rotation: alerts fire&nbsp;quickly, but&nbsp;understanding arrives late. A page tells you something&nbsp;crossed&nbsp;a threshold. It does not tell you what behavior is unfolding, how signals relate, or where to start if you want to stop the damage.&nbsp;</p><p>That delay is the Page-to-Understanding Gap. It is the translation cost of turning a page into an explicit explanation of system behavior. This is why MTTR&nbsp;improvements&nbsp;plateau even in teams with mature alerting and observability.&nbsp;</p><p>Engineers feel this cost most acutely during incidents today, but it is also what prevents alerts from being usable inputs to automated response workflows.</p><p>What comes next is closing that gap. That means grounding alerts directly in system behavior and causal context, so investigation starts with understanding instead of translation.&nbsp;</p><h2 id="the-problem-alerts-create-during-investigation"><strong>The problem alerts create during investigation</strong>&nbsp;</h2><p>When an alert fires, the first requirement is not diagnosis. It is orientation.&nbsp;</p><p>Engineers need to know what kind of behavior is happening, whether multiple alerts point to the same underlying issue, and where to focus first. Alerts do not answer those questions. They were never designed to.&nbsp;</p><p>Most alerting systems encode thresholds, burn rates, or proxy signals that usually correlate with impact. They reflect past incidents and operational heuristics, not real-time system behavior. That is why the same alert can&nbsp;represent&nbsp;very different&nbsp;failure modes, and why many different alerts can describe the same underlying degradation.&nbsp;</p><p>As a result, the first minutes of an incident are spent translating alert language into a mental model of the system.&nbsp;Engineers jump between dashboards, traces, and logs to figure out what the alert&nbsp;actually means&nbsp;right now.&nbsp;Slack fills with partial hypotheses. Context fragments before anyone agrees on what problem they are solving.&nbsp;</p><p>This translation work is&nbsp;the&nbsp;gap. It is where time is lost and coordination breaks down.&nbsp;</p><h2 id="the-missing-bridge-between-alerts-and-understanding"><strong>The missing bridge between alerts and understanding</strong>&nbsp;</h2><p>Alerts do carry some intent. They are not random&nbsp;noise. Each alert exists because someone believed a certain condition mattered.&nbsp;</p><p>What has been missing is an explicit bridge between that alert intent and an explanation of system behavior that can drive consistent action. Without that bridge, alerts&nbsp;remain&nbsp;isolated&nbsp;signals. Engineers are left to do the mapping themselves under pressure.&nbsp;</p><p>Closing the Page-to-Understanding Gap requires making that mapping explicit. An alert needs to land inside an explanation, not start a scavenger hunt for one.&nbsp;</p><h2 id="what-changed-in-causely"><strong>What changed in</strong>&nbsp;<strong>Causely</strong>&nbsp;</h2><p>Our&nbsp;<a href="https://docs.causely.ai/changelog/v1.0.114/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>recent release</u></a>&nbsp;adds the missing bridge.&nbsp;</p><p>Causely&nbsp;already ingests alerts from tools teams rely on, including&nbsp;Alertmanager, Prometheus or Mimir, incident.io, and Datadog.&nbsp;What’s&nbsp;new is that the relationship between those alerts and&nbsp;Causely’s&nbsp;symptom and causal model is now explicit and visible.&nbsp;</p><p>Each ingested alert is mapped to the symptom it&nbsp;represents&nbsp;and shown directly in the context of the inferred root cause.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2026/02/data-src-image-e071a432-5b90-47ab-b2c3-8fde9e9c65cc.png" class="kg-image" alt="A screenshot of a computer

AI-generated content may be incorrect." loading="lazy" width="2000" height="686" srcset="https://causely-blog.ghost.io/content/images/size/w600/2026/02/data-src-image-e071a432-5b90-47ab-b2c3-8fde9e9c65cc.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2026/02/data-src-image-e071a432-5b90-47ab-b2c3-8fde9e9c65cc.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2026/02/data-src-image-e071a432-5b90-47ab-b2c3-8fde9e9c65cc.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2026/02/data-src-image-e071a432-5b90-47ab-b2c3-8fde9e9c65cc.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Alerts mapped to symptoms are shown as evidence for the inferred root cause.&nbsp;</span></figcaption></figure><p>Instead of treating alerts as separate problems,&nbsp;Causely&nbsp;treats them as evidence of system behavior. Multiple alerts that describe the same behavior collapse into a single symptom. Those symptoms are then connected through the causal model to the change, dependency, or resource&nbsp;actually driving&nbsp;the issue.&nbsp;</p><p>Alerts are no longer just timestamps and labels. They become part of a coherent&nbsp;system&nbsp;story.&nbsp;</p><p>You can also view alerts over time and see which symptoms and root causes they mapped to. This makes alert behavior visible and understandable, rather than noisy.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2026/02/data-src-image-aab07489-b26c-4386-850f-8426b609bbc1.png" class="kg-image" alt="A screenshot of a computer

AI-generated content may be incorrect." loading="lazy" width="2000" height="629" srcset="https://causely-blog.ghost.io/content/images/size/w600/2026/02/data-src-image-aab07489-b26c-4386-850f-8426b609bbc1.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2026/02/data-src-image-aab07489-b26c-4386-850f-8426b609bbc1.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2026/02/data-src-image-aab07489-b26c-4386-850f-8426b609bbc1.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2026/02/data-src-image-aab07489-b26c-4386-850f-8426b609bbc1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">All ingested alerts are shown with their mapping status, including which alerts map to symptoms and which do not.&nbsp;</span></figcaption></figure><h2 id="how-this-changes-the-first-minutes-of-an-incident"><strong>How this changes the first minutes of an incident</strong>&nbsp;</h2><p>For the on-call engineer, the workflow changes&nbsp;immediately.&nbsp;</p><p>When a page fires, the alert can be pasted directly into Ask&nbsp;Causely. Instead of starting with dashboards, the engineer sees which symptom the alert maps to and whether there is an active root cause. The first question shifts from what to check to whether action is&nbsp;required&nbsp;now.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2026/02/data-src-image-efd26c7d-fcdf-494a-9f23-1f17eee8bb79.png" class="kg-image" alt="A screenshot of a computer

AI-generated content may be incorrect." loading="lazy" width="2000" height="802" srcset="https://causely-blog.ghost.io/content/images/size/w600/2026/02/data-src-image-efd26c7d-fcdf-494a-9f23-1f17eee8bb79.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2026/02/data-src-image-efd26c7d-fcdf-494a-9f23-1f17eee8bb79.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2026/02/data-src-image-efd26c7d-fcdf-494a-9f23-1f17eee8bb79.png 1600w, https://causely-blog.ghost.io/content/images/2026/02/data-src-image-efd26c7d-fcdf-494a-9f23-1f17eee8bb79.png 2304w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Ask&nbsp;Causely&nbsp;interprets the&nbsp;question about an&nbsp;alert and&nbsp;immediately&nbsp;explains&nbsp;the&nbsp;root cause.&nbsp;</span></figcaption></figure><p>The same mapping is also available as structured context for automated workflows, so alerts no longer require manual interpretation to become actionable.</p><p>For teams dealing with alert noise, the change is cumulative. Over time, they can see which alerts consistently map to the same symptoms and causes. This builds trust that important behavior is covered and makes it easier to reason about which alerts are redundant versus genuinely distinct.&nbsp;</p><p>For reliability leads, gaps become visible. Alerts that do not map cleanly to existing symptoms stand out. Those gaps are no longer discovered mid-incident.&nbsp;They become clear opportunities to extend coverage where it&nbsp;actually matters.&nbsp;</p><h2 id="from-paging-interface-to-investigation-entry-point"><strong>From paging interface to investigation entry point</strong>&nbsp;</h2><p>Alerts still wake people up. That does not change.&nbsp;</p><p>What changes is where investigation&nbsp;starts.&nbsp;Instead of beginning with translation and hypothesis building, teams start with an explanation that already accounts for how alerts relate to each other and to the system.&nbsp;</p><p>Over time, this shifts behavior. Engineers stop debating which alert is primary. War rooms converge faster on a shared narrative.&nbsp;Investigation&nbsp;energy moves from interpretation to containment.&nbsp;More importantly, alerts become explicit, interpretable signals that can drive response consistently, whether the next step is taken by a human or a system.</p><p>Causely&nbsp;does not replace alerting. It turns alerts into inputs to understanding.&nbsp;</p><h2 id="closing-the-gap"><strong>Closing the gap</strong>&nbsp;</h2><p>Alerts are&nbsp;not the investigation. They never were.&nbsp;</p><p>The cost teams pay today is not because they lack alerts or data, but because understanding arrives too late. By explicitly mapping alerts to symptoms and root causes, Causely closes the Page-to-Understanding Gap where most incidents stall, turning alerts from pages into grounded inputs for understanding and action.</p><p><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer">Know what to chase when everything breaks.&nbsp;</a></p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[How to Turn Slow Queries into Actionable Reliability Metrics with OpenTelemetry]]></title>
      <link>https://causely.ai/blog/how-to-turn-slow-queries-into-actionable-reliability-metrics-with-opentelemetry</link>
      <guid>https://causely.ai/blog/how-to-turn-slow-queries-into-actionable-reliability-metrics-with-opentelemetry</guid>
      <pubDate>Wed, 04 Feb 2026 10:11:00 GMT</pubDate>
      <description><![CDATA[Slow SQL queries degrade UX and reliability. This guide shows how to distill OpenTelemetry DB spans into actionable metrics: build span-derived slow-query dashboards, rank queries by traffic impact, and detect regressions with anomaly baselines, so you fix what matters first. Hands-on lab included.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2026/01/sharks-love-fast-queries.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Slow SQL queries degrade user experience, cause cascading failures, and turn simple operations into production incidents. The traditional fix? Collect more telemetry. But more telemetry means more things to look at, not necessarily more understanding.</p><p>Instead of treating traces as a data stream we might analyze someday, we should be opinionated about what matters at the moment of decision. As we argued in <a href="https://www.causely.ai/blog/the-signal-in-the-storm?ref=causely-blog.ghost.io"><em>The Signal in the Storm</em></a>, raw telemetry only becomes useful when we extract meaningful patterns. </p><p>In this guide, you’ll build a repeatable workflow that turns OpenTelemetry database spans into span-derived metrics you can dashboard and alert on—so you can identify what’s slow, what matters most, and what just regressed.</p><p>We’ll make this concrete with slow SQL queries, serving two use cases:</p><ul><li><strong>Optimization</strong>: Which queries yield the most value if made faster, weighted by traffic?</li><li><strong>Incident response</strong>: Which queries are behaving abnormally <em>right now</em>?</li></ul><p>We’ll build a <a href="https://github.com/causely-oss/slow-query-lab?ref=causely-blog.ghost.io" rel="noreferrer">lab</a> where your app emits <a href="https://opentelemetry.io/?ref=causely-blog.ghost.io" rel="noreferrer">OpenTelemetry</a> traces, and we distill those into actionable metrics, starting with simple slow query detection, then adding traffic-weighted impact, and finally anomaly detection.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-text">Want to skip the theory? <a href="#lab-setup" rel="noreferrer">Jump to the Lab</a>. But the context helps you understand what you’re building.</div></div><figure class="kg-card kg-embed-card kg-card-hascaption"><iframe width="200" height="113" src="https://www.youtube.com/embed/Gy38gx-7phA?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" title="The Signal in the Storm: Practical Strategies for Managing Telemetry Overload - Endre Sara"></iframe><figcaption><p><span style="white-space: pre-wrap;">The Signal in the Storm: Practical Strategies for Managing Telemetry Overload - Endre Sara</span></p></figcaption></figure><h2 id="what-makes-a-query-slow">What Makes a Query Slow?</h2><p>“Slow” isn’t a single problem. It’s a symptom with fundamentally different causes. A 50ms query might be fine for a reporting dashboard but catastrophic for checkout. As <a href="https://www.oreilly.com/library/view/high-performance-mysql/9781449332471/?ref=causely-blog.ghost.io"><em>High Performance MySQL</em></a> emphasizes, understanding why a query is slow determines how to fix it. Here are the most common problems that may cause slow queries:</p><h3 id="excessive-work">Excessive Work</h3><p>The database does more than necessary—typically full table scans due to missing or unusable indexes. Without an index on <code>customer_id</code>, a simple <code>SELECT * FROM orders WHERE customer_id = $1</code> grows from 20ms at 10K rows to minutes at 10M rows. The query didn’t change; the data volume did. See <a href="https://use-the-index-luke.com/?ref=causely-blog.ghost.io"><em>Use The Index, Luke!</em></a> for the fundamentals.</p><p>Aggregations and joins compound this. Even indexed queries can explode when the planner misjudges cardinality and chooses the wrong join strategy.</p><h3 id="resource-contention">Resource Contention</h3><p>Perfectly optimized queries can be slow when waiting for resources. Lock contention blocks queries until other transactions release rows. Connection pool exhaustion adds latency before the query even starts. A query spending 95% of its time waiting for locks won’t be fixed by query optimization—it needs transaction redesign.</p><h3 id="environmental-pressure">Environmental Pressure</h3><p>CPU saturation, I/O bottlenecks, and memory pressure can slow any query. The same SQL with the same plan performs completely differently under resource contention.</p><h3 id="plan-regressions">Plan Regressions</h3><p>Performance degrades when execution plans change—even with identical queries and data. Parameter-sensitive plans optimize for one set of values but fail for others. Stale statistics after bulk loads cause the planner to choose terrible strategies. The <a href="https://www.postgresql.org/docs/current/performance-tips.html?ref=causely-blog.ghost.io">PostgreSQL Performance Tips</a> documentation covers how to catch these regressions.</p><h3 id="pathological-patterns">Pathological Patterns</h3><p>Some slowness doesn’t appear in slow query logs. The N+1 problem executes 100 fast queries (2ms each) sequentially, adding 200ms latency plus network overhead. No individual query is “slow,” but the pattern is catastrophic.</p><h2 id="the-classic-workflow-db-native-tooling-manual-triage">The classic workflow: DB-native tooling + manual triage</h2><p>Databases ship with excellent diagnostic tools: slow query logs, query stores like PostgreSQL’s <a href="https://www.postgresql.org/docs/current/pgstatstatements.html?ref=causely-blog.ghost.io"><code>pg_stat_statements</code></a>, and plan inspection with <code>EXPLAIN</code>. These tell you what’s expensive inside the database.</p><p>What they don’t provide is context. Which service triggered the slow query? Is it user-facing or background work? Does it correlate with the latency spike you’re investigating? You’re left with a list of slow queries and no signal about which ones matter most.</p><p>Typically, someone bridges this gap manually: a developer notices a slow endpoint, brings the query to a DBA, and they optimize it together. This works, but that manual linking is exactly what we can automate.</p><h2 id="bringing-context-to-slow-queries">Bringing Context to Slow Queries</h2><p>Database tools tell you what is slow, but not why it matters. When you find a slow query in your logs, you’re missing critical context: Which service triggered it? Is it user-facing or background work? Does it correlate with the latency spike you’re investigating?</p><p>Distributed traces provide this context. Each database span is embedded in a request context—it knows which service, endpoint, and user triggered it.</p><p>Instead of correlating database logs and traces after the fact, we analyze slow queries directly from traces with all the application context built in.</p><h2 id="the-building-blocks">The Building Blocks</h2><p>Now that we understand the philosophy and the value of context-rich traces, let’s look at the building blocks we’ll use to implement slow query analysis.</p><h3 id="the-observability-stack">The Observability Stack</h3><p>For our lab, we use the <a href="https://opentelemetry.io/docs/collector/?ref=causely-blog.ghost.io">OpenTelemetry Collector</a> paired with <a href="https://github.com/grafana/docker-otel-lgtm?ref=causely-blog.ghost.io">docker-otel-lgtm</a>—a pre-packaged stack from Grafana that bundles Loki, Grafana, Tempo, and Mimir in a single container. This gives us a complete observability environment with minimal setup.</p><h3 id="the-application">The Application</h3><p>Our sample application is a simple Go-based “Album API” that serves music album data from PostgreSQL. It’s intentionally designed to produce the kind of intermittent slow queries that are common in production. They services use <a href="https://github.com/XSAM/otelsql?ref=causely-blog.ghost.io">otelsql</a> to instrument database calls, emitting spans with the <a href="https://opentelemetry.io/docs/specs/semconv/database/?ref=causely-blog.ghost.io">stable OpenTelemetry database semantic conventions</a>.</p><h3 id="the-dashboards">The Dashboards</h3><p>We’ll build three dashboards, each adding a layer of insight:</p><ol><li>A simple view of the queries by duration</li><li>Queries weighted by traffic to surface optimization opportunities</li><li>Anomaly detection to identify queries deviating from their normal behavior</li></ol><h2 id="lab-setup">Lab Setup</h2><p>Let’s put the theory into practice. We’ll clone a <a href="https://github.com/causely-oss/slow-query-lab?ref=causely-blog.ghost.io" rel="noreferrer">sample application</a>, start the observability stack, and explore three progressively more sophisticated approaches to slow query analysis. All you need is <a href="https://www.docker.com/?ref=causely-blog.ghost.io" rel="noreferrer">Docker</a> installed.</p><h3 id="clone-and-run">Clone and Run</h3><pre><code>git clone https://github.com/causely-oss/slow-query-lab
cd slow-query-lab
docker-compose up -d</code></pre><p>Once running, open Grafana at <a href="http://localhost:3001/?ref=causely-blog.ghost.io">http://localhost:3001</a>—that’s where we’ll explore our dashboards.</p><h2 id="queries-by-duration">Queries by Duration</h2><p>The first dashboard takes the most direct approach: query Tempo for database spans and aggregate them to find queries that take the longest time. This is what you’d naturally build when you first start exploring traces for slow query analysis.</p><h3 id="what-it-shows">What It Shows</h3><p>The <strong>Slow SQL - By Duration</strong> dashboard queries traces directly using TraceQL:</p><pre><code>{ span.db.system != "" } | select(span.db.query.text, span.db.statement)</code></pre><p>This finds all spans with database attributes, then uses Grafana transformations to:</p><ol><li><strong>Group by</strong> root operation (API endpoint) and SQL statement</li><li><strong>Aggregate</strong> duration into mean, max, and count</li><li><strong>Sort by</strong> average duration (slowest first)</li></ol><p>The result is a table showing your slowest queries, which endpoints triggered them, and how often they occur.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2026/01/slow-queries-by-root-operation.png" class="kg-image" alt="Slowest queries by root operation" loading="lazy" width="1133" height="304" srcset="https://causely-blog.ghost.io/content/images/size/w600/2026/01/slow-queries-by-root-operation.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2026/01/slow-queries-by-root-operation.png 1000w, https://causely-blog.ghost.io/content/images/2026/01/slow-queries-by-root-operation.png 1133w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Slowest queries by root operation</span></figcaption></figure><h3 id="what%E2%80%99s-good-about-this">What’s Good About This</h3><p>This approach gives you immediate visibility into queries with full application context:</p><ul><li>You can see exactly which SQL statements are taking the most time</li><li>You know which API endpoints trigger them</li><li>You have the count to understand frequency</li><li>You can click through to individual traces for debugging</li></ul><p>It’s a first improvement over raw database logs because you’re already seeing the application context that makes slow queries actionable.</p><h3 id="the-limitation">The Limitation</h3><p>Here’s the problem: sorting by average duration doesn’t tell you which queries matter most.</p><p>Consider two queries:</p>
<!--kg-card-begin: html-->
<table class="caption-top table">
<thead>
<tr class="header">
<th>Query</th>
<th>Avg Duration</th>
<th>Count</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Complex report</td>
<td>2.3s</td>
<td>5</td>
</tr>
<tr class="even">
<td>Search</td>
<td>150ms</td>
<td>10,000</td>
</tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>The complex report is “slower” by average duration, so it appears first. But the search query, despite being faster on average, runs 2,000 times more often. Its aggregate impact on your users is far greater.</p><p>This dashboard tells you what’s slow, but not what’s <em>impactful</em>. For that, we need to consider traffic volume.</p><h2 id="traffic-weighted-impact-analysis">Traffic-Weighted Impact Analysis</h2><p>The second dashboard addresses this limitation by introducing an <strong>impact score</strong>: the product of average duration and call count.</p><h3 id="what-it-shows-1">What It Shows</h3><p>The <strong>Slow SQL - Traffic Weighted</strong> dashboard uses the same TraceQL query but adds a calculated field:</p><pre><code>Impact = Avg Duration × Count</code></pre><p>This simple formula captures a key insight: a moderately slow query that runs thousands of times has more total impact than a very slow query that runs rarely. The dashboard sorts by impact score, surfacing the queries that matter most to your users.</p><p>The dashboard also adds:</p><ul><li><strong>Service breakdown</strong>: See which service triggered each query</li><li><strong>Latency distribution</strong>: Visualize duration over time, not just averages</li><li><strong>Top queries by impact</strong>: A quick view of where to focus optimization efforts</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2026/01/slow-queries-by-root-operation-with-impact.png" class="kg-image" alt="Highest impact queries by root operation" loading="lazy" width="1151" height="306" srcset="https://causely-blog.ghost.io/content/images/size/w600/2026/01/slow-queries-by-root-operation-with-impact.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2026/01/slow-queries-by-root-operation-with-impact.png 1000w, https://causely-blog.ghost.io/content/images/2026/01/slow-queries-by-root-operation-with-impact.png 1151w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Highest impact queries by root operation</span></figcaption></figure><h3 id="what%E2%80%99s-good-about-this-1">What’s Good About This</h3><p>Traffic-weighted impact gives you a much better prioritization signal for optimization work:</p><ul><li>High-volume, moderately-slow queries surface above rare-but-slow ones</li><li>You can justify optimization work with concrete impact numbers</li><li>The service and endpoint context helps you route issues to the right team</li></ul><p>When someone asks “which slow queries should we optimize first?”, this dashboard gives you a defensible answer. It’s exactly what you need for planning performance improvements.</p><h3 id="the-limitation-1">The Limitation</h3><p>But this dashboard is for optimization, not incident response. Even with traffic-weighted impact, it can’t answer a critical question:</p><blockquote><strong>“What has changed?”</strong></blockquote><p>Suppose your search query has an impact score of 150,000. Is that normal? Is it higher than yesterday? Higher than last week? The dashboard shows you a snapshot of current state, but it has no concept of baseline.</p><p>This matters enormously during incidents. When latency spikes, you don’t just want to know “search queries are slow”—you want to know “search queries are slower than normal”. You need to distinguish between:</p><ul><li>A query that’s always been slow (known behavior, maybe acceptable)</li><li>A query that just became slow (new problem, needs investigation)</li></ul><p>Without a baseline, every slow query looks the same. You’re left manually comparing current values to your memory of what’s “normal,” or digging through historical data to establish context.</p><p>This is the gap that the third dashboard addresses.</p><h2 id="symptom-detection-with-anomaly-baselines">Symptom Detection with Anomaly Baselines</h2><p>Because of these limitations, the third dashboard changes our approach: instead of just querying traces, we distill metrics from spans and then apply anomaly detection to identify deviations from normal behavior.</p><h3 id="the-setup">The Setup</h3><p>For this dashboard, we add the <code>spanmetrics</code> connector to the OpenTelemetry Collector. Here’s the relevant part of the collector configuration:</p><pre><code>connectors:
  spanmetrics:
    dimensions:
      - name: db.system
        default: "unknown"
      - name: db.query.text
      - name: db.statement
      - name: db.name
        default: "unknown"
    exemplars:
      enabled: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform, batch]
      exporters: [spanmetrics, otlphttp/lgtm]
    
    metrics:
      receivers: [spanmetrics]
      processors: [batch]
      exporters: [otlphttp/lgtm]</code></pre><p>The <code>spanmetrics</code> connector examines every database span and generates histogram metrics for query latency, labeled by:</p><ul><li><code>service_name</code>: Which service made the query</li><li><code>db_system</code>: Database type (postgresql)</li><li><code>db_query_text</code> or <code>db_statement</code>: The SQL query</li><li><code>db_name</code>: Database name</li></ul><p>These metrics are stored in Mimir (the Prometheus-compatible backend in docker-otel-lgtm), where we can apply PromQL-based anomaly detection.</p><h3 id="anomaly-detection-with-adaptive-baselines">Anomaly Detection with Adaptive Baselines</h3><p>The sample app includes Prometheus recording rules from Grafana’s <a href="https://github.com/grafana/promql-anomaly-detection?ref=causely-blog.ghost.io">PromQL Anomaly Detection</a> framework. These rules calculate:</p><ul><li><strong>Baseline</strong>: A smoothed average of historical values (what’s “normal”)</li><li><strong>Upper band</strong>: Baseline + N standard deviations (upper threshold)</li><li><strong>Lower band</strong>: Baseline - N standard deviations (lower threshold)</li></ul><p>When current values exceed the bands, we have an anomaly—a clear signal that something has changed.</p><h3 id="what-it-shows-2">What It Shows</h3><p>The <strong>Slow SQL - Anomaly Detection</strong> dashboard displays:</p><ol><li><strong>Current latency</strong> plotted against the adaptive baseline bands</li><li><strong>Anomaly indicators</strong> when latency exceeds normal bounds</li><li><strong>Per-query breakdown</strong> so you can see which specific queries are anomalous</li></ol><p>The key insight is the visual comparison: instead of just showing “p95 latency is 450ms”, it shows “p95 latency is 450ms, which is above the expected range of 200-350ms.”</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2026/01/slow-query-anomaly.png" class="kg-image" alt="" loading="lazy" width="2000" height="641" srcset="https://causely-blog.ghost.io/content/images/size/w600/2026/01/slow-query-anomaly.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2026/01/slow-query-anomaly.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2026/01/slow-query-anomaly.png 1600w, https://causely-blog.ghost.io/content/images/2026/01/slow-query-anomaly.png 2194w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Query latency with anomaly bands</span></figcaption></figure><h3 id="why-this-is-better">Why This Is Better</h3><p>This dashboard answers the question the previous one couldn’t: “What has changed?”</p><ul><li>A query that’s always slow (450ms baseline) won’t trigger anomalies when it runs at 450ms</li><li>A query that’s normally fast (50ms baseline) will trigger anomalies if it suddenly runs at 200ms</li><li>You get automatic context for what’s “normal” without maintaining manual thresholds</li></ul><p>The anomaly detection acts as a symptom detector. It tells you: “This query is behaving differently than it usually does.” That’s a high-signal insight you can act on immediately.</p><h3 id="from-metrics-to-symptoms">From Metrics to Symptoms</h3><p>Notice what we’ve achieved with this architecture:</p><ol><li><strong>Raw telemetry</strong> (traces) flows from the application</li><li><strong>Distillation</strong> (spanmetrics connector) extracts metrics from those traces</li><li><strong>Anomaly detection</strong> (Prometheus rules) identifies deviations from baseline</li><li><strong>Symptoms</strong> (anomalous queries) surface for investigation</li></ol><p>We went from thousands of trace spans to a handful of anomaly signals that tell you exactly where to look.</p><h2 id="taking-this-to-production">Taking This to Production</h2><h3 id="metric-cardinality">Metric Cardinality</h3><p>Raw SQL in metric labels will explode your metrics backend—<code>SELECT * FROM orders WHERE customer_id = 12345</code> becomes a separate series per customer. Use prepared statements (so instrumentation captures templates, not literals), normalize query text, or use <code>aggregation_cardinality_limit</code> in the spanmetrics connector.</p><h3 id="privacy">Privacy</h3><p>SQL may contain sensitive data. The Collector is the ideal place to redact: drop or transform sensitive attributes before shipping downstream. This aligns with distillation: sanitize at the edge, not centrally.</p><h3 id="anomaly-detection-baseline">Anomaly Detection Baseline</h3><p>Adaptive rules need 24-48 hours of data to establish baselines. Start with wider bands and tighten as confidence grows.</p><h2 id="the-remaining-gap-from-symptoms-to-root-causes">The Remaining Gap: From Symptoms to Root Causes</h2><p>Even with anomaly detection, you’re still looking at symptoms. In real-world incident scenarios, especially in large environments, slow queries are just one of many symptoms that pop up at once. You’re not only trying to understand the cause of this one; you’re triaging a flood of alerts and correlating many symptoms to find the real root cause.</p><p>When the dashboard shows “search query latency spiked,” you know something changed. But you don’t know <em>why</em> it changed. The root cause might be:</p><ul><li>A missing index after a schema migration</li><li>Query plan regression due to stale statistics</li><li>Lock contention from a concurrent batch job</li><li>Resource pressure from a noisy neighbor on the database host</li><li>Upstream service degradation causing retry storms</li></ul><p>Connecting the symptom (“search query is slow”) to the root cause (“index was dropped during last night’s migration”) requires causal reasoning—understanding the relationships between system components and tracing the chain of causation from effect back to cause.</p><p>You can absolutely do this reasoning yourself. Look at deployment timestamps, check for schema changes, investigate resource metrics, correlate with other symptoms. Good engineers do this every day.</p><p>But it’s manual, time-consuming, and doesn’t scale.</p><h3 id="going-beyond-symptoms-with-causely">Going Beyond Symptoms with Causely</h3><p>This is where Causely comes in: Causely extracts slow queries (and other symptoms) as distilled insights out of the box—the same pattern we implemented manually. But it goes further:</p><ul><li><a href="https://docs.causely.ai/getting-started/how-causely-works/?ref=causely-blog.ghost.io" rel="noreferrer"><strong>Causal model</strong></a>: Slow queries are connected into a model of your system’s dependencies. You can see what they <em>impact</em> (which endpoints, which users) and what <em>causes</em> them (resource constraints, upstream failures, configuration changes).</li><li><a href="https://docs.causely.ai/in-action/root-causes/?ref=causely-blog.ghost.io" rel="noreferrer"><strong>Root cause identification</strong></a>: Instead of showing you a list of symptoms to investigate, Causely traces causation chains to identify the underlying root cause. “Search queries are slow <em>because</em> the index was dropped.”</li><li><a href="https://docs.causely.ai/in-action/ask-causely/?ref=causely-blog.ghost.io#analyzing-slow-sql-queries" rel="noreferrer"><strong>Actionable recommendations</strong></a>: AskCausely helps you get to “what should we change?”—whether that’s adding an index, reverting a deployment, or addressing the upstream pressure that made the query slow in the first place.</li></ul><p>The pattern we built in this post—distill, detect anomalies, surface symptoms—is the foundation. Causely is the natural next step: turning symptoms into root causes at scale.</p><p>Want to see how Causely connects your slow queries to their root causes? <a href="https://www.causely.ai/try?ref=causely-blog.ghost.io">Try it yourself</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2026/01/ask-causely-slow-queries-6ebf934bdaa316de63e9a10af404cc4f.png" class="kg-image" alt="" loading="lazy" width="1167" height="1329" srcset="https://causely-blog.ghost.io/content/images/size/w600/2026/01/ask-causely-slow-queries-6ebf934bdaa316de63e9a10af404cc4f.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2026/01/ask-causely-slow-queries-6ebf934bdaa316de63e9a10af404cc4f.png 1000w, https://causely-blog.ghost.io/content/images/2026/01/ask-causely-slow-queries-6ebf934bdaa316de63e9a10af404cc4f.png 1167w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Ask Causely about slow queries</span></figcaption></figure>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[When Asynchronous Systems Fail Quietly, Reliability Teams Pay the Price]]></title>
      <link>https://causely.ai/blog/when-asynchronous-systems-fail-quietly-reliability-teams-pay-the-price</link>
      <guid>https://causely.ai/blog/when-asynchronous-systems-fail-quietly-reliability-teams-pay-the-price</guid>
      <pubDate>Wed, 28 Jan 2026 20:19:59 GMT</pubDate>
      <description><![CDATA[Causely’s causal model has been expanded for asynchronous messaging systems. Instead of treating queues as opaque buffers, Causely models messaging infrastructure as it operates in production, making asynchronous failures explicit and explainable.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2026/01/New-Feature----expanded-causal-model-for-asynchronous-communications--2-.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>In our&nbsp;previous&nbsp;post,&nbsp;<a href="https://www.causely.ai/blog/queue-growth-dead-letter-queues-and-why-asynchronous-failures-are-easy-to-misread?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Queue Growth, Dead Letter Queues, and Why Asynchronous Failures Are Easy to Misread</u></a>, we described a failure pattern that plays out repeatedly in modern systems built on asynchronous messaging.&nbsp;</p><p>A queue&nbsp;starts to grow slowly. Nothing&nbsp;looks obviously&nbsp;broken at first. Publish calls are&nbsp;succeeding&nbsp;and consumers are still running, just&nbsp;not quite keeping&nbsp;up. Over time, messages begin to age out, and dead-letter queues start accumulating entries. Downstream services that depend on those messages begin to behave unpredictably. There are partial data, delayed&nbsp;processing&nbsp;and subtle customer-facing issues that are hard to tie back to a single event. By the time the impact is visible in latency or error rates elsewhere in the system, the original cause is&nbsp;buried&nbsp;several layers upstream and hours in the past.&nbsp;</p><p>Teams do not miss these failures because they lack data. They miss them because the signals do not point clearly to the cause.&nbsp;</p><p>Over the past several weeks,&nbsp;we’ve&nbsp;expanded&nbsp;Causely’s&nbsp;asynchronous and messaging queue capabilities to make these failures explicit, explainable, and actionable.&nbsp;This includes:&nbsp;&nbsp;</p><ul><li>Expanding&nbsp;Causal Model for<a href="https://docs.causely.ai/changelog/v1.0.108/?ref=causely-blog.ghost.io#expanded-messaging-queue-causal-model" rel="noreferrer noopener">&nbsp;<u>Amazon</u>&nbsp;<u>SNS</u>&nbsp;<u>and</u>&nbsp;<u>SQS and RabbitMQ</u></a>&nbsp;</li><li>A new&nbsp;<a href="https://docs.causely.ai/reference/root-causes/applications/?ref=causely-blog.ghost.io#producer-publish-rate-spike" rel="noreferrer noopener"><u>Producer Publish Rate Spike</u></a>&nbsp;root cause&nbsp;&nbsp;&nbsp;</li><li>And adding&nbsp;<a href="https://docs.causely.ai/changelog/v1.0.109/?ref=causely-blog.ghost.io#expanded-causal-model-for-asynchronous-communications" rel="noreferrer noopener"><u>Queue Size Growth and Dead-letter Queue</u></a>&nbsp;as first class&nbsp;symptoms to our&nbsp;model&nbsp;&nbsp;</li></ul><h2 id="reliability-blind-spot-in-messaging-driven-architectures"><strong>Reliability Blind Spot in Messaging-Driven Architectures</strong>&nbsp;</h2><p>Asynchronous communication is foundational to how modern systems scale. The same advantages systems&nbsp;like&nbsp;Kafka and RabbitMQ provide, decoupling services and absorbing traffic spikes, also introduce new reliability challenges.&nbsp;</p><p>The core issue is not that these systems fail quietly, but that cause and effect are separated. A producer can overload the system without returning errors. A broker can continue accepting traffic while consumers fall behind. By the time downstream symptoms appear, the triggering behavior has often already passed.&nbsp;</p><p>For engineering managers&nbsp;and or&nbsp;those on the frontline of the on-call slack channel, this creates a familiar and frustrating dynamic. Reliability degrades without a clear trigger. Incident response turns into a debate about whether the producer or consumer is responsible. Teams chase anomalies across dashboards while backlogs continue to grow. By the time a decisive action is taken, the&nbsp;customer&nbsp;impact is already real.&nbsp;</p><h2 id="why-traditional-observability-falls-short"><strong>Why Traditional Observability Falls Short</strong>&nbsp;</h2><p>Metrics, logs, and traces are excellent at answering local questions. They tell you what a service is doing, how long an operation took, or how many messages are currently sitting in a queue.&nbsp;</p><p>What they do not provide is causal understanding across asynchronous boundaries.&nbsp;</p><p>In messaging-driven systems, cause and effect are separated in time and space. A spike in&nbsp;publish&nbsp;rate from one service may not create visible impact until hours later, in a different service, owned by a different team. A slow consumer may be the result of downstream backpressure rather than a defect in the consumer itself. Dead-letter queues tell you that messages failed, but not why the system reached that state.&nbsp;</p><p>Without a causal model of how producers, exchanges, queues, and consumers interact, teams are forced to infer failures indirectly. That inference is slow, fragile, and heavily dependent on tribal knowledge. Under pressure, it leads to overcorrection, unnecessary rollbacks, and missed root causes.&nbsp;</p><h2 id="expanding-the-causal-model-for-messaging-systems"><strong>Expanding the Causal Model for Messaging Systems</strong>&nbsp;</h2><p>To close this gap, we have significantly expanded&nbsp;Causely’s&nbsp;causal model for asynchronous messaging systems.&nbsp;</p><p>Rather than treating queues as opaque buffers,&nbsp;Causely&nbsp;now models messaging infrastructure the way it actually operates in production. Producers, exchanges, queues, and consumers are represented as distinct entities with explicit relationships and data flows. This applies across common technologies, including Amazon SQS, Amazon SNS, and RabbitMQ, whether used in simple queue mode or exchange-based pub/sub patterns.&nbsp;</p><p>By modeling the topology directly,&nbsp;Causely&nbsp;can reason about how work enters the system, how it is routed, where it accumulates, and how pressure propagates across services. This makes it possible to explain failures that previously required intuition and guesswork.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2026/01/data-src-image-1a5bf397-ae85-457b-b971-fd32e48b76f1.png" class="kg-image" alt="A screenshot of a computer

AI-generated content may be incorrect." loading="lazy" width="1974" height="1206" srcset="https://causely-blog.ghost.io/content/images/size/w600/2026/01/data-src-image-1a5bf397-ae85-457b-b971-fd32e48b76f1.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2026/01/data-src-image-1a5bf397-ae85-457b-b971-fd32e48b76f1.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2026/01/data-src-image-1a5bf397-ae85-457b-b971-fd32e48b76f1.png 1600w, https://causely-blog.ghost.io/content/images/2026/01/data-src-image-1a5bf397-ae85-457b-b971-fd32e48b76f1.png 1974w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Causely&nbsp;Dataflow Map makes it easy for&nbsp;engineers&nbsp;to understand how data moves between services&nbsp;and&nbsp;exchanges and queues&nbsp;that make up Amazon SQS and SNS and RabbitMQ </span></figcaption></figure><p><strong>Making Queue Growth and Dead-Letter Failures First-Class Signals</strong>&nbsp;</p><p>We have also expanded the causal model to treat queue size growth and dead-letter queue activity as first-class symptoms, not secondary indicators.&nbsp;</p><p>This changes how asynchronous failures are diagnosed. Instead of surfacing queue metrics as passive signals,&nbsp;Causely&nbsp;reasons&nbsp;about&nbsp;them causally, linking backlog growth and dead-letter events directly to the producers, consumers, and operations involved.&nbsp;</p><p>As a result, queue-related failures are no longer inferred indirectly from downstream latency or error spikes. The failure mode is explicit, explainable, and traceable to the point where intervention is most effective.&nbsp;&nbsp;</p><h2 id="a-new-root-cause-producer-publish-rate-spike"><strong>A New Root Cause: Producer Publish Rate Spike</strong>&nbsp;</h2><p>One of the most common and least understood asynchronous failure modes is a sudden change in&nbsp;publish&nbsp;behavior.&nbsp;Causely&nbsp;now includes a dedicated root cause for this pattern: Producer Publish Rate Spike.&nbsp;</p><p>This occurs when a service, HTTP path, or RPC method begins publishing messages at a significantly higher rate than normal. The increase may be triggered by a code change, a configuration update, or an unexpected shift in traffic patterns. Downstream queues absorb the&nbsp;initial&nbsp;surge, but consumers cannot keep up indefinitely. Queue depth grows, message age increases, and backpressure begins to affect the rest of the system.&nbsp;</p><p>What makes this failure particularly dangerous is that the producer often looks healthy. Publish requests&nbsp;succeed, error rates&nbsp;remain&nbsp;low, and nothing appears obviously wrong at the source. Without causal reasoning, teams&nbsp;frequently&nbsp;blame consumers or infrastructure capacity, missing the true trigger entirely.&nbsp;</p><p>Causely&nbsp;now detects this condition explicitly. It ties unexpected increases in publish rate to queue growth, consumer pressure, and downstream service degradation, making the failure both visible and explainable.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2026/01/data-src-image-5724806e-aee9-45d1-b284-f1bb19005727.png" class="kg-image" alt="A screenshot of a computer

AI-generated content may be incorrect." loading="lazy" width="2000" height="957" srcset="https://causely-blog.ghost.io/content/images/size/w600/2026/01/data-src-image-5724806e-aee9-45d1-b284-f1bb19005727.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2026/01/data-src-image-5724806e-aee9-45d1-b284-f1bb19005727.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2026/01/data-src-image-5724806e-aee9-45d1-b284-f1bb19005727.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2026/01/data-src-image-5724806e-aee9-45d1-b284-f1bb19005727.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Understanding the cause of&nbsp;increased queue depths&nbsp;, causing performance degradation </span></figcaption></figure><h2 id="what-this-changes-for-reliability-teams"><strong>What This Changes for Reliability Teams</strong>&nbsp;</h2><p>For teams responsible for revenue-critical services, these capabilities change how asynchronous failures are handled in practice.&nbsp;</p><p>Instead of reacting after queues are saturated and customers are&nbsp;impacted, teams can see which producer&nbsp;initiated&nbsp;the failure, how pressure propagated through the messaging system, and where intervention will have the greatest effect. Slow consumers, misconfigured routing, and unexpected publish spikes are distinguished clearly rather than conflated into a single “queue issue.”&nbsp;</p><p>This shortens incident response, reduces unnecessary mitigation, and&nbsp;eliminates&nbsp;the finger-pointing that often arises when failures span multiple teams. More importantly, it enables a proactive&nbsp;reliability&nbsp;posture in systems that are constantly changing.&nbsp;</p><h2 id="asynchronous-reliability-without-guesswork"><strong>Asynchronous Reliability Without Guesswork</strong>&nbsp;</h2><p>Asynchronous architectures are essential for scale, but they demand a different approach to reliability than synchronous request paths.&nbsp;</p><p>With its expanded messaging and asynchronous causal model,&nbsp;Causely&nbsp;provides deterministic, explainable reasoning over how&nbsp;data&nbsp;flows&nbsp;through your system. Teams do not need to stitch together dashboards to reconstruct timelines after the fact. They do not need to trust black-box AI summaries that cannot explain their conclusions. They no longer&nbsp;have to&nbsp;exhaustively&nbsp;eliminate&nbsp;possibilities to arrive at a root cause.&nbsp;</p><p>Instead, they get clear answers to the questions that matter most: what is breaking, why it is breaking, and where to act first to protect reliability and revenue.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Alerts Aren’t the Investigation]]></title>
      <link>https://causely.ai/blog/alerts-arent-the-investigation</link>
      <guid>https://causely.ai/blog/alerts-arent-the-investigation</guid>
      <pubDate>Thu, 22 Jan 2026 16:59:39 GMT</pubDate>
      <description><![CDATA[Alerts are supposed to start an investigation. Too often, they start translation: what is the system doing right now? That translation slows containment, splinters context, and stretches customer impact.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2026/01/sharks-do-not-love-alerts.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>PagerDuty fires: CheckoutAPI burn rate (2m/1h). Grafana shows p99 going from ~120ms to ~900ms. Retries doubled. DB CPU is flat, but checkout pods are throttling and a downstream dependency’s error budget is evaporating. Ten minutes in, you’ve collected artifacts, not understanding.</p><p>If you’ve been on call, you’ve seen this movie.</p><p>This is also why plenty of “AI-powered observability” rollouts still don’t change the lived experience of on-call. Leadership expects response times to improve. On-call gets richer dashboards, smarter summaries, and more plausible explanations. To be fair, those do help with faster lookup and briefing in the room, but the social reality stays the same: alerts get silenced in PagerDuty, rules get tagged as “flappy,” and old pages keep firing long after anyone can justify what they were meant to protect. The problem isn’t effort. It’s that the page still doesn’t reliably collapse into shared understanding.&nbsp;Call it the Page-to-Understanding Gap: the time and coordination cost of turning a threshold into a system story.</p><p>Alerts are supposed to start an investigation. Too often, they start translation: <em>what is the system doing right now?</em> That translation slows containment, splinters context, and stretches customer impact.&nbsp;</p><p>That decoding work is what incident response depends on, yet it’s rarely made explicit. It’s why MTTR gains often plateau even after teams invest heavily in monitoring and dashboards.&nbsp;</p><h2 id="alerts-are-a-paging-interface-not-a-language-of-explanation">Alerts are a paging interface, not a language of explanation&nbsp;</h2><p>Alerting is optimized for one job: interrupt a human at the right moment.&nbsp;</p><p>So alerts are built from what’s easiest to express at scale: thresholds, proxy signals, and rules that “usually work.” They encode operational history, not system truth.&nbsp;</p><p>That’s not a failure of alerting. It’s what alerting is for, but it also makes alerts a shaky foundation for understanding. They’re a simplified label over messy reality. When alerts aren’t grounded in clear definitions of “healthy” behavior, the signal loses meaning and teams stop trusting it.&nbsp;</p><h2 id="one-alert-name-can-mean-multiple-different-realities">One alert name can mean multiple different realities&nbsp;</h2><p>Take a familiar page: “latency high,” or the modern equivalent: an SLO burn rate page.&nbsp;</p><p>Burn rate fires on checkout. You assume checkout is slow, start in the service dashboard, see p99 up, then notice retries doubled. Meanwhile, Slack is already split: “DB” vs. “checkout.” Fifteen minutes later you realize a downstream dependency is brownouting and checkout is just drowning in retries. The giveaway is usually the shape: one hop shows rising timeouts and retry storms while upstream looks "healthy" until it saturates. The alert didn’t lie—it just didn’t tell you what you needed first.&nbsp;</p><p>The page looks identical. The mechanism isn’t.&nbsp;</p><p>So teams build muscle memory around “what usually causes this,” and it works—until the system changes just enough that it stops working. Scale and change are exactly what modern organizations optimize for, so the failure mode is guaranteed. When that happens, the alert doesn’t just wake you up. It points you in the wrong direction.&nbsp;</p><h2 id="many-alerts-can-describe-the-same-underlying-behavior">Many alerts can describe the same underlying behavior&nbsp;</h2><p>The reverse problem happens just as often. A single degradation creates a cascade of pages across services: latency, errors, saturation, queueing, burn rate alarms. Each is technically “true,” but treating them as separate problems creates thrash.&nbsp;</p><p>People split into parallel investigations, duplicate context gathering, and argue about which page is “the real one.” Context fragments across Slack threads and war rooms, ownership ping‑pongs, and escalations get noisy. By the time you agree which page is “primary,” you’ve already created a coordination incident. The outcome isn’t just wasted engineer time—it’s that nobody has one shared narrative everyone can repeat while impact is unfolding. The incident becomes less about understanding and more about sorting competing signals. This is how alert fatigue turns into incident fatigue.&nbsp;</p><h2 id="the-silent-gap-important-behavior-you-don%E2%80%99t-alert-on">The silent gap: important behavior you don’t alert on&nbsp;</h2><p>The costly ones are the slow degradations that ship impact before they page.&nbsp;</p><p>A dependency gets a little slower. Retries creep up. One critical route starts timing out for a slice of customers. Averages look fine. Thresholds don’t trip. You don’t notice until the page fires, or until customers do.&nbsp;</p><p>It’s not that teams don’t care. It’s that these behaviors don’t fit neatly into alert rules: risk builds up, dependencies decay, partial impact hides in aggregates, and propagation only makes sense once you’ve traced it end to end.&nbsp;</p><p>Most alerts only get defined after you’ve understood the behavior in the middle of an incident. The post-mortem produces a new rule and a brief feeling of closure. Then traffic shifts, dependencies evolve, and the next incident arrives with a different shape. You’re never done.&nbsp;</p><p>So teams find these late, after impact is already underway, when time is most expensive.&nbsp;</p><h2 id="why-teams-don%E2%80%99t-switch-investigation-entry-points">Why teams don’t switch investigation entry points&nbsp;</h2><p>When teams adopt a new investigation system, they often ask a simple question: “Does it match our alerts?”&nbsp;</p><p>What they’re really asking is: “Can I trust this in the first 90 seconds?”&nbsp;Because in the first minute, the primary goal isn’t elegance, it’s not making it worse.</p><p>A system that generates more hypotheses doesn’t help if it can’t connect the page to what the system is doing in a way the on-call trusts.&nbsp;</p><p>If the system describes an incident in a different language than the alert model engineers rely on, mismatches get interpreted as duplication, contradiction, or risk. The result is predictable: people consult it late, after they’ve already committed to a direction.&nbsp;</p><p>Even correct insights arrive too late to change behavior.&nbsp;</p><h2 id="why-this-problem-is-getting-worse">Why this problem is getting worse&nbsp;</h2><p>Systems are becoming more dynamic: more dependencies, faster deploys, and more integration points. Deploy frequency keeps climbing—and AI-assisted coding is only accelerating it—so the number of failure paths keeps growing. Meanwhile, alert fatigue is already high, and teams are hesitant to change workflows mid-incident.&nbsp;</p><p>Better tooling can speed up lookup and correlation, but it can’t compensate for an alerting model that no longer maps cleanly to real system behavior.&nbsp;</p><p>So the interpretation workload keeps rising. Every page demands more interpretation, more cross-checking, more manual stitching of symptoms into a coherent story.&nbsp;</p><h2 id="what%E2%80%99s-actually-broken">What’s actually broken&nbsp;</h2><p>Most organizations are operating with two different languages: the language of paging and the language of understanding. The persistent MTTR plateau is the Page-to-Understanding Gap between them.</p><p>Incidents start with the first, but the work happens in the second.&nbsp;</p><h2 id="a-better-way-to-think-about-alerts">A better way to think about alerts&nbsp;</h2><p>Alerts are not the investigation. They’re a notification that something is going sideways.&nbsp;</p><p>The goal is not to tune thresholds until the noise feels tolerable. It’s to shorten the time from page to shared understanding: what behavior is emerging, what changed, what’s being impacted—and whether it matters to the business.&nbsp;</p><p>Treating that translation as unwritten know‑how is not a workflow quirk. It’s a structural weakness. If your incident response starts with decoding alerts, you’re spending your best engineers on interpretation instead of containment.&nbsp;</p><div class="kg-card kg-cta-card kg-cta-bg-green kg-cta-minimal    " data-layout="minimal">
            
            <div class="kg-cta-content">
                
                
                    <div class="kg-cta-content-inner">
                    
                        <div class="kg-cta-text">
                            <p dir="ltr"><b><strong style="white-space: pre-wrap;">Shipped in v1.0.114</strong></b><span style="white-space: pre-wrap;">: Now each ingested alert is mapped to the symptom it&nbsp;represents&nbsp;and shown directly in the context of the inferred root cause.&nbsp;Alerts are no longer just timestamps and labels. </span><a href="https://www.causely.ai/blog/alerts-arent-the-investigation-what-comes-next-in-incident-response?ref=causely-blog.ghost.io" rel="noreferrer" class="cta-link-color"><span style="white-space: pre-wrap;">They become part of a coherent&nbsp;system&nbsp;story</span></a><span style="white-space: pre-wrap;">.</span></p>
                        </div>
                    
                    
                        <a href="https://docs.causely.ai/changelog/v1.0.114/?ref=causely-blog.ghost.io" class="kg-cta-button " style="background-color: #000000; color: #ffffff;">
                            See the release notes
                        </a>
                        
                    </div>
                
            </div>
        </div>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Queue Growth, Dead-Letter Queues, and Why Asynchronous Failures Are Easy to Misread]]></title>
      <link>https://causely.ai/blog/queue-growth-dead-letter-queues-and-why-asynchronous-failures-are-easy-to-misread</link>
      <guid>https://causely.ai/blog/queue-growth-dead-letter-queues-and-why-asynchronous-failures-are-easy-to-misread</guid>
      <pubDate>Tue, 20 Jan 2026 19:18:45 GMT</pubDate>
      <description><![CDATA[Asynchronous pipelines sit at the core of most modern systems. Message brokers accept traffic, consumers process it in the background, and downstream services depend on the results. When these systems fail, the failure rarely shows up where it starts.]]></description>
      <author>Yotam Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2026/01/queue.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Asynchronous pipelines sit at the core of most modern systems. Message brokers accept traffic, consumers process it in the background, and downstream services depend on the results.</p><p>When these systems fail, the failure rarely shows up where it starts.</p><p>Teams often notice stale data, degraded behavior, or latency spikes elsewhere in the system. By the time those symptoms appear, the underlying problem has usually been present for some time.</p><p>In many real-world failures, two signals appear earlier: <strong>queue growth</strong> and <strong>dead-letter queues</strong>. They are widely monitored, but they are still widely misunderstood.</p><h2 id="the-common-misunderstanding"><strong>The common misunderstanding</strong>&nbsp;</h2><p>Queues are often treated as infrastructure components rather than behavioral signals. When a queue grows, it is attributed to load. When messages land in a DLQ, it is treated as a retry policy doing its job. Investigation tends to focus downstream, where symptoms are visible.</p><p>This framing obscures where asynchronous systems actually break.&nbsp;In many failures, message brokers continue to accept traffic normally. Producers succeed. Nothing looks obviously down. The problem is that consumers are no longer able to keep up reliably or consistently.&nbsp;That distinction matters.</p><h2 id="queue-growth-is-not-just-volume"><strong>Queue growth is not just volume</strong>&nbsp;</h2><p>Queue growth occurs when messages arrive faster than they can be processed successfully over time.&nbsp;This does not require a traffic spike. It can result from:&nbsp;</p><ul><li>Consumers slowing due to code changes or resource pressure&nbsp;</li><li>Dependencies becoming latent or unreliable&nbsp;</li><li>Retry rates increasing&nbsp;</li><li>Backpressure failing to engage&nbsp;</li><li>Partition skew concentrating work unevenly&nbsp;</li></ul><p>In these cases, the broker behaves correctly. Messages are accepted. The queue grows quietly.&nbsp;What is accumulating is <strong>lag</strong>.&nbsp;A sustained backlog means work is no longer flowing through the system at the intended rate, even if no explicit failures are visible yet.&nbsp;&nbsp;</p><h2 id="why-this-matters-before-anything-looks-broken"><strong>Why this matters before anything looks broken</strong>&nbsp;</h2><p>Asynchronous systems are designed to absorb instability. Queues buffer mismatches. Retries smooth over failures. Backlogs delay visible impact.&nbsp;This is useful, but it also postpones feedback.&nbsp;As queues grow:&nbsp;</p><ul><li>Processing time increases&nbsp;</li><li>Derived state falls behind&nbsp;</li><li>Downstream services operate on increasingly stale or incomplete data&nbsp;</li></ul><p>The transition from “degraded” to “broken” often appears sudden because the system has been accumulating lag for some time before any external threshold is crossed.&nbsp;</p><h2 id="dead-letter-queues-signal-a-different-failure"><strong>Dead-letter queues signal a different failure</strong>&nbsp;</h2><p>Dead-letter queues exist to capture messages that cannot be processed successfully.&nbsp;Messages land there after repeated failures, timeouts, or deterministic errors. DLQs prevent infinite retries and protect the main pipeline.&nbsp;What they represent is not transient instability, but <strong>persistent processing failure under current system behavior</strong>.&nbsp;</p><p>A non-empty DLQ means some class of messages cannot be handled as the system is currently operating. That incompatibility can come from:&nbsp;</p><ul><li>Broken contracts between producers and consumers&nbsp;</li><li>Partial or skewed deployments&nbsp;</li><li>Schema drift&nbsp;</li><li>Unhandled edge cases&nbsp;</li><li>Dependencies that fail consistently rather than intermittently&nbsp;</li></ul><p>DLQs often grow alongside backlogs, but they can also appear independently.&nbsp;</p><h2 id="why-these-problems-are-so-common"><strong>Why these problems are so common</strong>&nbsp;</h2><p>In real systems, producers and consumers evolve independently. Load shifts. Dependencies degrade. Retry behavior changes system dynamics in non-obvious ways.&nbsp;It is common for:&nbsp;</p><ul><li>Brokers to continue accepting traffic&nbsp;</li><li>Queues to grow steadily&nbsp;</li><li>Consumers to fail intermittently or slow down&nbsp;</li><li>Processing failures to accumulate quietly&nbsp;</li></ul><p>Operationally, queues and DLQs sit between services. They rarely have clear ownership. They are easy to monitor superficially and hard to reason about in context.&nbsp;As a result, many teams only notice these issues once downstream behavior degrades.&nbsp;&nbsp;</p><h2 id="queue-growth-and-dlqs-are-related-but-distinct"><strong>Queue growth and DLQs are related, but distinct</strong>&nbsp;</h2><p>Queue growth and DLQs are often discussed together, but they answer different questions.&nbsp;</p><p>Queue growth asks:&nbsp;"Are messages flowing through the system fast enough?"</p><p>DLQs ask:&nbsp;&nbsp;"Are some messages failing to be processed at all?"</p><p>In many incidents, sustained queue growth precedes DLQs. Consumers slow down, retries increase, and retry limits are eventually exceeded.&nbsp;In others, DLQs appear immediately due to deterministic processing failures, even while queue depth looks healthy.&nbsp;Treating one as a proxy for the other creates blind spots.&nbsp;</p><h2 id="the-deeper-diagnostic-challenge"><strong>The deeper diagnostic challenge</strong>&nbsp;</h2><p>Most teams diagnose asynchronous failures indirectly.&nbsp;They look at:</p><ul><li>Latency spikes&nbsp;</li><li>Error rates&nbsp;</li><li>Timeouts&nbsp;</li><li>User-visible symptoms&nbsp;</li></ul><p>Those signals matter, but they are downstream effects.&nbsp;Earlier and more precise signals exist inside the message pipeline itself: where messages are accepted, where they slow down, and where they fail to be processed reliably. When those signals are ignored or misinterpreted, teams spend time chasing symptoms rather than isolating where the workflow is actually breaking.&nbsp;&nbsp;</p><h2 id="a-better-way-to-think-about-queues"><strong>A better way to think about queues</strong>&nbsp;</h2><p>Queues are not just buffers.&nbsp;Queue growth is not harmless backlog.&nbsp;Dead-letter queues are not operational exhaust.&nbsp;They are indicators of whether asynchronous workflows are functioning as intended or quietly degrading under real conditions.&nbsp;</p><p>The goal is not to watch queue depth, instead, it’s to continuously understand flow: where work accumulates, why it accumulates, and which downstream interactions it impacts.&nbsp;</p><p>Understanding them is not an optimization. It is foundational to operating reliable, event-driven systems.&nbsp;</p><div class="kg-card kg-cta-card kg-cta-bg-green kg-cta-minimal    " data-layout="minimal">
            
            <div class="kg-cta-content">
                
                
                    <div class="kg-cta-content-inner">
                    
                        <div class="kg-cta-text">
                            <p dir="ltr"><b><strong style="white-space: pre-wrap;">Shipped in v1.0.109</strong></b><span style="white-space: pre-wrap;">: Causely now models the </span><b><strong style="white-space: pre-wrap;">async failure mode behind queue growth and DLQs</strong></b><span style="white-space: pre-wrap;">, so teams can </span><b><strong style="white-space: pre-wrap;">pinpoint where processing breaks down</strong></b><span style="white-space: pre-wrap;"> instead of misreading the signals as generic load.</span></p>
                        </div>
                    
                    
                        <a href="https://docs.causely.ai/changelog/v1.0.109/?ref=causely-blog.ghost.io" class="kg-cta-button " style="background-color: #000000; color: #ffffff;">
                            See the release notes
                        </a>
                        
                    </div>
                
            </div>
        </div>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Slight Reliability EP 113: AI Use-cases for SRE with Shmuel Kliger]]></title>
      <link>https://causely.ai/blog/slight-reliability-ai-use-cases-for-sre</link>
      <guid>https://causely.ai/blog/slight-reliability-ai-use-cases-for-sre</guid>
      <pubDate>Mon, 12 Jan 2026 19:31:00 GMT</pubDate>
      <description><![CDATA[Originally published to the Slight Reliability Podcast.]]></description>
      <author>Shmuel Kliger</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2026/01/Slight-Reliability-with-Causely-Shmuel-Kliger.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>From the day we invented computers we've been struggling to keep applications running and delivering services to the business. Is this latest wave of AI helping or hurting us?<br><br>This week I'm joined by Causely founder Shmuel Kliger to dive into...<br><br>🌊 The three waves of AI hype over the decades (the history of AI)<br>☠️ The dangers of over-promising and under-delivering what AI can do<br>🧠 What is causal reasoning?<br>😱 Is AI replacing SREs?<br>🔮 AI as a way to allow humans to solve higher level problems<br><br></p>
<!--kg-card-begin: html-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/e1L9YE7igz4?si=SqMOgs6teJO1_AQP" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
<!--kg-card-end: html-->
]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Expands Datadog Integration to Deliver Causal Intelligence Across Hybrid Environments]]></title>
      <link>https://causely.ai/blog/causely-expands-datadog-integration-to-deliver-causal-intelligence-across-hybrid-environments</link>
      <guid>https://causely.ai/blog/causely-expands-datadog-integration-to-deliver-causal-intelligence-across-hybrid-environments</guid>
      <pubDate>Mon, 22 Dec 2025 14:43:25 GMT</pubDate>
      <description><![CDATA[Causely’s expanded Datadog integration turns Datadog APM signals into system-level causal intelligence, helping teams understand how issues propagate across services and pinpoint true root cause.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/12/Datadog-and-Causely-integration.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Causely&nbsp;is expanding its <a href="https://www.datadoghq.com/?ref=causely-blog.ghost.io" rel="noreferrer">Datadog</a> integration to address a problem every senior engineering team eventually runs&nbsp;into:&nbsp;observability data keeps growing, but confidence during incidents does not.&nbsp;Even with Datadog APM, infrastructure metrics, and monitors deployed everywhere, engineers are still forced to interpret symptoms and argue about which change or dependency&nbsp;actually caused&nbsp;an outage.&nbsp;The issue is not missing telemetry. It is the lack of a system-level understanding of cause and effect.&nbsp;</p><p>This limitation becomes especially visible in modern, hybrid architectures. Services span Kubernetes clusters, standalone EC2 instances, ECS tasks, and legacy infrastructure, all connected through real production traffic. Datadog can surface signals across these&nbsp;environments, but&nbsp;understanding how failures propagate across those boundaries&nbsp;remains&nbsp;a manual, error-prone exercise.&nbsp;The result is slower recovery, repeated incidents, and reduced confidence in change.&nbsp;</p><p>With this <a href="https://docs.causely.ai/telemetry-sources/datadog/?ref=causely-blog.ghost.io" rel="noreferrer">expanded Datadog integration</a>,&nbsp;Causely&nbsp;gives teams a unified,&nbsp;causal&nbsp;model&nbsp;of their entire application across Kubernetes and non-Kubernetes environments.&nbsp;This&nbsp;model&nbsp;that&nbsp;explains&nbsp;<em>why</em>&nbsp;services are&nbsp;impacted, not just&nbsp;<em>where</em>&nbsp;symptoms appear.&nbsp;</p><h2 id="from-observability-signals-to-system-understanding"><strong>From Observability Signals to System Understanding</strong>&nbsp;</h2><p>Many teams already rely on Datadog APM, infrastructure metrics, and&nbsp;monitors&nbsp;as the backbone of their observability stack. With&nbsp;Causely’s&nbsp;expanded support, those same Datadog signals can now be used to build a complete and&nbsp;accurate&nbsp;causal model of the system without changing existing instrumentation.&nbsp;</p><p>Causely&nbsp;supports Datadog&nbsp;APM&nbsp;dual shipping, which allows trace data to be sent directly from the Datadog collector into&nbsp;Causely’s&nbsp;mediator. Teams continue using Datadog exactly as they do today, while&nbsp;Causely&nbsp;consumes the same traces for causal reasoning. This approach avoids&nbsp;additional&nbsp;agents, avoids data duplication, and does not introduce new egress costs.&nbsp;</p><p>Just as importantly,&nbsp;Causely&nbsp;now supports services running outside Kubernetes&nbsp;and keeps causality intact across the hybrid boundary. By tagging Datadog APM traces with host identity metadata,&nbsp;Causely&nbsp;can stitch together services running on EC2 with those running inside Kubernetes clusters.&nbsp;What previously broke at environment boundaries becomes a single, end-to-end behavioral model of how the application&nbsp;actually runs&nbsp;in production.&nbsp;</p><p>Datadog monitors can also be ingested directly into&nbsp;Causely&nbsp;and treated as symptoms rather than conclusions. Instead of reacting to alerts in isolation,&nbsp;Causely&nbsp;uses them as signals that inform its understanding of what is happening in the system and why.&nbsp;That’s&nbsp;how you get faster convergence, fewer false leads, and higher confidence in the fix.&nbsp;</p><h2 id="a-real-world-hybrid-application-scenario"><strong>A Real-World Hybrid Application Scenario</strong>&nbsp;</h2><p>Consider a typical production application. Customer-facing APIs and&nbsp;frontend&nbsp;services run in a Kubernetes cluster. Background workers, billing services, or legacy processing jobs run on standalone EC2 instances. The application depends on shared infrastructure such as Postgres, Redis, and external APIs. Datadog is already deployed across all of it.&nbsp;</p><p>Under normal conditions, everything appears&nbsp;healthy. Then, during a traffic spike, latency starts creeping up in one of the Kubernetes services. Shortly after, Datadog monitors begin firing for elevated error rates in downstream components.&nbsp;Engineers&nbsp;open dashboards, inspect traces, and try to correlate timelines across environments. The symptoms are visible, but the cause is not obvious.&nbsp;</p><p>This is where&nbsp;Causely&nbsp;changes the workflow.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/12/causely_dd_serviceMap4.png" class="kg-image" alt="" loading="lazy" width="2000" height="1001" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/12/causely_dd_serviceMap4.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/12/causely_dd_serviceMap4.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/12/causely_dd_serviceMap4.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2025/12/causely_dd_serviceMap4.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Causely&nbsp;leverages Datadog&nbsp;APM&nbsp;dual shipping as input to build its own model of service dependencies, infrastructure, and data flows.&nbsp;</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/12/data-src-image-ea6fb104-c884-44d6-996c-62636154f91c.png" class="kg-image" alt="A screenshot of a computer

AI-generated content may be incorrect." loading="lazy" width="1311" height="1285" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/12/data-src-image-ea6fb104-c884-44d6-996c-62636154f91c.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/12/data-src-image-ea6fb104-c884-44d6-996c-62636154f91c.png 1000w, https://causely-blog.ghost.io/content/images/2025/12/data-src-image-ea6fb104-c884-44d6-996c-62636154f91c.png 1311w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Causely&nbsp;leverages Datadog&nbsp;APM&nbsp;dual shipping as input to build its own model of service dependencies, infrastructure, and data flows.&nbsp;</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/12/data-src-image-c34f290b-7606-4566-b45d-fd2547d5eeee.png" class="kg-image" alt="A computer screen shot of a computer

AI-generated content may be incorrect." loading="lazy" width="1743" height="1057" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/12/data-src-image-c34f290b-7606-4566-b45d-fd2547d5eeee.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/12/data-src-image-c34f290b-7606-4566-b45d-fd2547d5eeee.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/12/data-src-image-c34f290b-7606-4566-b45d-fd2547d5eeee.png 1600w, https://causely-blog.ghost.io/content/images/2025/12/data-src-image-c34f290b-7606-4566-b45d-fd2547d5eeee.png 1743w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Causely&nbsp;leverages Datadog&nbsp;APM&nbsp;dual shipping as input to build its own model of service dependencies, infrastructure, and data flows.&nbsp;</span></figcaption></figure><h2 id="automatically-pinpointing-the-true-cause"><strong>Automatically Pinpointing the True Cause</strong>&nbsp;</h2><p>Using Datadog traces, infrastructure metadata, and&nbsp;monitor&nbsp;events,&nbsp;Causely&nbsp;continuously reconstructs end-to-end request paths and dependencies across Kubernetes and EC2. That continuity holds across environment boundaries, so you&nbsp;don’t&nbsp;have to manually stitch together “what talks to what” in the middle of an incident.&nbsp;&nbsp;Instead of reacting to individual alerts,&nbsp;Causely&nbsp;continuously builds and&nbsp;maintains&nbsp;a behavioral model of the entire system. This model captures how services, infrastructure, and data flows interact, and how specific failure modes produce observable symptoms.&nbsp;</p><p>Datadog APM traces provide&nbsp;the raw&nbsp;evidence of system behavior, including service interactions, request paths, and downstream dependencies. Datadog monitors are ingested and mapped as symptoms within&nbsp;Causely’s&nbsp;knowledge base. Together, these signals allow&nbsp;Causely&nbsp;to&nbsp;maintain&nbsp;an up-to-date causal model that explicitly links&nbsp;observed&nbsp;symptoms to the conditions and changes that produced them.&nbsp;</p><p>Because this causal model is updated continuously,&nbsp;Causely&nbsp;can explain not just what is failing, but what changed first, how the impact propagated, and why specific services or endpoints are affected. The result is a precise, system-level explanation of performance degradation that teams can act on&nbsp;immediately.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/12/data-src-image-ef924856-d2ab-4b92-873b-e51a7f51a731.png" class="kg-image" alt="A screenshot of a computer

AI-generated content may be incorrect." loading="lazy" width="2000" height="641" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/12/data-src-image-ef924856-d2ab-4b92-873b-e51a7f51a731.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/12/data-src-image-ef924856-d2ab-4b92-873b-e51a7f51a731.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/12/data-src-image-ef924856-d2ab-4b92-873b-e51a7f51a731.png 1600w, https://causely-blog.ghost.io/content/images/2025/12/data-src-image-ef924856-d2ab-4b92-873b-e51a7f51a731.png 2016w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Using the system-level understanding,&nbsp;Causely&nbsp;applies its knowledge base of known failure patterns to infer the exact root&nbsp;cause&nbsp;driving service degradation.&nbsp;</span></figcaption></figure><h2 id="from-incident-response-to-reliability-assurance"><strong>From Incident Response to Reliability Assurance</strong>&nbsp;</h2><p>This expanded Datadog integration is not just about faster root cause analysis during incidents. By continuously modeling system behavior,&nbsp;Causely&nbsp;enables teams to&nbsp;validate&nbsp;reliability before changes reach production,&nbsp;monitor&nbsp;how reliability evolves over time, and detect drift caused by infrastructure or configuration changes.&nbsp;</p><p>Modern systems are hybrid by default, and reliability problems do not respect environment boundaries. To&nbsp;operate&nbsp;confidently at scale, teams need more than visibility. They need to understand how their systems behave and why failures occur.&nbsp;</p><p>With expanded Datadog support across Kubernetes and EC2,&nbsp;Causely&nbsp;helps teams move from alert-driven firefighting to causal reliability engineering. The result is fewer war rooms, faster resolution, and the confidence to ship changes without fear.&nbsp;</p><p>To learn more about using&nbsp;Causely&nbsp;with Datadog, <a href="https://docs.causely.ai/telemetry-sources/datadog/?ref=causely-blog.ghost.io" rel="noreferrer">explore the integration guide</a> or <a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer">reach out to see a unified service graph in action</a>.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Thank You, FluxCD: How it helps us, and how you can use it too!]]></title>
      <link>https://causely.ai/blog/thank-you-fluxcd</link>
      <guid>https://causely.ai/blog/thank-you-fluxcd</guid>
      <pubDate>Tue, 16 Dec 2025 17:47:11 GMT</pubDate>
      <description><![CDATA[How Causely uses FluxCD and GitOps to ship weekly on Kubernetes, keep clusters in sync, and wire up OpenTelemetry and Causely in a hands-on lab you can copy.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/12/Gemini_Generated_Image_na4zj1na4zj1na4z.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>The second post in our “thank you” series, just in time for the end of the year.</p><p><a href="https://www.causely.ai/blog/thank-you-grafana-beyla-how-to?ref=causely-blog.ghost.io">In the first one, we said thanks to Grafana for donating Beyla</a> and making it easier for teams to get to usable telemetry quickly. This time we want to zoom out to something that quietly runs under the hood at Causely every day: GitOps with <a href="https://fluxcd.io/?ref=causely-blog.ghost.io">FluxCD</a>.</p><p>Causely is a member of the <a href="https://www.cncf.io/?ref=causely-blog.ghost.io" rel="noreferrer">Cloud Native Computing Foundation (CNCF)</a>. That’s not just a logo on the website for us: our entire product and our own operations lean heavily on CNCF projects. We build on <a href="https://opentelemetry.io/?ref=causely-blog.ghost.io" rel="noreferrer">OpenTelemetry</a>, we run on <a href="https://kubernetes.io/?ref=causely-blog.ghost.io" rel="noreferrer">Kubernetes</a>, and we <a href="https://www.causely.ai/blog/eating-our-own-dog-food-causelys-journey-with-opentelemetry-causal-ai?ref=causely-blog.ghost.io" rel="noreferrer">dogfood our own reliability engine</a> against that stack.</p><p>Another key piece in that puzzle is FluxCD. It is what takes “the desired state in git” and makes it true in our clusters, repeatedly. It’s the heartbeat behind our weekly releases.</p><h2 id="why-flux-helps">Why Flux Helps</h2><p>If you’re operating modern Kubernetes environments, you’ve probably felt this tension. On one hand, you want velocity: teams push changes constantly, services multiply, and configurations evolve all the time. On the other hand, every manual kubectl apply is a potential one-off change that no one can fully reconstruct later. Over time, clusters drift away from whatever was last written down as “how things should be,” and you are left relying on muscle memory and shell history.</p><p>Flux solves exactly that problem by turning git into the control surface and reacting to changes within seconds of a merge.</p><p>For us, that has very practical consequences.</p><ul><li>Releases are commits, not hand-crafted ceremonies.</li><li>A new Causely version is rolled out by changing a tag or a value in git; Flux notices the change within seconds, reconciles the cluster, and either converges to the new state or loudly tells us why it couldn’t.</li><li>Environments stay in sync because the same manifests back our test clusters, staging, and production. The differences between them are intentional and visible in overlays, not hidden in one-off fixes on a live production cluster.</li><li>And drift becomes a signal, not a mystery: when the cluster does not match git, Flux shows it, which turns the familiar “what changed?” question into a quick investigation rather than a full-blown incident archaeology.</li></ul><p>The net effect is that we can move quickly without giving up control. Our reliability engine depends on a stable substrate; Flux helps us keep it that way.</p><h2 id="best-practices-we%E2%80%99ve-learned-using-flux">Best Practices We’ve Learned Using Flux</h2><p>We didn’t get there on day one. It took a set of habits to turn FluxCD from a cool project into a core platform primitive.</p><ol><li><a href="#treat-manifests-like-product-code" rel="noreferrer">Treat manifests like product code</a></li><li><a href="#keep-kustomize-overlays-boring" rel="noreferrer"> Keep Kustomize overlays boring</a></li><li><a href="#watch-flux-like-any-other-production-controller" rel="noreferrer">Watch Flux like any other production controller</a></li><li><a href="#make-promotion-a-path-not-an-event" rel="noreferrer">Make promotion a path, not an event</a></li></ol><h3 id="treat-manifests-like-product-code">Treat manifests like product code</h3><p>All of our Kubernetes manifests live in git. Not most of them, not “the important ones” – all of them. That sounds obvious, but it changes behavior in subtle ways.</p><ul><li>Reviews happen before things break, because a change to a HelmRelease or a Kustomize overlay goes through the same review process as a feature change.</li><li> The commit history becomes an operational log: when we see a strange spike in errors, we can line it up against recent git changes, including configuration tweaks that would otherwise live only in someone’s bash history.</li><li>Our issue tracker also stays connected to reality, because we reference issue numbers in commit messages, so the question “why did we change this setting?” always has a direct link back to the discussion that justified it.</li></ul><p>In the early days, we still made the occasional manual change in production, usually in the name of speed. Those changes always came back to haunt us as confusing states that no one could fully explain. Once we committed to git as the only source of truth and forced ourselves to route every change through it, the platform became much more predictable.</p><h3 id="keep-kustomize-overlays-boring">Keep Kustomize overlays boring</h3><p>We use <a href="https://kustomize.io/?ref=causely-blog.ghost.io" rel="noreferrer">Kustomize</a> to manage environment-specific differences across test clusters, staging, production, and chaos environments. The rule we eventually settled on is simple: overlays describe differences, not alternative universes.</p><p>In practice, that means we maintain a clean base with shared resources such as namespaces, common <a href="https://fluxcd.io/flux/components/helm/helmreleases/?ref=causely-blog.ghost.io" rel="noreferrer">HelmReleases</a>, and shared configuration. On top of that base, we keep the environment overlays as thin as possible. They patch what truly needs to change, such as cluster names, resource limits, or a particular feature flag, rather than redefining whole stacks.</p><p>Whenever we tried to be clever with external references or overlays that diverged heavily from one another, troubleshooting became harder. Keeping overlays compact and predictable means we can scan a diff and understand at a glance what will change in a given cluster. Before committing, we render Kustomize configs locally as a quick sanity check that catches typos and misaligned paths before Flux has to complain about them.</p><h3 id="watch-flux-like-any-other-production-controller">Watch Flux like any other production controller</h3><p>GitOps is not “set and forget.” Flux is a control loop running in production and, when it is unhappy, your platform will slowly drift.</p><p>We treat Flux like a critical controller. We watch reconciliation health and consider a stuck HelmRelease or Kustomization as important as any failing deployment. When Flux cannot talk to git, or when an apply keeps failing, that is something we alert on rather than something we notice days later in a dashboard. And when “nothing seems to be changing” in a cluster despite recent commits, Flux logs are one of the first places we look.</p><p>This mindset becomes even more important when GitOps extends beyond just core applications and starts to manage your observability stack, gateways, and even Causely itself.</p><h3 id="make-promotion-a-path-not-an-event">Make promotion a path, not an event</h3><p>Flux really shines when you treat deployments as a series of git-based promotions instead of isolated production pushes. A typical Causely release starts with a change landing in a test environment: we use clusters like test1 and test2 for this. We verify that the change behaves as expected there, including how it interacts with telemetry and Causely’s own reasoning about incidents. Once we are happy, we promote the same change to staging by updating the relevant overlay or values. Only after staging behaves as expected do we roll the change into production.</p><p>Alongside this path, we maintain dedicated chaos clusters, chaos1 and chaos2, where we deliberately break things to see how the system responds. Because everything flows through git, we can rehearse failure modes without fear of leaving behind strange manual fixes that only exist on one cluster. Keeping cluster-specific configuration isolated and well documented is what allows us to run realistic experiments in those chaos clusters without letting that complexity bleed into production.</p><h2 id="try-flux-on-your-own">Try Flux On Your Own</h2><p>To really understand Flux, it helps to feel git driving your cluster. The smallest useful experiment is a git repository, a local kubernetes cluster, and Flux bootstrapped from that repository. The nice part is that flux bootstrap already does most of the heavy lifting: it creates the repository, installs the controllers, and wires everything together for you.</p><p>You can run the following guide on your laptop with <a href="https://kind.sigs.k8s.io/?ref=causely-blog.ghost.io">kind</a>.</p><p>Start by installing the Flux CLI. The easiest way is via the official install script; if you prefer Homebrew, apt, or other package managers, the Flux documentation lists those options as well in the <a href="https://fluxcd.io/flux/installation/?ref=causely-blog.ghost.io">Flux installation guide</a>.</p><pre><code class="language-bash">curl -s https://fluxcd.io/install.sh | sudo bash</code></pre><p>Next, export your GitHub credentials so Flux can authenticate and create the repository for you. If you are logged in with the GitHub CLI (<code>gh auth login</code>), you can derive both the user name and token directly from it.</p><pre><code class="language-bash">export GITHUB_USER="$(gh api user --jq '.login')"
export GITHUB_TOKEN="$(gh auth token)"</code></pre><p>Now, create a local Kubernetes cluster and verify that Flux can run there.</p><pre><code class="language-bash">kind create cluster --name flux-playground</code></pre><p>When your cluster is ready, bootstrap Flux into it. This command will create a repository called <code>flux-playground-gitops</code> under your GitHub account, install Flux into the <code>flux-system</code> namespace, and configure it to track the <code>./clusters/flux-playground</code> path in that repo.</p><pre><code class="language-bash">flux bootstrap github \
&nbsp; --owner=$GITHUB_USER \
&nbsp; --repository=flux-playground-gitops \
&nbsp; --branch=main \
&nbsp; --path=./clusters/flux-playground \
&nbsp; --personal</code></pre><p>Clone the newly created repository to your machine and change into it.</p><pre><code class="language-bash">gh repo clone flux-playground-gitops
cd flux-playground-gitops</code></pre><p>You are now ready to define the <a href="https://opentelemetry.io/docs/demo/?ref=causely-blog.ghost.io" rel="noreferrer">OpenTelemetry demo</a> as a git-managed workload by adding a manifest for the demo under <code>clusters/flux-playground</code>.</p><pre><code class="language-bash">cat &gt; clusters/flux-playground/oteldemo.yaml &lt;&lt;'EOF'
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
&nbsp; name: open-telemetry
&nbsp; namespace: flux-system
spec:
&nbsp; interval: 1m
&nbsp; url: https://open-telemetry.github.io/opentelemetry-helm-charts---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
&nbsp; name: otel-demo
&nbsp; namespace: flux-system
spec:
&nbsp; interval: 1m
&nbsp; chart:
&nbsp;&nbsp;&nbsp; spec:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; chart: opentelemetry-demo
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sourceRef:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; kind: HelmRepository
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; name: open-telemetry
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; namespace: flux-system
&nbsp; targetNamespace: otel-demo
&nbsp; install:
&nbsp;&nbsp;&nbsp; createNamespace: true
EOF</code></pre><p>At this point your git repository fully describes both Flux itself and the OpenTelemetry demo that Flux will deploy.</p><p>Commit and push these changes so Flux can reconcile the new state.</p><pre><code class="language-bash">git add .
git commit -m "Add OpenTelemetry demo via Flux"
git push origin main</code></pre><p>Within seconds of the push, Flux will see the new revision, apply the changes, and start rolling out the OpenTelemetry demo. You can watch the pods come up.</p><pre><code class="language-bash">watch kubectl get pods -n otel-demo</code></pre><p>After a few minutes you will see the OpenTelemetry demo microservices starting in the otel-demo namespace.</p><p>You now have a real application being managed by Flux from git: the desired state lives in a repository, Flux reconciles it into the cluster within seconds of your merge, and you never had to use kubectl apply for the actual deployment.</p><h2 id="flux-causely-gitops-all-the-way-down">Flux + Causely: GitOps All the Way Down&nbsp;</h2><p>If you followed the small lab above, you already have a local cluster, Flux installed, and the OpenTelemetry demo running under GitOps control. From there, adding <a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noreferrer">Causely</a> is just one more git-driven change.&nbsp;</p><p>Conceptually, the flow is simple. You obtain a Causely access token, store it as a Kubernetes Secret, and add the Causely FluxCD manifests to the same git repository that Flux already manages. Git drives the rollout, Flux reconciles it into the cluster, the OpenTelemetry demo generates realistic behavior, and Causely explains what is going on.&nbsp;</p><p><a href="https://docs.causely.ai/installation/flux/?ref=causely-blog.ghost.io" rel="noreferrer noopener">Our documentation</a> has a full FluxCD installation guide with more background and variations. The example below is meant to be something you can copy and adapt directly from your existing <code>flux-playground-gitops</code> setup.&nbsp;</p><p>First, retrieve your Causely <a href="https://portal.causely.app/?ref=causely-blog.ghost.io" rel="noreferrer">access token from Causely </a>and keep it handy. Then, create a namespace for Causely and a kubernetes secret with your token. The Causely FluxCD manifests expect a secret named <code>causely-secrets</code> and use Flux's native post-build substitution to inject <code>CAUSELY_TOKEN</code> into the <code>HelmRelease</code>, so the token never has to be committed to git:&nbsp;</p><pre><code class="language-bash">kubectl create namespace causely&nbsp;
kubectl create secret generic causely-secrets \&nbsp;
&nbsp; --from-literal=CAUSELY_TOKEN=your-actual-gateway-token-here \&nbsp;
&nbsp; -n causely&nbsp;</code></pre><p>Next, from inside your GitOps repository, clone the public <a href="https://github.com/causely-oss/causely-deploy?ref=causely-blog.ghost.io" rel="noreferrer">causely-deploy repository</a> and copy the FluxCD manifests into your own cluster configuration.&nbsp;</p><pre><code>cd flux-playground-gitops&nbsp;
git clone https://github.com/causely-oss/causely-deploy.git&nbsp;&nbsp;
mkdir -p clusters/flux-playground/causely&nbsp;
cp causely-deploy/kubernetes/fluxcd/causely/*.yaml clusters/flux-playground/causely/</code></pre><p>With this setup, your git repository now contains everything Flux needs to deploy Causely, and you can commit and push these changes so Flux can reconcile the new state.</p><pre><code class="language-bash">git add clusters/flux-playground&nbsp;
git commit -m "Add Causely via Flux"&nbsp;
git push origin main&nbsp;</code></pre><p>After the push, Flux notices the change, applies the new manifests, and starts deploying Causely into your cluster. You can watch it the same way you watched the OpenTelemetry demo roll out.&nbsp;</p><pre><code class="language-&nbsp;">flux get kustomizations -A&nbsp;
kubectl get pods -n causely</code></pre><p>&nbsp;Once the Causely pods are healthy, you can return to the Causely portal, where the cluster you just configured will appear. Over time, topology fills in and, as issues arise, you will see root cause views associated with the services in your demo.&nbsp;</p><p>At that point, you have a complete loop on a single laptop: Git drives change, Flux applies it, the OpenTelemetry demo generates behavior, and Causely explains what happens when things go wrong. If you want more variations, production-grade knobs, or to run this across multiple clusters, <a href="https://docs.causely.ai/installation/flux/?ref=causely-blog.ghost.io" rel="noreferrer">the FluxCD installation guide</a> in our docs walks through additional options in detail.&nbsp;</p><h2 id="closing-thank-you-flux">Closing: Thank You, Flux&nbsp;</h2><p>FluxCD is a great example of the kind of infrastructure we love in the CNCF ecosystem. It nudges teams toward good habits, turns the question “how did this get here?” into something with a clear, auditable answer, and helps keep complex Kubernetes estates boring and predictable.</p><p>As a CNCF member building on OpenTelemetry, Kubernetes, and the wider cloud-native stack, we are genuinely grateful for projects like Flux that quietly raise the floor for everyone.</p><p>So: thank you to the Flux maintainers and community for building and maintaining an engine that lets us practice what we preach about control, desired state, and autonomous reliability. If you are running kubernetes and still relying on manual deploys or one-off scripts, Flux is worth a serious look.&nbsp;&nbsp;</p><p>And if you want to see what happens when you combine GitOps with causal reasoning, we are always happy to show you how Causely fits into that picture. </p><p><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener">Book a demo today -&gt;</a>&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Named a Gartner Cool Vendor in AI for IT Operations 2025]]></title>
      <link>https://causely.ai/blog/gartner-cool-vendor-ai-it-operations</link>
      <guid>https://causely.ai/blog/gartner-cool-vendor-ai-it-operations</guid>
      <pubDate>Tue, 09 Dec 2025 20:25:00 GMT</pubDate>
      <description><![CDATA[Gartner recognized Causely for maintaining a live causality graph and using continuous inference to identify the underlying driver behind changes in golden signals as they emerge, even when failures cascade across multiple services.]]></description>
      <author>Yotam Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/12/Causely-Named-a-Gartner-Cool-Vendor-in-AI-for-IT-Operations-2025.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>We are excited to share that Gartner has named&nbsp;Causely&nbsp;a&nbsp;<a href="https://www.gartner.com/en/documents/7233330?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Cool Vendor for AI in IT Operations for 2025</u></a>.&nbsp;For us, this recognition reflects what&nbsp;we’re&nbsp;seeing&nbsp;across engineering and operations teams everywhere. Systems are changing faster than traditional tools can keep up, and teams need a more reliable way to understand how their applications behave as they evolve.&nbsp;</p><p>At the pace modern cloud-native systems&nbsp;move,&nbsp;reacting after symptoms appear is not enough. Teams need a&nbsp;reliability&nbsp;operating system that works continuously alongside their applications,&nbsp;maintains&nbsp;an up-to-date understanding of how everything fits together, and provides the context&nbsp;required&nbsp;for safe automation and proactive reliability.&nbsp;</p><h3 id="why-this-recognition-matters"><strong>Why this recognition matters</strong>&nbsp;</h3><p>Modern cloud-native applications evolve constantly. Every&nbsp;deploy, configuration change, traffic surge, and infrastructure adjustment can reshape system behavior. The pace is so fast that even the best teams struggle to reason about what is happening and why.&nbsp;</p><p>Traditional observability dashboards can show what happened after symptoms appear, and AI copilots can help speed up triage, but reacting after the fact&nbsp;isn’t&nbsp;good enough for business-critical applications.&nbsp;</p><p>Teams need something that runs continuously alongside their systems, understands how everything fits together, and helps them see how changes will affect performance before they land in production.&nbsp;That’s&nbsp;the foundation for proactive reliability, not just faster incident response.&nbsp;</p><h3 id="what-gartner-recognized-about-causely"><strong>What Gartner recognized about</strong>&nbsp;<strong>Causely</strong>&nbsp;</h3><p>Gartner recognized&nbsp;Causely&nbsp;for taking a fundamentally different approach to reliability in modern systems. Instead of reacting to symptoms after they spread,&nbsp;Causely&nbsp;maintains a&nbsp;<a href="https://www.causely.ai/blog/causal-reasoning-the-missing-piece-to-service-reliability?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>live causality graph</u></a>&nbsp;that reflects how services, dependencies, and performance constraints relate to one another as the environment evolves.&nbsp;By continuously analyzing telemetry,&nbsp;Causely&nbsp;identifies&nbsp;the underlying driver behind emerging changes in golden signals, even when failures cascade across multiple services, including the code change, configuration update, or operational event that first introduced risk.&nbsp;</p><p>This continuous causal inference is what enables proactive reliability.&nbsp;Causely&nbsp;provides clear direction on where to focus and what action is most likely to reduce performance risk, long before issues escalate. The same causal model supports both pre-production and production, helping teams understand how behavior will shift during testing, rollout, and real-world load.&nbsp;</p><p>Causely&nbsp;runs locally as a lightweight, containerized system and&nbsp;<a href="https://www.causely.ai/blog/demystifying-automatic-instrumentation?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>processes telemetry without exporting raw data</u></a>. This&nbsp;eliminates&nbsp;the need for central pipelines, avoids sampling or data volume constraints, and gives teams high-fidelity insight that integrates directly into their engineering workflows through APIs, webhooks, and an MCP server. This structured context supports both human decisions and safe automation.&nbsp;</p><p>For us, this recognition&nbsp;validates&nbsp;the direction we have been building toward. The future of reliability is proactive, predictive, and grounded in causal understanding.&nbsp;</p><h3 id="who-should-care"><strong>Who</strong>&nbsp;<strong>should care?</strong>&nbsp;</h3><p>Causely&nbsp;is designed for teams responsible for building and&nbsp;operating&nbsp;modern distributed systems. It gives engineering and operations organizations a clearer understanding of how system behavior changes over time and how those changes affect reliability. Whether preparing a release, managing growth, or diagnosing unexpected behavior, teams need deeper clarity to make confident decisions. Continuous causal inference provides&nbsp;that clarity.&nbsp;</p><h3 id="looking-ahead"><strong>Looking ahead</strong>&nbsp;</h3><p>We’re&nbsp;grateful&nbsp;to Gartner for this recognition and excited about what it&nbsp;represents. Reliability is entering a new chapter. Systems are more dynamic,&nbsp;<a href="https://www.linkedin.com/pulse/when-ai-overwhelms-your-architecture-machine-load-new-shergilashvili-zhw5f/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>AI workloads are growing</u></a>, and teams need deeper clarity to keep everything running smoothly.&nbsp;</p><p>Causely’s&nbsp;mission is to provide that&nbsp;reliability.&nbsp;Continuous causal inference helps teams prevent issues before they escalate, support high-velocity engineering without sacrificing reliability, and give both humans and automation the context they need to act safely.&nbsp;</p><p>We’re&nbsp;excited for&nbsp;what’s&nbsp;ahead and proud to help shape the future of reliable, intelligent, and resilient systems.&nbsp;</p><hr><p>Want to learn what this could look like for your organization?&nbsp;<a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Get started with Causely today.</u></a>&nbsp;&nbsp;</p><p>Gartner&nbsp;subscribers can&nbsp;<a href="https://www.gartner.com/en/documents/7233330?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>view the full report</u></a>&nbsp;for more information.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Announcing Reliability Delta: Clear, Objective Insight into Whether Your Release Made Your System Better or Worse]]></title>
      <link>https://causely.ai/blog/announcing-reliability-delta</link>
      <guid>https://causely.ai/blog/announcing-reliability-delta</guid>
      <pubDate>Thu, 04 Dec 2025 21:23:59 GMT</pubDate>
      <description><![CDATA[In a 50 to 100+ microservice environment with dense service-to-service dependencies, even small regressions can cascade silently. And slowing down isn’t an option. Leadership needs faster delivery and fewer incidents. This is why we built Reliability Delta.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/12/Screenshot-2025-12-04-at-4.08.32---PM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Your team has been grinding for days, tuning a critical service to improve performance without lighting your cloud bill on fire.&nbsp;It’s&nbsp;the kind of systemic change you&nbsp;can’t&nbsp;hand off to an AI coding agent. After countless reviews, experiments, and late nights, the update is finally in production.&nbsp;</p><p>You take a breath. Maybe even consider sleeping.&nbsp;</p><p>Then Slack lights up:&nbsp;</p><blockquote><strong>“Did it work?” — CTO</strong>&nbsp;</blockquote><p>You stare at&nbsp;dashboards. Nothing’s red. But you still&nbsp;don’t&nbsp;actually know:&nbsp;</p><ul><li>Did reliability improve, or quietly regress?&nbsp;</li><li>Did the change shift bottlenecks or introduce new stress points?&nbsp;</li><li>Are you now closer to the edge under peak load?&nbsp;</li></ul><p>In a 50 to 100+ microservice environment with dense service-to-service dependencies, even small regressions can cascade silently. And slowing down&nbsp;isn’t&nbsp;an option. Leadership needs faster delivery and fewer incidents.&nbsp;</p><p>This is exactly why we built&nbsp;Reliability&nbsp;Delta.&nbsp;</p>
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/GiXq71HEGwE?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely Feature Demo: Reliability Delta">
  </iframe>
</div>
<!--kg-card-end: html-->
<h2 id="reliability-delta-a-deterministic-answer-to-%E2%80%9Cdid-this-change-make-things-better-or-worse%E2%80%9D"><strong>Reliability Delta: A Deterministic Answer to “Did This Change Make Things Better or Worse?”</strong></h2><p>Reliability Delta turns subjective guesswork&nbsp;(like&nbsp;manual diffing of dashboards, correlation hunts, “nobody is complaining”&nbsp;anecdotes)&nbsp;into clear, evidence-based reliability signals.&nbsp;</p><p>It’s&nbsp;powered by&nbsp;Causely’s&nbsp;continuously updated understanding of your environment:&nbsp;</p><h3 id="causality-mapping">Causality Mapping&nbsp;</h3><p>Causely&nbsp;builds a Bayesian network that models how issues propagate across services, enabling true cause-and-effect visibility.&nbsp;</p><h3 id="attribute-dependency-graph">Attribute Dependency Graph&nbsp;</h3><p>A DAG of functional dependencies generated from live topology and&nbsp;Causely’s&nbsp;attribute models, highlighting how attributes influence one another.&nbsp;</p><p>These models allow&nbsp;Causely&nbsp;to compare two snapshots—two releases, two load tests, or two moments in time—and&nbsp;determine&nbsp;whether system behavior:&nbsp;</p><ul><li>Improved&nbsp;</li><li>Regressed&nbsp;</li><li>Or shifted in ways you need to investigate&nbsp;</li></ul><p>The result: deterministic&nbsp;signals&nbsp;engineers can trust.&nbsp;&nbsp;</p><h2 id="use-cases-for-reliability-delta"><strong>Use Cases for</strong>&nbsp;<strong>Reliability Delta</strong>&nbsp;</h2><h3 id="1-validate-every-release-instantly"><em>1.</em>&nbsp;<em>Validate</em>&nbsp;<em>Every Release Instantly</em>&nbsp;</h3><p>Know immediately whether your change introduced risk.&nbsp;</p><p>Feature flags and canaries help, but they&nbsp;don’t&nbsp;guarantee safety. What matters is whether the system is behaving normally.&nbsp;</p><p>Reliability Delta automatically surfaces:&nbsp;</p><ul><li>Behavior changes isolated to a specific flag, tenant, or traffic segment&nbsp;</li><li>Downstream effects in pipelines, async jobs, and data flows&nbsp;</li><li>Subtle regressions that&nbsp;don’t&nbsp;trip alerts but violate known patterns of normal&nbsp;</li></ul><p>This&nbsp;isn’t&nbsp;“no alerts fired = good.”&nbsp;</p><p>This is evidence-based&nbsp;release&nbsp;confidence.&nbsp;&nbsp;</p><h3 id="2-understand-load-test-results-beyond-passfail"><em>2. Understand Load Test Results Beyond Pass/Fail</em>&nbsp;</h3><h3 id=""></h3><blockquote><em>“With</em>&nbsp;<em>Causely’s</em>&nbsp;<em>Reliability Delta, we can quantify how each release behaves under identical load. It surfaces changes in bottlenecks, stress patterns, and causal relationships that traditional load tests miss. At our scale, having that level of confidence before shipping is critical.”</em>&nbsp;</blockquote><p> <strong>Cade Moore, Performance Engineering Lead at Hard Rock Digital</strong><br>&nbsp;</p><p>Did this release push you closer to the breaking point?&nbsp;</p><p>A load test passing&nbsp;doesn’t&nbsp;mean&nbsp;you’re&nbsp;safe.&nbsp;</p><p>Reliability Delta shows:&nbsp;</p><ul><li>How bottlenecks shifted compared to last time&nbsp;</li><li>Whether the same load now produces more stress&nbsp;</li><li>Early signs of fragility or shrinking performance margins&nbsp;</li></ul><p>It answers the question load tests never answer:&nbsp;</p><p>“Are we drifting toward failure or away from it?”&nbsp;</p><h3 id="3-detect-reliability-drift-over-time"><em>3. Detect Reliability Drift Over Time</em>&nbsp;</h3><p>Systems naturally drift through config changes, dependency updates, scaling events, and organic load shifts.&nbsp;</p><p>By capturing snapshots periodically, you can:&nbsp;</p><ul><li>Spot slow-building risk&nbsp;</li><li>Track reliability trends&nbsp;</li><li>Validate that ongoing changes are improving SLO posture, not eroding it&nbsp;</li></ul><p>This moves teams from reactive firefighting to proactive reliability assurance.&nbsp;</p><h3 id="4-validate-experiments-with-confidence"><em>4.</em>&nbsp;<em>Validate</em>&nbsp;<em>Experiments with Confidence</em>&nbsp;</h3><p>Know&nbsp;immediately&nbsp;whether your experiment improved or degraded system behavior.&nbsp;</p><p>Teams&nbsp;frequently&nbsp;adjust timeouts, concurrency, sampling, queue behavior, or other system parameters,&nbsp;but these changes rarely trigger alerts, and standard dashboards make it hard to see their true impact.&nbsp;</p><p>Reliability Delta lets you&nbsp;validate&nbsp;experiments with clear before-and-after evidence by automatically highlighting:&nbsp;</p><p>• Shifts in bottlenecks or stress patterns across services&nbsp;</p><p>• Degradations hidden behind “passing” performance metrics&nbsp;</p><p>• Unexpected side effects in downstream dependencies&nbsp;</p><p>• Whether the experiment made the system more resilient or more fragile&nbsp;</p><p>This&nbsp;isn’t&nbsp;trial-and-error tuning.&nbsp;It&nbsp;is evidence-based experiment validation.&nbsp;</p><h2 id="why-reliability-delta-matters-for-modern-engineering-teams"><strong>Why Reliability Delta Matters for Modern Engineering Teams</strong>&nbsp;</h2><p>If&nbsp;you’re&nbsp;accountable for revenue-critical systems—measured&nbsp;by&nbsp;99.9%+ SLOs,&nbsp;delivery&nbsp;pace, and incident reduction—you need more than observability dashboards. You need a deterministic framework for evaluating how change affects system behavior.&nbsp;&nbsp;</p><p>Reliability Delta gives you:&nbsp;</p><ul><li>Objective, repeatable comparisons between versions&nbsp;</li><li>Root-cause-aware analysis using causal models&nbsp;</li><li>Clear guardrails leadership can trust&nbsp;</li><li>Confidence to ship fast without risking SLOs or customer experience&nbsp;</li></ul><p>It transforms subjective judgment into trusted, actionable reliability signals—so every release, load test, and system change is safer, faster, and more predictable.&nbsp;&nbsp;</p><h2 id="ship-faster-with-confidence"><strong>Ship Faster with Confidence&nbsp;</strong>&nbsp;</h2><p>Reliability&nbsp;isn’t&nbsp;something you can&nbsp;eyeball&nbsp;anymore. With Reliability Delta, engineering leaders get the missing layer between observability and automation: clear causal evidence of how changes affect system behavior. It ensures your team can move fast, protect SLOs, and deliver with&nbsp;the confidence&nbsp;that every release is safer than the last.&nbsp;</p><p>To learn more,&nbsp;see our docs:&nbsp;<a href="https://docs.causely.ai/in-action/reliability-delta/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>https://docs.causely.ai/in-action/reliability-delta/</u></a>&nbsp;&nbsp;<br>&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[eAfterWork EP 9: What Every Leader Needs to Know with Severin Neumann]]></title>
      <link>https://causely.ai/blog/eafterwork-ep-9-what-every-leader-needs-to-know-with-severin-neumann</link>
      <guid>https://causely.ai/blog/eafterwork-ep-9-what-every-leader-needs-to-know-with-severin-neumann</guid>
      <pubDate>Wed, 03 Dec 2025 11:36:00 GMT</pubDate>
      <description><![CDATA[Originally published as a livestream to e-After Work.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/12/Screenshot-2025-12-09-at-6.36.12---AM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>In this episode of eAfterWork, we’re going straight to the source: Severin, OpenTelemetry maintainer, member of the OpenTelemetry Governance Committee, and one of the people who writes and maintains the official OpenTelemetry documentation.<br><br>He’ll help us understand why OpenTelemetry matters for both technical and non-technical leaders, how it’s shaping the future of observability, and what you really need to know to make the right decisions.</p>
<!--kg-card-begin: html-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/LkaytYEAJnQ?si=FzCQwQO6LtsRhuJa" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
<!--kg-card-end: html-->
]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely: Continuous service reliability root cause hunting]]></title>
      <link>https://causely.ai/blog/causely-continuous-service-reliability-root-cause-hunting</link>
      <guid>https://causely.ai/blog/causely-continuous-service-reliability-root-cause-hunting</guid>
      <pubDate>Mon, 01 Dec 2025 11:30:00 GMT</pubDate>
      <description><![CDATA[Originally posted to Intellyx by Jason English.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/12/intellyx-header.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Originally posted to </em><a href="https://intellyx.com/2025/12/01/causely-continuous-service-reliability-root-cause-hunting/?ref=causely-blog.ghost.io" rel="noreferrer"><em>Intellyx</em></a><em> by </em><a href="https://intellyx.com/author/jenglish/?ref=causely-blog.ghost.io" rel="noreferrer"><em>Jason English</em></a><em>.</em></p><h2 id="an-intellyx-brain-candy-brief">An Intellyx Brain Candy Brief</h2><p><a href="https://www.causely.ai/?ref=causely-blog.ghost.io"><strong>Causely</strong></a>&nbsp;monitors a real-time Bayesian network of semantically abstracted runtime operational telemetry data in order to observe and alert engineers to highly probable causes of issues and failure conditions, so they can ideally be resolved before they can emerge as customer-facing incidents.</p><p><a href="https://causely.ai/?ref=causely-blog.ghost.io"></a>Their “Causal Inference System” is a directed acyclic graph populated with tons of possible failure mode indicators. You can think of these indicators like micro-CVEs for observability, so that the system can know what to look for as it passively observes payloads within OTel logs and traces alongside golden signals such as latency or errors. It’s not another AI SRE, as the inferences it surfaces are deterministically based on live indicators that are semantically enriched with context.</p><p>When causes are observed, they can be captured as feedback so DevOps teams can flow through changes in the next CI/CD cycle, or reported into the enterprise’s incident management, ITSM and observability tools of choice, with direct links to contextual insights.</p><p>Sure, you could still do root cause hunting with any major observability platform worth its salt. However, to do that for a large distributed enterprise system, you would need to define thousands of policies that collect and tag telemetry data, and set up triggers and automation for each, such that the cognitive load and cloud data costs to keep it current might be prohibitively high. There’s always another way to do things!</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Purposeful OpenTelemetry]]></title>
      <link>https://causely.ai/blog/purposeful-opentelemetry</link>
      <guid>https://causely.ai/blog/purposeful-opentelemetry</guid>
      <pubDate>Wed, 26 Nov 2025 16:27:02 GMT</pubDate>
      <description><![CDATA[Originally posted as a livestream from OllyGarden.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/12/Screenshot-2025-12-03-at-6.17.42---AM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Originally posted as a </em><a href="https://www.youtube.com/watch?v=sTOa0MxAm78&ref=causely-blog.ghost.io" rel="noreferrer"><em>livestream from OllyGarden</em></a><em>.</em></p><p>Organizations collect far more telemetry than they use. The result? Exponential costs, PII risks, and still no root cause when incidents happen.<br><br>In this technical session, Yuri Oliveira (OllyGarden) and Severin Neumann (Causely) demonstrate:</p><ul><li>How to identify instrumentation problems at the source</li><li>Why traces and semantic conventions enable purposeful telemetry</li><li>How quality telemetry enables causal reasoning at scale</li></ul>
<!--kg-card-begin: html-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/sTOa0MxAm78?si=eKzEeTU0bhErolqf" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
<!--kg-card-end: html-->
]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[How Causal AI Is Transforming SRE Reliability in Kubernetes Environments]]></title>
      <link>https://causely.ai/blog/how-causal-ai-is-transforming-sre-reliability-in-k8s</link>
      <guid>https://causely.ai/blog/how-causal-ai-is-transforming-sre-reliability-in-k8s</guid>
      <pubDate>Tue, 25 Nov 2025 15:25:00 GMT</pubDate>
      <description><![CDATA[Originally posted to TFIR by Monika Chauhan. Causely’s Severin Neumann explains how causal reasoning, MCP, and AI-driven automation are transforming SRE workflows and Kubernetes reliability.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/11/TFIR-header.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Originally posted to </em><a href="https://tfir.io/ai-sre-causal-reasoning/?ref=causely-blog.ghost.io" rel="noreferrer"><em>TFIR</em></a><em> by </em><a href="https://tfir.io/author/mona2772/?ref=causely-blog.ghost.io" rel="noreferrer"><em>Monika Chauhan</em></a><em>.</em></p>
<!--kg-card-begin: html-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/BKUs9j8hIiY?si=wJbWfWpS9kMUGkYU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
<!--kg-card-end: html-->
<p><a href="https://tfir.io/gremlin-empowers-sre-and-chaos-engineering-teams-with-detected-risks-capability-kolton-andrus/?ref=causely-blog.ghost.io" rel="noopener">SRE teams</a>&nbsp;are hitting a breaking point as&nbsp;<a href="https://tfir.io/dynatrace-brings-ai-powered-advanced-observability-to-kubernetes-environments/?ref=causely-blog.ghost.io" rel="noopener">Kubernetes environments</a>&nbsp;scale faster than traditional workflows can keep up. Alerts pile up, incidents drag on, and teams lose hours in reactive firefighting instead of building reliability into their systems. At&nbsp;<a href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/?ref=causely-blog.ghost.io" rel="noopener">KubeCon + CloudNativeCon in Atlanta,</a>&nbsp;<a href="https://www.linkedin.com/in/severinneumann/?ref=causely-blog.ghost.io" rel="noopener">Severin Neumann,</a>&nbsp;Head of Community at&nbsp;<a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noopener">Causely</a>, offered a clear perspective on how AI can finally shift SRE practice from reactive to proactive. But this requires more than feeding logs into an&nbsp;<a href="https://tfir.io/tag/llm/?ref=causely-blog.ghost.io" rel="noopener">LLM</a>&nbsp;— it requires causal reasoning.</p><p>For years, the industry has talked about proactive SRE, but the reality rarely matches the aspiration. Neumann has spent more than a decade in&nbsp;<a href="https://tfir.io/tag/observability/?ref=causely-blog.ghost.io" rel="noopener">observability</a>&nbsp;and monitoring, and in his experience, teams always circle back to the same pattern: there’s an alert, teams react, things stabilize, and then the cycle starts again. The promise of AI offered a potential breakthrough, but early attempts fell into a familiar trap.</p><p>When companies say “AI SRE,” many immediately think of a large language model interpreting alerts and suggesting fixes. Neumann stressed that this approach often works only at the surface level. LLMs are powerful in pattern matching and summarization, but they don’t understand systems the way SREs need them to. They can connect correlations, but they can’t determine causation. And in high-pressure incident situations, an AI hallucination is more than an annoyance — it can waste hours and cost companies real money.</p><p>Causely takes a different approach. Their platform is built on a causal model that understands the structural relationships inside Kubernetes environments. Instead of giving an LLM raw observability data and hoping it finds meaning, Causely creates a deterministic model of how services interact and what components influence one another. The model knows which symptoms map to which possible root causes and can identify the single cause that explains them all.</p><p>This is a critical shift. Instead of chasing noisy correlations, SRE teams can rely on a reasoning engine that explains exactly why a problem is happening. LLMs do come into the picture, but only after the causal model has already identified the issue. At that point, a language model can help generate explanations, provide remediation guidance, or suggest kubectl or Helm commands to fix the problem. The heavy lifting — the understanding — remains deterministic.</p><p>Neumann also explained why this approach stands apart in the industry. Many teams tried the LLM-first path and burned their fingers. Some built internal solutions, others adopted competitors, but most eventually discovered the limitations. LLMs often fail when precision matters most, and in distributed systems, precision is everything. One misleading answer can waste hours in the middle of a firefight.</p><p>Causely’s causal reasoning avoids that by grounding all decisions in a model built around real system behavior. This also sets the stage for a more ambitious goal: shifting reliability left. Instead of waiting for symptoms, the model can analyze normal system behavior and identify bottlenecks or weak points before they cause downtime. This turns SRE work from reaction into prevention.</p><p>The conversation also explored the role of MCP (Model Context Protocol). Causely launched their&nbsp;<a href="https://www.causely.ai/blog/cloud-native-now-causely-adds-mcp-server-to-causal-ai-platform?ref=causely-blog.ghost.io" rel="noopener">MCP Server</a>&nbsp;at KubeCon, enabling developers to pull causal insights directly into tools they already use. Through MCP, the remediation guidance — including detailed commands — can appear directly in an IDE or command line. The SRE no longer has to dig through logs or dashboards to figure out what’s going wrong. The causal model does that work and surfaces the fix.</p><p>Neumann outlined a vision where, over time, teams can set boundaries that allow AI to autonomously remediate certain classes of issues. If a service needs to scale up or memory needs to be increased, those actions could eventually be automated within limits defined by humans. This is where AI becomes a genuine partner — one that teams can trust to handle repetitive corrective tasks while they focus on system design, SLOs, and reliability architecture.</p><p>Trust is a key theme. Introducing AI into SRE workflows doesn’t eliminate human responsibility; it shifts it. Humans now become orchestrators who decide when to delegate and when to intervene. As Neumann put it, the more trust teams build in the system, the more they can remove themselves from situations where they’re no longer needed. This opens space for deeper reliability engineering, building guardrails, and designing better services.</p><p>A real-world example makes the value of causal reasoning clear. Imagine a financial services company running hundreds of microservices. Suddenly, everything turns red and user transactions fail. Traditional debugging would mean looking across alerts, logs, and traces to guess where the bottleneck is. Causely can cut straight through the noise, identify the one overloaded service, and surface the exact command needed to scale it up or adjust its memory. The time shaved off matters — for reliability, user experience, and cost.</p><p>Throughout the conversation, it became evident that AI isn’t replacing SREs. It’s allowing them to finally escape the cycle of constant firefighting. With causal reasoning, proactive reliability, and tools like MCP, SREs can shift to roles that emphasize guidance, architecture, and strategic improvements rather than crisis management.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Brings Reliability Engineering to the Heart of Cloud-Native Development with Yotam Yemini]]></title>
      <link>https://causely.ai/blog/causely-brings-reliability-engineering-to-the-heart-of-cloud-native-development</link>
      <guid>https://causely.ai/blog/causely-brings-reliability-engineering-to-the-heart-of-cloud-native-development</guid>
      <pubDate>Mon, 17 Nov 2025 15:39:00 GMT</pubDate>
      <description><![CDATA[Originally posted to Techstrong.tv. Learn how Causely integrates reliability engineering into product development, tackling challenges in cloud-native applications.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/11/techstrong-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Originally posted to </em><a href="https://techstrong.tv/videos/interviews/causley-brings-reliability-engineering-to-the-heart-of-cloud-native-development-with-yotam-yemini?ref=causely-blog.ghost.io" rel="noreferrer"><em>Techstrong.tv</em></a><em> by </em><a href="https://techstrong.tv/host/alan-shimel?ref=causely-blog.ghost.io" rel="noreferrer"><em>Alan Shimel</em></a><em>.</em></p><p>Yotam Yemini, CEO of Causely, explains how the company integrates reliability engineering into product development, tackling challenges in cloud-native applications. The evolving role of Site Reliability Engineers (SREs) is discussed, highlighting the need to understand cause and effect in system performance. Causely provides tools for developers to enhance performance and reliability, promoting collaboration among engineering teams within a vibrant open source and cloud technology community.</p><p>Watch the interview <a href="https://techstrong.tv/videos/interviews/causley-brings-reliability-engineering-to-the-heart-of-cloud-native-development-with-yotam-yemini?ref=causely-blog.ghost.io" rel="noreferrer">here</a>. </p><figure class="kg-card kg-image-card"><a href="https://techstrong.tv/videos/interviews/causley-brings-reliability-engineering-to-the-heart-of-cloud-native-development-with-yotam-yemini?ref=causely-blog.ghost.io"><img src="https://causely-blog.ghost.io/content/images/2025/11/Screenshot-2025-11-26-at-10.44.06---AM.png" class="kg-image" alt="" loading="lazy" width="905" height="514" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/11/Screenshot-2025-11-26-at-10.44.06---AM.png 600w, https://causely-blog.ghost.io/content/images/2025/11/Screenshot-2025-11-26-at-10.44.06---AM.png 905w" sizes="(min-width: 720px) 720px"></a></figure>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Drinking the OTel SODA: Send Observability Data Anywhere]]></title>
      <link>https://causely.ai/blog/drinking-the-otel-soda-send-observability-data-anywhere</link>
      <guid>https://causely.ai/blog/drinking-the-otel-soda-send-observability-data-anywhere</guid>
      <pubDate>Mon, 17 Nov 2025 11:38:28 GMT</pubDate>
      <description><![CDATA[With community-standard instrumentation and the OTel Collector, your metrics, logs, and traces are no longer trapped in a walled garden. Originally posted to the ClickHouse blog.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/11/soda-clickhouse.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Originally posted to the </em><a href="https://clickhouse.com/blog/otel-soda-send-observability-data-anywhere?ref=causely-blog.ghost.io" rel="noreferrer"><em>ClickHouse blog</em></a><em>.</em></p><p>For a long time,&nbsp;<strong>observability has meant buying into a full stack you can’t really change</strong>: proprietary agents to collect the data, a proprietary protocol to move it, and a proprietary backend to look at it.&nbsp;<strong>Your telemetry lived inside a walled garden.</strong></p><p><a href="https://opentelemetry.io/?ref=causely-blog.ghost.io"><strong>OpenTelemetry</strong></a>&nbsp;<strong>(OTel) is breaking that pattern</strong>. With community-standard instrumentation and the&nbsp;<a href="https://www.causely.ai/blog/using-opentelemetry-and-the-otel-collector-for-logs-metrics-and-traces?ref=causely-blog.ghost.io">OTel Collector</a>&nbsp;acting as a translation and routing engine, your&nbsp;<strong>metrics, logs, and traces are no longer trapped in that garden</strong>.</p><h2 id="observability-isn%E2%80%99t-a-monolith">Observability isn’t a monolith&nbsp;</h2><p>There’s nothing inherently wrong with proprietary software; plenty of great systems are closed source.&nbsp;<strong>The problem is when your data becomes proprietary</strong>&nbsp;within these systems.</p><p>When collection, transport, and storage are tightly coupled to one vendor, your options shrink. Want support for a less common programming language? You might wait quarters for an agent. Want to change vendors? Prepare for weeks of reinstrumentation. Even simple ideas, like experimenting with a second backend in parallel, can become “projects.”</p><p><strong>OTel changes the economics</strong>. Today, you can instrument almost everything consistently, and, yes, it has&nbsp;<strong>never been easier to swap your observability platform</strong>&nbsp;without touching application code. But, it’s not just about reducing vendor lock-in;&nbsp;<strong>when you own how your data moves, you can send it anywhere</strong>.</p><p><a href="https://www.causely.ai/blog/reflections-on-apmdigests-observability-series-and-where-we-go-next?ref=causely-blog.ghost.io">I promised not to rant about this again</a>, but I need to come back to the “observability” debate once again: the way “observability” is used as a marketing term makes it seem like collecting, processing, and storing telemetry is one big monolith. It isn’t. Your pipeline is inherently composable, and the most leverage shows up at the tail: the “backend.” Treat that tail as a junction, not a culdesac.</p><h2 id="what-does-soda-mean">What does SODA mean?&nbsp;</h2><p>That’s the idea behind&nbsp;<strong>SODA: Send Observability Data Anywhere</strong>.</p><p>SODA is simple on the surface: send a sensible combination or subset of your signals to the best tool for the job at hand. Under the hood, it means&nbsp;<strong>being deliberate about what you send and where you send it</strong>, keeping a durable copy you control, and refusing to recreate a new walled garden in the name of convenience. The OTel Collector makes this practical: you can enrich events, redact sensitive fields, apply sampling and routing policies, and fan out to multiple consumers, without touching application code.</p><p>We should understand that&nbsp;<strong>observability data is data, and often has value outside of just observability.</strong></p><h2 id="what-does-soda-look-like-in-practice">What does SODA look like in practice?&nbsp;</h2><p>By owning the transport of your data,&nbsp;<strong>your architecture becomes pluggable, and can adapt to how your needs change over time</strong>.</p><p>In the immediate-term, it means you can&nbsp;<strong>look at the most pressing concern most teams face: cost</strong>. You may not be able to move observability vendors overnight, but by instrumenting your app with OTel, you solve one of the biggest hurdles to getting started: testing a new observability stack is one configuration change away, and you can run multiple tools in parallel for comparison or during a migration.</p><p>For long-term retention,&nbsp;<strong>OTel makes it easy to store your complete historical data as a&nbsp;</strong><a href="https://clickhouse.com/blog/lakehouses-path-to-low-cost-scalable-no-lockin-observability?ref=causely-blog.ghost.io"><strong>durable copy in inexpensive object storage in open formats</strong></a>. Doing so frees you from keeping all of your history in an expensive vendor bucket, and gives you replay: you can hydrate any alternative tool later without asking teams to reinstrument.</p><p>From there, you can&nbsp;<strong>send slices of telemetry to systems that act</strong>. Use signals inside the cluster to make decisions, like autoscaling with&nbsp;<a href="https://keda.sh/?ref=causely-blog.ghost.io">KEDA</a>&nbsp;or the&nbsp;<a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/?ref=causely-blog.ghost.io">Horizontal Pod Autoscaler</a>&nbsp;(HPA), gating progressive delivery so a canary only promotes when golden metrics and dependency health agree, or flipping feature flags when error budgets trend in the wrong direction. Security teams can mirror selected logs and traces to their SIEM, while the raw truth stays immutable in cold storage.</p><p>As AI enters the observability space,&nbsp;<strong>you are free to adopt new platforms that innovate in specific areas that matter to you</strong>, without being restricted to the capabilities of your core observability platform. You can use&nbsp;<a href="https://www.causely.ai/?ref=causely-blog.ghost.io">Causely</a>&nbsp;to apply model-driven causal inference over your topology and recent changes to pinpoint likely causes for reliability issues and recommend or auto-apply safe actions. Or&nbsp;<a href="https://ollygarden.com/?ref=causely-blog.ghost.io">OllyGarden</a>&nbsp;to help improve the quality of your OTel implementation.</p><p>Your&nbsp;<strong>telemetry is also product and business data in disguise</strong>, and can enable customer-facing teams to see product usage, user journeys, or&nbsp;<a href="https://www.causely.ai/blog/causely-feature-demo-clickstack?ref=causely-blog.ghost.io">customer impact as a result of an incident</a>. Observability data is often silo’d and unavailable to internal analytics, but OTel allows you to bridge that gap whether you have separated data platforms, or a unified data stack. Platforms like&nbsp;<a href="https://clickhouse.com/use-cases/observability?ref=causely-blog.ghost.io">ClickStack</a>&nbsp;build on-top of flexible, open datastores like ClickHouse, and demonstrate how observability and business data can be co-located and correlated within one platform.</p><p>When the pipeline is open, you can experiment with, and adopt, new tools without rearchitecting your stack.</p><h2 id="composed-not-siloed">Composed, not siloed&nbsp;</h2><p>OpenTelemetry already breaks silos at the source by giving us a shared schema and transport for metrics, logs, and traces. SODA applies the same principle at the tail. Keep signals together as shared context on a durable stream and let specialized tools subscribe to that context, hand off seamlessly, and act without duplicating or fragmenting the truth. In a composed flow, a root cause identified in one place becomes the pivot to quantify impact and drive recovery in another—without losing the thread.</p><h2 id="with-freedom-comes-responsibility">With freedom comes responsibility&nbsp;</h2><p>I’d like to stress an important note of caution:&nbsp;<strong>“send anywhere” is not the same as “send everything everywhere.”</strong></p><p><a href="https://clickhouse.com/blog/breaking-free-from-rising-observability-costs-with-open-cost-efficient-architectures?ref=causely-blog.ghost.io">Splitting your metrics, logs, and traces across disjointed backends that cannot be correlated is a fast path to longer MTTR and fingerpointing</a>. If you want a thoughtful breakdown of why unified access matters for investigations,&nbsp;<a href="https://medium.com/womenintechnology/storing-all-of-your-observability-signals-in-one-place-matters-36178cd0ce10?ref=causely-blog.ghost.io">this article is a good primer</a>.</p><p>The SODA posture helps to keep a coherent, durable source of truth under your control, then route purposeful subsets to the systems that extract additional value.</p><h2 id="drink-some-soda-and-let-us-know-what-you-think">Drink some SODA and let us know what you think&nbsp;</h2><p>If you’ve read this far, you probably already have a mental list of places you wish your telemetry could go but currently doesn’t. That list is your SODA plan.</p><p>Maybe you start by dual-writing to object storage and turning on replay. Maybe you add a lightweight autoscaling signal for a spiky service. Maybe you route instrumentation health to a specialized tool while keeping unified access for investigations. The point isn’t to chase a shiny mesh of destinations; it’s to get more leverage from the data you already collect: safely, cheaply, and under your control.</p><p><a href="https://clickhouse.com/slack?ref=causely-blog.ghost.io">We’d love to hear how you’re doing SODA today</a>. Where are you sending telemetry beyond the one vendor you pay for? Which use cases are you covering: cost reduction, faster incident response, safer rollouts, richer product insights? Which ones do you want to see covered outside the “standard observability pipeline”?</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[International Business Times: Cutting Through the Noise — Startups to Watch at KubeCon 2025]]></title>
      <link>https://causely.ai/blog/international-business-times-cutting-through-the-noise-startups-to-watch-at-kubecon-2025</link>
      <guid>https://causely.ai/blog/international-business-times-cutting-through-the-noise-startups-to-watch-at-kubecon-2025</guid>
      <pubDate>Mon, 10 Nov 2025 03:26:00 GMT</pubDate>
      <description><![CDATA[Originally posted to International Business Times by David Thompson.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/11/IBT-header-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Originally posted to </em><a href="https://www.ibtimes.com/cutting-through-noise-startups-watch-kubecon-2025-3789977?ref=causely-blog.ghost.io" rel="noreferrer"><em>International Business Times</em></a><em> by David Thompson.</em></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/11/kubernetes.webp" class="kg-image" alt="" loading="lazy" width="736" height="416" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/11/kubernetes.webp 600w, https://causely-blog.ghost.io/content/images/2025/11/kubernetes.webp 736w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Kubernetes</span></figcaption></figure><p>It's back! This week is KubeCon, North America—one of the most beloved technology conferences in the world, centered around one of the most beloved enterprise technologies: Kubernetes. Born out of Google, Kubernetes has changed the very infrastructure on top of which modern applications are built. More than half of all enterprises have adopted Kubernetes, with millions of developers deploying their applications on top of Kubernetes every day.</p><p>But like any other technology, it comes with its trade-offs...mainly with the complexity of managing performance and debugging issues, which is why hundreds, if not thousands, of software vendors are touting their Kubernetes solutions this week. To help you separate the noise from the value, we want to highlight [] companies that truly help deliver value to enterprises using Kubernetes.</p><h2 id="gremlin">Gremlin</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://d.ibtimes.com/en/full/4633718/gremlin.png?w=736&amp;f=ee6a09ff939121bd7d5e7624e65ec634" class="kg-image" alt="Gremlin" loading="lazy" width="736" height="579"><figcaption><span style="white-space: pre-wrap;">Gremlin</span></figcaption></figure><p>We're very excited to see this company ramp back up their public persona.&nbsp;<a href="http://www.gremlin.com/?ref=causely-blog.ghost.io" rel="noopener">Gremlin</a>&nbsp;was founded by ex-Netflix and Amazon engineers in 2017 and made a lot of noise around the cutting-edge discipline of Chaos Engineering. For those of you who are unfamiliar, Gremlin has built a platform that empowers companies to run experiments on their systems in order to proactively identify where the weaknesses are.</p><p>Like getting people to the gym, it isn't always easy to convince people to change their habits—but the enterprises that build Chaos Engineering into their routines benefit from much healthier, reliable systems. At this KubeCon, Gremlin is announcing a strategic partnership with&nbsp;<a href="https://www.dynatrace.com/?ref=causely-blog.ghost.io" rel="noopener">Dynatrace</a>—a leader in observability and application performance monitoring—to help Kubernetes users keep their applications in a desired state.</p><p>Kubernetes services are automatically discovered within Gremlin, powered by Dynatrace's AI-driven observability and topology mapping. Health checks are then applied to Kubernetes' objects, allowing organizations to efficiently implement standardized reliability testing and gain deeper insights into their environments.</p><h2 id="mezmo">Mezmo</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://d.ibtimes.com/en/full/4633719/mezmo.jpg?w=736&amp;f=5c59190c0b17dbee6137eed9014621d2" class="kg-image" alt="Mezmo" loading="lazy" width="736" height="414"><figcaption><span style="white-space: pre-wrap;">Mezmo</span></figcaption></figure><p><a href="http://www.mezmo.com/?ref=causely-blog.ghost.io" rel="noopener"><strong>Mezmo</strong></a>&nbsp;is pioneering the concept of active telemetry in order to provide AI agents with better data and context. The company recently launched an RCA agent (root cause analysis) aimed at Kubernetes users that automatically identifies and fixes common issues such as deployment failures, resource issues, configuration errors, application-level failures, and more.</p><p>The proof is in the pudding: their AI agent consistently resolves issues in complex cloud environments&nbsp;<a href="https://www.mezmo.com/blog/why-your-sre-agent-overpromises-and-underproduces-plus-how-to-fix-that?ref=causely-blog.ghost.io" rel="noopener">faster and more accurately</a>&nbsp;than other AI agents and models. According to the company's blog,&nbsp;<em>"we're entering an era where incidents resolve themselves before engineers even know they exist."</em>&nbsp;By leveraging agentic AI workflows, Mezmo rapidly analyzes telemetry data to pinpoint root causes, eliminate noise, and recommend actionable remediation steps.</p><h2 id="causely">Causely</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://d.ibtimes.com/en/full/4633720/causely.png?w=736&amp;f=4353e13881c8d1a23650b31e19f41e14" class="kg-image" alt="Causely" loading="lazy" width="736" height="397"><figcaption><span style="white-space: pre-wrap;">Causely</span></figcaption></figure><p><a href="http://www.causely.ai/?ref=causely-blog.ghost.io" rel="noopener">Causely</a>&nbsp;is pioneering the category of AI SRE—a category that will undoubtedly explode in popularity in 2026. Now more than ever, engineering teams are drowning in too much data and too many alerts. Having a system like Causely on the market—founded by a veteran of the industry with two prior startups in IT Operations—is a natural and critical response to the rise of AI code-generation tools that are shipping code faster than humans can reasonably understand it or manage it.</p><p>At this year's KubeCon, Causely is announcing an MCP Server that seamlessly integrates into any MCP-compatible IDE and enables developers to automatically diagnose, understand, and remediate complex issues within Kubernetes and application code using natural language prompts.</p><p>It works by analyzing the real-time state of the system, identifying whether the cause of an issue is in the infrastructure or application layer, recommending the precise code changes, configuration changes, or helm chart updates, and presenting these suggestions inline within the developer's IDE for review, refinement, or approval.</p><h2 id="komodor">Komodor</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://d.ibtimes.com/en/full/4633721/komodor.png?w=736&amp;f=3f8e3fad1256018bf01ffddeb85b41cc" class="kg-image" alt="Komodor" loading="lazy" width="736" height="447"><figcaption><span style="white-space: pre-wrap;">Komodor</span></figcaption></figure><p>It's hard to talk about Kubernetes troubleshooting and not mention&nbsp;<a href="http://www.komodor.com/?ref=causely-blog.ghost.io" rel="noopener">Komodor</a>. Founded by an ex-Google engineer, Komodor has been dedicated to the Kubernetes ecosystem since it launched out of stealth in 2021. Their management platform simplifies operations, provides automated troubleshooting, and helps teams manage complex environments. It tracks changes, analyzes their impact, and provides actionable context for issues, which reduces troubleshooting time and improves delivery velocity. Key features include automated drift detection, root cause analysis, and monitors for cluster health and resource optimization.</p><p>If you'll be in Atlanta, Georgia, this week for KubeCon 2025—stop by the booth of each of these companies. They are adding tremendous value to enterprises looking to maximize the performance and benefits of using Kubernetes.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[TechOps Talk: Causal Reasoning-Based Root Cause Analysis]]></title>
      <link>https://causely.ai/blog/techops-talk-causal-reasoning-based-root-cause-analysis</link>
      <guid>https://causely.ai/blog/techops-talk-causal-reasoning-based-root-cause-analysis</guid>
      <pubDate>Sun, 09 Nov 2025 14:51:00 GMT</pubDate>
      <description><![CDATA[Learn why causal inference is the missing piece in AI-driven observability, and how Causely is the only AI SRE platform that uses causal reasoning to pinpoint where, what, and why application and system related issues occur.]]></description>
      <author>Yotam Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/11/Screenshot-2025-11-10-at-9.48.13---AM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Tune into this TechOps Talk between UKG SRE Olivier Mwanza Tshibemba and Causely CEO Yotam Yemini. They talk about why causal inference is the missing piece in AI-driven observability, and how Causely is the only AI SRE platform out there that uses causal reasoning to pinpoint where, what, and why application and system related issues occur, making it easier than ever to troubleshoot and resolve them.</p>
<!--kg-card-begin: html-->
<iframe src="https://www.linkedin.com/video/embed/live/urn:li:ugcPost:7392939107649441793?embedDomain=www.causely.ai" height="500" width="710" frameborder="0" allowfullscreen="" title="Embedded post"></iframe>
<!--kg-card-end: html-->
]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Cloud Native Now: Causely Adds MCP Server to Causal AI Platform for Troubleshooting Kubernetes Environments]]></title>
      <link>https://causely.ai/blog/cloud-native-now-causely-adds-mcp-server-to-causal-ai-platform</link>
      <guid>https://causely.ai/blog/cloud-native-now-causely-adds-mcp-server-to-causal-ai-platform</guid>
      <pubDate>Thu, 06 Nov 2025 15:21:00 GMT</pubDate>
      <description><![CDATA[Originally posted to Cloud Native Now by Mike Vizard.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/11/Cloud-Native-Now.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Originally posted to </em><a href="https://cloudnativenow.com/features/causely-adds-mcp-server-to-causal-ai-platform-for-troubleshooting-kubernetes-environments/?ref=causely-blog.ghost.io" rel="noreferrer"><em>Cloud Native Now</em></a><em> by </em><a href="https://cloudnativenow.com/author/mikevizard/?ref=causely-blog.ghost.io" rel="noreferrer"><em>Mike Vizard</em></a><em>.</em></p><p>Causely today unveiled a Model Context Protocol (MCP) server that enables developers to automatically diagnose, understand, and remediate complex issues within Kubernetes and application code using natural language prompts from within their integrated developer environment (IDE).</p><p>Severin Neumann, head of community for Causely, said the Causely MCP Server will make it easier for application developers to troubleshoot Kubernetes issues using a platform that already uses causal artificial intelligence (AI) models that were created to augment site reliability engineers (SREs).</p><p>The overall goal is to provide developers and SREs with insights needed to reduce downtime by analyzing the state of the system in real time to identify whether the cause of an issue is in the infrastructure or application layer. The Causely platform will then recommend precise code, configuration or Helm chart changes for developers to review, refine, or approve within their IDE.</p><p>Additionally, the Causely platform will also generate patches for Terraform, Helm, or application code to prevent issues from recurring.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/2025/11/service-overview-1.webp" class="kg-image" alt="" loading="lazy" width="836" height="450" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/11/service-overview-1.webp 600w, https://causely-blog.ghost.io/content/images/2025/11/service-overview-1.webp 836w" sizes="(min-width: 720px) 720px"></figure><p>It’s not clear at this point just how many software engineering teams are relying on AI to automate the management of Kubernetes workflows, but given the complexity of these environments, there is a significant opportunity to reduce stress and toil. The simple truth is that many organizations are limiting the number of Kubernetes clusters they might deploy simply because there isn’t enough SRE expertise available to manage them.</p><p>Many developers, meanwhile, have historically been intimidated by the complexity of Kubernetes clusters. The MCP server developed by Causely makes it simpler for developers to resolve issues on their own, both before and after cloud-native applications are deployed.</p><p>It’s not likely AI platforms such as Causely will replace the need for SREs any time soon, but they do present an opportunity for fewer SREs to successfully manage a larger number of Kubernetes clusters at scale at a time when the number of cloud-native applications being deployed continues to steadily increase. The issue, as always, is reducing mean time to remediation whenever all but inevitable incidents occur.</p><p>In fact, AI advances should enable SREs to spend more time on strategic issues such as ensuring availability versus constantly performing tactical incident management tasks, said Neumann.</p><p>Ultimately, multiple AI models that are being invoked by AI agents will soon be pervasively embedded across every DevOps workflow. The next major challenge will be finding a way to orchestrate the management of AI agents that will be accessing causal, predictive and generative AI models to optimize IT environments. Hopefully, those advances will significantly reduce the level of burnout that many software engineering teams experience as they are overwhelmed by repetitive manual tasks, especially when managing complex Kubernetes clusters that can be easily misconfigured.</p><p>In fact, it’s even conceivable that as AI starts to eliminate many of the&nbsp;<a href="https://cloudnativenow.com/webinar/big-risks-engineering-bottlenecks-and-ai/?ref=causely-blog.ghost.io" rel="noopener">bottlenecks that exist in software engineering</a>&nbsp;workflows, much of the joy that attracted SREs and application developers to IT in the first place might soon be rediscovered.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[TechTimes: Causely Launches MCP Server for Automated Issue Resolution in Kubernetes]]></title>
      <link>https://causely.ai/blog/techtimes-causely-launches-mcp-server-for-automated-issue-resolution-in-kubernetes</link>
      <guid>https://causely.ai/blog/techtimes-causely-launches-mcp-server-for-automated-issue-resolution-in-kubernetes</guid>
      <pubDate>Thu, 06 Nov 2025 15:14:00 GMT</pubDate>
      <description><![CDATA[Reposted from its original publication on TechTimes by Carl Williams]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/11/techtimes.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Author: </em><a href="https://www.techtimes.com/reporters/carl-williams?ref=causely-blog.ghost.io" rel="noreferrer"><em>Carl Williams</em></a><em>; Republished with permission from its source on </em><a href="https://www.techtimes.com/articles/312540/20251106/causely-launches-mcp-server-automated-issue-resolution-kubernetes.htm?ref=causely-blog.ghost.io" rel="noreferrer"><em>TechTimes</em></a></p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/2025/11/service-overview.webp" class="kg-image" alt="" loading="lazy" width="836" height="450" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/11/service-overview.webp 600w, https://causely-blog.ghost.io/content/images/2025/11/service-overview.webp 836w" sizes="(min-width: 720px) 720px"></figure><p>If you haven't heard the term&nbsp;<em>"AI SRE,"</em>&nbsp;you will soon. Following the boom of AI-generated code helping developers significantly increase their output, it was only a matter of time before more of the operations-related tasks around debugging, root cause analysis, and incident resolution were also AI-assisted.</p><p>One of the startups pioneering the AI SRE category is&nbsp;<a href="http://www.causely.ai/?ref=causely-blog.ghost.io" rel="noopener">Causely</a>—founded by IT Ops veteran Shmuel Kliger, who successfully sold two of his prior companies. Causely is unique in its ability to leverage causal inference to arrive at the root cause of issues accurately and more quickly than other solutions on the market. They are also shifting reliability left and identifying concerns before the code even gets shipped to production.</p><p>Today, the company announced the release of the new Causely MCP Server, a powerful tool designed to streamline and automate troubleshooting within Kubernetes. The company is making the announcement ahead of the popular conference KubeCon next week. This innovative solution integrates seamlessly with any MCP-compatible IDE, empowering developers to diagnose, understand, and resolve complex system issues using simple natural language prompts.</p><p>As Kubernetes' scalability and flexibility grow, so does the complexity of managing these systems. Issues like resource conflicts, unexpected pod evictions, and DNS delays often lead engineers to patch symptoms without uncovering root causes. Traditional monitoring tools offer valuable data but can make troubleshooting a manual, time-consuming process.</p><p>Causely's MCP Server aims to change that by embedding advanced causal reasoning directly into the developer workflow. Once integrated into popular IDEs such as Cursor or Claude, the system allows engineers to describe problems or desired outcomes conversationally, removing the need for manual searches or scripting.</p><h2 id="key-features-of-the-causely-mcp-server-include"><strong>Key features of the Causely MCP Server include:</strong></h2><ul><li><strong>IDE-Centric Integration</strong>: Easy installation into MCP-compatible IDEs without significant infrastructure changes.</li><li><strong>Natural Language Prompts</strong>: Developers communicate issues and fixes naturally, streamlining problem reporting and resolution.</li><li><strong>Context-Aware Recommendations</strong>: The system uses real-time data and causal models to suggest effective fixes at the runtime, configuration, or code level.</li><li><strong>Upstream Fixes</strong>: Generates patches for Terraform, Helm, or application code to prevent similar issues in future deployments.</li><li><strong>Immediate Review &amp; Refinement</strong>: Recommendations appear inline for iterative review before applying changes.</li></ul><p>The MCP Server analyzes real-time system states to pinpoint whether an issue stems from infrastructure or application layers, then recommends precise modifications—be it code, configuration, or Helm chart adjustments. This approach simplifies maintaining systems in their desired state.</p><p>Karthik Ramakrishan, VP of Artificial General Intelligence at Amazon, praised the innovation:&nbsp;</p><blockquote><em>"Language models are powerful, but they require structured causal context to make the right decisions. Causely fills this gap, enabling real-time automation and reliable microservice operations."</em></blockquote><p>By embedding intelligent, causal remediation directly into developers' workflows, Causely makes maintaining Kubernetes applications more straightforward and efficient than ever before. Certainly a company worth checking out if you'll be in Atlanta for KubeCon...or even if you won't be there, for that matter.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Introduces MCP Server for Automated Remediation Across Kubernetes and Application Code]]></title>
      <link>https://causely.ai/blog/causely-introduces-mcp-server-for-automated-remediation</link>
      <guid>https://causely.ai/blog/causely-introduces-mcp-server-for-automated-remediation</guid>
      <pubDate>Thu, 06 Nov 2025 15:04:00 GMT</pubDate>
      <description><![CDATA[Causely announced the launch of the Causely MCP Server that seamlessly integrates into any MCP-compatible IDE and enables developers to automatically diagnose, understand, and remediate complex issues within Kubernetes and application code using natural language prompts.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/11/businesswire.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><strong>NEW YORK--(</strong><a href="https://www.businesswire.com/news/home/20251106134583/en/Causely-Introduces-MCP-Server-for-Automated-Remediation-Across-Kubernetes-and-Application-Code?ref=causely-blog.ghost.io" rel="noreferrer"><strong>BUSINESS WIRE</strong></a><strong>)</strong>--<a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noreferrer">Causely</a>, a leader in AI-driven Site Reliability Engineering, today announced the launch of the Causely MCP Server that seamlessly integrates into any MCP-compatible IDE and enables developers to automatically diagnose, understand, and remediate complex issues within Kubernetes and application code using natural language prompts.</p><p>Kubernetes scalability and flexibility come with increased complexity. Services conflict for resources, pods are evicted unexpectedly, DNS queries lag, etc. When outages occur, engineers are often left patching symptoms without understanding the cause of these observed problems. Traditional monitoring and observability tools provide useful data, but troubleshooting remains a manual and time-consuming process.</p><blockquote>“Causely’s MCP Server accelerates incident response by placing sophisticated causal reasoning directly in the hands of developers,” said Ben Yemini, Head of Product at Causely. “Once integrated into IDEs such as Cursor or Claude, the MCP Server allows engineers to describe problems or desired outcomes using simple natural language commands.”</blockquote><h2 id="key-features-of-the-causely-mcp-server-include"><strong>Key Features of the Causely MCP Server include:</strong></h2><ul><li><strong>IDE-Centric Integration:</strong>&nbsp;Installs seamlessly into any MCP-compatible IDE, requiring no significant infrastructure overhaul.</li><li><strong>Natural Language Prompts:</strong>&nbsp;Developers communicate problems or fixes conversationally, without needing to write scripts or manually search dashboards.</li><li><strong>Context-Aware Recommendations</strong>: The system uses real-time system data and causal models to propose specific, effective fixes at runtime, configuration, or code level.</li><li><strong>Upstream Fixes:</strong>&nbsp;Generates patches for Terraform, Helm, or application code to prevent issues from recurring in future deployments.</li><li><strong>Immediate Review &amp; Refinement:</strong>&nbsp;Developers see recommendations inline, allowing iterative improvements before applying changes.</li></ul><p>Causely’s new MCP server works by analyzing the real-time state of the system; identifying whether the cause of an issue is in the infrastructure or application layer; recommending the precise code changes, configuration changes, or helm chart updates; and presenting these suggestions inline within the developer’s IDE for review, refinement, or approval.</p><blockquote>"If you’re serious about automating reliability in microservices you need what Causely is doing,” said Karthik Ramakrishan, VP of Artificial General Intelligence at Amazon. “Language models are powerful, but they can’t make the right calls without structured causal context. That’s the gap Causely fills, and it’s what makes real-time automation possible."</blockquote><p>By embedding intelligent, causal remediation into the developer workflow, Causely makes it simpler than ever to maintain Kubernetes applications in their desired state. To learn more, read the announcement blog post or stop by their booth at KubeCon, Atlanta November 11-13.</p><h2 id="about-causely">About Causely</h2><p>Causely&nbsp;is an AI startup dedicated to transforming Site Reliability Engineering through innovative automation, causal reasoning, and developer-centric tools. Their solutions help organizations manage complex distributed systems more efficiently and reliably.</p><h2 id="media-contact">Media Contact</h2><p>Adam LaGreca</p><p>Founder, 10K Media</p><p><a href="mailto:adam@10kmedia.co" rel="noreferrer">adam@10kmedia.co</a> </p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Introducing Causely’s MCP Server for Automated Remediation in Kubernetes and Beyond]]></title>
      <link>https://causely.ai/blog/introducing-causelys-mcp-server</link>
      <guid>https://causely.ai/blog/introducing-causelys-mcp-server</guid>
      <pubDate>Wed, 05 Nov 2025 23:37:07 GMT</pubDate>
      <description><![CDATA[The Causely MCP Server brings our Causal Reasoning Engine directly into the IDE so engineers can understand why incidents happen and apply the right fix at the right layer, whether that’s runtime, configuration, or code.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/11/Screenshot-2025-11-05-at-3.06.04---PM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Today,&nbsp;we’re&nbsp;releasing the&nbsp;Causely&nbsp;MCP Server. It brings&nbsp;our&nbsp;Causal Reasoning Engine directly into the&nbsp;IDE&nbsp;so engineers can understand why incidents happen and apply the right fix at the right layer, whether that’s runtime, configuration, or code.&nbsp;&nbsp;</p><h2 id="the-problem-complexity-hides-the-real-issue"><strong>The Problem: Complexity Hides the Real Issue</strong>&nbsp;</h2><p>Kubernetes gives teams&nbsp;<a href="https://www.causely.ai/blog/do-you-even-need-kubernetes?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>the power to scale fast</u></a>, but it also introduces new layers of complexity. Services contend for memory. Pods get evicted. DNS queries&nbsp;slow&nbsp;to a crawl.&nbsp;When something breaks, symptoms often show up far away from the real cause.&nbsp;</p><p>A latency spike might surface at an API gateway, but the actual issue could be a congested message queue. Pod evictions might trace back to a misconfigured limit in a different service. Engineers end up chasing alerts, patching downstream effects, and firefighting without ever fully closing the loop.&nbsp;</p><p>In systems like this, even well-intentioned remediation often lands in the wrong place. The action&nbsp;isn’t&nbsp;wrong;&nbsp;it’s&nbsp;just applied to the&nbsp;symptoms, not the cause.<strong>&nbsp;</strong>&nbsp;</p><h2 id="how-causely-approaches-this"><strong>How</strong>&nbsp;<strong>Causely</strong>&nbsp;<strong>Approaches This</strong>&nbsp;</h2><p>Causely&nbsp;was built for distributed systems where problems propagate in non-obvious ways.&nbsp;It is designed&nbsp;to understand&nbsp;cause and effect across services and layers of the stack.&nbsp;</p><p>At the core is a&nbsp;<a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causal Reasoning Engine</a> (CRE)&nbsp;that&nbsp;applies domain-specific causal models to real-time telemetry. It maps&nbsp;causes to&nbsp;the&nbsp;symptoms&nbsp;they&nbsp;cause with high precision. It understands how services interact, how constraints&nbsp;emerge, and how changes ripple through the&nbsp;environment. This allows it to pinpoint&nbsp;the cause of service degradations, even in noisy environments where&nbsp;symptoms may be missing or spurious.&nbsp;&nbsp;</p><p>Once the cause is&nbsp;pinpointed,&nbsp;Causely&nbsp;drives&nbsp;resolution at the&nbsp;appropriate layer,&nbsp;whether&nbsp;that’s&nbsp;a runtime adjustment, a configuration change, or a code fix.&nbsp;</p><h2 id="closing-the-loop-with-the-mcp-server"><strong>Closing the Loop with the MCP Server&nbsp;</strong>&nbsp;</h2><p>The new MCP Server connects this reasoning engine directly to the&nbsp;developer&nbsp;workflow. Through integrations with MCP-compatible editors like Cursor and Claude,&nbsp;Causely&nbsp;now:&nbsp;</p><ul><li>Generates&nbsp;upstream patches to Terraform, Helm charts, or application code to prevent the same issue from recurring.&nbsp;</li><li>Remediates&nbsp;runtime issues in Kubernetes automatically, including CPU starvation, noisy neighbor interference, and memory exhaustion.&nbsp;</li><li>Delivers&nbsp;these remediations directly into the IDE for review, approval, or refinement, with full causal context.&nbsp;</li></ul><p>This&nbsp;isn't&nbsp;about writing scripts or building brittle rules.&nbsp;Causely&nbsp;analyzes the environment in real time and proposes the correct remediation for the right layer.&nbsp;</p><h3 id="example-slow-database-queries"><strong>Example: Slow Database Queries</strong>&nbsp;</h3><p>Let’s&nbsp;say your service slows down due to database query latency. Most tools might point you to the spike.&nbsp;Causely&nbsp;uses&nbsp;its&nbsp;CRE to&nbsp;map&nbsp;causes to&nbsp;the&nbsp;symptoms&nbsp;they cause, like&nbsp;a&nbsp;slow-running query may cause&nbsp;elevated latencies&nbsp;across multiple HTTP paths. It combines observed signals with domain-specific causal models&nbsp;to&nbsp;deterministically&nbsp;infer&nbsp;the cause of the observed symptoms. The MCP Server then surfaces the recommended fix directly in your IDE, with full causal context.&nbsp;</p>
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/nN4Iy5BuC3c?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Solving Slow Database Queries with Causely and its MCP Server">
  </iframe>
</div>
<!--kg-card-end: html-->
<h2 id="remediation-where-engineers-work"><strong>Remediation, Where Engineers Work</strong>&nbsp;</h2><p>The MCP Server brings this reasoning and remediation into the tools engineers already use.&nbsp;There’s&nbsp;no jumping between dashboards, terminals, and editors to track an issue from signal to fix. Everything happens in place,&nbsp;with&nbsp;the right context.&nbsp;</p><p>Whether&nbsp;you’re&nbsp;using Cursor, Claude, or any MCP-compatible editor,&nbsp;Causely&nbsp;now provides inline, explainable remediations. Developers can fix&nbsp;what’s&nbsp;broken—runtime, config, or code—without jumping between tools or digging through dashboards.&nbsp;</p><p>Here’s&nbsp;another&nbsp;look at how this works in practice.&nbsp;Causely&nbsp;identifies&nbsp;a CPU resource contention issue&nbsp;that’s&nbsp;degrading multiple services, traces it to a misconfigured Helm value, and proposes the corrected setting—delivered directly into the IDE via the MCP Server.&nbsp;</p>
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/L-nWJr4tZ7U?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely's MCP Server Brings Reliability into Your IDC | Helm Chart'">
  </iframe>
</div>
<!--kg-card-end: html-->
<h2 id="built-for-complex-systems-kubernetes-is-just-the-start"><strong>Built for Complex Systems; Kubernetes is Just the Start</strong></h2><p>We’ve&nbsp;focused first on Kubernetes because it concentrates so many reliability challenges in one environment: ephemeral workloads, misaligned configurations, and distributed dependencies.&nbsp;But the underlying problems we solve go beyond the cluster.&nbsp;</p><p>Whether the root cause lives in your service mesh,&nbsp;Terraform&nbsp;plan, or application code,&nbsp;Causely’s&nbsp;causal model surfaces it,&nbsp;and delivers the fix where it matters. Kubernetes is one environment we&nbsp;operate&nbsp;in. The goal is broader: to make reliability engineered, not reactive, across every layer of modern software delivery.&nbsp;&nbsp;</p><h2 id="see-it-in-action"><strong>See It in Action</strong>&nbsp;</h2><p>If&nbsp;you’re&nbsp;attending&nbsp;KubeCon&nbsp;North America,&nbsp;come see&nbsp;it in action at Causely Booth #1661.&nbsp;We’ll&nbsp;show you how&nbsp;Causely&nbsp;detects&nbsp;what’s&nbsp;wrong, explains why&nbsp;it’s&nbsp;happening, and remediates it,&nbsp;right where the fix belongs.&nbsp;&nbsp;</p><p>And if&nbsp;you’re&nbsp;ready to explore more on your own, start here:&nbsp;<a href="https://docs.causely.ai/ask-causely/mcp-server/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>https://docs.causely.ai/ask-causely/mcp-server/</u></a>&nbsp;&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Pairs Its Causal Reasoning Engine with Gemini for Automated Service Reliability]]></title>
      <link>https://causely.ai/blog/causely-pairs-its-causal-reasoning-engine-with-gemini-for-automated-service-reliability</link>
      <guid>https://causely.ai/blog/causely-pairs-its-causal-reasoning-engine-with-gemini-for-automated-service-reliability</guid>
      <pubDate>Wed, 29 Oct 2025 14:35:38 GMT</pubDate>
      <description><![CDATA[Causely now leverages Google’s Gemini models to enhance how users interact with its Causal Reasoning Engine.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/10/businesswire-2.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><strong>SEATTLE--(</strong><a href="https://www.businesswire.com/news/home/20251029667702/en/Causely-Pairs-Its-Causal-Reasoning-Engine-with-Gemini-for-Automated-Service-Reliability?ref=causely-blog.ghost.io" rel="noreferrer"><strong>BUSINESS WIRE</strong></a><strong>)</strong>--<a href="https://causely-blog.ghost.io/causely-raises-8-8m-in-seed-funding-to-deliver-it-industrys-first-causal-ai-platform/" rel="noreferrer">Causely</a>, the only AI SRE using a structured causal graph to enable deterministic automation, now leverages Google’s Gemini models to enhance how users interact with its Causal Reasoning Engine and is available today on Google Cloud Marketplace. Together, Causely’s causal inference and Gemini's language-to-query and summarization capabilities help teams proactively resolve issues and keep SLOs on track.</p><blockquote>“Bringing Causely to Google Cloud Marketplace will help customers quickly deploy, manage, and grow the Causal Reasoning Engine on Google Cloud's trusted, global infrastructure," said Dai Vu, Managing Director, Marketplace &amp; ISV GTM Programs at Google Cloud.</blockquote><p>Causely’s reasoning engine models how distributed systems behave, identifying the cause of reliability risks, their impacts, and the actions required to assure performance. With Google’s Gemini models, Causely automatically generates clear, context-rich explanations and remediation guidance that help teams act with speed and confidence.</p><blockquote>“Our causal engine determines why services experience increased latency and error rates. Google’s Gemini models help communicate what to do next, translating our causal inference into actionable, automated guidance. We’re eliminating the need for incident war rooms and enabling proactive reliability,” said Yotam Yemini, CEO of Causely.</blockquote><p><strong><u>Two new features leverage Gemini models within Causely:</u></strong></p><ul><li><strong>Ask Causely:</strong>&nbsp;A chat interface for using Causely's inferencing engine to understand the cause of increased service latency and error rates and the corresponding blast radius.</li><li><strong>Enhanced Root Cause Descriptions</strong>: Causely prompts Gemini models with the metrics, symptoms, events, and logs associated with each root cause, allowing it to summarize the issue and expand Causely’s remediation with detailed, context-specific guidance.</li></ul><p>Causely customers have reported up to 75 percent faster recovery and 25 percent fewer incidents, improving uptime and productivity. While optimized for Google Cloud, Causely remains multi-cloud and model-agnostic, operating across public clouds and on-prem environments and integrating with a range of large language models.</p><h2 id="about-causely"><strong>About Causely</strong></h2><p>Causely&nbsp;is an AI startup dedicated to transforming Site Reliability Engineering through innovative automation, causal reasoning, and developer-centric tools. Their solutions help organizations manage complex distributed systems more efficiently and reliably.</p><h2 id="media-contact">Media Contact</h2><p>Adam LaGreca<br>Founder of 10KMedia<br><a href="mailto:adam@10kmedia.co" rel="nofollow">adam@10kmedia.co</a></p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[How Causely and Google Gemini Are Powering Autonomous Reliability]]></title>
      <link>https://causely.ai/blog/causely-and-google-gemini</link>
      <guid>https://causely.ai/blog/causely-and-google-gemini</guid>
      <pubDate>Tue, 28 Oct 2025 23:51:02 GMT</pubDate>
      <description><![CDATA[Gemini’s ability to interpret natural language, generate structured code, and summarize technical context complements Causely’s deterministic causal inference engine, turning complex telemetry into clear and reliable insights.]]></description>
      <author>Steffen Geißinger</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/10/causely-gemini.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>As systems scale and interactions multiply, reliability can’t be assured through dashboards and alerts.&nbsp;When hundreds of interdependent services rely on managed components, asynchronous communication, and shared databases, engineers spend valuable hours chasing symptoms because they lack a system that infers causality across dependencies.&nbsp;</p><p>Causely&nbsp;addresses this gap through its <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Causal Reasoning Engine</u></a>, which models how dependencies interact in real time and accurately determines the cause of observed service latency and errors. By inferring the cause of performance degradation and understanding the affected dependencies, Causely enables automated actions to assure performance.&nbsp;&nbsp;</p><p>Now, through a new collaboration with Google Gemini, engineering teams can act on those insights faster and more intuitively.&nbsp;</p><h2 id="why-we-started-with-gemini">Why We Started with Gemini &nbsp;</h2><p>Reliability engineering depends on both accuracy and trust. <a href="https://www.infoq.com/articles/causal-reasoning-observability/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>LLMs excel</u></a> at interpreting vast, unstructured data, but without a principled understanding of cause and effect, their outputs are prone to hallucination.&nbsp;&nbsp;</p><p>Causely provides that missing foundation. Its Causal Reasoning Engine models how services, dependencies, and resources interact. These causal models provide deterministic truth about the causes of performance anomalies, and their blast radius. LLMs&nbsp;build on this foundation by translating the results of this causal inference into natural language explanations and action plans that help teams act with confidence.  The result is a real-time, closed loop between insight and action.&nbsp;</p><p>We chose Gemini because of its contextual reasoning, enterprise-grade security, and deep integration with Google Cloud workloads. Gemini’s ability to interpret natural language, generate structured code, and summarize technical context complements Causely’s deterministic causal inference engine, turning complex telemetry into clear and reliable insights.&nbsp;</p><h2 id="how-causely-uses-gemini-to-enhance-autonomous-reliability">How Causely Uses Gemini to Enhance Autonomous Reliability&nbsp;</h2><p>While Causely remains interoperable with any LLM that a customer wishes to use, we’ve integrated Gemini into Causely for two new features that make interacting with our Causal Reasoning Engine more intuitive and powerful.&nbsp;</p><h3 id="ask-causely">Ask Causely&nbsp;&nbsp;</h3><p><a href="https://www.causely.ai/blog/causely-feature-demo-using-ask-causely-to-transform-incident-response?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Ask Causely</u></a> empowers users to ask complex questions about their environment, check service health, and identify both existing root causes and potential failure points. Ask Causely leverages Gemini’s natural language understanding and generation capabilities to deliver a conversational and seamless experience. It uses multiple Gemini models and takes advantage of Gemini’s generative features to provide a white-glove reliability experience.&nbsp;</p><figure class="kg-card kg-embed-card kg-card-hascaption"><iframe width="200" height="113" src="https://www.youtube.com/embed/SbpVWUjRyfU?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" title="Causely Feature Demo: Using Ask Causely to Transform Incident Response"></iframe><figcaption><p><span style="white-space: pre-wrap;">Ask Causely leverages Gemini's natural language understanding and generation capabilities.</span></p></figcaption></figure><p>Gemini is integrated into several stages of the Ask Causely pipeline, offering a high degree of flexibility, control, and integration within Causely’s autonomous reliability framework. To ensure timely and accurate results, Causely uses low-latency Gemini models to support data-intensive operations such as log summarization, entity extraction, and contextual signal analysis across diverse telemetry sources.&nbsp;&nbsp;</p><p>Key aspects of the integration include:&nbsp;</p><ul><li><strong>Adaptive Model Selection:</strong> Causely strategically deploys low-latency Gemini models to ensure quick responses while using higher-capability reasoning models to convert Causely’s causal diagnoses into clear, actionable remediations.&nbsp;</li><li><strong>Grounded search for reliable knowledge: </strong>Ask Causely uses Gemini’s grounded search capability to deliver accurate, context-aware remediations based on trusted external sources such as vendor documentation, Stack Overflow, and GitHub.&nbsp;</li><li><strong>Tool calling and code generation for live system intelligence</strong>: Ask Causely uses Gemini’s tool calling and code generation to query live services, interpret telemetry, and surface insights from Causely’s causal engine on identified symptoms and root causes.&nbsp;</li><li><strong>Code generation for automation: </strong>Gemini’s code generation enables Causely’s <a href="https://docs.causely.ai/installation/?ref=causely-blog.ghost.io" rel="noreferrer">Code Agents</a> to analyze time series and topology data, generate diagnostic workflows, automate remediations, and perform dynamic analysis during active incidents.&nbsp;</li><li><strong>Entity recognition: </strong>Gemini’s strong entity recognition helps Causely rapidly locate and correlate critical services, nodes, and components within complex environments.&nbsp;</li><li><strong>Embeddings for enterprise grounding: </strong>Gemini embeddings enable Causely to integrate internal documentation and historical incidents to deliver organization-aware, contextually grounded insights.&nbsp;</li><li><strong>Interpretable causal insights: </strong>Gemini translates Causely’s causal signal extraction from unstructured telemetry into clear, human-readable explanations and actionable remediations.&nbsp;&nbsp;</li></ul><h3 id="causal-explanation-and-remediation"><strong>Causal&nbsp;Explanation and Remediation</strong>&nbsp;</h3><p>Causely uses Gemini to generate SLO-aware, application-specific explanations and actionable remediations grounded in verified causal data and analysis. Causely infers the precise cause of observed anomalies and gathers the most relevant logs and events from across the environment to enable automated action. Gemini then contextualizes this evidence with Causely’s live <a href="https://docs.causely.ai/reference/terminology/?ref=causely-blog.ghost.io#causality-graph-cg" rel="noreferrer noopener"><u>causal graph</u></a> to produce coherent, human/machine-readable descriptions and remediations that reflect&nbsp;the underlying issue,&nbsp;its operational impact, and suggested remediation steps&nbsp;</p><p><strong>Supporting features include:</strong>&nbsp;</p><ul><li><strong>Log and event contextualization:</strong> Gemini interprets logs and events selected by Causely’s reasoning engine, connecting raw telemetry to observed symptoms and their SLO implications.&nbsp;</li><li><strong>Causal-grounded remediation actions:</strong> Recommendations are based on observed symptoms and tailored to the affected application or service context.&nbsp;</li><li><strong>Automated post-incident summaries:</strong> Gemini compiles structured summaries that capture causal explanations, operational impact, and applied remediations, ensuring consistency and traceability across incidents.&nbsp;&nbsp;</li></ul><h2 id="open-and-flexible">Open and Flexible&nbsp;</h2><p>We started with Gemini as an initial and foundational proof point. But our platform was built to be multi-cloud and model-agnostic. Causely runs across public clouds or on-prem environments, and we’ll continue to develop integrations with other large language models, giving customers flexibility without lock-in. If you have a particular model and use case in mind, please contact us!&nbsp;</p><h2 id="the-future-from-reactive-to-autonomous-reliability">The Future: From Reactive to Autonomous Reliability&nbsp;</h2><p>Causely and Google Gemini together mark a step toward <a href="https://www.causely.ai/blog/capabilities-causal-analysis?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>autonomous service reliability</u></a>, where systems can understand, explain, and prevent issues before users are affected. This shift moves reliability from reactive firefighting to proactive, explainable prevention.&nbsp;&nbsp;</p><h2 id="see-it-in-action">See It in Action&nbsp;</h2><p>Explore the new Gemini-powered capabilities in Causely: </p><ul><li>View Causely on <a href="https://console.cloud.google.com/marketplace/product/causely-public/crp-cna-gcp?hl=en&project=dspopup-austin&ref=causely-blog.ghost.io" rel="noreferrer"><u>Google Cloud Marketplace </u>➜</a></li><li><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer"><u>Request a Demo</u> ➜</a>&nbsp;</li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Modern CTO Podcast: The AI SRE Hype and How to Get it Right]]></title>
      <link>https://causely.ai/blog/modern-cto-podcast-ai-sre</link>
      <guid>https://causely.ai/blog/modern-cto-podcast-ai-sre</guid>
      <pubDate>Mon, 27 Oct 2025 16:43:38 GMT</pubDate>
      <description><![CDATA[Modern CTO Podcast's Joel Beasley sits down with Causely CEO Yotam Yemini to dive deep into the world of AI Site Reliability Engineering.]]></description>
      <author>Yotam Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/10/Screenshot-2025-10-27-at-12.38.27---PM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>In this episode, CTO Podcast's Joel Beasley sits down with Causely CEO Yotam Yemini to dive deep into the world of AI Site Reliability Engineering. </p><p>They cut through the hype and explore:</p><ul><li>Why AI SREs are gaining so much attention</li><li>How to make AI benefits in operations tangible</li><li>The limitations of language models in SRE work</li></ul><p>Yotam shares insights from his unique journey from psychology to tech, and offers a fresh perspective on building reliable systems in the age of AI. Listen below or <a href="https://youtu.be/_OkvGj4h-ts?si=G99y-h9NnVNZYF-6&ref=causely-blog.ghost.io" rel="noreferrer">watch it on YouTube</a>. </p>
<!--kg-card-begin: html-->
<iframe frameborder="0" height="200" scrolling="no" src="https://playlist.megaphone.fm?e=LDB2451234294" width="100%"></iframe>
<!--kg-card-end: html-->
]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Zero Downtime at 30,000 RPS: How Quantum Metric Rearchitected with Causely]]></title>
      <link>https://causely.ai/blog/quantum-metric-rearchitected-with-causely</link>
      <guid>https://causely.ai/blog/quantum-metric-rearchitected-with-causely</guid>
      <pubDate>Mon, 27 Oct 2025 15:44:05 GMT</pubDate>
      <description><![CDATA[During a high-risk migration, Causely gave Quantum Metric a new kind of clarity rooted in cause-and-effect across dynamic systems. This helped them improve how they think about managing complexity at scale and move fast without breaking things.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/10/QM-icon.webp" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Quantum Metric’s digital analytics platform processes millions of requests per second and ingests petabytes of data daily. When their platform team needed to migrate a core Google Kubernetes Engine (GKE) cluster to <a href="https://cloud.google.com/kubernetes-engine/docs/concepts/dataplane-v2?ref=causely-blog.ghost.io"><u>Dataplane V2,</u></a> they needed to carefully avoid impacting the stability of critical services powering some of the world’s largest brands.</p><p><strong>The goal: </strong>modernize without compromising performance.&nbsp;</p><p><strong>The constraint: </strong>no in-place upgrade path.</p><p><strong>The outcome: </strong>a seamless migration, with zero regressions.</p><h2 id="the-stakes-migrating-a-backbone-cluster-under-load"><strong>The Stakes: Migrating a Backbone Cluster Under Load</strong></h2><p><a href="https://www.quantummetric.com/?ref=causely-blog.ghost.io" rel="noreferrer">Quantum Metric</a> is the customer-centered digital analytics platform for today’s leading organizations, enabling global brands to make smarter, faster decisions about their digital customer experiences. Their platform ingests and processes petabytes of data each day to help these brands avoid revenue loss and deliver more seamless digital journeys across their web and mobile applications.</p><p>Operating at this scale presents numerous challenges and Quantum Metric’s engineering team is constantly looking for ways to best serve their customers. The team wanted to upgrade to Google Cloud’s Dataplane V2 as a way to address operational challenges related to networking, but the team knew this wouldn’t be a simple migration to execute. The cluster in question hosted some of their highest-throughput services, including internal request handlers and real-time data processing pipelines. Several of these services alone handled over 30,000 requests per second.</p><p>With no in-place upgrade available, success meant spinning up a new cluster, migrating services in stages, and deprecating the old one without impacting production.</p><h2 id="stepwise-migration-with-no-room-for-error"><strong>Stepwise Migration with No Room for Error</strong></h2><p>The team adopted a blue-green migration strategy. They started by shifting lower-risk, stateless services with minimal dependencies, which was ideal for early validation. From there, they moved to heavier components with deeper integrations and higher throughput.</p><p>Every step introduced risk. The services being moved were foundational; any missed dependency or regression could create a cascade of downstream failures.</p><h2 id="why-causely-was-essential"><strong>Why Causely Was Essential</strong></h2><p>For Kevin Ard, Staff Platform Engineer at Quantum Metric and leading the migration, confidence throughout the process came from one source: <a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noreferrer">Causely</a>.</p><blockquote>“Because of Causely, I didn’t need to do any custom telemetry work or worry about if something was off as changes were rolled out. The system made it simple to complete our migration with confidence.”</blockquote><p>Causely continuously builds a live model of service and data flows using lightweight, eBPF-based tracing - no manual instrumentation, dashboard-building or query-writing is required. As Kevin shifted traffic, Causely proactively showed and analyzed real-time changes in service dependencies and system health. If something started to show signs of degradation, it showed exactly where and why so that changes could be applied ahead of any major problems.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/2025/10/data-src-image-0d6e9817-f0fc-428d-8eea-fe10f1ea2f78.png" class="kg-image" alt="" loading="lazy" width="1600" height="1600" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/10/data-src-image-0d6e9817-f0fc-428d-8eea-fe10f1ea2f78.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/10/data-src-image-0d6e9817-f0fc-428d-8eea-fe10f1ea2f78.png 1000w, https://causely-blog.ghost.io/content/images/2025/10/data-src-image-0d6e9817-f0fc-428d-8eea-fe10f1ea2f78.png 1600w" sizes="(min-width: 720px) 720px"></figure><p><em>Example: Causely’s service and dataflow graphs show how requests and data move across services—and surface the true root cause directly within that flow.</em></p><p>Rather than stitching together dashboards or relying on intuition, Kevin was able to rely on Causely as a real-time copilot that analyzed cause-and-effect for him as changes were being made.&nbsp;</p><h2 id="the-result-two-weeks-saved-no-fire-drills"><strong>The Result: Two Weeks Saved, No Fire Drills</strong></h2><p>Without Causely, Kevin estimates he would’ve spent two weeks building dashboards, coordinating instrumentation with at least three other engineers, and piecing together a view of the system. Instead, he had a single, always-on copilot keeping an eye on things as traffic shifted.</p><p>“With the sheer volume of telemetry our systems can generate, filtering the noise to focus on causality was a game-changer. Knowing that Causely would proactively spot degradations without needing to configure any rules or alerts was icing on the cake.”</p><p>The result:</p><p><strong>✅ Zero regressions</strong></p><p><strong>✅ No performance degradations</strong></p><p><strong>✅ No fire drills</strong></p><p><strong>✅ A smooth cutover of a high-throughput cluster</strong></p><h2 id="making-high-scale-reliability-practical"><strong>Making High-Scale Reliability Practical</strong></h2><p>Even in environments with strong observability, Causely adds something critical: real-time causal reasoning. It understands not just what changed, but why – and it does this automatically without custom dashboards or complex rules.&nbsp;</p><p>At Quantum Metric’s scale, reliability means preventing issues before they become incidents. During a high-risk migration, Causely gave Kevin and his team a new kind of clarity rooted in cause-and-effect across dynamic systems. This helped them improve how they think about managing complexity at scale and move fast without breaking things.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Do You Even Need Kubernetes for Reliable Service Delivery?]]></title>
      <link>https://causely.ai/blog/do-you-even-need-kubernetes</link>
      <guid>https://causely.ai/blog/do-you-even-need-kubernetes</guid>
      <pubDate>Mon, 27 Oct 2025 13:45:00 GMT</pubDate>
      <description><![CDATA[Kubernetes has become the default backbone of cloud native architecture. But does it actually help you ship services more reliably, or is it just more moving parts?]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/10/Untitled-design--1-.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Originally posted to </em><a href="https://cloudnativenow.com/contributed-content/do-you-even-need-kubernetes-for-reliable-service-delivery/?ref=causely-blog.ghost.io" rel="noreferrer"><em>Cloud Native Now</em></a><em>.</em></p><p>Kubernetes has become the default backbone of cloud native architecture. But does it actually help you ship services more reliably, or is it just more moving parts?</p><p><a href="https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines?ref=causely-blog.ghost.io">Despite Betteridge’s law of headlines</a>, the answer is yes, for the vast majority of companies. Kubernetes has proven essential for reliable service delivery in today’s cloud native world. But not because it magically sprinkles reliability over your stack. Rather, it gives you what you need to build reliability like an engineer, not a magician. Kubernetes is a means, never the end. But let me explain before you&nbsp;<a href="https://www.theverge.com/2023/4/27/23701551/bluesky-skeets-now?ref=causely-blog.ghost.io">skeet</a>&nbsp;this out of context…</p><h2 id="kubernetes-is-not-the-end-but-a-means"><strong>Kubernetes&nbsp;Is&nbsp;Not&nbsp;the&nbsp;End,&nbsp;but&nbsp;a&nbsp;Means</strong></h2><p>Like any technology, Kubernetes is just a building block supporting your goals. Treat it that way. If you adopt it simply because&nbsp;<em>nobody gets fired for choosing Kubernetes</em>, you’re setting yourself up for pain. Kubernetes is powerful and unforgiving — there’s no shortage of&nbsp;<a href="https://thenewstack.io/day-2-kubecon-europe-keynotes-users-share-kubernetes-war-stories/?ref=causely-blog.ghost.io">real‑world lessons</a>&nbsp;and&nbsp;<a href="https://k8s.af/?ref=causely-blog.ghost.io">failure stories</a>&nbsp;from teams led by hype and not purpose.</p><p>Instead of chasing hype, start with the essentials. Cut to the core by asking a deceptively simple question:&nbsp;<em>“Who cares?”</em></p><p>Who cares about the benefits —reliability, consistency and automation — that Kubernetes gives you? Teams running services on your platform expect those properties to hold under load and during change. Who cares about that? Your organization, offering those services to users. Users, who have their own goals. Those goals aren’t yours to reach, but you can make the path smooth and successful. When users succeed, they’re delighted — and that’s our end.</p><h2 id="why-delighted-users-matter-to-engineers"><strong>Why&nbsp;Delighted&nbsp;Users&nbsp;Matter&nbsp;to&nbsp;Engineers</strong></h2><p>Keeping that mindset, you might ask: “Why should I, as an engineer, care about delighted users?” You might wonder why you should care at all. Sure, you could go eat pretzels and scroll&nbsp;<a href="http://www.opentelemetry.bayern/?ref=causely-blog.ghost.io">silly websites</a>. But engineers don’t stop there. We look deeper — because by profession, you are a systematic thinker. You solve human needs with science and&nbsp;engineering.</p><p>So instead of just answering the initial question (“Do we even need Kubernetes for reliable service delivery?”), we approached it like engineers. We used a structured method called causal analytics — asking “Who cares?” — until the real goal emerges: Does Kubernetes deliver the level of service reliability that we need to delight our&nbsp;users?</p><p>All that remains is to break it down in your specific context: Who are your users? What is a delightful experience for them? What expectations for reliability do they have?</p><p>I am certain that after answering those questions, 95% of you will conclude that your needs for reliability are not exceptional, and that Kubernetes is the right choice to achieve those goals. This should not be surprising, because as with most requirements, there’s a bell curve; most teams need a reasonable amount of reliability, some need a little and a few need the extreme.</p><p>The teams needing extreme reliability are building systems where lives are literally at stake, if they fail — aviation, medicine and similar fields.</p><p>For the teams that need less than average reliability, here’s a&nbsp;story:&nbsp;Years ago, I asked the ops team of a world-class soccer club how much reliability mattered for their online store. Their answer surprised me —&nbsp;<em>“Our users are fans!&nbsp;If they can’t get a&nbsp;jersey&nbsp;now,&nbsp;they’ll&nbsp;wait.&nbsp;Reliability&nbsp;isn’t&nbsp;our&nbsp;differentiator — we&nbsp;focus&nbsp;on&nbsp;a&nbsp;unique&nbsp;fan experience.</em>“</p><p>Indeed, fans happily wait. Half the fun is in the waiting, which deepens their sense of&nbsp;belonging.</p><p>What I learned from that is simple — it’s perfectly fine to be average in most areas, as long as you know where not to compromise.</p><h2 id="you%E2%80%99re-an-engineer-%E2%80%93-behave-like-one"><strong>You’re&nbsp;an&nbsp;Engineer&nbsp;–&nbsp;Behave&nbsp;Like&nbsp;One!</strong></h2><p>As you might have guessed, the point of this text is not to settle the question of whether you need Kubernetes for reliable service delivery. It is to remind you that you should not only like building and tinkering like an engineer but also&nbsp;enjoy&nbsp;thinking and making decisions like one. It’s your job to know when it is perfectly fine to deliberately pick the boring, established standard and when you need to go against the grain.</p><p>When you make decisions that way, you gain clarity and calmness. You will read yet another blog post from teams ditching Kubernetes and understand that they might be one of the outliers that need something different. But because you made the choice to adopt it with your context in mind, you won’t regret it.</p><p>Even if you reach the point where you need to revisit your decision, as new information and context emerge, you’ll approach the problem with the right questions: Given our current context, is moving away from Kubernetes the right thing to do? Or is it preferable to continue using it, even if your enthusiasm has waned?</p><p>Finally, it’s not just about Kubernetes. Every technology you add is a means, not the end and should be evaluated in your context. This mindset is especially useful for approaching hype cycles. Right now, everyone is trying to solve every problem with LLMs. Next year it will be something else. The cycle never stops.</p><p>But as an engineer, your job isn’t to chase the latest shiny thing. It’s to understand your context, weigh the trade-offs and decide deliberately. Sometimes that means adopting the new; sometimes it means sticking with the boring. The strength lies in knowing the&nbsp;difference.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Brings Causal Intelligence to IBM Instana]]></title>
      <link>https://causely.ai/blog/causely-brings-causal-intelligence-to-ibm-instana</link>
      <guid>https://causely.ai/blog/causely-brings-causal-intelligence-to-ibm-instana</guid>
      <pubDate>Thu, 23 Oct 2025 17:13:29 GMT</pubDate>
      <description><![CDATA[Causely already mediates to OpenTelemetry, Datadog, and Dynatrace to consume traces, metrics, alerts and logs. Today we’re adding IBM Instana Observability to that list.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/10/instana-blog-graphic0.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Are you tired of chasing alerts? Are you drowning in endless traces, metrics, and dashboards? You don’t have a telemetry problem—you have a triage problem. Your observability tool fills your screen with alerts, anomalies, and dashboards; bridges spin up; people pile into war rooms. The job becomes hunting for the signal instead of fixing the system. Your observability tools turned your operations into a big data problem. It is time to stop this journey to nowhere. There must be a different way.&nbsp;</p><p><a href="https://docs.causely.ai/telemetry-sources/?ref=causely-blog.ghost.io" rel="noreferrer">Causely already mediates</a> to OpenTelemetry, Datadog, and Dynatrace to consume traces, metrics, alerts and logs. Today we’re adding <a href="https://www.ibm.com/products/instana?ref=causely-blog.ghost.io" rel="noreferrer"><strong>IBM Instana Observability</strong></a> to that list. </p><p>This is yet another step in our journey to <a href="https://www.causely.ai/blog/capabilities-causal-analysis?ref=causely-blog.ghost.io" rel="noreferrer">autonomous service reliability</a>: Our <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causal Reasoning Engine</a> (CRE) takes the flood of data you already collect and turns it into action by automatically inferring and pinpointing the causes of service degradations, reliability issues, and emerging risks. Furthermore, Causely provides context-rich explanations and remediation guidance—so responders move faster and with confidence.&nbsp;</p><h2 id="why-this-matters">Why This Matters&nbsp;</h2><p>Causely eliminates the need for war rooms. Causely’s reasoning engine analyzes telemetry, traces, logs and alerts from your observability stack—including IBM Instana Observability—minimizing MTTR, protecting SLOs, turning dashboards into decisions, and enabling reliability. </p><blockquote>“Our goal is to collapse the war room—use the Instana data you already have to identify true causes and risks, then deliver clear, guided remediation,” says Ben Yemini, Head of Product at Causely.&nbsp;</blockquote><p>Causely’s users stop chasing alerts and act on clear, context-rich explanations and guided remediation. Causely maintains a continuously updated <a href="https://docs.causely.ai/getting-started/how-causely-works/?ref=causely-blog.ghost.io" rel="noreferrer">topology graph</a> of your services, data flows and infrastructure and a <a href="https://docs.causely.ai/reference/terminology/?ref=causely-blog.ghost.io#causality-graph-cg" rel="noreferrer">causality graph</a> that maps causes to symptoms. Using probabilistic inference, CRE pinpoints causes, evaluates blast radius, and prioritizes remediation—so teams converge faster with fewer escalations. Because the engine runs continuously, it also highlights emerging risks before they become incidents.&nbsp;</p><p>As one customer put it:  </p><blockquote>“We don’t have a shortage of telemetry—we’ve been using Instana for years. We’re overwhelmed by data and dashboards. Dealing with performance issues is labor-intensive, and it’s been nearly impossible to prevent them.” </blockquote><p>This integration is designed to change that.&nbsp;</p><h2 id="key-features-of-the-ibm-instana-observability-integration">Key Features of the IBM Instana Observability Integration&nbsp;</h2><ul><li><strong>Automatic root cause analysis pinpointing cause and risk in real time.</strong> Causely’s CRE contiguously ingests Instana signals and pinpoints the causes of service degradations and failures and surfacing emerging risks before they become incidents.&nbsp;&nbsp;</li><li><strong>Automatic impact analysis and blast radius mapping.</strong> Causely’s CRE continuously infers the impacted services, endpoints, databases, and SLOs to drive the right prioritization.&nbsp;&nbsp;</li><li><strong>Automatic remediation.</strong> Causely provides context-rich explanation of the root causes, but more importantly can automatically remediate the cause by providing the specific fix at runtime, configuration, or code level.&nbsp;&nbsp;</li><li><strong>Fast, low-lift enablement.</strong> Connect Instana via API and start diagnosing in minutes without changes to your services.&nbsp;</li></ul><p>The result:&nbsp;&nbsp;</p><ul><li>Continuously assuring service reliability&nbsp;</li><li>Minimizing MTTR&nbsp;</li><li>Improving operational efficiencies&nbsp;&nbsp;</li><li>Increasing productivity&nbsp;</li></ul><h2 id="learn-more">Learn More</h2><p>To learn more about using Causely with IBM Instana Observability, see the <a href="https://docs.causely.ai/telemetry-sources/instana/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>integration guide</u></a>.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Thank You, Grafana: How Beyla Helped Us, and How You Can Use it Too!]]></title>
      <link>https://causely.ai/blog/thank-you-grafana-beyla-how-to</link>
      <guid>https://causely.ai/blog/thank-you-grafana-beyla-how-to</guid>
      <pubDate>Tue, 07 Oct 2025 22:59:32 GMT</pubDate>
      <description><![CDATA[By open-sourcing eBPF-based auto-instrumentation and then donating it as an OpenTelemetry BPF Instrumentation (OBI) project, Grafana didn’t just release code, they lowered the onramp for observability.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/10/causely-loves-beyla2.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>This post is the first in what we hope will become a series where we pause to say thanks to the projects and communities that move us forward. Today it’s <a href="https://grafana.com/oss/beyla-ebpf/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Beyla</u></a>’s turn. By open-sourcing eBPF-based auto-instrumentation and then <a href="https://github.com/open-telemetry/community/issues/2406?ref=causely-blog.ghost.io" rel="noreferrer">donating</a> it as an OpenTelemetry BPF Instrumentation (<a href="https://opentelemetry.io/docs/zero-code/obi/?ref=causely-blog.ghost.io" rel="noreferrer">OBI</a>) project, <a href="https://grafana.com/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Grafana Labs</u></a> didn’t just release code, they lowered the onramp for observability. </p><p>That matters to us because Causely turns your observability data into distilled insights to reason over cause and effect to put teams in control. The faster you get baseline, trustworthy signals, the faster we can do that work.&nbsp;</p><h2 id="why-beyla-helps">Why Beyla Helps&nbsp;</h2><p>Teams come to us in two states. Some already have a healthy OpenTelemetry footprint; for them, we plug Causely in and get to causal reasoning quickly. Others are still early: traces are patchy, metrics are inconsistent, logs are everywhere. Beyla makes that second state less painful. You turn it on and a picture forms: service boundaries, dependency maps, request paths. Suddenly they’re no longer guessing. Our engine creates a consistent baseline of telemetry they can trust, and the engine can begin attributing symptoms to specific causes with confidence.&nbsp;&nbsp;</p><p>The effect shows up in the first week: fewer blind spots, clearer causal analytics, and the ability to move from “seeing an error” to “knowing why it happened and what to change.” Beyla helps create the raw material; Causely turns it into decisions.&nbsp;</p><h2 id="what-the-obi-donation-signals">What the OBI Donation Signals&nbsp;</h2><p>Donating Beyla to the OTel ecosystem signals a commitment to standards and longevity. It’s a move toward shared building blocks rather than one-off integrations. We value that because our promise — <em>autonomous reliability without drama</em> — depends on predictable inputs and open interfaces. The more OTel wins, the healthier the whole stack becomes.&nbsp;</p><h2 id="how-this-looks-with-causely">How This Looks With Causely&nbsp;</h2><p>Our agents roll out Beyla by default. Setup is minimal: deploy Beyla, establish a baseline of traces/metrics, and immediately begin causal reasoning. Beyla provides out-of-the-box visibility; Causely layers causal inference, risk scoring, and proposed actions. </p><p>You get:&nbsp;</p><ul><li><strong>A consistent baseline</strong> of traces/metrics powered by eBPF auto-instrumentation.&nbsp;</li><li><strong>High-confidence root cause detection</strong> that doesn’t drown you in correlations.&nbsp;</li><li><strong>Guardrails that respect zero-trust boundaries </strong>— you keep your data; Causely reasons over signals, not secrets.&nbsp;</li></ul><p>It adds up to a simple promise: </p><blockquote>Faster time to control, not just faster dashboards.&nbsp;&nbsp;</blockquote><h2 id="try-beyla-on-your-own">Try Beyla on Your Own&nbsp;</h2><p>If you’d like to see what we’re talking about, the easiest path is to stand up a tiny environment locally and watch Beyla fill in the picture. The steps below take you from a single service to a small conversation between services, and then into traces and metrics you can actually explore.&nbsp;</p><h3 id="single-service-beyla-docker">Single Service + Beyla (Docker)&nbsp;</h3><p>Pick a simple HTTP service; the example below instruments a demo app running on port 8080 and prints spans to the console — no OTLP required to get started.&nbsp;</p><pre><code class="language-shell"># Terminal 1 — start your app&nbsp;
docker run --rm --name demo -p 5678:5678 golang:1.23 go run github.com/hashicorp/http-echo@latest -text=hello&nbsp;</code></pre><pre><code class="language-shell"># Terminal 2 — run Beyla next to it (console output only)&nbsp;
docker run --rm \&nbsp;
&nbsp; --name beyla \&nbsp;
&nbsp; --privileged \&nbsp;
&nbsp; --pid="container:demo" \&nbsp;
&nbsp; -e BEYLA_OPEN_PORT=5678 \&nbsp;
&nbsp; -e BEYLA_TRACE_PRINTER=text \&nbsp;
&nbsp; grafana/beyla:latest&nbsp;</code></pre><p>Open the app in your browser (http://localhost:5678), click around to generate traffic, and watch spans print in your Beyla terminal. Save a short snippet of that output — we’ll use it below as an example of what “good” looks like.&nbsp;</p><p><br>If you want to send data to an OpenTelemetry Collector, Tempo, or Jaeger, add the following to the Beyla container:&nbsp;&nbsp;</p><pre><code class="language-shell">&nbsp; -e OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf \&nbsp;
&nbsp; -e OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318&nbsp;</code></pre><h3 id="a-small-web-of-services-docker-compose">A Small Web of Services (Docker Compose)&nbsp;</h3><p>To make this concrete, the snippet below starts two services that talk to each other, runs Beyla against the frontend, and wires up local backends (an OpenTelemetry Collector that feeds Jaeger for traces, and Prometheus scraping Beyla’s metrics).&nbsp;</p><pre><code class="language-yaml">services:&nbsp;
&nbsp; frontend:&nbsp;
&nbsp;&nbsp;&nbsp; image: golang:1.23&nbsp;
&nbsp;&nbsp;&nbsp; command: ["go", "run", "github.com/hashicorp/http-echo@latest", "-listen=:5678", "-text=hello"]&nbsp;
&nbsp;&nbsp;&nbsp; ports:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - "5678:5678"&nbsp;
&nbsp;&nbsp;&nbsp; environment:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - BACKEND_URL=http://backend:9090&nbsp;
&nbsp;
&nbsp; backend:&nbsp;
&nbsp;&nbsp;&nbsp; image: ealen/echo-server:latest&nbsp;
&nbsp;&nbsp;&nbsp; environment:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - PORT=9090&nbsp;
&nbsp;&nbsp;&nbsp; expose:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - "9090"&nbsp;
&nbsp;
&nbsp; beyla:&nbsp;
&nbsp;&nbsp;&nbsp; image: grafana/beyla:latest&nbsp;
&nbsp;&nbsp;&nbsp; privileged: true&nbsp;
&nbsp;&nbsp;&nbsp; pid: "service:frontend"&nbsp;
&nbsp;&nbsp;&nbsp; environment:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - BEYLA_OPEN_PORT=5678&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - BEYLA_PROMETHEUS_PORT=8999&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - BEYLA_TRACE_PRINTER=&nbsp;
&nbsp;&nbsp;&nbsp; depends_on:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - frontend&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - jaeger&nbsp;
&nbsp;
&nbsp; jaeger:&nbsp;
&nbsp;&nbsp;&nbsp; image: cr.jaegertracing.io/jaegertracing/jaeger:2.10.0&nbsp;
&nbsp;&nbsp;&nbsp; ports:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - "16686:16686"&nbsp; # Jaeger UI&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - "4318:4318"&nbsp;&nbsp;&nbsp; # OTLP/HTTP ingest (native in v2)&nbsp;
&nbsp;
&nbsp; prometheus:&nbsp;
&nbsp;&nbsp;&nbsp; image: prom/prometheus:latest&nbsp;
&nbsp;&nbsp;&nbsp; volumes:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro&nbsp;
&nbsp;&nbsp;&nbsp; ports:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - "9090:9090"&nbsp;</code></pre><p>Create the Prometheus config next to your compose file:&nbsp;</p><p>prometheus.yml&nbsp;</p><pre><code class="language-yaml">global:&nbsp;
&nbsp; scrape_interval: 5s&nbsp;
scrape_configs:&nbsp;
&nbsp; - job_name: "beyla"&nbsp;
&nbsp;&nbsp;&nbsp; static_configs:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - targets: ["beyla:8999"]&nbsp;</code></pre><p>Bring it up:&nbsp;</p><pre><code class="language-shell">docker compose up -d&nbsp;</code></pre><p>Hit the frontend to generate traffic:&nbsp;</p><pre><code class="language-shell">curl -s http://localhost:8080/ | head -n1&nbsp;</code></pre><p>When the stack is up, open the Jaeger UI at <code>http://localhost:16686</code> and search for the frontend service to browse traces. For metrics, visit Prometheus at <code>http://localhost:9090</code> and try queries like <code>http_server_request_duration_seconds_count</code> or <code>http_client_request_duration_seconds_count</code> to see call patterns emerge.</p>
<h3 id="what-to-look-for">What to Look for&nbsp;</h3><p>Generate a little pressure (for example, <code>hey -z 30s http://localhost:8080/</code>). In Jaeger, follow a slow trace end-to-end and note where p95 shifts between the frontend and the backend call. In Prometheus, line up the client and server RED metrics to see where latency actually accumulates — it’s a simple way to separate symptoms from causes.</p>
<h3 id="where-to-read-more">Where to Read More&nbsp;</h3><p>If you want the exact steps, the canonical source is the <a href="https://grafana.com/docs/beyla/latest/?ref=causely-blog.ghost.io" rel="noreferrer">Beyla documentation</a> and the <a href="https://opentelemetry.io/docs/zero-code/obi/?ref=causely-blog.ghost.io" rel="noreferrer">OBI pages in OpenTelemetry</a>. </p><p><a href="https://docs.causely.ai/telemetry-sources/grafana/?ref=causely-blog.ghost.io" rel="noreferrer">Our own docs</a> show how Beyla and the Causely agents fit together in a few minutes of setup.&nbsp;</p><h2 id="closing-the-loop">Closing the Loop&nbsp;</h2><p>Beyla is the right kind of infrastructure: minimal friction, maximal signal, and donated to the place where open standards live. </p><p>If you’re ready to move from “seeing it” to “controling it,” we’d be happy to show how Causely turns that signal into confident action.&nbsp;</p><p><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer"><em>Ready when you are</em></a><em>.</em>&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[How a Leading Gaming Platform Prevented Revenue Loss Using Causely in Grafana]]></title>
      <link>https://causely.ai/blog/gaming-platform-prevented-revenue-loss-using-causely-in-grafana</link>
      <guid>https://causely.ai/blog/gaming-platform-prevented-revenue-loss-using-causely-in-grafana</guid>
      <pubDate>Tue, 30 Sep 2025 16:26:07 GMT</pubDate>
      <description><![CDATA[With Causely + Grafana, the gaming platform can spot reliability risks early, take the right action, and avoid revenue-impacting incidents before users even notice.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/amit-lahav-6I-HWjwn-hk-unsplash.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>In live sports betting, every second counts. That’s why a major interactive gaming company brought Causely’s AI SRE into Grafana. With Causely, they can now spot reliability risks early, take the right action, and avoid revenue-impacting incidents before users even notice.&nbsp;</p><h2 id="the-stakes-real-time-bets-real-time-risk"><strong>The Stakes: Real-Time Bets, Real-Time Risk</strong>&nbsp;</h2><p>For this industry-leading platform, performance is crucial. During live events, hundreds of thousands of users rely on the app to place bets in real time. If a service slows down, bets fail. If latency spikes, revenue is lost. There’s no buffer for outages, and no time for guesswork.&nbsp;</p><p>Many of the platform’s highest-traffic moments, such as marquee games, playoff weekends, or last-minute betting surges, occur on weekends or late at night. This adds strain on a small team that's already stretched thin, often pulling engineers into incident response during off hours, when the toll on focus, sleep, and morale is highest.&nbsp;</p><p>The company had modern observability tools in place, including Grafana dashboards, Prometheus metrics, and alerting pipelines. But when something went wrong, their SREs still had to hunt for the cause across a vast array of charts and logs. They needed something that could connect the dots for them, quickly and accurately.&nbsp;</p><h2 id="the-challenge-observability-without-clarity"><strong>The Challenge: Observability Without Clarity</strong>&nbsp;</h2><p>Like many teams operating large-scale microservices architectures, they have invested in and standardized on an observability platform that provides lots of visibility. What the team lacked was understanding to enable rapid decision-making. They could see when something was breaking, but not why, or what to fix first.&nbsp;</p><p>The complexity came not just from the number of services, but from the nature of the system itself. Services communicated asynchronously over messaging queues, wrote to distributed databases, and scaled dynamically based on real-time load. These components operated independently yet were deeply interdependent with one another. During peak sporting events, traffic could spike dramatically in seconds. This created unpredictable cascades of latency, retries, and congestion.&nbsp;</p><p>Under these conditions, their observability tools struggled to identify the real cause of issues. Dashboards showed metric anomalies. Alerts fired across disconnected tools. However, understanding what was actually happening required engineers to mentally reconstruct a constantly shifting topology and then guess at the most likely root cause based on limited data.&nbsp;</p><p>Each incident became a manual, high-stakes investigation. Teams would dig through logs and traces, cross-reference alerts, and escalate issues across multiple teams to determine where to start. That meant long triage loops, missed SLOs, and far too much time reacting instead of preventing.&nbsp;</p><p>What they needed was a system that could keep up with the complexity, something that could analyze the real-time behavior of the entire environment, understand how changes ripple across services, and surface what matters before user impact.</p><h2 id="the-solution-causal-reasoning-embedded-in-grafana"><strong>The Solution: Causal Reasoning Embedded in Grafana</strong>&nbsp;</h2><p>To close this gap, the team rolled out Causely’s <a href="https://www.causely.ai/blog/from-dashboards-to-decisions-introducing-the-causely-plugin-for-grafana?ref=causely-blog.ghost.io" rel="noreferrer">Grafana plugin</a>. It gave their SREs a new layer of intelligence, right inside the dashboards they already use.&nbsp;</p><p>Causely continuously consumes telemetry from their existing observability tools and applies causal reasoning to identify what matters. It answers three essential questions:&nbsp;</p><blockquote>Where is the problem?<br>What is it?<br>And why did it happen?&nbsp;</blockquote><p>With that context, <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causely enables remediation</a>, either by guiding engineers to the right action or automating it through AI.&nbsp;</p><p>In this case, when service latency began trending in the wrong direction, Causely didn’t just trigger an alert. It identified the actual cause, explained the potential <a href="https://www.causely.ai/blog/beyond-the-blast-radius-demystifying-and-mitigating-cascading-microservice-issues?ref=causely-blog.ghost.io" rel="noreferrer">blast radius</a>, and gave the team the confidence to act quickly, without second-guessing or escalating the situation.&nbsp;</p><p><a href="https://www.causely.ai/blog/causely-feature-demo-unlock-root-cause-analysis-in-grafana?ref=causely-blog.ghost.io" rel="noreferrer">Watch the demo</a>:</p>
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/D6Ps1VoGHvw?si=qtH0THeJXzj0WmWw?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely ClickStack by ClickHouse integration">
  </iframe>
</div>
<!--kg-card-end: html-->
<h2 id="the-outcome-reliability-under-constant-change"><strong>The Outcome: Reliability Under Constant Change&nbsp;</strong>&nbsp;</h2><p>During a major live event, the platform began showing early signs of degradation. But instead of spinning up a war room, the SRE team used Causely to trace the issue to a misconfigured connection pool in an upstream service, before end users were ever impacted.&nbsp;</p><p>They resolved the issue fast, stayed within SLOs, and avoided revenue loss during one of their busiest betting windows. There was no guesswork, no back-and-forth, and no missed opportunities.&nbsp;&nbsp;</p><h2 id="see-it-at-observabilitycon"><strong>See It at ObservabilityCon</strong>&nbsp;</h2><p>Causely will be at Grafana’s <a href="https://grafana.com/events/observabilitycon/?ref=causely-blog.ghost.io" rel="noreferrer">ObservabilityCon</a>, showing how causal reasoning is helping engineering teams stay ahead of incidents—and focus on building, not firefighting.&nbsp;</p><p>Want a 1:1 walkthrough before the event?&nbsp;</p><p>👉 <a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer">Book time with our team</a>. &nbsp;</p><h3 id="causely-turns-observability-into-action">Causely turns observability into action.&nbsp;</h3><p>It brings clarity, speed, and confidence to every incident—right from your existing Grafana workflows.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Feature Demo: From Root Cause to Business Impact with Causely and ClickStack by ClickHouse]]></title>
      <link>https://causely.ai/blog/causely-feature-demo-clickstack</link>
      <guid>https://causely.ai/blog/causely-feature-demo-clickstack</guid>
      <pubDate>Tue, 23 Sep 2025 02:17:59 GMT</pubDate>
      <description><![CDATA[See how Causely and ClickStack by ClickHouse help teams fix failures confidently and address their real-world impact.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/Screenshot-2025-09-22-at-10.06.17---PM.png" type="image/jpeg" />
      <content:encoded><![CDATA[
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/sk_KmMOF1lE?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely ClickStack by ClickHouse integration">
  </iframe>
</div>
<!--kg-card-end: html-->
<p>When a system fails, identifying and resolving the technical root cause is what brings services back to health. But the story doesn't end there. To fully recover, teams also need to understand the impact outside their systems: which customers were affected, how many transactions failed, and what actions to take next. Traditional observability tools stop at symptoms, leaving engineers without a clear view from failure to business impact.<br><br><a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causely</a> changes that. By continuously reasoning over your telemetry, Causely identifies the true root cause in real time and delivers it straight into Slack — no guesswork required. From there, engineers can drill into the Causely UI to validate technical details and confirm the non-technical context behind the failure.<br><br>Combined with <a href="https://clickhouse.com/use-cases/observability?ref=causely-blog.ghost.io" rel="noreferrer">ClickStack</a> by ClickHouse, teams can move beyond the technical fix to quantify and remediate the wider impact. Starting from the root cause Causely highlights, ClickStack makes it easy to pivot directly to the affected users, measure the scope of the issue, and take recovery actions like targeted win-back campaigns. Together, they close the loop: Causely restores system health, while ClickStack ensures customer trust is recovered just as quickly.<br><br>Watch the video to see how Causely and ClickStack by ClickHouse help you fix failures confidently — and address their real-world impact.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Feature Demo: Accelerate Incident Response with Causely + incident.io]]></title>
      <link>https://causely.ai/blog/causely-feature-demo-incident-io</link>
      <guid>https://causely.ai/blog/causely-feature-demo-incident-io</guid>
      <pubDate>Mon, 22 Sep 2025 16:56:38 GMT</pubDate>
      <description><![CDATA[By combining Causely’s causal reasoning engine with incident.io’s powerful automation platform, engineering teams can identify the true root cause of incidents faster and respond with greater focus.]]></description>
      <author>Anson McCook</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/Screenshot-2025-09-22-at-12.46.48---PM.png" type="image/jpeg" />
      <content:encoded><![CDATA[
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/p07c2gy3baM?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely incident.io integration">
  </iframe>
</div>
<!--kg-card-end: html-->
<p>Incidents create noise, pull in too many teams, and waste valuable engineering time chasing symptoms instead of causes. </p><p>That’s why we’re excited to share <a href="https://www.causely.ai/blog/causely-integrates-with-incident-io?ref=causely-blog.ghost.io" rel="noreferrer">our new integration with&nbsp;incident.io</a>. By combining Causely’s <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">causal reasoning engine</a> with&nbsp;<a href="http://incident.io/?ref=causely-blog.ghost.io" rel="noopener noreferrer">incident.io</a>’s powerful automation platform, engineering teams can identify the true root cause of incidents faster and respond with greater focus. </p><p>The result: fewer people distracted, faster MTTR, and incidents contained before they spread.</p><p>In this short demo, you’ll see how Causely analyzes incoming alerts in real time, maps them onto live dependency graphs, and pinpoints the single root cause behind cascading failures.&nbsp;incident.io&nbsp;then orchestrates the response, routing the incident to the right team with the right context from the start. Together, Causely and&nbsp;incident.io&nbsp;replace reactive firefighting with clarity, precision, and automation so your teams can spend less time triaging and more time building.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Now Integrates With  incident.io to Accelerate Incident  Response]]></title>
      <link>https://causely.ai/blog/causely-integrates-with-incident-io</link>
      <guid>https://causely.ai/blog/causely-integrates-with-incident-io</guid>
      <pubDate>Mon, 22 Sep 2025 10:24:45 GMT</pubDate>
      <description><![CDATA[By combining Causely’s causal reasoning engine with incident.io, engineering teams with complex microservices environments can go from incident to resolution much faster.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/incident-io-blog-graphic0.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>At Causely, our mission is to help engineers deliver services reliably and reclaim the hours lost to reactive troubleshooting. Achieving this goal requires seamless integration with the tools and processes teams already rely on to manage incidents.</p><p>That’s why we’re excited to share how Causely now integrates with <a href="https://incident.io/?ref=causely-blog.ghost.io" rel="noreferrer">incident.io</a>. By combining Causely’s <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">causal reasoning engine</a> with incident.io, engineering teams with complex microservices environments can go from incident to resolution much faster.</p><h2 id="why-this-matters"><strong>Why This Matters</strong></h2><p>When an incident begins, the first question everyone asks themselves is, <em>“Why?”&nbsp;</em></p><p>In modern microservices environments, answering that causality question is hard. Trying to do it in a programmatic way that enables automation feels to many as being near impossible. Dependencies and data flows shift unexpectedly, and latency and errors ripple outward in ways that are difficult to untangle. Without an understanding of causal relationships and emergent behaviors across services, multiple service owners often get pulled into noisy investigations and waste time chasing downstream effects instead of the actual cause.</p><p>We love incident.io as an automation platform because it has a great user experience and provides the structure and automation needed for automating incident response. For customers using incident.io in large and dynamic environments, Causely adds critical intelligence by applying causal reasoning directly to live infrastructure, services, and data flows to infer the single cause behind a flood of alerts. This ensures the right service owner is reached with the right context from the start.</p><p><strong>The result:</strong> faster resolution, fewer people distracted, and incidents contained before they spread.</p><h2 id="how-it-works"><strong>How It Works</strong></h2><ol><li><strong>Alerts flow into incident.io</strong> from multiple sources such as Alertmanager, cloud platforms, and observability tools.</li><li><strong>Causely analyzes those alerts in real time</strong>, mapping them onto live dependency maps of services, infrastructure, and data flows it discovers. It applies causal inference to identify the cause of the anomalies. This requires no complex rules, policies, or prior training, and scales naturally to large, dynamic microservices environments.</li><li><strong>incident.io orchestrates the response</strong> with this context: routing incidents to the right owners, engaging on-call engineers, opening Slack channels, automating where possible, and keeping stakeholders informed.</li></ol><p>For large engineering organizations with hundreds or thousands of dependencies, the value is clear. Instead of multiple teams chasing scattered symptoms, the integration ensures the right team gets the right context immediately, and impacted teams get the fastest path to innocence.</p><h2 id="the-joint-value"><strong>The Joint Value</strong></h2><p>By combining causal inference with automated orchestration, teams get:</p><ul><li><strong>Clarity without chaos</strong>: Causely distinguishes cause from effect across live dependency maps, so only the right team is engaged.</li><li><strong>Seamless execution</strong>: incident.io automates the response, removing friction from every operational step.</li><li><strong>Aligned productivity</strong>: The integration prevents cross-team thrash, reduces MTTR, and frees engineers to stay focused on building.</li></ul><p>Together, Causely and incident.io replace reactive firefighting with precise, dependency-aware incident response and automation.</p><h2 id="see-it-in-action"><strong>See It in Action</strong></h2><p>We’ll be showcasing this integration at <strong>incident.io’s </strong><a href="https://www.sev0.com/?ref=causely-blog.ghost.io" rel="noreferrer"><strong>Sev0 Conference</strong></a><strong> in San Francisco on September 23rd</strong>. Stop by the Causely booth to see how our joint solution can work for you, or request a free consultation <a href="https://www.causely.ai/campaigns/sev0-causely?ref=causely-blog.ghost.io" rel="noreferrer">here</a>. <br></p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[The Cost of Confusing SRE, DevOps, and Platform Engineering]]></title>
      <link>https://causely.ai/blog/sre-devops-and-platform-engineering</link>
      <guid>https://causely.ai/blog/sre-devops-and-platform-engineering</guid>
      <pubDate>Thu, 18 Sep 2025 02:37:00 GMT</pubDate>
      <description><![CDATA[Confusing SRE, DevOps, and Platform Engineering may work at 20 engineers, but at 200 it creates chaos. Here’s why the distinctions matter and how to scale them effectively.]]></description>
      <author>Yotam Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/pexels-diva-26885601-1-1.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Few terms in software get misused more than DevOps, SRE, and Platform Engineering. Too often they’re treated as interchangeable labels, or worse, slapped on job titles without clear intent. The result? Confused teams, duplicated work, and brittle systems held together by heroics.&nbsp;&nbsp;</p><p>These aren’t interchangeable hats. They’re disciplines with fundamentally different goals. Blurring them may help you survive at 20 engineers, but at 200 it will strangle velocity and reliability. Even Netflix and Spotify, the poster children for speed at scale, had to abandon blended roles once the stakes got high.&nbsp;&nbsp;</p><p>This article cuts through the noise: what these roles really are, where they overlap, and why betting on “one-size-fits-all ops” is a costly mistake.&nbsp;&nbsp;&nbsp;</p><h2 id="the-three-disciplines"><strong>The Three Disciplines</strong>&nbsp;</h2><h3 id="devops-a-cultural-foundation-not-a-job-title"><strong>DevOps: A Cultural Foundation, Not a Job Title</strong>&nbsp;</h3><p>DevOps is more than just a role: It is a mindset. Born to tear down the wall between dev and ops, DevOps emphasizes shared responsibility for software delivery.&nbsp;</p><ul><li><strong>Goal:</strong> Deliver software quickly, safely, and consistently&nbsp;</li><li><strong>Focus Areas:</strong> Collaboration, automation, CI/CD&nbsp;</li><li><strong>Common Activities:</strong> Pipelines, IaC, release orchestration&nbsp;</li><li><strong>Metric:</strong> Deployment frequency and lead time for changes&nbsp;&nbsp;</li></ul><p><strong>Where companies go wrong: </strong>hiring “DevOps engineers” as if DevOps were just another box on the org chart. Rather than existing as a role or even a team, DevOps should be considered the baseline culture for modern software delivery.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/09/devops.webp" class="kg-image" alt="" loading="lazy" width="1536" height="864" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/09/devops.webp 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/09/devops.webp 1000w, https://causely-blog.ghost.io/content/images/2025/09/devops.webp 1536w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The eight phases of DevOps</span></figcaption></figure><h3 id="site-reliability-engineering-sre-reliability-as-a-feature"><strong>Site Reliability Engineering (SRE): Reliability as a Feature</strong>&nbsp;</h3><p>SRE, <a href="https://sre.google/20/?ref=causely-blog.ghost.io" rel="noreferrer">created at Google</a>, applies engineering discipline to operations. It treats reliability not as an afterthought but as a product feature with measurable outcomes.&nbsp;</p><ul><li><strong>Goal:</strong> Keep systems reliable, scalable, and performant&nbsp;</li><li><strong>Focus Areas:</strong> SLOs, error budgets, capacity planning, incident response&nbsp;</li><li><strong>Common Activities:</strong> Defining SLIs/SLOs, automating toil, postmortems&nbsp;</li><li><strong>Metrics:</strong> Error rates, availability, latency&nbsp;</li></ul><p>Where DevOps creates shared accountability, SRE enforces it with data and rigor. Reliability becomes something you can track, budget for, and improve.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/09/SRE-hierarchy.jpg" class="kg-image" alt="" loading="lazy" width="900" height="779" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/09/SRE-hierarchy.jpg 600w, https://causely-blog.ghost.io/content/images/2025/09/SRE-hierarchy.jpg 900w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The SRE hierarchy, as defined by </span><a href="https://sre.google/sre-book/part-III-practices/?ref=causely-blog.ghost.io" rel="noreferrer"><span style="white-space: pre-wrap;">Google</span></a></figcaption></figure><h3 id="platform-engineering-scaling-without-chaos"><strong>Platform Engineering: Scaling Without Chaos</strong>&nbsp;</h3><p>Platform Engineering is the newest of the three, and it is rapidly becoming essential. Its job is to build internal products for developers, standardized pipelines, self-service infrastructure, and golden paths that reduce cognitive load.&nbsp;</p><ul><li><strong>Goal:</strong> Improve developer productivity and consistency&nbsp;</li><li><strong>Focus Areas:</strong> Internal developer platforms (IDPs), golden paths, service catalogs&nbsp;</li><li><strong>Common Activities:</strong> CI/CD frameworks, self-service infra, observability integrations&nbsp;</li><li><strong>Metrics:</strong> Developer satisfaction, time-to-value for new features&nbsp;&nbsp;</li></ul><p>Where SRE enforces reliability, Platform Engineering makes speed sustainable by removing friction and standardizing how teams build and ship.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/09/diagram-of-platform-engineering.png" class="kg-image" alt="" loading="lazy" width="1360" height="766" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/09/diagram-of-platform-engineering.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/09/diagram-of-platform-engineering.png 1000w, https://causely-blog.ghost.io/content/images/2025/09/diagram-of-platform-engineering.png 1360w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Platform Engineering, according to </span><a href="https://www.gartner.com/en/infrastructure-and-it-operations-leaders/topics/platform-engineering?ref=causely-blog.ghost.io" rel="noreferrer"><span style="white-space: pre-wrap;">Gartner</span></a></figcaption></figure><h2 id="how-they-interrelate"><strong>How They Interrelate</strong>&nbsp;</h2><p>These disciplines do not compete, they reinforce one another:&nbsp;</p><ul><li><strong>DevOps</strong> creates the culture of shared responsibility.&nbsp;</li><li><strong>SRE</strong> makes reliability measurable and actionable.&nbsp;</li><li><strong>Platform Engineering</strong> productizes infrastructure so both can scale.&nbsp;</li></ul><p>Together, they form a virtuous cycle: culture drives collaboration, reliability ensures quality, and platforms remove friction.&nbsp;</p><h2 id="why-the-lines-blur"><strong>Why The Lines Blur</strong>&nbsp;</h2><p>In small companies, blending roles is inevitable. One engineer might build the pipeline, run on-call, and hack together infra automation. That works when survival depends on speed.&nbsp;</p><p>But the trade-offs are real: constant context switching, shallow specialization, and bottlenecks that emerge as systems grow. What looks like efficiency at 20 engineers becomes fragility at 200.&nbsp;</p><p>Common blends include:&nbsp;</p><ul><li><strong>DevOps + SRE:</strong> one ops-focused engineer juggling deployments and reliability.&nbsp;</li><li><strong>SRE + Platform:</strong> reliability engineers building tooling to reduce toil.&nbsp;</li><li><strong>DevOps + Platform:</strong> CI/CD pipelines evolving into internal platforms.&nbsp;</li></ul><p>These hybrids buy time early, but they do not scale.&nbsp;</p><h2 id="the-growth-path-from-blended-roles-to-specialization"><strong>The Growth Path: From Blended Roles to Specialization</strong>&nbsp;</h2><p>As headcount and complexity rise, most organizations follow a predictable arc:&nbsp;</p><ul><li><strong>Early Stage (1–20 engineers):</strong> Generalists do everything. Speed over rigor.&nbsp;</li><li><strong>Growth (20–100 engineers):</strong> “DevOps” teams emerge, basic SRE practices are introduced.&nbsp;</li><li><strong>Scaling (100–500 engineers):</strong> Dedicated SRE and Platform teams form to manage reliability and developer experience.&nbsp;</li><li><strong>Enterprise (500+ engineers):</strong> Clear ownership is established across disciplines, with DevOps principles embedded everywhere.&nbsp;</li></ul><p>Ignore this progression and you will pay the price, either in velocity lost to chaos or outages caused by brittle systems.&nbsp;</p><h2 id="lessons-from-netflix-and-spotify"><strong>Lessons from Netflix and Spotify</strong>&nbsp;</h2><p>Even the best had to evolve.&nbsp;</p><ul><li><strong>Netflix:</strong> Started with infra-savvy generalists. They embraced end-to-end DevOps culture, pioneered Chaos Monkey, and eventually built platforms like Spinnaker (CI/CD) and Titus (containers). Today, reliability is a shared mandate, supported by platform teams with a product mindset.&nbsp;</li><li><strong>Spotify:</strong> Grew fast with generalist ops, then adopted squad-based DevOps. Over time, they created dedicated reliability teams and built Backstage (later open-sourced) to tame service sprawl. Platform and SRE teams now enable hundreds of squads to ship quickly without burning out.&nbsp;&nbsp;</li></ul><p>Both prove the same point: blended roles help in the sprint, but specialization wins the marathon.&nbsp;</p><h2 id="conclusion-one-size-fits-all-ops-is-a-trap"><strong>Conclusion: One-Size-Fits-All Ops is a Trap</strong>&nbsp;</h2><p>SRE, DevOps, and Platform Engineering are not buzzwords or interchangeable hats. They are complementary disciplines that companies must deliberately balance as they scale.&nbsp;</p><p>The playbook is clear:&nbsp;</p><ol><li><strong>Start with DevOps</strong> as cultural glue.&nbsp;</li><li><strong>Invest in SRE</strong> to treat reliability as a product feature.&nbsp;</li><li><strong>Build platforms</strong> once scale demands it.&nbsp;&nbsp;</li></ol><p>Done right, this progression does not just prevent outages. It transforms operational chaos into a lasting competitive advantage.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Reflections on APMdigest’s Observability Series — and Where We Go Next]]></title>
      <link>https://causely.ai/blog/reflections-on-apmdigests-observability-series-and-where-we-go-next</link>
      <guid>https://causely.ai/blog/reflections-on-apmdigests-observability-series-and-where-we-go-next</guid>
      <pubDate>Wed, 17 Sep 2025 11:00:02 GMT</pubDate>
      <description><![CDATA[Whether we call it APM or observability is bikeshedding. What really matters is ensuring systems deliver the service levels users expect. That’s where AI comes in.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/pexels-tommes-frites-1141358642-33905801-1.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Over the summer, <a href="https://www.apmdigest.com/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>APMdigest</u></a> published a <a href="https://www.apmdigest.com/apm-observability-cutting-through-confusion-1?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>fantastic 12-part series on APM and observability</u></a>, bringing together dozens of voices from across our industry. First things first: a huge thank you to <a href="https://www.linkedin.com/in/pete-goldin-a5503416/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Pete Goldin</u></a> and the APMdigest team for pulling it together. If you haven’t read it yet, I highly recommend it. It’s one of the best ways to see how practitioners, vendors, and thought leaders are thinking about the future of monitoring and reliability.&nbsp;</p><p>Two topics stood out most in the series. The first is the ongoing wrestling match between the terms Application Performance Management (APM) and observability. The second is how AI will change this space over the coming years. Let’s take them one by one.&nbsp;&nbsp;</p><h2 id="apm-vs-observability-%E2%80%94-the-bikeshedding-debate">APM vs. Observability — The Bikeshedding Debate&nbsp;</h2><p>When we talk about observability, it’s important to distinguish between the technical definition and the marketing definition. In control theory, observability means “a measure of how well internal states of a system can be inferred from knowledge of its external outputs” (<a href="https://en.wikipedia.org/wiki/Observability?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Wikipedia</u></a>). We borrowed that definition from control theory into the domain of software, where those “internal states” are the real things that are happening invisible to us: CPU cycles, memory consumption, packet flows, the actual bits moving through our systems. The “outputs” are what we measure: traces, metrics, logs, profiles, and ultimately what the end user sees.&nbsp;</p><p>It’s a good definition because it clearly states how we can make a system objectively more observable: by providing increasingly better external outputs that describe the internal states as closely as possible, until in theory one could determine the entire system’s behavior from its outputs. This is how engineers think, which is why it resonates so well with us.&nbsp;&nbsp;</p><p>Furthermore, we can see this approach in action, as open projects and standards like Prometheus and OpenTelemetry evolve, adding new signals, more domains, and even semantic conventions to steadily improve what can be inferred from telemetry. Observability in the technical sense is just about making more of the system’s inner state visible.&nbsp;</p><p>But then there’s the marketing use of the term “observability,” particularly in how vendors brand and sell their offerings as “observability platforms.” Look under the hood and you’ll see evolution, not revolution. Many features touted as “new” under the observability banner — distributed tracing, combined app and infra visibility, end-user monitoring, business-centric insights — were already present in APM solutions long before. </p><p>APM, traditionally defined as the discipline of monitoring and managing the performance and availability of software applications, had long offered these capabilities — the key difference is that they were locked inside vendor walled gardens, fragmented across products, and often poorly implemented. Observability didn’t invent them; it generalized them and, crucially, opened them up.&nbsp;&nbsp;</p><p>That’s the real accomplishment of observability, in its marketing sense: expanding a space once dominated by closed APM vendors, breaking down their walls, and making telemetry collection a commodity.&nbsp;</p><p>Beyond that, does it really matter whether we call it APM or observability? The debate may have been useful to move the industry forward, but today it’s a distraction and a waste of energy; it’s <a href="https://en.wiktionary.org/wiki/bikeshedding?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>bikeshedding</u></a>. Use the terms however they suit you — what really matters is ensuring systems consistently deliver the service levels our users expect in the years ahead. And that’s where AI comes in.&nbsp;</p><h2 id="ai-in-observability-%E2%80%94-hype-reality-and-what%E2%80%99s-missing">AI in Observability — Hype, Reality, and What’s Missing&nbsp;</h2><p>The APMdigest series also collected a wide range of AI predictions. Many of them, we at Causely agree with wholeheartedly. Machine learning is already reducing alert fatigue, clustering related events, and detecting anomalies faster than humans. LLMs are making telemetry more accessible in many languages, so engineers can query systems in their own tongue without having to learn domain-specific query languages. These are real wins.&nbsp;</p><p>Where we diverge is in how far LLMs can take us. At their core, LLMs are pattern-matching machines. They can generate plausible explanations, but they don’t actually understand causality. That’s a critical limitation when the task is diagnosing why a system failed. <a href="https://www.tylervigen.com/spurious-correlations?ref=causely-blog.ghost.io" rel="noreferrer">Correlation is not causation</a> — and if you’ve been on-call at 3 a.m., you know how costly the wrong guess can be.&nbsp;</p><p>LLMs are also inherently reactive. If I have to ask the chatbot why something is broken, it’s already too late. They’re not designed to proactively detect precursors of failure, to flag the subtle changes that precede an outage. That requires something different: causal reasoning.&nbsp;</p><p>Causal reasoning is what turns noise into signal. It builds an explicit model of how services depend on each other and how failures propagate. That model acts as a pre-processor layer: it distills raw telemetry into precise insights about what is really going on. Once you have that, LLMs can actually shine. They can take the output of causal analysis and do what they do best — generate natural-language explanations, propose remediations, even open pull requests to apply safe fixes.&nbsp;</p><p>In other words, causal reasoning is the missing layer between raw observability data and generative AI. Without it, we risk drowning in correlations. With it, we unlock the path to truly autonomous reliability. We’ve written more about this in <a href="https://www.infoq.com/articles/causal-reasoning-observability/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>InfoQ</u></a>, on <a href="https://www.causely.ai/blog/causal-reasoning-the-missing-piece-to-service-reliability?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>our own blog</u></a>, and in industry pieces like <a href="https://www.ciodive.com/spons/ai-sres-separating-hype-from-reality/759482/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>CIO Dive</u></a> if you want to dive deeper.&nbsp;</p><h2 id="closing-thoughts">Closing Thoughts&nbsp;</h2><p>The APM vs. observability debate may make for lively discussion, but the more interesting question is how AI will actually change the way we run systems. At Causely, we agree with much of the optimism in the APMdigest series, but we believe the crucial next step is to go beyond correlation and embrace causal reasoning. That’s how we move from data to control, from dashboards to autonomous systems.&nbsp;</p><p>It’s a privilege to be part of a community that debates these questions so openly. We’re grateful to APMdigest for hosting the series, and we look forward to future editions. In the meantime, we’d love to hear your perspective — drop us a note at <a href="mailto:community@causely.ai" rel="noreferrer noopener"><u>community@causely.ai</u></a> or join the conversation with us on <a href="https://www.linkedin.com/company/causely-ai?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>social media</u></a>.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Microservices and the Myth of Fault Isolation]]></title>
      <link>https://causely.ai/blog/microservices-and-the-myth-of-fault-isolation</link>
      <guid>https://causely.ai/blog/microservices-and-the-myth-of-fault-isolation</guid>
      <pubDate>Wed, 10 Sep 2025 14:05:45 GMT</pubDate>
      <description><![CDATA[Microservices do not automatically deliver fault isolation by design. They replace one obvious forest fire with a sprawling network of subtle, cascading brush fires.]]></description>
      <author>Yotam Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/gui-muller-dRqOMXgo0NQ-unsplash-1-1.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p><a href="https://www.atlassian.com/microservices?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Atlassian’s guide on microservices</u></a><em> </em>makes the claim: <em>“One error affects the entire application in monolithic architectures. But microservices are independent. One failure won't affect the other parts of the application.”</em>&nbsp;</p><p>It’s a reassuring idea, but it’s a myth. Microservices don’t isolate failure; they multiply it.&nbsp;</p><p>In real-world distributed systems operating at scale, failures do not stay politely in their lanes. They leak across queues, caches, retries, and shared state. They multiply through invisible dependencies. And the more services you run, the harder it gets to see where the <a href="https://www.causely.ai/blog/beyond-the-blast-radius-demystifying-and-mitigating-cascading-microservice-issues?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>blast radius</u></a> actually stops and, critically, what the cause is.&nbsp;</p><p><strong>Microservices do not automatically deliver fault isolation by design.</strong> They replace one obvious forest fire with a sprawling network of subtle, cascading brush fires.&nbsp;</p><h2 id="the-promise-of-microservices">The Promise of Microservices&nbsp;</h2><p>On paper, microservices look like a natural cure for fragility:&nbsp;</p><ul><li>Each service is independent, so one crash should not cascade to others.&nbsp;</li><li>Circuit breakers, bulkheads, and fallbacks can contain failures.&nbsp;</li><li>Advanced designs like cell-based architectures further limit blast radius.&nbsp;</li></ul><p>This sounds good in theory. And in tightly disciplined environments with mature engineering practices, some of these promises hold.&nbsp;</p><h2 id="the-reality">The Reality&nbsp;</h2><p>In practice, fault isolation is rarely automatic and microservices make it harder to understand and control the blast radius.&nbsp;&nbsp;</p><ul><li><strong>New failure modes emerge.</strong> Latency, coordination bugs, partial outages, and data drift become common.&nbsp;</li><li><strong>Shared dependencies betray isolation. </strong>A database, queue, or cache hiccup can silently spread impact across dozens of services.&nbsp;</li><li><strong>Partial degradation is worse than full failure.</strong> Services stuck in retry storms or serving stale data prolong incidents instead of containing them.&nbsp;</li><li><strong>Operational burden grows.</strong> Effective isolation requires top-tier observability, disciplined retry policies, and carefully engineered degradation strategies.&nbsp;</li></ul><p>Without disciplined engineering focus, fault isolation remains more promise than reality.&nbsp;</p><h2 id="our-perspective-at-causely">Our Perspective at Causely&nbsp;</h2><p>We do not believe resilience is a side effect of microservices. It is a design goal that must be deliberately engineered, monitored, and maintained.&nbsp;</p><p>That is why our system focuses on <a href="https://www.causely.ai/blog/causal-reasoning-the-missing-piece-to-service-reliability?ref=causely-blog.ghost.io" rel="noreferrer">causal reasoning</a>:&nbsp;</p><ul><li><strong>Mapping dependencies.</strong> We continuously uncover how services and infrastructure actually connect, not just how architects think they do.&nbsp;</li><li><strong>Analyzing blast radius. </strong>We model how failures propagate, not just where they originate.&nbsp;</li><li><strong>Pinpointing cause and effect.</strong> We distinguish symptoms from true root causes, even when failures ripple through retries, caches, and queues.&nbsp;</li></ul><p>The challenge is not shrinking the blast radius. It is being able to understand it clearly and programmatically mitigate the damage. That is the gap our system closes.&nbsp;</p><h2 id="tldr">TL;DR?&nbsp;</h2><p>Resilience does not come from the architecture you choose. It comes from how well you understand causality inside it.&nbsp;</p><p>For engineering teams, that means:&nbsp;</p><ul><li><strong>Continuously mapping </strong>dependencies and blast radius across services.&nbsp;</li><li><strong>Designing isolation explicitly </strong>instead of assuming microservices will provide it.&nbsp;</li><li><strong>Using systems that reason about cause and effect</strong>, so humans are not left guessing when it matters most.&nbsp;</li></ul><p>The myth is that microservices give you fault isolation for free. The reality is that they make causal reasoning non-optional.&nbsp;</p><p>If your experience has confirmed (or contradicted) this reality, <a href="mailto:community@causely.ai" rel="noreferrer">we would love to hear it</a>. Engineers get stronger when we challenge the myths together.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[The Software With Podcast: Find the Root Cause in Seconds]]></title>
      <link>https://causely.ai/blog/the-software-with-podcast-find-the-root-cause-in-seconds</link>
      <guid>https://causely.ai/blog/the-software-with-podcast-find-the-root-cause-in-seconds</guid>
      <pubDate>Wed, 10 Sep 2025 10:38:00 GMT</pubDate>
      <description><![CDATA[Severin shares insights into his career path, including his involvement with AppDynamics and Cisco, and his current role at Causely, where he focuses on OpenTelemetry and causal reasoning for root cause analysis.]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/Screenshot-2025-09-16-at-6.37.04---AM-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>This interview with Severin Neumann on <a href="https://www.youtube.com/watch?v=0aBF7t3l9y4&ref=causely-blog.ghost.io" rel="noreferrer">The Software With</a> podcast examines his journey from writing his first line of code in PHP 3.0 to becoming a key figure in the world of OpenTelemetry. Severin shares insights into his career path, including his involvement with AppDynamics and Cisco, and his current role at Causely, where he focuses on OpenTelemetry and causal reasoning for root cause analysis.</p>
<!--kg-card-begin: html-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/0aBF7t3l9y4?si=jrqDSfR6XxVcnyL7" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
<!--kg-card-end: html-->
]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[AI SREs: Separating hype from reality]]></title>
      <link>https://causely.ai/blog/ai-sres-separating-hype-from-reality</link>
      <guid>https://causely.ai/blog/ai-sres-separating-hype-from-reality</guid>
      <pubDate>Mon, 08 Sep 2025 14:30:00 GMT</pubDate>
      <description><![CDATA[This article has been reposted with permission from CIO Dive.]]></description>
      <author>Yotam Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/cio-dive-blk.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Reposting with permission; the article was originally published in </em><a href="https://www.ciodive.com/spons/ai-sres-separating-hype-from-reality/759482/?ref=causely-blog.ghost.io" rel="noreferrer"><em>CIO Dive</em></a><em>.</em></p><h2 id="introduction"><strong>Introduction</strong>&nbsp;</h2><p>“AI SRE” – shorthand for applying artificial intelligence (AI) to Site Reliability Engineering (SRE) – has been bubbling up in conference talks, blog posts, and vendor marketing. SRE is a discipline that applies software engineering principles to ensure systems meet defined reliability targets. The idea of automating or augmenting this work with AI is compelling: fewer mistakes, less downtime, faster resolution, and lower cost. Not to mention, most engineering leaders would acknowledge a talent shortage in trying to fill these roles.&nbsp;</p><p>But the conversation is often fuzzy due to:&nbsp;</p><ul><li>A deafening level of marketing hype disconnected from reality.&nbsp;</li><li>Varied understanding and adoption of SRE as a job function.&nbsp;</li><li>Different interpretations and implementations of AI.&nbsp;</li><li>Technology being implemented that is a poor fit for the task at hand.&nbsp;</li></ul><p>In this article, I’ll strip away the buzzwords and define what I mean by AI SRE in precise, engineering-grounded terms. I’ll clarify how “AI” is being used today and what it should mean in the context of reliability engineering. From there, I’ll examine where it adds value, where it falls short, and outline a more effective starting point built on structured causal reasoning that can empower the safe and effective automation of reliability work.&nbsp;</p><h2 id="sre-one-discipline-different-meanings"><strong>SRE: One Discipline, Different Meanings</strong>&nbsp;</h2><p>The SRE role traces its roots to Google in the 2000s, when Ben Treynor Sloss described it as “what happens when you ask a software engineer to design an operations function.” At its core, SRE is about taking accountability for reliability at a defined level that balances user and business needs while preserving engineering velocity.&nbsp;</p><p>To support this, the discipline introduced a set of practices that gave teams a measurable way to manage reliability: Service Level Objectives (SLOs) to make goals explicit, error budgets to balance those goals with the pace of change, and blameless postmortems to learn from failure. Automation followed as a way to reduce toil and eliminate human error. Together, these practices positioned reliability as an engineering problem rather than an afterthought.&nbsp;</p><p>However, in practice, many teams struggle to adopt these ideals. Reliability work often takes a backseat to feature delivery, and SLOs are often only partially implemented across sprawling microservices. What Google codified in its SRE handbook often reaches production environments as a patchwork.&nbsp;</p><p>Organizations also implement the role itself in different ways: embedded with product teams, focused on platform infrastructure, parachuting in as consultants, or owning services end to end. Most end up with some hybrid.&nbsp;</p><p>No matter the org structure or maturity, the discipline of reliability engineering is well understood: applying engineering principles to keep systems available, performant, and predictable. While the goals are clear, most engineering leaders would agree there is still plenty of room for improvement in practice.&nbsp;</p><h2 id="what-%E2%80%9Cai%E2%80%9D-actually-means"><strong>What “AI” Actually Means</strong>&nbsp;</h2><p>Before defining the ideal AI SRE, we need to clarify what “AI” means and how the term is used today.&nbsp;&nbsp;</p><p>In its original sense, Artificial Intelligence refers to systems capable of completing tasks that normally require human intelligence: reasoning, learning, problem-solving, perception, and natural language understanding.&nbsp;&nbsp;</p><p>Classical AI approached these challenges by decomposing complex tasks into structured representations of knowledge, explicit planning procedures, and inference engines that could compute outcomes from evidence. For example, building a diagnostic assistant for medicine required an ontology of diseases and symptoms, probabilistic rules for how symptoms map to likely conditions, and an inference engine to suggest tests or treatments based on the evolving patient state. Historically, AI has included:&nbsp;</p><ul><li>Rule-based systems</li><li>Search and planning algorithms</li><li>Probabilistic reasoning</li><li>Constraint satisfaction</li></ul><p>For deeper study, I recommend:&nbsp;</p><ul><li><em>Artificial Intelligence: A Guide for Thinking Humans</em>&nbsp;– Melanie Mitchell</li><li><em>Artificial Intelligence: A Modern Approach</em>&nbsp;– Stuart Russell &amp; Peter Norvig</li><li><em>The Book of Why</em>&nbsp;– Judea Pearl</li><li>Judea Pearl on&nbsp;<em>Cause and Effect</em>&nbsp;(Sean Carroll Podcast)&nbsp;</li></ul><p>These systems performed well when the domain was well defined and the relationships between causes, observations, and interventions could be explicitly modeled. But they broke down when faced with ambiguity, missing context, or natural language.&nbsp;</p><p>In common usage, “AI” today typically refers to large language models (LLMs), and in some cases, smaller-scale variants known as small language models (SLMs). These models generate language by predicting the most likely next word given the previous context, using patterns learned from massive training datasets. They have become popular because they handle a wide range of natural language tasks with impressive fluency, such as text generation, summarization, question answering, and code completion.&nbsp;&nbsp;</p><p>But while language models are effective at producing language and code, they have no inherent understanding of system state, live telemetry, or causal dependencies.&nbsp;&nbsp;</p><h2 id="why-language-models-are-the-wrong-starting-point-for-reliability-engineering"><strong>Why Language Models Are the Wrong Starting Point for Reliability Engineering</strong>&nbsp;</h2><p>Reliability engineering requires exactly this kind of causal reasoning, which is why language models alone are the wrong foundation. The irony is that in its foundational sense, AI is a natural fit for reliability engineering: diagnosing outages, predicting failures, and recommending fixes are structured decision problems that align well with decades-old AI techniques. But when “AI” is reduced to LLMs and SLMs, the fit becomes like shoving a square peg through a round hole.&nbsp;</p><p>Surveying the market, most self-proclaimed AI SRE approaches fall into one or more of these categories:&nbsp;</p><ul><li><strong>Postmortem narratives</strong>&nbsp;– After-the-fact write-ups shaped as much by bias as data, crafted to explain what went wrong in hindsight.&nbsp;</li><li><strong>Correlation engines</strong>&nbsp;– Systems that surface co-occurring anomalies and related events during incidents but conflate correlation with causation.&nbsp;</li><li><strong>Data-fetching assistants</strong>&nbsp;– Interfaces that summarize telemetry and suggest plausible explanations without guarantees they are correct or verifiable. Some resemble traditional automation engines, requiring heavy setup in data formatting, rules, and conditions before delivering value.&nbsp;</li></ul><p>When language models are used as the foundation for reliability engineering, four weaknesses undermine their effectiveness:&nbsp;</p><ul><li><strong>Spurious causes</strong>&nbsp;– Coherent but incorrect diagnoses from hallucination, logical inconsistency, or lack of live-environment awareness.&nbsp;</li><li><strong>Unprincipled reasoning</strong>&nbsp;– Mimicking the language of reasoning without performing structured inference.&nbsp;</li><li><strong>Causal identification failures</strong>&nbsp;– Difficulty pinpointing causes in dynamic systems, especially when new evidence contradicts learned assumptions.&nbsp;</li><li><strong>Runaway costs</strong>&nbsp;– Without precise prompting and context, LLMs consume large amounts of compute to generate answers that may still be inaccurate.&nbsp;</li></ul><p>For reliability engineering, starting with a language model is starting in the wrong place. These approaches assume you already know which signals are worth chasing, when in reality multiple noisy alerts often trace back to a single root cause. Without causal reasoning at the foundation, both humans and AI waste time chasing symptoms instead of causes. Whether you are building internally at enterprise scale or looking for an off-the-shelf capability, any viable AI SRE must begin with causal analysis as its core.&nbsp;</p><h2 id="what-effective-ai-for-reliability-engineering-looks-like"><strong>What Effective AI for Reliability Engineering Looks Like</strong>&nbsp;</h2><p>An effective AI SRE isn’t just a chatbot or a rule engine. It’s a framework that combines structured causal knowledge, probabilistic inference, and agentic capabilities. This foundation enables advanced reasoning and supports automation of real reliability engineering work.&nbsp;</p><p>Such a system needs at least three interdependent capabilities:&nbsp;</p><h3 id="1-a-live-causal-representation-of-the-environment"><strong>1. A live causal representation of the environment</strong></h3><p>A domain-specific causal model that encodes how components in a distributed system can fail, how those failures propagate, and the symptoms they produce, paired with a continuously updated topology graph showing the real-time structure of services, infrastructure, and their interconnections.&nbsp;</p><p><em>Why it matters:</em>&nbsp;Replaces fuzzy LLM pattern-matching with a verifiable system map, enabling deterministic reasoning.&nbsp;</p><p><em>Counter to LLMs:</em>&nbsp;Addresses unprincipled reasoning by grounding analysis in a causal Bayesian network that models directionality as probabilities.&nbsp;</p><p><em>Example:</em>&nbsp;The model knows that a latency spike in a database layer can propagate through dependent APIs three layers away and can quantify that likelihood based on current conditions.&nbsp;</p><h3 id="2-real-time-probabilistic-inference-over-live-telemetry"><strong>2. Real-time probabilistic inference over live telemetry</strong></h3><p>Continuous ingestion of metrics, traces, and logs, mapped against the causal model to identify the most likely root cause at any given moment. This inference layer reflects both structural dependencies and observed patterns of failure propagation, updating conclusions the instant new evidence arrives.&nbsp;</p><p><em>Why it matters:</em>&nbsp;Dynamic systems change fast, with new code, new dependencies, and shifting load. LLMs cannot adapt without retraining, while a probabilistic inference engine adjusts instantly.&nbsp;</p><p><em>Counter to LLMs:</em>&nbsp;Addresses causal identification failures, especially in counterfactual scenarios where new observations contradict prior assumptions.&nbsp;</p><p><em>Example:</em>&nbsp;If a new deployment creates a previously nonexistent dependency, the model incorporates it into its reasoning immediately, without new rules or configs.&nbsp;</p><h3 id="3-cross-attribute-and-cross-service-dependency-reasoning"><strong>3. Cross-attribute and cross-service dependency reasoning</strong></h3><p>Beyond linking causes to symptoms, the system maps how performance attributes such as latency, throughput, and utilization depend on one another across services and infrastructure layers. By modeling these relationships, it can trace how a change in one part of the system cascades elsewhere, identify emerging bottlenecks, and detect when operational constraints are at risk. An added benefit is a sharp reduction in alert fatigue, because by isolating the true point of failure, only the responsible service owner is notified rather than multiple teams chasing downstream symptoms.&nbsp;</p><p><em>Why it matters:</em>&nbsp;Many incidents stem from chains of interdependent changes, not a single fault. Modeling these relationships eliminates blind spots and false leads.&nbsp;</p><p><em>Counter to LLMs:</em>&nbsp;Addresses spurious causes by ruling out explanations that violate known constraints or dependencies.&nbsp;</p><p><em>Example:</em>&nbsp;If a queue length increase is due to a downstream service slowdown rather than local CPU saturation, the model identifies the real cause directly.&nbsp;</p><p>Either way, the same foundation of causal reasoning, probabilistic inference, and dependency modeling is required. With that foundation in place, language models finally have a meaningful role to play.&nbsp;</p><h2 id="the-role-of-language-models-in-an-effective-ai-sre"><strong>The Role of Language Models in an Effective AI SRE</strong>&nbsp;</h2><p>Earlier, I noted that large and small language models can add value to reliability workflows. When provided with the right context, they can help accelerate diagnosis, generate remediations, and improve resolution timelines. But on their own, they lack causal understanding. Without grounding in system state and structure, even the most advanced model tends to produce responses that are plausible but operationally useless.&nbsp;</p><p>Modern AI SRE solutions must transcend summarizing symptoms or correlating metrics. To truly support autonomous site reliability, they must be able to understand complex IT environments, reason over live telemetry, and intervene intelligently to maintain SLOs and keep services healthy. AI for SRE should not be defined by language generation alone. It must include:&nbsp;</p><ul><li>A continuously updated causal model of the environment</li><li>The ability to perform probabilistic inference over live telemetry</li><li>The capacity to reason across attributes, services, and dependencies</li><li>An agentic interface that can propose or execute safe and effective actions</li></ul><p>Taken together, these capabilities reframe AI SRE from buzzword to engineering discipline.&nbsp;</p><h2 id="conclusion"><strong>Conclusion</strong>&nbsp;</h2><p>The conversation about AI SREs is too important to leave to buzzwords. “AI SRE” should not mean a chatbot guessing its way through your telemetry. Treating “AI” narrowly as LLMs or SLMs misses the chance to apply decades of proven AI techniques — causal reasoning, probabilistic inference, and constraint satisfaction — to one of the highest-impact opportunities in modern software engineering: autonomous service reliability.&nbsp;</p><p>The future of AI SRE will not be built on pattern-matching language models alone, but on systems that can reason causally about live environments and adapt in real time. Language models then become a powerful augmentation layer. Without that foundation, AI SRE is just another buzzword.&nbsp;</p><p>Whether you are an enterprise building internally and looking to integrate causal reasoning into your architecture, or a mid-market company that needs a turnkey solution, the same principle applies: causal reasoning must come first. At Causely, we provide the flexibility to address both situations - a platform that can serve as a foundation for your own build, or a ready-to-deploy solution that works out of the box.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Feature Demo: Address External API Slowdowns]]></title>
      <link>https://causely.ai/blog/causely-feature-demo-external-api-slowdown</link>
      <guid>https://causely.ai/blog/causely-feature-demo-external-api-slowdown</guid>
      <pubDate>Wed, 03 Sep 2025 15:16:21 GMT</pubDate>
      <description><![CDATA[When a provider slows down, Causely shows exactly how the impact ripples across your services and identifies the external API as the root cause.]]></description>
      <author>Anson McCook</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/Screenshot-2025-09-04-080734.png" type="image/jpeg" />
      <content:encoded><![CDATA[
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/tT0Ju5vO97w?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely External API Slowdown Demo">
  </iframe>
</div>
<!--kg-card-end: html-->
<p>When an upstream API slows down or quietly rate-limits requests, your customers feel the pain long before your team finds the cause. Error rates spike, retries pile up, queues fill, and internal services appear unreliable. The hard part is proving it isn’t you. Traditional observability tools only show symptoms inside your stack — latency, 5xx errors, and timeouts across services — making it nearly impossible to tell whether the real problem is internal or hiding in an external provider.<br><br><a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causely</a> changes that. With <a href="https://docs.causely.ai/telemetry-sources/ebpf/?ref=causely-blog.ghost.io" rel="noreferrer">eBPF auto-instrumentation</a>, Causely sees client communication with external services and maps those dependencies directly into your topology and causal reasoning engine. When a provider slows down, Causely shows exactly how the impact ripples across your services and identifies the external API as the root cause. That means your team can stop chasing internal alerts and start taking action — escalating with proof in hand or triggering fallback strategies. <br><br>Watch the video to see how Causely helps you go from symptoms to root cause instantly, even when the root cause isn’t yours.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[How Causal Reasoning Addresses the Limitations of LLMs in Observability]]></title>
      <link>https://causely.ai/blog/how-causal-reasoning-addresses-the-limitations-of-llms-in-observability</link>
      <guid>https://causely.ai/blog/how-causal-reasoning-addresses-the-limitations-of-llms-in-observability</guid>
      <pubDate>Tue, 02 Sep 2025 18:15:29 GMT</pubDate>
      <description><![CDATA[Causal reasoning with AI agents enable proactive incident prevention, automated remediation, and a path toward autonomous service reliability.]]></description>
      <author>Dhairya Dalal</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/infoq-1.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Reposted with permission from </em><a href="https://www.infoq.com/articles/causal-reasoning-observability/?ref=causely-blog.ghost.io" rel="noreferrer"><em>InfoQ</em></a><em>.</em></p><h2 id="key-takeaways">Key Takeaways</h2><ul><li>Large language models (LLMs) in observability excel at turning high-volume telemetry such as logs, traces, and metrics into concise human-readable narratives, but they lack structural system knowledge and struggle to isolate root causes in complex distributed architectures.</li><li>Current LLM and agentic AI approaches are prone to hallucinating plausible but incorrect explanations, mistaking symptoms for causes, and ignoring event ordering, which leads to misdiagnosis and incomplete remediation.</li><li>Causal reasoning models service and resource dependencies explicitly, accounts for event temporality, and supports inference under partial or noisy observations, enabling more accurate root cause identification.</li><li>Causal graphs and Bayesian inference allow for counterfactual and probabilistic reasoning, which lets engineers evaluate remediation options and their likely impact before taking action.</li><li>Integrating LLM-based interfaces with continuously updated causal models and abductive inference engines provides a practical path to reliable, explainable, and eventually autonomous incident diagnosis and remediation in cloud native systems.</li></ul><hr><p>The central goal of IT operations and site reliability engineering (SRE) is to maintain the availability, reliability, and performance of services while enabling safe and rapid delivery of changes. Achieving this requires a deep understanding of how systems behave during incidents and under operational stress. Observability platforms provide the foundation for this understanding by exposing telemetry data (logs, metrics, traces) that support anomaly detection, performance analysis, and root cause investigations. However, modern applications are increasingly difficult to manage as cross-service calls, event-driven workflows, and distributed data stores introduce complex and dynamic interactions.</p><p>For instance, in July 2024, a faulty configuration update in CrowdStrike’s Falcon sensor caused&nbsp;<a href="https://en.wikipedia.org/wiki/2024_CrowdStrike-related_IT_outages?ref=causely-blog.ghost.io">widespread crashes on millions of Windows systems</a>&nbsp;across industries worldwide. In another case, the 2016 removal of the tiny but widely used left-pad package from npm briefly broke thousands of builds and disrupted major websites until it was restored, revealing the&nbsp;<a href="https://arstechnica.com/information-technology/2016/03/rage-quit-coder-unpublished-17-lines-of-javascript-and-broke-the-internet?ref=causely-blog.ghost.io">fragility of transitive dependencies at scale</a>. Whether the trigger is an external contingency or a rare emergent interaction within a highly coupled system, modern IT infrastructure can experience widespread service outages due to complex cross service dependencies. Implicit dependencies, asynchronous communication, and distributed state make it challenging to pinpoint the source of incidents or understand the chain of effects across the system.</p><p>A&nbsp;<a href="https://www.gartner.com/reviews/market/observability-platforms?ref=causely-blog.ghost.io">new class of AI-based observability solutions</a>&nbsp;built on LLMs is gaining traction as they promise to simplify incident management, identify root causes, and automate remediation. These systems sift through high-volume telemetry, generate natural-language summaries based on their findings, and propose configuration or code-level changes. Additionally, with the advent of agentic AI, remediation workflows can be automated to advance the goal of self-healing environments. However, such tools remain fundamentally limited in their ability to perform root-cause analysis for modern applications. LLM-based solutions often hallucinate plausible but incorrect explanations, conflate symptoms with causes, and disregard event ordering, leading to misdiagnosis and superficial fixes.</p><p>Fundamentally, LLMs operating only on observed symptoms gleaned from telemetry are attempting to deduce root causes by traversing logs and shallow topologies. LLMs lack, however, an a priori understanding of the environment as a dynamic system with evolving interdependencies. As a result, underlying issues will persist even if symptoms are partially remediated in the short term.</p><p>Effective root-cause analysis in complex, distributed systems requires understanding the causal structure of events, services, and resources. Causal knowledge and reasoning remain critical missing components in modern AI-based observability solutions. Causal knowledge is codified into causal graphs, which model inter-service and resource dependencies explicitly. By supporting counterfactual inquiry, causal inference enables root-cause isolation and systematic remediation analysis. Augmenting LLMs and agentic AI with continuously updated causal models and an abductive inference engine (which identifies the best explanation for observed symptoms using causal reasoning), offers a path toward autonomous service reliability.</p><p>In this article, we begin by outlining the strengths of LLMs and agentic AI in observability and incident management. We then examine their limitations in performing accurate root cause analysis and driving effective remediation. Next, we introduce how causal knowledge and inference engines provide the missing context for precise incident diagnosis and response. Finally, we discuss how combining causal reasoning with AI agents enables proactive incident prevention, automated remediation, and the path toward autonomous service reliability.</p><h2 id="the-strengths-and-promise-of-llms-and-agentic-ai">The Strengths and Promise of LLMs and Agentic AI</h2><p>The term "AI" has become increasingly overloaded with marketing hype and public fascination. AI now applies to everything from threshold-based alerting scripts to autonomous agents capable of planning and acting across complex workflows. For simplicity, AI solutions in the observability space can be categorized as rule-based systems, LLM-based tools, and agentic AI systems. Rule-based systems include hand-crafted logic and statistical models configured to monitor baseline deviations, detect known signal patterns, and apply threshold-based alerting across logs, metrics, and traces.</p><p>LLM-based solutions leverage the generative and language-understanding capabilities of language models to support natural-language interactions with observability data. LLMs can process unstructured telemetry such as logs, traces, and alert descriptions to generate summaries, interpret errors, and create remediation plans. Agentic AI allows LLMs to act in a managed environment by providing multi-step planning, tool-assisted execution, and direct code and configuration changes. Next, we examine the specific strengths of LLMs and agentic AI in the context of observability.</p><p>LLMs are neural architectures pretrained on large-scale corpora of natural language, code, and other text-based resources. Despite being fundamentally next-token predictors trained to model the conditional probability of text, when scaled to billions of parameters and exposed to terabytes of diverse data, LLMs become highly effective at producing coherent language and supporting a wide range of language-based tasks. Modern LLMs have been further fine-tuned for instruction-following, factual recall, code generation, and domain-specific question answering, used subsequently to explain errors, answer technical questions, and generate code, scripts, and configuration changes.</p><p>In observability contexts, LLMs can interpret complex logs and trace messages, summarize high-volume telemetry, translate natural-language queries into structured filters, and synthesize scripts or configuration changes to support remediation. Most LLM solutions rely on proprietary providers such as OpenAI and Anthropic, whose training data is opaque and often poorly aligned with specific codebases or deployment environments. More fundamentally, LLMs can only produce text. They cannot observe system state, execute commands, or take action. These limitations gave rise to agentic systems that extend LLMs with tool use, memory, and control.</p><p>Agentic AI comes closest to delivering on the speculative promise of AI. In practice, agentic systems commonly follow the&nbsp;<a href="https://arxiv.org/abs/2210.03629?ref=causely-blog.ghost.io">ReAct framework introduced by Yao et al. (2022)</a>, which integrates reasoning and action in an interleaved loop. In this setup, the LLM generates intermediate reasoning steps, selects actions such as querying tools or retrieving information, and incorporates feedback from those actions to inform the next step. This cycle of thought, action, and observation allows the system to iteratively plan, update context, and progress toward a goal. With these capabilities, LLMs are able to write applications, solve multi-step reasoning problems, generate code based on system feedback, and interact with external services to complete goal-directed tasks.</p><p>Agentic AI shifts observability workflows from passive diagnostics to active response by predicting failure paths, initiating remediations, and executing tasks such as service restarts, configuration rollbacks, and state validation. However, current agentic systems lack a priori structural and causal models of the environment, which limits their ability to anticipate novel failure modes or explain observed behavior beyond surface-level associations. While these constraints remain, agentic AI represents a necessary step toward autonomous, tool-integrated systems capable of reasoning and acting within complex managed environments.</p><p>The ultimate promise of applying agentic AI to IT operations is autonomous service reliability. An ideal system continuously monitors telemetry, identifies potential failures, evaluates impact, and applies targeted interventions with limited human oversight. Integrated into the observability and operations stack, agentic AI should function as the control layer. It reasons over system state, coordinates diagnostics, and orchestrates remediations so that services operate reliably and in alignment with defined service level objectives (SLOs), which specify availability, latency, or other performance targets. Ultimately, autonomous service reliability reduces operational complexity, accelerates incident resolution, and improves service reliability across large-scale, dynamic environments.</p><h2 id="on-the-limitations-of-modern-ai-and-the-need-for-causal-reasoning">On The Limitations of Modern AI and the Need for Causal Reasoning</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://imgopt.infoq.com/fit-in/3000x4000/filters:quality(85)/filters:no_upscale()/articles/causal-reasoning-observability/en/resources/144figure-1-1756372621866.jpg" class="kg-image" alt="" loading="lazy" width="522" height="487"><figcaption><span style="white-space: pre-wrap;">Figure 1. Example Service Map where timeouts on service S5 are caused by connection exhaustion on resource R2.</span></figcaption></figure><p>Modern service architectures often rely on shared infrastructure and layered services, where dependencies are opaque and incident signals surface far from their origin. Imagine the simple service topology shown in Figure 1, which consists of two shared resources (R1 and R2) and a set of services (S1–S5) connected through overlapping dependencies. Now consider that we observe elevated latency and request timeouts at service S5. The underlying root cause is connection exhaustion on resource R2, which intermittently blocks new connections. This condition is not surfaced through direct telemetry because service S2, which depends on resource R2, reports only latency and timeouts without exposing the underlying resource-level failure. Tracing the issue upstream, service S3 shows increased request latency, and service S2 exhibits degraded performance. Meanwhile, service S1 also reports elevated CPU usage and latency, though its downstream service S6 remains unaffected.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://imgopt.infoq.com/fit-in/3000x4000/filters:quality(85)/filters:no_upscale()/articles/causal-reasoning-observability/en/resources/110figure-2-1756372621866.jpg" class="kg-image" alt="" loading="lazy" width="961" height="501"><figcaption><span style="white-space: pre-wrap;">Figure 2. An example of how Agentic AI would diagnose the timeouts on S5.</span></figcaption></figure><p>An LLM-based agent begins with observable symptoms (see Figure 2). It queries telemetry, inspects logs, and follows trace spans starting at S5. Along the path through S3 and S2, it observes anomalies based on latency and request failures. It also notices performance degradation on S1 and considers it a potential contributor. The agent restarts S1 and S2 and observes that the latency and timeouts at S5 are mitigated, leading it to conclude the issue is resolved. However, the problem resurfaces once connection limits on resource R2 are hit again. This scenario illustrates two key challenges.</p><p>First, spurious signals can misdirect diagnosis by drawing attention to unrelated events. Second, some root causes cannot be directly observed through telemetry and must instead be inferred from incomplete or indirect symptoms. For instance, in this scenario, the connection exhaustion on R2 emits no direct signal. It must be inferred by reasoning over the observed symptoms across S2, S3, and S5 in combination with structural knowledge of the system. Therefore, resolving such incidents requires principled causal reasoning and structural causal knowledge.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://imgopt.infoq.com/fit-in/3000x4000/filters:quality(85)/filters:no_upscale()/articles/causal-reasoning-observability/en/resources/91figure-3-1756372621866.jpg" class="kg-image" alt="" loading="lazy" width="1751" height="552"><figcaption><span style="white-space: pre-wrap;">Figure 3. Example of causal resources required for abductive reasoning.</span></figcaption></figure><p>Causal knowledge maps the various causal relationships typically found in modern service architectures. The architectural knowledge provides a structural understanding of the various resources, services, and data dependencies. Finally the causal graphs provide a probabilistic framework for inferring root causes given the observed symptoms in context of the architecture.</p><p>Causal knowledge represents the relationship between root causes and their observable symptoms. Such knowledge can be abstracted to support principled downstream reasoning over system behavior such as fault localization, impact analysis, and proactive mitigation. Causal graphs provide a formal structure for encoding this knowledge. Popularized by Judea Pearl,&nbsp;<a href="https://en.wikipedia.org/wiki/Causal_graph?ref=causely-blog.ghost.io">causal graphs</a>&nbsp;are directed acyclic graphs that represent cause-effect relationships among variables. When applied to reliability engineering, causal graphs describe how specific failure conditions (e.g., memory exhaustion, resource saturation, lock contention, etc.) produce observable symptoms (e.g., latency, connection errors, service timeouts, etc.).</p><p>Unlike telemetry signals capturing only runtime observations or dependency graphs, which represent observed service call relationships, causal graphs provide a richer structural understanding of how faults propagate across services and resources. This knowledge can be utilized to support inferential reasoning and identify root causes from partially observed symptoms, even when the underlying issue is not directly visible. This can be accomplished by abductive causal reasoning.</p><p>Abductive causal reasoning provides a principled, logical, and inferential framework for identifying the most likely explanation for observed symptoms. Provided a set of plausible candidate root causes and a graphical model of their symptoms, abduction selects the cause that best accounts for the observed evidence. This approach offers several advantages for modern applications: it supports inference under partial observability, selects explanations based on causal sufficiency, and provides formal guarantees about the inferred root cause. When paired with causal Bayesian graphs extending causal graphs with probabilistic reasoning, abductive inference becomes tractable in large systems. These graphs encode prior knowledge about potential root causes and their associated symptoms, allowing the system to compute the most probable root cause without requiring prior training. Furthermore, likelihoods can be updated over time using posterior observations, enabling continuous refinement in dynamic environments.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://imgopt.infoq.com/fit-in/3000x4000/filters:quality(85)/filters:no_upscale()/articles/causal-reasoning-observability/en/resources/60figure-4-1756372621866.jpg" class="kg-image" alt="" loading="lazy" width="2048" height="743"><figcaption><span style="white-space: pre-wrap;">Figure 4. A simplified example of the abductive reasoning where Causal Bayesian networks are utilized to infer the most probable root cause based on observed and unobserved symptoms. Prior probabilities capture the expected symptoms associated with a given root cause and likelihood is computed based on the aggregate observations to estimate which root cause best explains the observed symptoms.</span></figcaption></figure><p>Let’s revisit our scenario with the abductive reasoning framework (Figure 4). The framework begins with a defined set of all possible root causes. Each root cause is represented by a causal graph that maps the cause to its expected symptoms with associated prior probabilities. These priors capture the likelihood of each symptom occurring if the root cause is true, derived from historical incidents and domain expertise. The abductive process identifies which root cause best explains the observed symptoms while also accounting for those that are unobserved. It computes a likelihood score for each candidate by updating the prior probabilities of its associated symptoms with observational data.</p><p>In our scenario, the observed symptoms include timeouts at service S5 and latency at services S2 and S3, which align with the causal graph for connection exhaustion on resource R2. The graph also expects latency on S4; although it is not observed, that absence is included in the likelihood estimation. Competing explanations, such as CPU starvation on S1 or network congestion on S3, receive lower likelihoods because they either fail to explain all observed symptoms or have too few expected symptoms confirmed observationally. The key distinction between abductive causal reasoning and deductive reasoning is that abduction evaluates all candidate root causes against both the observed symptoms and the expected symptoms defined by the causal models.</p><p>The agent-based approach missed the actual root cause because it lacked structural causal knowledge and stopped at the first plausible symptom path. In contrast, the abductive process leverages causal models and Bayesian graphs to infer the most likely root cause even when observations are incomplete or include spurious symptoms. Abductive causal reasoning uses these structured probabilities to isolate the most coherent explanation despite partial observations and spurious symptoms.</p><h2 id="limitations-of-causal-reasoning">Limitations of Causal Reasoning</h2><p>While causal reasoning is a powerful approach, it does have limitations. Constructing causal models requires significant domain knowledge and ongoing effort. The graphs must accurately capture service and resource dependencies and need regular updates as architectures evolve. Coverage is another constraint. A reasoning engine can only work with root causes that are defined in the model. If a candidate cause is missing, the engine has no basis to infer it, which reduces its effectiveness in novel or poorly understood failure cases. There are also computational challenges. In large distributed environments, some root causes map to broad sets of symptoms. Computing conditional probabilities and running Bayesian inference across these sets can become costly, especially when multiple competing explanations must be evaluated in real time.</p><p>While these limitations do not diminish the value of causal reasoning, but they do highlight that such methods are more effective under specific circumstances. In service architectures with linear and well-understood dependencies, the root cause of an incident is often self-apparent, and a reasoning engine may not be required. The challenge arises in complex distributed environments where cross-service dependencies, asynchronous communication, and distributed state make it difficult to trace symptoms back to their causes. Humans can resolve these incidents, but the process is slow, resource-intensive, and often results in longer periods of downtime and service unavailability. Therefore, meeting this challenge requires developing more effective solutions that integrate causal reasoning with AI, moving beyond current limitations and enabling progress toward autonomous service reliability.</p><h2 id="towards-the-promise-of-autonomous-service-reliability">Towards the Promise of Autonomous Service Reliability</h2><p>Causal knowledge is essential for diagnosing incidents in modern service architectures. LLM-based solutions and current agentic AI often miss the forest for the trees. These systems operate over logs, traces, and metrics, but lack the structural context required to reason comprehensively about system-level behavior. While LLMs can provide useful insights by aggregating and parsing telemetry, they are ultimately bounded by the inputs they receive. With incomplete observations, LLMs tend toward speculative explanations that often result in hallucinations. Without a causal understanding of how failures propagate through services and infrastructure, modern AI solutions cannot move beyond summarization of observed events to reliably identify underlying causes. Causal knowledge and abductive causal reasoning provide the missing keys that, when combined with LLMs and agentic AI, unlock effective identification of likely root causes from incomplete and partial observations in complex, distributed environments.</p><p>Therefore, it is necessary to augment modern LLMs with a causal reasoning engine to achieve effective incident analysis and autonomous reliability. A causal reasoning engine combines three key components: causal models that encode known root causes and their associated symptoms, Bayesian causal graphs that apply probabilistic reasoning over service topologies, and an abductive inference engine that selects the most likely root cause given partial or noisy observations. This engine can operate as an external reasoning layer, providing structured causal context that modern LLMs lack.</p><p>Drawing from the principles of neuro-symbolic reasoning, the LLM serves as the flexible language interface, while the causal reasoning engine verifies hypotheses, refines candidate root causes, and performs advanced reasoning that goes beyond the LLM’s predictive text generation. By integrating this engine, LLM-based agents can transition from surface-level incident triage to precise root-cause identification and actionable remediation, constructing a path toward proactive, autonomous service reliability.</p><p>Causal agents provide a meaningful step in the pursuit of autonomous service reliability. By integrating causal knowledge and abductive reasoning, these systems move beyond reactive response and manual triage. They enable proactive incident prevention by identifying emerging risks, support targeted remediation through structural awareness, and drive self-healing by isolating and addressing root causes without human intervention. The result is a system that not only detects symptoms but understands their context. This shift reduces downtime, accelerates resolution, and aids teams in managing reliability at scale with minimal manual intervention.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[OpenTelemetry Logging Best Practices to Avoid Drowning in Data]]></title>
      <link>https://causely.ai/blog/opentelemetry-logging</link>
      <guid>https://causely.ai/blog/opentelemetry-logging</guid>
      <pubDate>Thu, 28 Aug 2025 15:19:48 GMT</pubDate>
      <description><![CDATA[We’ll recap OTel logging best practices, explore how to use logs effectively in troubleshooting without drowning in data, walk through a tutorial workflow you can apply today, and show how Causely operationalizes this approach automatically at scale.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/08/pexels-pixabay-247701.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>The OpenTelemetry (OTel) community has made enormous progress in how we think about logs. Not long ago, application logging was a patchwork of vendor- or language-specific formats and ad hoc fields, without a consistent way to connect logs with other telemetry. Now, thanks to OTel’s logging specification and Collector pipeline, teams can capture, enrich, and correlate logs across any service and runtime in a standardized way.&nbsp;<br>&nbsp;<br>In this post, we’ll recap OTel logging best practices, explore how to use logs effectively in troubleshooting without drowning in data, walk through a tutorial workflow you can apply today, and show how Causely operationalizes this approach automatically at scale.&nbsp;&nbsp;</p><h2 id="otel%E2%80%99s-established-best-practices-for-logging"><strong>OTel’s Established Best Practices for Logging</strong></h2><p>The OTel project has laid a strong foundation for logs to work alongside metrics and traces:&nbsp;</p><ul><li><strong>Consistent Structure</strong> – OTel’s log data model ensures every log record has a predictable shape, making it machine-parseable and easier to route or query.&nbsp;&nbsp;</li><li><strong>Context propagation </strong>– By embedding trace IDs, span IDs, and resource attributes in every log entry, OTel makes it possible to pivot seamlessly between logs, traces, and metrics in your analysis tools, without guesswork. &nbsp;</li><li><strong>Unified Collection</strong> – The <a href="https://www.causely.ai/blog/using-opentelemetry-and-the-otel-collector-for-logs-metrics-and-traces?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>OTel Collector</u></a> can ingest logs from multiple sources — applications, infrastructure, third-party services — and normalize them before routing to storage or analysis tools.&nbsp;</li><li><strong>Enrichment and Filtering</strong> – Processors in the Collector allow you to enrich logs with metadata (region, environment, service version) and filter out noise or limit verbosity.&nbsp;</li></ul><p>If your team is following these practices, congratulations: <s>y</s>You already have the building blocks for the next step.&nbsp;</p><h2 id="why-logs-need-context-to-be-useful"><strong>Why Logs Need Context to Be Useful&nbsp;</strong></h2><p>Logs are essential to troubleshooting — they carry the fine-grained evidence of what went wrong. But in large, OTel-instrumented systems, collecting every log line from every service creates new challenges:&nbsp;</p><p>• <strong>Mountains of data</strong> – terabytes of logs that make queries slow and expensive.&nbsp;</p><p>• <strong>Information overload</strong> – too many log lines without guidance on which ones matter.&nbsp;</p><p>• <strong>Slow incident response</strong> – engineers spend precious time searching instead of fixing.&nbsp;</p><p>The goal isn’t to reduce the importance of logs — it’s to use them in context. By correlating logs with metrics and traces, you can elevate logs from raw text into <a href="https://docs.causely.ai/getting-started/how-causely-works/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>causal signals</u></a>: proof tied directly to the root cause of a problem.&nbsp;<br>&nbsp;<br>Instead of asking engineers to manually dig through a warehouse of logs, the right approach is to automatically pull the relevant 0.1% of logs that explain the incident. Logs don’t lose their role; they gain clarity by being connected to the surrounding signals.&nbsp;</p><h2 id="tutorial-implementing-contextual-logs-in-otel"><strong>Tutorial: Implementing Contextual Logs in OTel&nbsp;</strong></h2><p>Here’s a practical outline for how to set this up today using OTel:&nbsp;</p><h3 id="step-1-%E2%80%93-ensure-rich-context-in-logs"><strong>Step 1 – Ensure Rich Context in Logs&nbsp;</strong></h3><p>Start with a simple log statement, for example:&nbsp;</p><p><code>log.warn("Database connection pool exhausted – queries may fail");</code></p><p>On its own, this message is useful but limited. In a large distributed system, you need more context to understand which service emitted it, in what environment, and how it relates to the rest of a trace.&nbsp;</p><p>That’s where OTel comes in. When the <a href="https://opentelemetry.io/docs/languages/go/getting-started/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>OTel SDK</u></a> is properly configured, it automatically enriches log records with resource attributes (service name, version, environment, region, etc.) and correlation identifiers (trace_id, span_id).&nbsp;</p><p>For example, the same log record might look like this once enriched:&nbsp;</p><p><code>resource:&nbsp;<br>&nbsp; attributes:&nbsp;<br>&nbsp;&nbsp;&nbsp; service.name: "checkout-service"&nbsp;<br>&nbsp;&nbsp;&nbsp; service.version: "1.2.3"&nbsp;<br>&nbsp;&nbsp;&nbsp; deployment.environment: "production"&nbsp;<br>&nbsp;&nbsp;&nbsp; cloud.region: "us-east-1"&nbsp;<br>trace_id: "91be8f92d45e471ea0bf1c25be8e3f1c"&nbsp;<br>span_id: "2d4b1c73c9c3e7f0"&nbsp;<br>severity_text: "WARN"&nbsp;<br>body: "Database connection pool exhausted – queries may fail"</code></p><p>This enrichment happens automatically if you’ve instrumented with OTel, which means developers don’t have to manually add metadata. They just log as usual, and OTel ensures those logs are correlatable with traces and metrics.&nbsp;</p><h3 id="step-2-%E2%80%93-collect-and-route-logs-through-the-otel-collector"><strong>Step 2 – Collect and Route Logs Through the OTel Collector&nbsp;</strong></h3><p><code>service:&nbsp;<br>&nbsp; pipelines:&nbsp;<br>&nbsp;&nbsp;&nbsp; logs:&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; receivers: [otlp, filelog]&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; processors: [attributes, filter]&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; exporters: [otlphttp]</code></p><h3 id="step-3-%E2%80%93-use-metricstraces-to-narrow-the-scope"><strong>Step 3 – Use Metrics/Traces to Narrow the Scope</strong>&nbsp;<br></h3><p>In your analysis tool, start with:&nbsp;</p><ul><li>Error rate, latency, or saturation metrics to find affected services.</li><li>Trace waterfalls to spot where requests slow down or fail.&nbsp;</li></ul><h3 id="step-4-%E2%80%93-pull-only-relevant-logs"><strong>Step 4 – Pull Only Relevant Logs&nbsp;</strong></h3><p><code>service.name = "checkout-service"&nbsp;<br>AND timestamp &gt;= "2025-02-10T14:03:00Z"&nbsp;<br>AND timestamp &lt;= "2025-02-10T14:08:00Z"</code></p><p>If using the OTel Collector with filtering, you can even forward only scoped logs to your log store during incidents.&nbsp;</p><h3 id="step-5-%E2%80%93-confirm-and-remediate"><strong>Step 5 – Confirm and Remediate</strong>&nbsp;<br></h3><p>Use the filtered logs to confirm the root cause hypothesis, capture the evidence you need for a fix, and move directly to remediation.&nbsp;</p><h2 id="how-causely-operationalizes-this-workflow"><strong>How Causely Operationalizes This Workflow&nbsp;</strong></h2><p>The tutorial above shows one way to approximate a more efficient log workflow by narrowing scope before querying logs. <a href="https://www.causely.ai/blog/bridging-the-gap-between-observability-and-automation-with-causal-reasoning?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Causely builds on this approach</u></a> by treating logs as integral to causal analysis — automatically surfacing the ones tied to the root cause.&nbsp;</p><p>Here’s how we do it automatically:&nbsp;</p><ul><li><a href="https://www.causely.ai/blog/why-there-needs-to-be-a-paradigm-shift-in-observability?ref=causely-blog.ghost.io" rel="noreferrer noopener"><strong><u>Top-down causal reasoning</u></strong></a> – Causely continuously builds a model of your environment’s dependencies, behaviors, and known failure patterns.&nbsp;&nbsp;</li><li><strong>Automatic root cause inference with contextual logs</strong> – Metrics, traces, logs, and topology context are analyzed together in real time to pinpoint the single most likely root cause. Relevant error logs are automatically surfaced in that context — tied to the specific service, time window, and failure mode.&nbsp;&nbsp;</li><li><strong>Remediation specificity</strong> – Contextual logs and causal context are routed to an LLM, producing precise and actionable remediation paths.&nbsp;</li><li><strong>Bring your own logs</strong> – Whether shipped via OTel or another pipeline, Causely ingests your logs and automatically pulls the right ~0.1% on demand. No pipeline reconfiguration or log warehouse required.&nbsp;</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/08/data-src-image-9c166d66-8b25-4780-86b4-bce0be54760d.png" class="kg-image" alt="A screenshot of a computer

AI-generated content may be incorrect." loading="lazy" width="864" height="609" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/08/data-src-image-9c166d66-8b25-4780-86b4-bce0be54760d.png 600w, https://causely-blog.ghost.io/content/images/2025/08/data-src-image-9c166d66-8b25-4780-86b4-bce0be54760d.png 864w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Example: Causely identified the external payment service as the root cause and automatically surfaced the relevant log evidence — repeated invalid token failures.</em></i></figcaption></figure><p>&nbsp;</p><p><strong>The result: </strong>Logs remain first-class citizens. You see them in context of metrics and traces, as proof and evidence of the root cause and remediation path — not a warehouse to sift through, and not a second-class afterthought.&nbsp;</p><h2 id="wrapping-up"><strong>Wrapping Up&nbsp;</strong></h2><p>The OTel community has done the hard work of making logs consistent, enriched, and correlated with other telemetry. The next step is using them more effectively: elevating logs from raw records to causal signals. That shift makes troubleshooting faster, cheaper, and more reliable.&nbsp;<br>&nbsp;<br>If you’re following OTel logging best practices, you’ve laid the foundation. With Causely, you can take it to the next level: automatically inferring the root cause in real time, surfacing logs in context, and generating a clear remediation path. All without workflow changes, pipeline reconfiguration, or trade-offs between log richness and operational efficiency.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Why Standalone Docker Matters: 4 Patterns Every Scalable Architecture Should Consider]]></title>
      <link>https://causely.ai/blog/standalone-docker-4-architecture-patterns</link>
      <guid>https://causely.ai/blog/standalone-docker-4-architecture-patterns</guid>
      <pubDate>Fri, 15 Aug 2025 13:46:01 GMT</pubDate>
      <description><![CDATA[This post explores four architecture patterns where standalone Docker is not only justified but recommended.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/08/markus-winkler-MJ-K_aFkjP0-unsplash.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>In modern cloud native environments, Kubernetes is often seen as the default. But sophisticated engineering teams know that distributed architectures are rarely that simple. In many high-performance production environments, <a href="https://docs.docker.com/engine/network/tutorials/standalone/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>standalone Docker</u></a> containers play a strategic role—complementing orchestrated services, not replacing them.&nbsp;</p><p>This post explores four architecture patterns where standalone Docker is not only justified but recommended. We’ll dive into the technical reasons behind each pattern, explore the operational tradeoffs, and show how Causely reduces the complexity of managing performance across these hybrid systems.&nbsp;&nbsp;</p><h2 id="architecture-pattern-1-edge-performance%E2%80%91critical-standalone-docker-services">Architecture Pattern #1: Edge &amp; Performance‑Critical Standalone Docker Services&nbsp;</h2><h3 id="use-case"><strong>Use Case: </strong></h3><p>Real-time inference, inline data transformation, high-throughput gateways, protocol termination (e.g., <a href="https://grpc.io/docs/what-is-grpc/core-concepts/?ref=causely-blog.ghost.io" rel="noreferrer">gRPC</a> or <a href="https://youtu.be/07n-fyqnAa0?si=oGaCaD826m6ZcU__&ref=causely-blog.ghost.io" rel="noreferrer">TLS offloading</a>)&nbsp;</p><h3 id="why-docker-makes-sense"><strong>Why Docker Makes Sense:</strong>&nbsp;</h3><ul><li>These services often sit at the ingress point of traffic or at the edge of the network, where ultra-low latency and deterministic behavior are required.&nbsp;</li><li>Standalone Docker provides tight control over <a href="https://enterprise-support.nvidia.com/s/article/what-is-cpu-affinity-x?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>CPU affinity</u></a>, memory isolation, and scheduling—where Kubernetes can introduce variability through abstraction.&nbsp;</li><li>With no orchestrator overhead and direct access to the host network or hardware, teams can squeeze out performance optimizations critical to meeting service level expectations.&nbsp;&nbsp;</li><li><strong>Example:</strong> A fraud scoring engine on a GPU-backed instance running outside the cluster for sub-10ms response time.&nbsp;</li></ul><h3 id="tradeoffs"><strong>Tradeoffs:</strong>&nbsp;</h3><ul><li>Requires homegrown automation for lifecycle management.&nbsp;</li><li>Correlating failures to upstream/downstream systems can be more difficult without orchestrated metadata.&nbsp;</li></ul><h3 id="when-to-consider-this"><strong>When to Consider This: </strong></h3><p>When you need deterministic performance, such as in ML inference, high-frequency trading, or streaming telemetry ingestion.&nbsp;</p><h2 id="architecture-pattern-2-specialized-customer%E2%80%91specific-components">Architecture Pattern #2: Specialized, Customer‑Specific Components&nbsp;&nbsp;</h2><p><strong>Use Case: </strong>Multi-tenant SaaS platforms or platforms with dedicated microservices per enterprise customer or vertical&nbsp;</p><h3 id="why-docker-makes-sense-1"><strong>Why Docker Makes Sense:</strong>&nbsp;</h3><ul><li>Standalone containers allow per-customer isolation without polluting the core K8s control plane.&nbsp;</li><li>Services can be versioned, tuned, and deployed independently, which is ideal when customers have differing SLAs, data residency needs, or workload profiles.&nbsp;</li><li>Enables progressive feature rollout, compliance isolation (e.g., GDPR), or regional data processing without touching the shared cluster.&nbsp;</li><li><strong>Example:</strong> Deploying a custom analytics pipeline for an EU customer with stricter data locality rules.&nbsp;</li></ul><h3 id="tradeoffs-1"><strong>Tradeoffs:</strong>&nbsp;</h3><ul><li>More surface area to monitor and secure.&nbsp;</li><li>Requires tooling to manage container sprawl and avoid operational drift.&nbsp;</li></ul><h3 id="when-to-consider-this-1"><strong>When to Consider This: </strong></h3><p>When serving enterprise customers with tailored requirements or isolating sensitive data flows.&nbsp;&nbsp;</p><h2 id="architecture-pattern-3-asynchronous-messaging-backbone">Architecture Pattern #3: Asynchronous Messaging Backbone&nbsp;</h2><h3 id="use-case-1"><strong>Use Case: </strong></h3><p>Internal event buses, <a href="https://github.com/networknt/kafka-sidecar?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Kafka sidecars</u></a>, queue processors, or background workers handling fan-out/fan-in traffic patterns&nbsp;</p><h3 id="why-docker-makes-sense-2"><strong>Why Docker Makes Sense:</strong>&nbsp;</h3><ul><li>These components often require tight control over retry logic, <a href="https://aws.amazon.com/what-is/dead-letter-queue/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>dead-letter queues</u></a>, and message deduplication strategies.&nbsp;</li><li>Docker allows dedicated tuning of each message processor (e.g., <a href="https://www.causely.ai/blog/tackling-cpu-throttling-in-kubernetes?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>CPU throttling</u></a>, <a href="https://docs.oracle.com/cd/E13222_01/wls/docs81/perform/JVMTuning.html?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>JVM heap tuning</u></a>) without being subject to shared limits imposed by K8s resource quotas.&nbsp;</li><li>Services can scale independently of the producers and consumers managed in K8s.&nbsp;</li><li>Supports use cases like Command Query Responsibility Segregation (<a href="https://www.youtube.com/watch?v=SvjdJoNPcHs&ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>CQRS</u></a>), or event sourcing, where replay behavior and ordering matter.&nbsp;&nbsp;</li><li><strong>Example:</strong> Kafka sidecar processors that transform and route event data before entering the service mesh.&nbsp;</li></ul><h3 id="tradeoffs-2"><strong>Tradeoffs:</strong>&nbsp;</h3><ul><li>Failures can ripple across multiple services but be misattributed.&nbsp;&nbsp;</li><li>Requires deeper visibility into both the message layer and consuming systems.&nbsp;</li></ul><h3 id="when-to-consider-this-2"><strong>When to Consider This: </strong></h3><p>When messaging is your system’s backbone and needs dedicated, configurable components.&nbsp;&nbsp;</p><h2 id="architecture-pattern-4-cicd-build-test-workloads">Architecture Pattern #4: CI/CD, Build &amp; Test Workloads&nbsp;</h2><h3 id="use-case-2"><strong>Use Case: </strong></h3><p>Ephemeral build/test containers in CI pipelines, sandbox environments for pre-merge validation&nbsp;</p><h3 id="why-docker-makes-sense-3"><strong>Why Docker Makes Sense:</strong>&nbsp;</h3><ul><li>CI systems like <a href="https://github.com/features/actions?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>GitHub Actions</u></a>, <a href="https://circleci.com/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>CircleCI</u></a>, and <a href="https://about.gitlab.com/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>GitLab</u></a> are built around standalone container workloads.&nbsp;</li><li>Docker ensures parity between dev and test environments, reducing false positives from environment drift.&nbsp;</li><li>Local development with <a href="https://docs.docker.com/compose/?ref=causely-blog.ghost.io" rel="noreferrer">Docker Compose</a> can mirror production service graphs with mocking or synthetic traffic.&nbsp;</li><li><strong>Example:</strong> nightly regression tests that spin up mocked API services using Docker Compose, inject synthetic latency, and validate SLA adherence before promoting a build to staging.&nbsp;</li></ul><h3 id="tradeoffs-3"><strong>Tradeoffs:</strong>&nbsp;</h3><ul><li>Pre-prod performance issues may not emerge until you shift into the orchestrated production environment unless your test setup includes sufficient load or concurrency.&nbsp;</li></ul><h3 id="when-to-consider-this-3"><strong>When to Consider This:</strong> </h3><p>If you need fast, reliable CI loops, or are troubleshooting performance regressions before services hit production.&nbsp;&nbsp;</p><h2 id="why-performance-management-gets-hard-in-these-hybrid-architectures">Why Performance Management Gets Hard in These Hybrid Architectures&nbsp;</h2><p>In a system composed of Kubernetes clusters, standalone Docker containers, and managed services like MongoDB or Kafka, performance issues don’t respect boundaries.&nbsp;</p><h3 id="symptoms-emerge-far-from-their-root-cause"><strong>Symptoms Emerge Far From Their Root Cause:</strong>&nbsp;&nbsp;</h3><ul><li>A degraded Kafka sidecar causes timeouts in upstream K8s services.&nbsp;&nbsp;</li><li>A CPU-hungry Docker-based customer module introduces latency that looks like a DB issue.&nbsp;</li><li>A slow edge container obscures the real issue in a downstream managed Redis instance.&nbsp;</li></ul><h3 id="these-environments-break-traditional-monitoring-assumptions"><strong>These Environments Break Traditional Monitoring Assumptions:</strong>&nbsp;</h3><ul><li>No single telemetry system captures the full execution path.&nbsp;&nbsp;</li><li>Logs, metrics, and traces must be stitched manually across platforms.&nbsp;&nbsp;</li><li>Minute-level delay in diagnosis equals dollars lost.&nbsp;</li></ul><p>You need to know not just <em>what</em> failed, but <em>why</em>. And you need it <em>now</em>.&nbsp;</p><h2 id="ebpf-based-auto-instrumentation-zero-overhead-insight">eBPF-Based Auto-Instrumentation: Zero-Overhead Insight&nbsp;</h2><p>eBPF provides kernel-level insight into what containers are doing, without needing in-app instrumentation, SDKs, or manual agents.&nbsp;</p><p>In standalone Docker environments, <a href="https://docs.causely.ai/telemetry-sources/ebpf/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Causely uses eBPF</u></a> to:&nbsp;&nbsp;</p><ul><li>Capture syscall activity, I/O latency, CPU/memory contention, and network issues directly from the host.&nbsp;&nbsp;</li><li>Observe containers even if they’re running outside Kubernetes or disconnected from the service mesh.&nbsp;&nbsp;</li><li>Collect minimal, high-value telemetry without generating a flood of low-signal noise.&nbsp;</li></ul><p>And because eBPF instrumentation is automatic,&nbsp;&nbsp;</p><ul><li>There’s no overhead to developers&nbsp;</li><li>Visibility is always on&nbsp;</li><li>And you can trace performance across heterogeneous systems without writing a line of config.&nbsp;</li></ul><p>Causely also enables end-to-end tracing across Kubernetes, Docker, and managed environments—stitching together execution flows in real time and letting you see how a single failure propagates.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/08/data-src-image-20c86440-74f5-4eec-a36f-b8b9e0c3fa45.png" class="kg-image" alt="A screenshot of a computer

AI-generated content may be incorrect." loading="lazy" width="936" height="709" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/08/data-src-image-20c86440-74f5-4eec-a36f-b8b9e0c3fa45.png 600w, https://causely-blog.ghost.io/content/images/2025/08/data-src-image-20c86440-74f5-4eec-a36f-b8b9e0c3fa45.png 936w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Causely enables end-to-end tracing across K8s, Docker, and managed environments</span></figcaption></figure><h2 id="causely-makes-standalone-docker-first-class">Causely Makes Standalone Docker First-Class&nbsp;</h2><p>While eBPF gives us the raw signals, Causely’s <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>causal reasoning engine</u></a> makes them useful.&nbsp;</p><ul><li>It applies a growing library of causal models which capture failure patterns across services, queues, and infrastructure.&nbsp;</li><li>It builds a real-time causal graph from observed behavior.&nbsp;</li><li>It reasons probabilistically to pinpoint the root cause—even across multiple failure domains.&nbsp;</li></ul><h3 id="we-treat-docker-containers-as-full-participants-in-the-production-graph"><strong>We Treat Docker Containers as Full Participants in the Production Graph</strong></h3><p>With Causely, you can:&nbsp;&nbsp;</p><ul><li>Automatically discover Docker-hosted services and map their upstream/downstream relationships.&nbsp;&nbsp;</li><li>Trace symptoms across Kubernetes, Docker, and managed services without needing to standardize instrumentation.&nbsp;&nbsp;</li><li>Get clear, prioritized, explainable root causes in real time.&nbsp;</li></ul><p>You don’t have to fear hybrid complexity; you just need a system that understands it.&nbsp;</p><h3 id="ready-to-make-standalone-docker-part-of-your-reliability-strategy-not-a-blind-spot"><strong>Ready to Make Standalone Docker Part of Your Reliability Strategy, Not a Blind Spot?</strong>&nbsp;</h3><p>Causely <a href="https://auth.causely.app/oauth/account/sign-up?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>installs in minutes</u></a>. With our <a href="https://www.causely.ai/blog/demystifying-automatic-instrumentation?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>automated instrumentation</u></a> and causal analysis, you’ll spend less time debugging and more time building.&nbsp;</p><p>Learn more about <a href="https://docs.causely.ai/installation/docker/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>how to add stand-alone Docker</u></a> to your Causely deployment.&nbsp;&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Feature Demo: Solve the Root Cause of Message Queue Lag]]></title>
      <link>https://causely.ai/blog/causely-feature-demo-solve-the-root-cause-of-message-queue-lag</link>
      <guid>https://causely.ai/blog/causely-feature-demo-solve-the-root-cause-of-message-queue-lag</guid>
      <pubDate>Tue, 12 Aug 2025 20:07:52 GMT</pubDate>
      <description><![CDATA[Watch the video to see how Causely turns “Lag High” chaos into confident, informed action in seconds.]]></description>
      <author>Anson McCook</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/08/Screenshot-2025-08-12-at-3.59.37---PM.png" type="image/jpeg" />
      <content:encoded><![CDATA[
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/xah1-eSqO4A?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely Message Queue Demo">
  </iframe>
</div>
<!--kg-card-end: html-->
<p>When a message queue starts to back up, every second counts. One slow consumer can stall a Kafka topic, delay downstream services, and turn healthy SLOs into a wall of red. </p><p>The hard part? Your observability tools only show the symptoms: rising lag, service timeouts, and growing error rates. With dozens of potential culprits — from a code change on one service to a misconfigured broker or even a hidden infrastructure issue — finding the exact cause can feel like guesswork.</p><p><a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causely</a> takes the uncertainty out of the equation. By combining deep domain knowledge with your <a href="https://www.causely.ai/blog/navigating-kafka-and-the-challenges-of-asynchronous-communication?ref=causely-blog.ghost.io" rel="noreferrer">Kafka topology</a> and real-time telemetry, Causely identifies the precise reason behind the lag — whether it’s inefficient garbage collection, resource contention, or broker congestion — and shows you exactly how it’s impacting your services. No endless alert-chasing. No service-by-service elimination. Just clear, evidence-backed answers so your team can act fast and keep the backlog from turning into a crisis. </p><p>Watch the video to see how Causely turns “Lag High” chaos into confident, informed action in seconds.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Demystifying Automatic Instrumentation: How the Magic Actually Works]]></title>
      <link>https://causely.ai/blog/demystifying-automatic-instrumentation</link>
      <guid>https://causely.ai/blog/demystifying-automatic-instrumentation</guid>
      <pubDate>Thu, 07 Aug 2025 07:31:00 GMT</pubDate>
      <description><![CDATA[Most developers use automatic instrumentation without knowing how it actually works. This post breaks down the key techniques behind it—not to build your own, but to understand what’s really happening when things "just work."]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://images.unsplash.com/photo-1561551602-0cdfdf7b1491?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDg3fHxtYWdpY3xlbnwwfHx8fDE3NTQ1NzEyODR8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Despite the rise of <a href="https://opentelemetry.io/?ref=causely-blog.ghost.io">OpenTelemetry</a> and <a href="https://ebpf.io/?ref=causely-blog.ghost.io">eBPF</a>, most developers don't know what automatic instrumentation actually does under the hood. This post breaks it down—not to suggest you build your own, but to help you understand what's going on when your tools magically "just work."</p><p>We'll explore five key techniques that power automatic instrumentation: monkey patching, bytecode instrumentation, compile-time instrumentation, eBPF, and language runtime APIs. Each technique leverages the unique characteristics of different programming languages and runtime environments to add observability without code changes.</p><h2 id="what-is-automatic-instrumentation">What is Automatic Instrumentation?</h2><p>According to <a href="https://opentelemetry.io/docs/concepts/glossary?ref=causely-blog.ghost.io">the OpenTelemetry glossary</a>, automatic instrumentation refers to “<em>telemetry collection methods that do not require the end-user to modify application’s source code. Methods vary by programming language, and examples include bytecode injection or monkey patching.</em>”</p><p>It’s worth noting that “automatic instrumentation” is often used to describe two related but distinct concepts. In the definition above and in this blog post, it refers to the specific techniques (like bytecode injection or monkey patching) that can be used to enable observability without code changes. However, when people use "automatic instrumentation" in conversations, they often mean complete zero-code solutions like the <a href="https://opentelemetry.io/docs/zero-code/java/agent/?ref=causely-blog.ghost.io" rel="noreferrer">OpenTelemetry Java agent</a>.</p><p>The distinction is important: there's actually a three-layer hierarchy here. At the bottom are the&nbsp;<strong>automatic instrumentation techniques</strong>&nbsp;(bytecode injection, monkey patching, etc.) that we explore in this blog post. These techniques are used by&nbsp;<a href="https://opentelemetry.io/docs/concepts/glossary/?ref=causely-blog.ghost.io#instrumentation-library">instrumentation libraries</a>&nbsp;that target specific frameworks, for example, libraries that instrument&nbsp;<a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/spring?ref=causely-blog.ghost.io">Spring and Spring Boot</a>,&nbsp;<a href="https://www.npmjs.com/package/@opentelemetry/instrumentation-express?ref=causely-blog.ghost.io">Express.js</a>,&nbsp;<a href="https://packagist.org/packages/open-telemetry/opentelemetry-auto-laravel?ref=causely-blog.ghost.io">Laravel</a>, or other popular frameworks. Finally, complete solutions like the&nbsp;OpenTelemetry Java agent&nbsp;bundle these instrumentation libraries together and add all the boilerplate configuration for exporters, samplers, and other building blocks.</p><p>There are ongoing debates in the observability community about the right terminology, and this blog post won’t attempt to resolve those discussions.</p><p>Also note that what appears "automatic" to one person might be "manual" to another: if a library developer integrates the OpenTelemetry API into their code, the users of that library will get traces, logs, and metrics from that library “automatically” when they add the OpenTelemetry SDK to their application.</p><h2 id="want-to-try-the-techniques-yourself">Want to try the techniques yourself?</h2><p>This blog post contains small code snippets to illustrate the concepts. A full working examples can be found in a lab repository at <a href="https://github.com/causely-oss/automatic-instrumentation-lab?ref=causely-blog.ghost.io">https://github.com/causely-oss/automatic-instrumentation-lab</a> where you can try them out yourself.</p><p>Before we explore these techniques, it’s important to note that you should not build your own automatic instrumentation from scratch, especially not using this blog post as a blueprint. The examples here are simplified for educational purposes and skip many complex details that you would encounter in real-world implementations. There are established tools and mechanisms available that handle much of the complexity and edge cases you would face when building instrumentation from the ground up. If you’re interested in diving deeper into this field, the best approach is to <a href="https://opentelemetry.io/community/?ref=causely-blog.ghost.io#develop-and-contribute" rel="noreferrer">contribute to existing projects like OpenTelemetry</a>, where you can learn from experienced maintainers and work with production-ready code.</p><h2 id="automatic-instrumentation-techniques">Automatic Instrumentation Techniques</h2><p>Now let’s explore how these techniques work under the hood.</p><h3 id="monkey-patching-runtime-function-replacement">Monkey Patching: Runtime Function Replacement</h3><p>Monkey patching is perhaps the most straightforward automatic instrumentation technique, commonly used in dynamic languages like JavaScript, Python, and Ruby. The concept is simple: at runtime, we replace existing functions with instrumented versions that inject telemetry before and after calling the original function.</p><p>Here's how this works in Node.js:</p><pre><code class="language-javascript">const originalFunction = exports.functionName;

function instrumentedFunction(...args) {&nbsp; 
  const startTime = process.hrtime.bigint();&nbsp; 
  const result = originalFunction.apply(this, args);&nbsp;
  const duration = process.hrtime.bigint() - startTime;&nbsp;&nbsp; 
  console.log(`functionName(${args[0]}) took ${duration} nanoseconds`);&nbsp; 
  return result;
}

exports.functionName = instrumentedFunction;
</code></pre><p>The <a href="https://www.npmjs.com/package/require-in-the-middle?ref=causely-blog.ghost.io" rel="noreferrer">require-in-the-middle</a> library allows us to perform this replacement at module load time, intercepting the module loading process to modify the exported functions before they’re used by the application:</p><pre><code class="language-javascript">const hook = require("require-in-the-middle");
hook(["moduleName"], (exports, name, basedir) =&gt; {
&nbsp; const functionName = exports.fibonacci;&nbsp;
&nbsp; ...
&nbsp; exports.functionName = instrumentedFunction;
&nbsp; return exports;
});</code></pre><p>However, monkey patching has limitations. It can't instrument code that's already been compiled to machine code, and it may not work with functions that are called before the instrumentation is loaded. Additionally, the overhead of function wrapping can be significant for performance-critical applications. Monkey patching is also brittle when the implementation of the instrumented code changes significantly, as the instrumentation code needs to be updated to match the new interface.</p><p>To try this out yourself, take a look at the <a href="https://github.com/causely-oss/automatic-instrumentation-lab?ref=causely-blog.ghost.io#monkey-patching-nodejs">Node.js example</a> from the lab.</p><p>If you’d like to see actively used implementations of monkey patching, you can take a look into the instrumentation libraries provided by OpenTelemetry for <a href="https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/packages?ref=causely-blog.ghost.io">JavaScript</a> or <a href="https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation?ref=causely-blog.ghost.io">Python</a>.</p><h3 id="bytecode-instrumentation-modifying-the-virtual-machine">Bytecode Instrumentation: Modifying the Virtual Machine</h3><p>For languages that run on virtual machines, bytecode instrumentation offers a powerful approach. This technique works by modifying the compiled bytecode as it’s loaded by the virtual machine, allowing us to inject code at the instruction level.</p><p>Java’s Instrumentation API provides the foundation for this approach. When a Java agent is specified with the <code>-javaagent</code> flag, the JVM calls the agent’s premain method before the main application starts. This gives us the opportunity to register a class transformer that can modify any class as it’s loaded.</p><pre><code class="language-java">public static void premain(String args, Instrumentation inst) {
&nbsp;&nbsp;&nbsp; new AgentBuilder.Default()
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .type(ElementMatchers.nameStartsWith("com.example.TargetApp"))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .transform((builder, typeDescription, classLoader, module, protectionDomain) -&gt;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; builder.method(ElementMatchers.named("targetMethod"))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .intercept(MethodDelegation.to(MethodInterceptor.class))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ).installOn(inst);
}</code></pre><p>The interceptor then wraps the original method call with timing logic:</p><pre><code class="language-java">@RuntimeType
public static Object intercept(@Origin String methodName,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @AllArguments Object[] args,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @SuperCall Callable&lt;?&gt; callable) throws Exception {
&nbsp;&nbsp;&nbsp; long startTime = System.nanoTime();
&nbsp;&nbsp;&nbsp; Object result = callable.call();
&nbsp;&nbsp;&nbsp; long duration = System.nanoTime() - startTime;
&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp; System.out.printf("targetMethod(%s) took %d ns%n", args[0], duration);
&nbsp;&nbsp;&nbsp; return result;
}</code></pre><p>Bytecode instrumentation is particularly powerful because it works at the JVM level, making it language-agnostic within the JVM ecosystem. It can instrument Java, Kotlin, Scala, and other JVM languages without modification. </p><p>The main advantage of bytecode instrumentation is its comprehensive coverage—it can instrument any code that runs on the JVM, including code loaded dynamically or from external sources. However, it comes with some overhead due to the bytecode transformation process.</p><p>In real implementations, <a href="https://bytebuddy.net/?ref=causely-blog.ghost.io#/">ByteBuddy</a> is the go-to library for bytecode instrumentation in Java, providing a powerful and flexible API for creating Java agents. It abstracts away much of the complexity of bytecode manipulation and provides a clean, type-safe way to define instrumentation rules.</p><p>To try this out yourself, take a look at the <a href="https://github.com/causely-oss/automatic-instrumentation-lab?ref=causely-blog.ghost.io#byte-code-instrumentation-java">Java example</a> from the lab.</p><p>If you’d like to see actively used implementations of byte code instrumentation, you can take a look into the instrumentation libraries provided by OpenTelemetry for <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation?ref=causely-blog.ghost.io">Java</a> or <a href="https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src?ref=causely-blog.ghost.io">.NET</a>.</p><h3 id="compile-time-instrumentation-baking-observability-into-the-binary">Compile-Time Instrumentation: Baking Observability into the Binary</h3><p>For statically compiled languages like Go, compile-time instrumentation offers a different approach. Instead of modifying code at runtime, we transform the source code during the build process using <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree?ref=causely-blog.ghost.io">Abstract Syntax Tree</a> (AST) manipulation.</p><p>The process involves parsing the source code into an AST, modifying the tree to add instrumentation code, and then generating the modified source code before compilation. This approach ensures that the instrumentation is baked into the final binary, providing zero runtime overhead for the instrumentation mechanism itself.</p><pre><code class="language-go">func instrumentFunction() {
&nbsp;&nbsp;&nbsp; fset := token.NewFileSet()
&nbsp;&nbsp;&nbsp; file, err := parser.ParseFile(fset, "app/target.go", nil, parser.ParseComments)
&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp; // Find the target function and add timing logic
&nbsp;&nbsp;&nbsp; ast.Inspect(file, func(n ast.Node) bool {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if fn, ok := n.(*ast.FuncDecl); ok &amp;&amp; fn.Name.Name == "targetFunction" {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // Add defer statement for timing
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; deferStmt := &amp;ast.DeferStmt{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Call: &amp;ast.CallExpr{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Fun: &amp;ast.CallExpr{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Fun: &amp;ast.Ident{Name: "trace_targetFunction"},
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; },
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; },
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fn.Body.List = append([]ast.Stmt{deferStmt}, fn.Body.List...)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return true
&nbsp;&nbsp;&nbsp; })
&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp; // Write the modified file back
&nbsp;&nbsp;&nbsp; printer.Fprint(f, fset, file)
}</code></pre><p>Compile-time instrumentation has several advantages. It provides zero runtime overhead for the instrumentation mechanism, and the resulting binary contains all the code it needs. This approach works well with compiled languages and can be integrated into existing build processes.</p><p>That said, it does come with trade-offs. It requires access to the source code and build system, which makes it impractical for instrumenting third-party applications or libraries. It also demands more sophisticated tooling to manipulate the abstract syntax tree (AST) correctly and consistently adding complexity to the build pipeline and potentially requiring changes to your CI/CD workflows.</p><p>To try this out yourself, take a look at the <a href="https://github.com/causely-oss/automatic-instrumentation-lab?ref=causely-blog.ghost.io#compile-time-instrumentation-go">Go compile-time example</a> from the lab.</p><p>If you’d like to see actively used implementations of compile-time instrumentation, you can take a look into the <a href="https://github.com/open-telemetry/opentelemetry-go-compile-instrumentation?ref=causely-blog.ghost.io">OpenTelemetry Go Compile Instrumentation</a> project.</p><h3 id="ebpf-instrumentation-kernel-level-observability">eBPF Instrumentation: Kernel-Level Observability</h3><p><a href="https://ebpf.io/?ref=causely-blog.ghost.io">eBPF</a> (Extended Berkeley Packet Filter) represents a fundamentally different approach to automatic instrumentation. Instead of modifying application code or bytecode, eBPF works at the kernel level, attaching probes to function entry and exit points in the running application.</p><p>eBPF programs are small, safe programs that run in the kernel and can observe system calls, function calls, and other events. For automatic instrumentation, we use uprobes (user-space probes) to attach to specific functions in our application.</p><pre><code class="language-shell">#!/usr/bin/env bpftrace

uprobe:/app/fibonacci:main.fibonacci
{
&nbsp;&nbsp;&nbsp; @start[tid] = nsecs;
}

uretprobe:/app/fibonacci:main.fibonacci /@start[tid]/
{
&nbsp;&nbsp;&nbsp; $delta = nsecs - @start[tid];
&nbsp;&nbsp;&nbsp; printf("fibonacci() duration: %d ns\n", $delta);
&nbsp;&nbsp;&nbsp; delete(@start[tid]);
}</code></pre><p>This <a href="https://github.com/bpftrace/bpftrace?ref=causely-blog.ghost.io">bpftrace</a> script attaches a probe to the function in our application. When the function is called, it records the start time. When the function returns, it calculates the duration and prints the result.</p><p>eBPF instrumentation is language-agnostic and works with any language running on Linux. It provides deep system-level observability without requiring any modifications to the application code or build process. The overhead is minimal since the instrumentation runs in the kernel.</p><p>However, eBPF instrumentation has some limitations. It requires Linux and root privileges to run, making it less suitable for containerized environments or applications that can’t run with elevated permissions.</p><p>For real-world use cases, bpftrace is just one of many eBPF tools available. While it’s excellent for learning and prototyping, production environments typically use more sophisticated frameworks like <a href="https://github.com/iovisor/bcc?ref=causely-blog.ghost.io">BCC</a> (BPF Compiler Collection) or <a href="https://github.com/libbpf/libbpf?ref=causely-blog.ghost.io">libbpf</a>, which provide better performance, more features, and stronger safety guarantees.</p><p>To try this out yourself, take a look at the <a href="https://github.com/causely-oss/automatic-instrumentation-lab?ref=causely-blog.ghost.io#ebpf-based-instrumentation-go">Go eBPF example</a> from the lab.</p><p>If you’d like to see actively used implementations of compile-time instrumentation, you can take a look into the <a href="https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation?ref=causely-blog.ghost.io">OpenTelemetry eBPF Instrumentation</a> project (short “OBI”), which is the outcome of <a href="https://github.com/open-telemetry/community/issues/2406?ref=causely-blog.ghost.io">the donation of Belay by Grafana</a>.</p><h3 id="language-runtime-apis-native-instrumentation-support">Language Runtime APIs: Native Instrumentation Support</h3><p>Some languages provide built-in APIs for instrumentation, offering a more integrated approach. <a href="https://github.com/php/php-src/blob/PHP-8.0/Zend/zend_observer.h?ref=causely-blog.ghost.io">PHP’s Observer API</a>, introduced in PHP 8.0, is a prime example of this approach.</p><p>The Observer API allows C extensions to hook into the PHP engine’s execution flow at the Zend engine level. This provides deep visibility into PHP application behavior without requiring code modifications.</p><pre><code class="language-cpp">static void observer_begin(zend_execute_data *execute_data) {
&nbsp;&nbsp;&nbsp; if (execute_data-&gt;func &amp;&amp; execute_data-&gt;func-&gt;common.function_name) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; const char *function_name = ZSTR_VAL(execute_data-&gt;func-&gt;common.function_name);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (strcmp(function_name, "fib") == 0) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; start_time = clock();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
&nbsp;&nbsp;&nbsp; }
}

static void observer_end(zend_execute_data *execute_data, zval *retval) {
&nbsp;&nbsp;&nbsp; if (execute_data-&gt;func &amp;&amp; execute_data-&gt;func-&gt;common.function_name) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; const char *function_name = ZSTR_VAL(execute_data-&gt;func-&gt;common.function_name);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (strcmp(function_name, "fib") == 0) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; clock_t end_time = clock();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; double duration = (double)(end_time - start_time) / CLOCKS_PER_SEC * 1000;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; php_printf("Function %s() took %.2f ms\n", function_name, duration);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
&nbsp;&nbsp;&nbsp; }
}</code></pre><p>The Observer API provides a clean, supported way to add instrumentation to PHP applications. It operates at the language runtime level, similar to how other languages implement their instrumentation APIs. This approach is efficient and well-integrated with the language ecosystem.</p><p>However, it requires writing C extensions, which adds complexity and makes it less accessible to developers who aren’t familiar with C or PHP’s internal APIs. It’s also specific to PHP, so the knowledge doesn’t transfer to other languages.</p><p>To try this out yourself, take a look at the <a href="https://github.com/causely-oss/automatic-instrumentation-lab?ref=causely-blog.ghost.io#php-observer-api-php">PHP Observer API example</a> from the lab.</p><p>If you’d like to see actively used implementations of byte code instrumentation, you can take a look into the instrumentation libraries provided by OpenTelemetry for <a href="https://github.com/open-telemetry/opentelemetry-php-contrib/tree/main/src/Instrumentation?ref=causely-blog.ghost.io">PHP</a>.</p><h2 id="a-note-on-context-propagation">A Note on Context Propagation</h2><p>While we've covered the core techniques of automatic instrumentation, there's an important aspect we haven't discussed: <a href="https://opentelemetry.io/docs/concepts/context-propagation/?ref=causely-blog.ghost.io" rel="noreferrer">context propagation</a>. This involves injecting trace context information (trace IDs, span IDs) into HTTP headers, message metadata, and other communication channels to enable distributed tracing across service boundaries.</p><p>Unlike the purely observational techniques we've explored, context propagation actively modifies your application's behavior by altering data transmitted across service boundaries. This introduces additional complexity that deserves its own dedicated blog post.<a></a></p><h2 id="automatic-instrumentation-as-telemetry-quality-assurance">Automatic Instrumentation as telemetry quality assurance<a></a></h2><p>At <a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noreferrer">Causely</a>, we rely on automatic instrumentation to ensure consistent, high-quality telemetry, even when customers haven't instrumented their code manually. Our agent comes with automatic instrumentation powered by Grafana Beyla (now also known as OpenTelemetry eBPF Instrumentation), leveraging <a href="https://docs.causely.ai/telemetry-sources/ebpf/?ref=causely-blog.ghost.io" rel="noreferrer">eBPF</a> and <a href="https://docs.causely.ai/telemetry-sources/opentelemetry/?ref=causely-blog.ghost.io" rel="noreferrer">OpenTelemetry</a> out-of-the-box to ensure fast time-to-value and dependable insights from day one.</p><h2 id="conclusion">Conclusion</h2><p>We've explored the core techniques behind automatic instrumentation, from monkey patching to bytecode instrumentation to eBPF probes. Each approach leverages the unique characteristics of different programming languages and runtime environments.</p><p>These techniques power production observability tools like OpenTelemetry, enabling developers to quickly add telemetry without modifying source code. The most successful observability strategies combine automatic and manual instrumentation: automatic instrumentation provides broad coverage for common patterns, while manual instrumentation captures business-specific metrics.</p><p>If you'd like to try out these techniques yourself, you can use the <a href="https://github.com/causely-oss/automatic-instrumentation-lab?ref=causely-blog.ghost.io" rel="noopener">Automatic Instrumentation Lab</a>.</p><p>If you're interested in contributing to these technologies, consider getting involved with <a href="https://github.com/open-telemetry/community/?ref=causely-blog.ghost.io#special-interest-groups">OpenTelemetry's various Special Interest Groups</a> (SIGs).</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Feature Demo: Pinpointing the Code Change Causing Performance Issues]]></title>
      <link>https://causely.ai/blog/causely-feature-demo-pinpointing-the-code-change-causing-performance-issues</link>
      <guid>https://causely.ai/blog/causely-feature-demo-pinpointing-the-code-change-causing-performance-issues</guid>
      <pubDate>Fri, 25 Jul 2025 15:43:42 GMT</pubDate>
      <description><![CDATA[In this short video, we show how Causely pinpoints the exact code change that triggered cascading performance issues — without requiring you to sift through logs or build custom dashboards.]]></description>
      <author>Anson McCook</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/07/image--10-.png" type="image/jpeg" />
      <content:encoded><![CDATA[
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/hvJDWHkxieg?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely Code Change Demo">
  </iframe>
</div>
<!--kg-card-end: html-->
<p>When dozens of teams are deploying code across services and environments, even a small change can quietly spiral into system-wide latency, retries, and customer pain. </p><p>And when things go sideways, your team is left scrambling:&nbsp;</p><blockquote><em>Was it the payments API? <br>A config change in the cluster? <br>A new version of Postgres?</em>&nbsp;</blockquote><p>In most tools, you’re chasing alerts across dashboards trying to guess your way to the root cause. That’s why we built Causely — to surface the answer before the war room even begins.</p><p>In this short video, we show how Causely pinpoints the exact code change that triggered cascading performance issues — without requiring you to sift through logs or build custom dashboards. Our <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causal Reasoning Engine</a> continuously maps relationships between telemetry, deploys, and system behavior to deliver real-time root cause insight. So when your team asks “what changed?”, Causely already knows — and shows you what to fix, what was impacted, and who needs to take action.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[The Signal in the Storm: Why Chasing More Data Misses the Point]]></title>
      <link>https://causely.ai/blog/the-signal-in-the-storm</link>
      <guid>https://causely.ai/blog/the-signal-in-the-storm</guid>
      <pubDate>Tue, 22 Jul 2025 18:50:24 GMT</pubDate>
      <description><![CDATA[More telemetry doesn’t guarantee more understanding. In many cases, it gives you the illusion of control while silently eroding your ability to reason about the system.]]></description>
      <author>Endre Sara</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/07/Screenshot-2025-07-22-at-2.49.11---PM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>As OpenTelemetry adoption has exploded, so has the volume of telemetry data moving through modern observability pipelines. But despite collecting more logs, metrics, and traces than ever before, teams are still struggling to answer the most basic questions during incidents: <em>What broke? Where? And why?</em>&nbsp;</p><p>In my recent talk at OpenTelemetry Community Day, <em>“</em><a href="https://www.youtube.com/watch?v=Gy38gx-7phA&ref=causely-blog.ghost.io" rel="noreferrer"><em>T<u>he Signal in the Storm: Practical Strategies for Managing Telemetry Overload</u></em></a><em>,”</em> I laid out a different path forward, one focused not on volume, but on meaning.&nbsp;</p><p>More telemetry doesn’t guarantee more understanding. In many cases, it gives you the illusion of control while silently eroding your ability to reason about the system. That illusion becomes expensive, especially when telemetry pipelines are optimized for ingestion, not insight.&nbsp;</p><h2 id="observability-needs-a-better-model">Observability Needs a Better Model&nbsp;</h2><p>Traditional observability relies on emitting and aggregating raw signals (spans, logs, and metrics), then querying across that pile post-hoc. That model assumes the data will be useful <em>after</em> something goes wrong. But today’s distributed, dynamic, multi-tenant AI-driven systems don’t give you that luxury. The <a href="https://www.causely.ai/blog/be-smarter-about-observability-data?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>cost of collecting all raw signals</u></a> without semantics and without context is too high.&nbsp;</p><p>We need a shift: from streams of telemetry to structured, semantic representations of how systems behave. That starts by modeling the actual components of the system (entities) and the relationships between them. Not as metadata bolted onto spans, but as first-class signals. This is the work being advanced by the <a href="https://github.com/open-telemetry/community?ref=causely-blog.ghost.io#special-interest-groups" rel="noreferrer noopener"><u>OpenTelemetry Entities SIG</u></a>, and it's central to how we think about observability at <a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Causely</u></a>.&nbsp;</p><h2 id="what-it-looks-like-in-practice">What It Looks Like in Practice&nbsp;</h2><p>At Causely, we’ve been applying these ideas in production. We observe system behavior, and instead of centralizing all that raw output, we extract anomalies in the context of semantic entities and relationship graphs at the edge. This gives us fine-grained insight without overwhelming the pipeline.&nbsp;</p><p>The impact is real:&nbsp;</p><ul><li><strong>Less data, more clarity</strong>: structured signals replace noisy aggregates&nbsp;</li><li><strong>Better performance and cost-efficiency</strong>: telemetry becomes lean and targeted&nbsp;</li><li><strong>Stronger privacy</strong>: raw, user-level data never needs to leave the cluster&nbsp;</li><li><strong>Faster debugging</strong>: understanding is built-in, not reverse-engineered&nbsp;</li></ul><p>The goal isn’t to eliminate data; it’s to <strong>collect with intent</strong>. To make the system itself understandable, not just observable.&nbsp;</p><h2 id="watch-the-talk">Watch the Talk&nbsp;</h2><p>In the session, I go deep into what this looks like technically, including:&nbsp;</p><ul><li>How teams are adapting their telemetry strategies to manage scale and cost&nbsp;</li><li>The role of entities, ontologies, and semantic modeling in modern observability&nbsp;</li><li>Why centralized data lakes aren’t a sustainable long-term answer&nbsp;</li><li>What we’ve learned building systems that can reason about their own behavior&nbsp;</li></ul><p>If you’ve felt the limits of traditional observability, and you're looking for a more scalable, reliable, and thoughtful path forward, I hope you’ll check it out.&nbsp; </p><p><a href="https://static.sched.com/hosted_files/otelopenobservabilityna25/75/signal-in-storm-observabilitysummit-2025.pptx&nbsp;?ref=causely-blog.ghost.io" rel="noreferrer">See the slides</a>, or watch the recording:</p>
<!--kg-card-begin: html-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/Gy38gx-7phA?si=SrH4C2ysZMB6QhI-" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
<!--kg-card-end: html-->
]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[When Everything Is Instrumented, and You Still Don’t Know What’s Broken]]></title>
      <link>https://causely.ai/blog/when-everything-is-instrumented-and-you-still-dont-know-whats-broken</link>
      <guid>https://causely.ai/blog/when-everything-is-instrumented-and-you-still-dont-know-whats-broken</guid>
      <pubDate>Mon, 14 Jul 2025 15:09:42 GMT</pubDate>
      <description><![CDATA[In 'Rethinking Reliability for Distributed Systems,' Endre Sara shared a common story: a large-scale customer, running mature microservices in Kubernetes with full observability coverage, still struggles to understand what’s broken during a high-stakes business event.]]></description>
      <author>Ben Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/07/Screenshot-2025-07-14-at-11.00.08---AM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<h2 id="why-microservices-need-causal-reasoning-not-just-observability"><em>Why microservices need causal reasoning, not just observability</em>&nbsp;</h2><p></p><p>You did the right things.&nbsp;</p><ul><li>Your microservices are fully instrumented.&nbsp;</li><li>You’ve got distributed tracing and a modern observability stack.&nbsp;&nbsp;</li><li>You even built custom dashboards.&nbsp;</li></ul><p>But during a major production incident, it still took hours to figure out what was wrong.&nbsp;</p><p>Sound familiar?&nbsp;</p><p>In our recent webinar, <em>'Rethinking Reliability for Distributed Systems</em>,' Causely co-founder <a href="https://www.linkedin.com/in/endresara/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Endre Sara</u></a> shared a story we hear far too often: a large-scale customer, running mature microservices in Kubernetes with full observability coverage, still struggles to understand what’s broken during a high-stakes business event.&nbsp;</p><h2 id="why-microservices-break-differently"><strong>Why microservices break differently</strong>&nbsp;</h2><p>Distributed systems aren't just complex, they're dynamic. Services spin up and down. Async data flows hide causal relationships. Teams own different pieces. Dashboards fill up with symptoms, not answers.&nbsp;</p><p>That's what happened to a large enterprise team Endre worked with. They had:&nbsp;</p><ul><li>Mature Kubernetes operations&nbsp;</li><li>Kafka for async comms&nbsp;</li><li>Comprehensive tracing + telemetry&nbsp;</li><li>A seasoned SRE team&nbsp;</li></ul><p>And still, they couldn't find the root cause of a high-stakes incident.&nbsp;</p><p>The problem wasn't a lack of data. It was a lack of <a href="https://www.causely.ai/blog/causal-reasoning-the-missing-piece-to-service-reliability?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>causal reasoning</u></a>.&nbsp;&nbsp;</p><h2 id="watch-the-recording-rethinking-reliability-for-distributed-systems"><strong>Watch the recording: Rethinking Reliability for Distributed Systems</strong>&nbsp;</h2><p>In this session, Endre walks through:&nbsp;</p><ul><li>Why microservices environments <a href="https://www.causely.ai/blog/be-smarter-about-observability-data?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>overwhelm traditional observability</u></a>&nbsp;</li><li>How causal reasoning changes incident response&nbsp;</li><li>What teams can do to move from firefighting to foresight&nbsp;</li></ul><p>Whether you’re drowning in alerts or struggling to explain why something broke, this talk offers a clear new perspective and a path forward.&nbsp;&nbsp;</p>
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/iuroPvbvDk8?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely | Rethinking Reliability for Distributed Systems">
  </iframe>
</div>
<!--kg-card-end: html-->
]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Feature Demo: Using Ask Causely to Transform Incident Response]]></title>
      <link>https://causely.ai/blog/causely-feature-demo-using-ask-causely-to-transform-incident-response</link>
      <guid>https://causely.ai/blog/causely-feature-demo-using-ask-causely-to-transform-incident-response</guid>
      <pubDate>Thu, 10 Jul 2025 18:39:56 GMT</pubDate>
      <description><![CDATA[In this short demo, we show how Ask Causely shifts incident response from a fire drill to a focused, high-context workflow.]]></description>
      <author>Anson McCook</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/07/Screenshot-2025-07-10-at-2.35.11---PM-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/SbpVWUjRyfU?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely Grafana Plugin Demo">
  </iframe>
</div>
<!--kg-card-end: html-->
<p>When something breaks in production, most tools make you start with a question:&nbsp;What’s going on?&nbsp;But with&nbsp;<strong>Ask Causely</strong>, you start with the answer. </p><p>Powered by our real-time <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causal Reasoning Engine</a>, Causely continuously analyzes your telemetry and system topology to surface the root cause the moment an issue begins — no prompt required. Unlike other AI assistants that reactively dig through logs or metrics after you ask, Causely already knows what’s wrong, why it’s happening, and what’s likely to break next.</p><p>In this short demo, we show how Ask Causely shifts incident response from a fire drill to a focused, high-context workflow. Whether you're on-call, in a war room, or mid-debugging, Ask Causely can tell you who owns the failing service, what downstream systems are impacted, and what to do next — all from Slack, Teams, or your browser. It’s observability that works like a teammate, not just a tool. </p><p>Watch the video to see how your team can go from alerts to action in seconds.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Making Observability Work: From Hype to Causal Insight]]></title>
      <link>https://causely.ai/blog/making-observability-work-from-hype-to-causal-insight</link>
      <guid>https://causely.ai/blog/making-observability-work-from-hype-to-causal-insight</guid>
      <pubDate>Thu, 03 Jul 2025 15:10:34 GMT</pubDate>
      <description><![CDATA[A few weeks back, I joined Charity Majors, Paige Cruz, Avi Freedman, Shahar Azulay, and Adam LaGreca for a roundtable on the state of modern observability. It was an honest conversation about where we are, what’s broken, and where things are heading. You can read the full summary on The New Stack. This exchange inspired me to write down my thoughts and to expand on them. 


Let’s Not Rename Observability — Let’s Make It Work 

Every few months, a new term pops up: understandability, explainabili]]></description>
      <author>Severin Neumann</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/07/Screenshot-2025-07-03-at-9.23.15---AM-1-1-1-1-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>A few weeks back, I joined Charity Majors, Paige Cruz, Avi Freedman, Shahar Azulay, and Adam LaGreca for a roundtable on the state of modern observability. It was an honest conversation about where we are, what’s broken, and where things are heading. You can read the <a href="https://thenewstack.io/the-modern-observability-roundtable-ai-rising-costs-and-opentelemetry?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>full summary on The New Stack</u></a>. This exchange inspired me to write down my thoughts and to expand on them.&nbsp;</p><h2 id="let%E2%80%99s-not-rename-observability-%E2%80%94-let%E2%80%99s-make-it-work">Let’s Not Rename Observability — Let’s Make It Work&nbsp;</h2><p>Every few months, a new term pops up: understandability, explainability, controllability. And sure, we all want better systems. But do we really need new language? Or do we need better outcomes?&nbsp;</p><p>As I said in the panel: six people will have twelve opinions on what observability means. But the real point is this: users don’t care what we call it! They want to catch root causes in production early and minimize the business impact during an incident. And that requires systems that deliver causal, actionable insights that allow teams to return their software into a healthy state quickly.&nbsp;</p><p>Renaming observability and packaging it as something else is not conducive to getting to the true outcome everyone wants – they are just words without the proper action and change to back it up. We saw this before, that “monitoring” was replaced with “observability” without no actual change.&nbsp;</p><p>Observability should support the full software lifecycle, from design to incident to fix. If it’s not helping you move faster <em>and</em> safer, then it’s just more dashboards. And no one wants that. Let’s not get stuck in terminology and instead get back to building systems that help us move.&nbsp;</p><h2 id="value-over-volume">Value Over Volume&nbsp;</h2><p>One of the loudest themes from the roundtable: cost. Observability spend is skyrocketing, while the value people get is… questionable.&nbsp;</p><p>We are stuck in a cycle of collecting everything, just in case! But volume does not equal value. More logs, more traces, more storage doesn’t solve problems, it mostly just adds noise.&nbsp;</p><p>Any vendor claiming to innovate in observability has to answer this question: how do you shift the focus from collecting more data to instead only delivering useful insights?&nbsp;</p><p>Causely’s answer: we do mediation at the cluster (i.e., edge-based processing), and only send distilled insights to the cloud  . Our system doesn’t aim to collect all the data. It aims to understand what’s wrong, fast. That means we’re not just watching metrics, we’re diagnosing causal chains. That’s how we make observability affordable, and more importantly useful.&nbsp;</p><h2 id="from-optional-to-invisible-the-future-of-opentelemetry">From Optional to Invisible: The Future of OpenTelemetry&nbsp;</h2><p>During the roundtable I said that OpenTelemetry will have won when people use it without realizing it. When the libraries, frameworks and programming languages that we use every day come with an out of the box integration. Developers will use it like any other core feature of their language, like if-statements, variables and comments.&nbsp;</p><p>That said, this vision is more of a stretch goal than a milestone on the horizon. In many ways, OpenTelemetry has already “won.” Making code observable, whether through OpenTelemetry or another system, is no longer optional, it is expected. Observability has become a baseline capability, and OpenTelemetry helped set that standard.&nbsp;</p><h2 id="smart-automation-not-ai-hype">Smart Automation, Not AI Hype&nbsp;</h2><p>A few years ago, AIOps promised to make sense of our systems with artificial intelligence and mostly delivered confusion. The hype faded, the promises didn’t hold up, and most teams were left with “smart” alert suppressors or dashboards that looked impressive but didn’t help when it counted.&nbsp;</p><p>Today’s AI wave is louder: LLMs that summarize incidents, tools that promise auto-remediation with natural language, anomaly detectors wrapped in glossy UI. But most of them suffer from the same core flaw: <a href="https://www.causely.ai/blog/causal-reasoning-the-missing-piece-to-service-reliability%22%20/l%20%22why-traditional-approaches-fall-short?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>LLMs are general-purpose tools, they excel at pattern recognition, but lack deep real-time causal reasoning needed to adopt to novel, dynamic environments</u></a>.&nbsp;</p><p>At Causely, we’re not building a chatbot. We’re not doing log clustering or time-series correlation. We’re building a causal reasoning system.&nbsp;</p><p>Our system encodes known failure modes as structured models. That means it can infer root causes even when the triggering signal isn’t directly observable. It doesn’t just suppress noise; it explains it. It doesn’t just summarize symptoms; it traces causality.&nbsp;</p><p>This isn’t black-box “AI.” It’s smart, explainable automation rooted in mathematics — built to help engineers understand <em>why</em> things happen and what to do next.&nbsp;</p><p>And the goal isn’t to replace humans! It’s to get them out of the loop for the boring stuff: the CPU spike, the misconfigured downstream service, the memory leak that shows up every Tuesday. These aren’t mysteries. They’re patterns. And they can be handled automatically.&nbsp;&nbsp;</p><p>That way engineers can spend time on doing what they love, building.&nbsp;&nbsp;&nbsp;</p>
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/wmugy8chF1A?si=3dOwreDILwBNFuZ5" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely Grafana Plugin Demo">
  </iframe>
</div>

<!--kg-card-end: html-->
<h2 id="closing-thoughts">Closing Thoughts&nbsp;</h2><p>The roundtable showed there’s still strong alignment across the industry: observability remains essential, but it needs to deliver more than just data. We need clearer value, smarter automation, and systems that help us move faster, not just monitor more.&nbsp;</p><p>Thanks to Adam, Charity, Paige, Shahar, and Avi for a thoughtful and honest discussion. It’s good to see real progress, and even better to debate what is coming next.&nbsp;</p><p></p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Feature Demo: Unlock Root Cause Analysis in Grafana]]></title>
      <link>https://causely.ai/blog/causely-feature-demo-unlock-root-cause-analysis-in-grafana</link>
      <guid>https://causely.ai/blog/causely-feature-demo-unlock-root-cause-analysis-in-grafana</guid>
      <pubDate>Fri, 27 Jun 2025 17:18:41 GMT</pubDate>
      <description><![CDATA[Grafana gives teams the power to visualize everything - but on Day 0, when your dashboards are live and alerts start firing, what your team really needs is clarity. That’s why we built the new Causely plugin for Grafana. In just minutes, Causely connects to your telemetry sources and begins surfacing the root cause of performance degradations - right inside your existing dashboards. No code changes. No sidecars. Just answers.

In this video, you’ll see how Causely helps teams cut thro]]></description>
      <author>Anson McCook</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/06/Screenshot-2025-06-27-at-10.46.24---AM-2.png" type="image/jpeg" />
      <content:encoded><![CDATA[
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/D6Ps1VoGHvw?rel=0" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" 
          allowfullscreen 
          title="Causely Grafana Plugin Demo">
  </iframe>
</div>


<!--kg-card-end: html-->
<p>Grafana gives teams the power to visualize everything - but on&nbsp;Day 0, when your dashboards are live and alerts start firing, what your team really needs is clarity. That’s why we built the new Causely plugin for Grafana. In just minutes, Causely connects to your telemetry sources and begins surfacing the&nbsp;<em>root cause</em>&nbsp;of performance degradations - right inside your existing dashboards. No code changes. No sidecars. Just answers.</p><p>In this video, you’ll see how Causely helps teams cut through the noise of modern microservice environments by pinpointing&nbsp;<em>why</em>&nbsp;systems are breaking, not just where. Our Causal Reasoning Engine builds a real-time cause-and-effect map of your services - so instead of drowning in symptoms, your team is focused on what matters most. Watch the video to see how Causely helps you deliver real insight from Day 0 - and drive better reliability from Day 1.</p><p></p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Alarm Suppression is Not Root Cause Analysis]]></title>
      <link>https://causely.ai/blog/alarm-suppression-is-not-root-cause-analysis</link>
      <guid>https://causely.ai/blog/alarm-suppression-is-not-root-cause-analysis</guid>
      <pubDate>Mon, 02 Jun 2025 19:05:33 GMT</pubDate>
      <description><![CDATA[“Root Cause Analysis” (RCA) is one of the most overloaded terms in modern engineering. Some call a tagged log line RCA. Others label time-series correlation dashboards or AI-generated summaries as RCA. Some reduce noise by filtering or hiding secondary and cascading alarms. And recently large language models (LLMs) have entered the scene, offering natural-language explanations for whatever just broke.  

But here is the problem: none of these are actually solving the Root Cause Analysis problem.]]></description>
      <author>Dhairya Dalal</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/catherine-hughes-PkEQHH6R7Eg-unsplash.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>“Root Cause Analysis” (RCA) is one of the most overloaded terms in modern engineering. Some call a tagged log line RCA. Others label time-series correlation dashboards or AI-generated summaries as RCA. Some reduce noise by filtering or hiding secondary and cascading alarms. And recently large language models (LLMs) have entered the scene, offering natural-language explanations for whatever just broke.&nbsp;&nbsp;</p><p>But here is the problem: none of these are actually solving the Root Cause Analysis problem. Alarm suppression is NOT Root Cause Analysis.&nbsp;&nbsp;</p><p>For teams operating modern, distributed systems - microservices, data pipelines, container orchestration, multi-cloud dependencies - these limitations aren’t minor. They make it impossible to reason clearly about why performance degradations and failures happen, and how to prevent them.&nbsp;<br>&nbsp;<br>For example, at a&nbsp;financial services company, a cascading Kafka issue was misdiagnosed for hours as a frontend memory spike. The result? Missed SLOs and three teams paged on a Saturday.&nbsp;</p><h3 id="what-these-tools-miss">What These Tools Miss&nbsp;</h3><p>Today's “RCA” tools suffer from several critical limitations:&nbsp;</p><ul><li>They can’t infer the root cause unless it’s already present in the observed alarms – i.e., the output must be in the input.&nbsp;</li><li>They don't explain <em>why</em> the symptoms occurred – they just hide redundant signals.&nbsp;</li><li>They require perfect, complete signals to function reliability.&nbsp;&nbsp;</li><li>They often produce misleading or spurious outputs.&nbsp;</li></ul><p>At Causely, we’re not building a log aggregator, a correlation engine, alarm suppression, nor a chatbot. We are building a causal reasoning system. To understand how this approach is different and why it matters, we first need to re-establish what “root cause analysis” is meant to solve.&nbsp;&nbsp;</p><h3 id="what-is-the-root-cause-analysis-problem">What is the Root Cause Analysis Problem?&nbsp;</h3><p>&nbsp;In managed cloud environments, the RCA problem requires the identification of the most likely cause of observed symptoms, based on a structured understanding of the environment and causal interdependencies between services. Put simply: it’s about using what you <em>know</em> about your system to explain what you <em>see</em>—not just matching alerts to patterns, but reasoning through cause and effect.&nbsp;</p><h2 id="what-most-rca-tools-actually-do"><strong>What Most RCA Tools Actually Do</strong>&nbsp;</h2><p>Most tools that claim to perform RCA fall into one of three categories:&nbsp;</p><ul><li><strong>Postmortem Narratives</strong>: These are after-the-fact writeups, often shaped more by human bias than data, constructed to provide a retrospective of what went wrong.&nbsp;&nbsp;</li><li><strong>Correlation Engines</strong>: These systems surface anomalies and related signals during incidents but confuse correlation with causation. They provide visibility into <em>what </em>things happened around the same time, but they don’t know <em>why</em>.&nbsp;&nbsp;&nbsp;</li><li><strong>LLM-Powered Assistants</strong>: These interfaces can produce plausible-sounding explanations by summarizing the data they have access to, but they often generate spurious or unverifiable answers.&nbsp;&nbsp;</li></ul><p>While all three of these approaches can be useful in some ways, none of them utilize structured causal knowledge and reasoning to solve the RCA problem. That’s the missing piece.&nbsp;</p><h2 id="what-actual-causal-analysis-looks-like"><strong>What Actual Causal Analysis Looks Like</strong>&nbsp;</h2><p>To be clear, causal analysis is not about finding "what’s weird." It’s about inferring <em>what is the root cause</em>. At Causely, our reasoning platform is built on three foundational principles:&nbsp;</p><h3 id="1-root-causes-are-explicitly-defined"><strong>1. Root Causes Are Explicitly Defined</strong>&nbsp;</h3><p>A root cause is an underlying issue that results in multiple degradations and disruptions in the managed environment. Formally, a root cause is defined by causes and effects, where the cause is the underlying issue (e.g., a DB experiencing inefficient locking), and the&nbsp;effects are the disruptions it creates in the environment (e.g., degraded service response times).&nbsp;&nbsp;</p><p>Our system monitors your environment and identifies which anomalous behaviors (such as high error rates, slow responses, or service crashes) are symptoms, using metrics gathered from telemetry and observability tools.&nbsp;</p><p>Causely represents each root cause as a closure - a signature uniquely defined by a specific set of expected symptoms. Each root cause and its closure are automatically generated from causal knowledge and the discovered topology. Utilizing Causal Bayesian Networks, Causely can effectively and accurately infer root causes by reasoning over the relationships between the root causes and symptoms, rather than relying on simple mapping or correlations.&nbsp;&nbsp;</p><p>Causely constructs precise causal graphs that show not just what broke, but why. This allows issues to be resolved faster and more efficiently.&nbsp;</p><h3 id="2-causal-reasoning-is-bayesian"><strong>2. Causal Reasoning Is Bayesian</strong>&nbsp;</h3><p>Causely infers the root cause based on observed symptoms, even when the observations are incomplete or noisy using causal Bayesian networks. Why Bayesian networks? Because you can’t assume perfect signals and should always assume that the observed symptoms will be noisy:&nbsp;</p><ul><li>Symptoms aren’t always fully observable.&nbsp;</li><li>Real systems behave unpredictably.&nbsp;</li><li>And we need to reason probabilistically under uncertainty.&nbsp;</li></ul><p>Causely uses Bayesian causal graphs to represent possible root causes and their effects, assigning probabilities to capture uncertainty in real-world systems. The prior probabilities in our models are defined by experts with decades of experience in distributed systems, microservices, data pipelines, and container orchestration. During active incidents, Causely uses observed symptoms to calculate posterior probabilities over possible root causes. These probabilities are then used to identify the most likely root cause, enabling teams to respond quickly and decisively.&nbsp;&nbsp;</p><h3 id="3-causal-graphs-are-customer-specific"><strong>3. Causal Graphs Are Customer-Specific</strong>&nbsp;</h3><p>Our causal models are not static or generalized templates. Causely automatically constructs topologically grounded causal graphs that are specific to the customer’s environment and dynamically adapt as environment services and dependencies change. The causal graphs map environment-specific causal dependencies and represent how root causes propagate and manifest themselves across services.&nbsp;&nbsp;</p><h2 id="benefits-of-the-causely-approach"><strong>Benefits of the Causely Approach</strong>&nbsp;</h2><p>Causely conducts root cause analysis in a principled way to ensure:&nbsp;</p><h3 id="high-precision">High Precision&nbsp;</h3><ul><li>Root causes are curated and relevant to managed cloud environments.&nbsp;</li><li>Causal structures mirror actual deployment topologies.&nbsp;</li><li>False positives are mitigated by never guessing outside the defined causal space.&nbsp;</li></ul><h3 id="generalizability-rapid-deployment">Generalizability &amp; Rapid Deployment&nbsp;</h3><ul><li>Bayesian methods identify root causes even with sparse observations.&nbsp;</li><li>Topologies are grounded in causal graphs&nbsp;to ensure root cause accuracy.&nbsp;</li><li>Dynamic topology updates with real-time telemetry data let us adapt to each customer’s specific patterns.&nbsp;</li></ul><h3 id="predictive-power">Predictive Power&nbsp;</h3><ul><li>Causal graphs are constructed a priori to ensure the reasoning engine can predict which downstream symptoms and disruptions may emerge when monitoring active root causes.&nbsp;&nbsp;</li><li>Causely’s causal graphs enable corrective interventions before all root cause symptoms manifest.&nbsp;</li><li>In addition to being used for real-time operations, the causal reasoning system can also be used to anticipate future failures and help prevent them. Causely identifies critical services and their failure risks based on causal pathways.&nbsp;&nbsp;</li></ul><h2 id="what-about-llms"><strong>What About LLMs?</strong>&nbsp;</h2><h2 id=""></h2><h3 id="strengths-of-llm-based-rca">Strengths of LLM-Based RCA&nbsp;</h3><p>LLM-based approaches to RCA are gaining traction. They offer unique strengths:&nbsp;</p><ul><li><strong>Natural Language Interface:</strong> You can ask follow-up questions in plain English and get responses without needing to write a query or scan dashboards.&nbsp;</li><li><strong>Trained on Broad Knowledge</strong>: LLMs draw from a vast, pre-trained corpus. Most LLM’s knowledge spans Stack 	Overflow, GitHub issues, and decades of online technical discourse. This breadth allows them to generate plausible explanations across diverse technologies.&nbsp;</li><li><strong>Rapid Response: </strong>Most LLMs will respond within seconds and can complete tedious tasks quickly. &nbsp;</li></ul><p>But these benefits come with real tradeoffs.&nbsp;</p><h3 id="limitations-of-llm-based-rca">Limitations of LLM-Based RCA&nbsp;</h3><ul><li><strong>Spurious Causes</strong>: LLMs often draw conclusions that appear coherent but are factually incorrect and contextually invalid - due to hallucinations, logical inconsistencies, and insufficient understanding of the managed environment&nbsp;&nbsp;</li><li><strong>Unprincipled Reasoning</strong>: LLMs mimic the language of reasoning without performing structured inference. Research shows that LLMs suffer from content effects, where prior biases interfere with logical reasoning and result in reasoning fallacies.&nbsp;&nbsp;</li><li><strong>Causal Identification Failures</strong>: Research shows LLMs systematically struggle with causal prediction, especially in dynamic settings due to the causal sufficiency problem.&nbsp;</li></ul><p>While LLMs are quickly gaining traction, they remain limited in accurately identifying root causes and reasoning in complex environments. Causely combines the best of both worlds by using LLMs responsibly to support natural language conversations, while grounding root cause analysis in structured causal models to ensure precision and accuracy.&nbsp;</p><h2 id="rca-isn%E2%80%99t-a-buzzword-it%E2%80%99s-a-well-defined-problem"><strong>RCA Isn’t a Buzzword. It’s a well-defined problem.</strong>&nbsp;</h2><p>If you are calling something a “root cause,” you should be able to show how it <em>caused</em> the observed effects - not just that it co-occurred or provide an explanation that sounds plausible.&nbsp;</p><p>At Causely we solve the RCA problem by being structured, explainable, and rooted in decades of engineering expertise. We don’t guess. We infer. We don’t react. We reason.&nbsp;</p><p>Let’s stop diluting RCA into dashboards and chatbots. Let’s build systems that actually understand why things break and how to prevent future failures.&nbsp;<br>&nbsp;<br>If you’re tired of dashboards that guess and chatbots that bluff, it’s time to reason instead. Start your journey with Causely today.&nbsp;</p><p><u>👉</u><a href="https://auth.causely.app/oauth/account/sign-up?ref=causely-blog.ghost.io" rel="noreferrer noopener">Access our sandbox and free trial <u>environment</u></a>&nbsp;</p><p><u>👉</u><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Reach out for a customized demo</u></a>&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[If Planes Can Fly Themselves Then Why Can’t IT Management be Autonomous?]]></title>
      <link>https://causely.ai/blog/if-planes-can-fly-themselves-then-why-cant-it-management-be-autonomous</link>
      <guid>https://causely.ai/blog/if-planes-can-fly-themselves-then-why-cant-it-management-be-autonomous</guid>
      <pubDate>Fri, 16 May 2025 14:59:00 GMT</pubDate>
      <description><![CDATA[When it comes to observability and IT operations, our goal should be to get humans out of the loop as much as possible.]]></description>
      <author>Shmuel Kliger</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/05/244762305_4421815224563670_2098469086385192273_n.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Reposted with permission from </em><a href="https://devops.com/if-planes-can-fly-themselves-then-why-cant-it-management-be-autonomous/?ref=causely-blog.ghost.io" rel="noreferrer"><em>DevOps.com</em></a></p><p>When it comes to&nbsp;<a href="https://devops.com/webinars/ai-and-the-future-of-software-development/?ref=causely-blog.ghost.io" rel="noopener">discussing how AI will impact the future of software development</a>&nbsp;and IT management, most vendors hold on to the objective that it’s important to keep the human in the loop. They are afraid to publicly acknowledge what’s always been true — that machines are better than humans at many things. Meanwhile, that list of things continues to grow.</p><p>When it comes to&nbsp;<a href="https://devops.com/next-generation-observability-combining-opentelemetry-and-ai-for-proactive-incident-management/?ref=causely-blog.ghost.io" rel="noopener">observability and IT operations</a>, our goal should be to get humans out of the loop as much as possible. In a world run by software, where SLAs guarantee 99.999% uptime, even one incident in a month — which requires human intervention — is enough downtime to violate your commitments to your customers.</p><p>With the rise of distributed, complex applications exacerbated by more and more AI-generated code, it is difficult for traditional IT operations&nbsp;to keep up across&nbsp;troubleshooting, provisioning infrastructure, performance monitoring, security, etc. Developers now deploy code 70% faster using AI, according to the&nbsp;<a href="https://c212.net/c/link/?t=0&l=en&o=4398647-1&h=4055757585&u=https%3A%2F%2Fdora.dev%2Fresearch%2F2024%2Fdora-report%2F&a=latest+DORA+report&ref=causely-blog.ghost.io" rel="noopener">latest DORA report</a>&nbsp;— you can’t have enough SREs and security personnel to keep up with this kind of activity. The companies that succeed will leverage technology to get humans out of the way as quickly as possible.&nbsp;</p><h2 id="failed-promises-of-aiops">Failed Promises of AIOps&nbsp;</h2><p>It is worth stating the obvious that there’s a difference between where things are today and where they are heading in the future. As someone who has been in IT management for decades, I have seen all the buzzwords as well as both&nbsp;marketing nonsense&nbsp;and genuine progress.&nbsp;&nbsp;</p><p>The term ‘AIOps’ was coined by Gartner in 2016 to address the increasing complexity and data volume in IT environments, aiming to automate processes such as event correlation, anomaly detection and causality determination. But as it turned out, many of the vendors who claimed to offer AIOps were nothing more than empty shells when you looked under the hood.&nbsp;&nbsp;</p><p>It was essentially the same AI/ML that had been used for a decade beforehand, branded in a new way and making outsized claims that didn’t map to reality. We saw tech giants making acquisitions of point solutions, and then bundling them under the ‘AIOps’ category because it was trendy to do so and they had nowhere else better to put them.&nbsp;</p><p>But I would argue that many of the companies that continued to trumpet these claims without actually delivering on the promises eventually suffered a hit to their reputation. We still aren’t at that stage where machines are acting autonomously to solve complex problems across performance, reliability and security. But they are doing more than ever before, and there’s every reason to believe we will reach that goal in the future.&nbsp;</p><h2 id="the-future-of-autonomous-reliability">The Future of Autonomous Reliability&nbsp;</h2><p>Autonomous reliability platforms of the future will not only surface actionable insights, but they will also be competent enough to make autonomous decisions without human intervention. And why should this be impossible? If planes can mostly fly themselves, why can’t IT management become autonomous?&nbsp;</p><p>The decades-long trend of collecting more and more data in the name of observability isn’t rendering autonomous service reliability. Nor is it feeding that data into a machine and magically hoping for answers. Machines trained on yesterday’s patterns might explain what went wrong in the past, but they can’t make the real-time decisions required to keep systems running, especially in dynamic, cloud-native environments.&nbsp;</p><p>What we need is a paradigm shift from data collection to causal understanding. By capturing causal knowledge as part of an ontology, we can reason about cause-and-effect in complex, ever-changing systems — the key is to move beyond reactive alerting into autonomous reliability. Unlike the growing industry trend of offloading alerts to LLMs, causal reasoning gives us the context and clarity needed to take real control.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[From Dashboards to Decisions: Introducing the Causely Plugin for Grafana]]></title>
      <link>https://causely.ai/blog/from-dashboards-to-decisions-introducing-the-causely-plugin-for-grafana</link>
      <guid>https://causely.ai/blog/from-dashboards-to-decisions-introducing-the-causely-plugin-for-grafana</guid>
      <pubDate>Tue, 06 May 2025 12:34:59 GMT</pubDate>
      <description><![CDATA[With Causely, you can see the why behind what’s happening without having to leave your Grafana interface.]]></description>
      <author>Endre Sara</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/05/causely-dashboard.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Grafana provides engineering teams with a clear lens into their systems, enabling them to surface logs, metrics, and traces in one unified place. For example, you could create a mission control dashboard, helping teams monitor what’s happening across services.&nbsp;</p><p>But even the clearest dashboard doesn’t give you what you need when something breaks.&nbsp;</p><p>Observability allows you to see what is happening, but what about understanding why something is happening? With Causely, you can see the why behind what’s happening without having to leave your Grafana interface.&nbsp;</p><p>That’s where Causely comes in. Our new plugin embeds root cause intelligence directly into your Grafana dashboards — so you can seamlessly shift from awareness to understanding and from incident response to proactive improvement, all in a single view.&nbsp;</p><h3 id="when-observability-stops-at-what"><strong>When Observability Stops at What</strong>&nbsp;</h3><p>Modern engineering teams operate in high-stakes, high-complexity environments:&nbsp;&nbsp;&nbsp;</p><ul><li>Dozens (or hundreds) of microservices&nbsp;&nbsp;&nbsp;</li><li>CI/CD pipelines constantly push new code&nbsp;&nbsp;&nbsp;</li><li>SLOs that leave no margin for delay&nbsp;&nbsp;&nbsp;</li></ul><p>The result? Even with powerful observability tools like Grafana, most teams still end up in incident management mode: war rooms, escalations, and Slack threads at 2 a.m.&nbsp;&nbsp;&nbsp;</p><p>Instead of building, you’re firefighting. Instead of improving systems, you're guessing at symptoms.&nbsp;</p><p>Grafana tells you <strong>what</strong> happened.&nbsp;&nbsp;&nbsp;</p><p>Causely shows you <strong>why</strong> — and what to do next.&nbsp;</p><h3 id="why-grafana-causely-is-different">&nbsp;<strong>Why Grafana + Causely Is Different</strong>&nbsp;</h3><p>This integration was built to help teams detect and truly understand problems—in context and real time.&nbsp;</p><p>By embedding Causely’s reasoning engine inside Grafana, teams can now:&nbsp;</p><ul><li>See the root causes for all active anomalies across your environment, not just alerts&nbsp;</li><li>Identify the highest-priority root cause impacting your services right now&nbsp;</li><li>Spot the root causes putting your SLOs at risk — before they’re breached&nbsp;</li><li>Reduce mean time to detect (MTTD) and resolve (MTTR) by surfacing not just data but decisions&nbsp;</li></ul><p>Causely continuously analyzes the relationships between services, symptoms, and anomalies — and uses causal reasoning to infer what’s driving those symptoms, all showcased natively in Grafana with the plugin.&nbsp;&nbsp;</p><h3 id="why-we-chose-beyla"><strong>Why We Chose Beyla</strong>&nbsp;</h3><p>What’s the superpower behind this integration? It’s what we chose to run under the hood.&nbsp;</p><p>Causely connects to your Kubernetes environment using Grafana Beyla, an open-source, eBPF-based instrumentation agent. It gives us deep visibility into your workloads — without sidecars, code changes, or custom configs.&nbsp;</p><p>We chose Beyla because it enables our customers scaling on Kubernetes to gain value immediately and understand the root causes of issues impacting their applications.&nbsp; Beyla enables quick time-to-value, minimal friction, and zero disruption to developer workflows.&nbsp;</p><p>An important distinction here: Beyla isn’t part of the plugin — Beyla isn’t part of the plugin itself — it’s the component that connects to your environment and continuously monitors your services. The plugin then brings that intelligence into Grafana, making it actionable for your team."&nbsp;</p><h3 id="what-you-get-from-combining-causely-and-grafana"><strong>What You Get from Combining Causely and Grafana</strong>&nbsp;</h3><p>Bringing Causely into Grafana means you're gaining critical visibility when you need it most(no longer flying blind!). Together, they create a powerful loop: observability meets root cause analysis - inside your team's workflows.&nbsp;</p><p>With Causely + Grafana, you get:&nbsp;</p><ul><li>Instant visibility into root causes for all active anomalies and performance issues across your environment&nbsp;</li><li>The most urgent root cause of degrading your services surfaced right inside your Grafana dashboards&nbsp;</li><li>Proactive identification of issues putting your SLOs at risk - before violations occur&nbsp;</li><li>Context-rich visualizations and alert annotations that go beyond detection to show you what's broken, why it's happening, and the steps to take to resolve the issue&nbsp;</li></ul><h3 id="integrated-with-grafana-alertmanager">Integrated with Grafana Alertmanager&nbsp;</h3><p>Causely's root cause alerts can be pushed directly into your existing Alertmanager workflows - so your teams don't just get notified that something's wrong; they get a head start on fixing it. That means fewer escalations, faster triage, and better SLO performance.&nbsp;</p><p>Causely's reasoning engine powers all of this. It continuously maps how symptoms propagate through your services and precisely identifies the underlying causes.&nbsp;</p><p>As mentioned, this is all enabled in minutes, thanks to Grafana Beyla. This open-source eBPF-based instrumentation layer lets Causely connect to your Kubernetes workloads without code changes, sidecars, or complex configurations. Beyla, let Causely start understanding your environment immediately - so you get value quickly without interrupting your team's flow.&nbsp;</p><h3 id="how-to-use-it"><strong>How to Use It</strong>&nbsp;</h3><p>Getting started is straightforward:&nbsp;</p><ul><li>Install the Causely Plugin:&nbsp;&nbsp;&nbsp;</li></ul><p>&nbsp;&nbsp; Please find it in the Grafana <a href="https://grafana.com/grafana/plugins/esara-causely-app/?ref=causely-blog.ghost.io" rel="noreferrer noopener">data source plugin catalog</a> or install it via <a href="https://github.com/esara/grafana?ref=causely-blog.ghost.io" rel="noreferrer noopener">GitHub</a>.&nbsp;</p><ul><li>Link to Causely:&nbsp;</li></ul><p>Connect your Causely deployment to your Grafana instance. No changes to your instrumentation are needed, as Causely uses Beyla to automatically ingest telemetry from your Kubernetes workloads.&nbsp;</p><ul><li>Visualize, Understand, Resolve:&nbsp;&nbsp;&nbsp;</li></ul><p>Add the Causely panel to your dashboards. See the root causes, understand the impact, and get ahead of issues before they become escalations.&nbsp;</p><p>For complete setup steps, check out the <a href="https://grafana.com/grafana/plugins/esara-causely-app/?tab=installation&ref=causely-blog.ghost.io" rel="noreferrer noopener">Causely plugin docs</a>.&nbsp;&nbsp;</p><h3 id="start-seeing-root-causes-not-just-symptoms"><strong>Start Seeing Root Causes, Not Just Symptoms&nbsp;</strong>&nbsp;</h3><p>The Causely Plugin for Grafana has officially launched and is ready for use. If you rely on Grafana to monitor your services, this is your next step in making observability genuinely actionable.&nbsp;</p><p><a href="https://auth.causely.app/oauth/account/sign-up?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Start Free Trial</u></a></p><p><a href="https://docs.causely.ai/index?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Explore Documentation</u></a>&nbsp;</p><p><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Request a Demo</u></a>&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[From Visibility to Control - Key Takeaways from a Fireside Chat with Viktor Farcic & Shmuel Kliger]]></title>
      <link>https://causely.ai/blog/from-visibility-to-control-key-takeaways-from-a-fireside-chat-with-viktor-farcic-shmuel-kliger</link>
      <guid>https://causely.ai/blog/from-visibility-to-control-key-takeaways-from-a-fireside-chat-with-viktor-farcic-shmuel-kliger</guid>
      <pubDate>Mon, 05 May 2025 19:03:53 GMT</pubDate>
      <description><![CDATA[“You actually cannot do meaningful reasoning especially when it comes to root cause analysis with LLMs or machine learning alone. You need more than that.”
-Shmuel Kliger, Founder of Causely]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/05/Untitled--Facebook-Post--3-3.png" type="image/jpeg" />
      <content:encoded><![CDATA[<blockquote>“You actually cannot do meaningful reasoning especially when it comes to root cause analysis with LLMs or machine learning alone. You need more than that.”<br>-Shmuel Kliger, Founder of Causely</blockquote>
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://www.youtube.com/embed/O46ru2HDyBI" 
          style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" 
          frameborder="0" allowfullscreen></iframe>
</div>
<!--kg-card-end: html-->
<p>Modern observability tools help ensure engineers have all the data they could hope for at their disposal. But these tools cause a certain amount of “observability overload”; these tools do not have an understanding of whether an observed anomaly is a real problem and, more importantly, they do not on their own accurately infer the <em>why </em>(i.e. root cause)<em> </em>behind the what (i.e. observed symptoms).&nbsp;&nbsp;</p><p>This was the focus of a recent conversation hosted by 10KMedia’s Adam LaGreca with guests Viktor Farcic (DevOps Toolkit) and Shmuel Kliger (founder of Causely). They explored why traditional observability approaches fall short on their own and what it will take to move beyond visibility toward real operational control.&nbsp;</p><h3 id="the-signal-problem-no-ones-solving"><strong>The Signal Problem No One's Solving</strong>&nbsp;</h3><p>The current state of reliability is defined by having too much data and not enough clarity. Alerts fire constantly and are often ignored. Dashboards contain all the data you could wish to have if you had all the time in the world to explore them and try to draw meaning from the data. What’s missing isn’t more data; it’s understanding. Without an understanding of cause and effect, teams are left guessing and leaning heavily on whichever domain experts they can grab at the moment.&nbsp;&nbsp;</p><h3 id="causal-reasoning-as-a-path-forward"><strong>Causal Reasoning as a Path Forward</strong>&nbsp;</h3><p>The conversation centered on a new approach, causal reasoning, which leans away from bottom-up analysis of mountains of metrics, logs, and traces. By understanding service dependencies and identifying the cascading effect that load and code changes have on distributed systems, causal reasoning offers a way to eliminate the noise and focus on what truly matters for assuring reliable application performance. This helps reduce alert fatigue, speeds up troubleshooting, and frees engineers from the toil of reliability.&nbsp;</p><h3 id="redefining-the-role-of-ai"><strong>Redefining the Role of AI</strong>&nbsp;</h3><p>While it makes logical sense to try to apply GenAI to the mountains of observability data we accumulate, leveraging LLMs without proper context will simply generate more noise at scale. By using causality to pinpoint the root cause of observed anomalies, a structured understanding of how systems behave can be used to get more efficient and effective outcomes from leveraging LLMs to automate operational work.&nbsp;&nbsp;</p><h3 id="the-future-of-reliability"><strong>The Future of Reliability</strong>&nbsp;</h3><p>Looking ahead, the vision is clear: fewer dashboards, fewer rabbit holes, and fewer hours lost to manual debugging. It’s a shift from observability to assurance. And for teams operating at the speed and scale of today’s distributed architectures, it’s a shift that can’t come soon enough.&nbsp;</p><p>🔍 <strong>Want to see this approach in action?</strong>&nbsp;<br>Check out the Causely sandbox or start a conversation with us: <a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>www.causely.ai</u></a>&nbsp;</p><p>&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Building a Reasoning Platform, Together]]></title>
      <link>https://causely.ai/blog/building-a-reasoning-platform-together</link>
      <guid>https://causely.ai/blog/building-a-reasoning-platform-together</guid>
      <pubDate>Fri, 02 May 2025 15:41:30 GMT</pubDate>
      <description><![CDATA[A version upgrade. A schema change. And suddenly, a critical service stalls. MySQL 8’s hidden metadata locking behavior has tripped up even the most prepared teams. We captured this knowledge — and now, Causely can pinpoint it.

If you’ve learned about how Causely works, you already know that our Causal Reasoning Platform includes a built-in causal knowledge base. This knowledge base guides system behavior by capturing the potential root causes in your environment and the symptoms they may cause]]></description>
      <author>Enlin Xu</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/05/iStock-515855048.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p><strong>A version upgrade. A schema change. And suddenly, a critical service stalls. MySQL 8’s hidden metadata locking behavior has tripped up even the most prepared teams. We captured this knowledge — and now, Causely can pinpoint it.</strong></p><p>If you’ve learned about&nbsp;<a href="https://www.causely.ai/blog/capabilities-causal-analysis?ref=causely-blog.ghost.io"><u>how Causely works</u></a>, you already know that our Causal Reasoning Platform includes a built-in causal knowledge base. This knowledge base guides system behavior by capturing the potential root causes in your environment and the symptoms they may cause. We are constantly exploring ways to expand that knowledge base, and one key source of inspiration for this is the work we do in the real-world with real users of our platform.</p><p>One of the most rewarding parts of my job is collaborating with awesome engineers who understand the value of our system and share ideas with me about how to make it better. Every customer scenario we encounter strengthens this knowledge base as we work through a four-step process:</p><p><strong>Learn</strong>: Understand the root causes and observable symptoms of the scenario</p><p><strong>Generalize</strong>: Capture the root causes and symptoms in the causality knowledge base</p><p><strong>Implement</strong>: Develop the mediation to discover and monitor the required information</p><p><strong>Deploy</strong>: Apply broadly and help all users benefit from this added knowledge</p><p>Whether it’s a metadata lock cascade in MySQL 8 or a Kubernetes resource noisy neighbor, this collaborative approach ensures that when one team faces a problem, the entire community can benefit from the expansion of our knowledge base. The following article is one such example of how this all played out in the real world with a real customer.</p><h3 id="modern-database-upgrades-can-be-painful">Modern Database Upgrades Can Be Painful</h3><p>Last month I had a discussion with one of the engineering leaders at Yext, Peter Rimshnick. We explored a challenge many teams face: unexpected database locking in MySQL 8. Peter shared an incident where a routine schema change caused a bit of unexpected downtime. It turns out this scenario was tied to MySQL 8’s nuanced metadata locking behavior.</p><h3 id="metadata-locking-in-mysql-8-what-changed">Metadata Locking in MySQL 8: What Changed?</h3><p><a href="https://dev.mysql.com/doc/refman/8.4/en/metadata-locking.html?ref=causely-blog.ghost.io"><u>MySQL 8 introduced critical improvements, but one underappreciated in their locking mechanism impacts teams daily</u></a>. Before MySQL 8, A DDL statement (e.g., ALTER TABLE) locked only the target table. In MySQL 8, DDL operations now extend metadata locks to tables linked by foreign keys.</p><p>&nbsp;Imagine this scenario:</p><p><strong>A Migration Runs:</strong> A DROP COLUMN on the users<strong> table</strong> requests an exclusive MDL.</p><p><strong>Dependencies Ignited: </strong>MySQL 8 locks the <strong>profiles table</strong> (linked via foreign key).</p><p><strong>Queries Back Up: </strong>Reads/writes on both tables time out after lock_wait_timeout.</p><p><strong>Symptoms Spread:</strong> APIs fail, dashboards freeze, teams chase false leads.</p><p><strong>The Hidden Cost:</strong> Engineers manually trace foreign keys; customers see unrelated errors.</p><h3 id="how-can-causely-help">How Can Causely Help</h3><p>Causely discovers dependencies such as users ↔ profiles, and observes how clients interact with the tables. Based on the symptoms it detects, it infers that the root cause is the DDL Locking during the database migration.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/05/Screenshot-2025-04-22-at-6.12.27-PM.png" class="kg-image" alt="" loading="lazy" width="2000" height="1053" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/05/Screenshot-2025-04-22-at-6.12.27-PM.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/05/Screenshot-2025-04-22-at-6.12.27-PM.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/05/Screenshot-2025-04-22-at-6.12.27-PM.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2025/05/Screenshot-2025-04-22-at-6.12.27-PM.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Figure 1: Causely discovers Database Tables with dependent entities</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/05/Screenshot-2025-04-22-at-6.13.21-PM.png" class="kg-image" alt="" loading="lazy" width="2000" height="1059" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/05/Screenshot-2025-04-22-at-6.13.21-PM.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/05/Screenshot-2025-04-22-at-6.13.21-PM.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/05/Screenshot-2025-04-22-at-6.13.21-PM.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2025/05/Screenshot-2025-04-22-at-6.13.21-PM.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;"> Figure 2: Causely discovers Database Tables and its clients</span></figcaption></figure><p>Our platform can now infer root causes like DDL locking based on observed symptoms:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/05/Screenshot-2025-04-22-at-6.15.26-PM.png" class="kg-image" alt="" loading="lazy" width="2000" height="1047" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/05/Screenshot-2025-04-22-at-6.15.26-PM.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/05/Screenshot-2025-04-22-at-6.15.26-PM.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/05/Screenshot-2025-04-22-at-6.15.26-PM.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2025/05/Screenshot-2025-04-22-at-6.15.26-PM.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;"> Figure 3: DDL Excessive Lock inferred by Causely with observed symptoms </span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/05/Screenshot-2025-04-22-at-6.15.37-PM.png" class="kg-image" alt="" loading="lazy" width="2000" height="1058" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/05/Screenshot-2025-04-22-at-6.15.37-PM.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/05/Screenshot-2025-04-22-at-6.15.37-PM.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/05/Screenshot-2025-04-22-at-6.15.37-PM.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2025/05/Screenshot-2025-04-22-at-6.15.37-PM.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;"> Figure 4: Causality view of how DDL Excessive Lock propagates</span></figcaption></figure><p>With Causely’s automated “DDL Excessive Lock” detection, engineers instantly pinpoint stalled schema changes—no more manual foreign-key tracing. This MySQL metadata-locking is just <a href="https://docs.causely.ai/root-causes/overview/?ref=causely-blog.ghost.io"><u>one of the many root causes we deliver out of the box</u></a>. Explore the full library of insights to see how Causely can help you build reliable, resilient data pipelines.</p><h3 id="the-network-effect-of-shared-learning">The Network Effect of Shared Learning</h3><p>This is how modern observability evolves: <strong>real problems solved once, scaled to many</strong>. Every collaboration like Yext’s isn’t just a fix; it’s a force multiplier that makes the platform more knowledgeable and eliminates manual troubleshooting. Every new root cause we learn from strengthens the entire platform. Join a community of engineers building a smarter, more resilient future.</p><p>&nbsp;<strong>For engineers tired of playing whack-a-mole with outages</strong>:</p><ul><li><a href="https://www.causely.ai/product?ref=causely-blog.ghost.io"><u>Learn more about Causely</u></a></li><li>See Causely in action: <a href="https://www.causely.ai/try?ref=causely-blog.ghost.io"><u>Request a demo</u></a></li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causal Reasoning: The Missing Piece to Service Reliability]]></title>
      <link>https://causely.ai/blog/causal-reasoning-the-missing-piece-to-service-reliability</link>
      <guid>https://causely.ai/blog/causal-reasoning-the-missing-piece-to-service-reliability</guid>
      <pubDate>Tue, 22 Apr 2025 16:30:37 GMT</pubDate>
      <description><![CDATA[Assuring service reliability is the most critical goal of IT. It was never easy, and it is getting increasingly complex as businesses require greater speed, agility, and scalability to stay competitive and respond quickly to changing market demands. These needs are driving the adoption of microservices architectures, enabling organizations to build and deploy applications with increased flexibility, resilience, and efficiency at scale.  

But there are no free lunches -this adoption comes with a]]></description>
      <author>Endre Sara</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/04/ChatGPT-Image-Apr-22--2025-at-12_23_07-PM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Assuring service reliability is the most critical goal of IT. It was never easy, and it is getting increasingly complex as businesses require greater speed, agility, and scalability to stay competitive and respond quickly to changing market demands. These needs are driving the adoption of microservices architectures, enabling organizations to build and deploy applications with increased flexibility, resilience, and efficiency at scale.&nbsp;&nbsp;</p><p>But there are no free lunches -this adoption comes with a cost. As organizations adopt microservices, they encounter new operational challenges. Microservices architectures are dynamic, ever evolving and loosely coupled with intricate dependencies and interactions between services. Although built with standardized building blocks and common patterns, the emergent behavior of each system is unique and constantly changing as components are rewritten, upgraded, or replaced. The simplicity of small, loosely coupled services can quickly become overshadowed by the complexity of managing hundreds (or thousands) of interdependent components.&nbsp;</p><p>Continuing the decades long trend in Observability of collecting more and more data won’t get us to the desired state of assuring service reliability. Furthermore, feeding this data to a machine and hoping the machine will magically generate the answers required to continuously assure service reliability is a false hope. Machines trained on yesterday's data may be able to understand the past but cannot make the real-time decisions necessary to continuously assure service levels, especially in cloud-native environments given the dynamic nature of such environments.&nbsp;</p><p>At Causely, we believe that the key to overcoming these challenges lies in <strong>causal knowledge</strong> captured as part of an <strong>ontology</strong>, to enable reasoning about cause-and-effect relationships in complex systems. This stands in contrast to the industry's growing reliance on simply sending alerts to Large Language Models (LLMs), and we believe <strong>causal reasoning</strong> is the critical missing piece for <strong>autonomous service reliability</strong>.&nbsp;</p><h2 id="why-traditional-approaches-fall-short">Why Traditional Approaches Fall Short&nbsp;</h2><p>Some organizations attempt to address reliability by dumping vast amounts of raw, unstructured telemetry into ML algorithms in the hope of surfacing meaningful patterns. When anomalies are detected, these observations are sometimes passed to LLMs to generate plausible explanations. While this can help contextualize events, it often falls short where it matters most. LLMs, by design, are general-purpose tools trained on broad, historical data. They excel at pattern recognition and language generation but lack the deep, real-time causal reasoning needed to adapt to novel, dynamic environments. They may describe "what" happened, but they struggle to uncover "why" it happened and "what" to do about it in a new, unseen context. Even when LLMs are fine-tuned on telemetry patterns, they generalize across environments. But reliability failures are context specific. What broke in one deployment doesn’t explain a novel failure in another deployment.&nbsp;</p><h3 id="example-1-%E2%80%93-misleading-latency-correlations-across-services"><strong>Example 1 – Misleading Latency Correlations Across Services:</strong>&nbsp;</h3><p>In a microservices-based e-commerce platform, a simultaneous latency spike was observed in the checkout, inventory, and payment services. Observability tools showed strong correlations between these services, leading engineers to suspect the inventory service as the root cause. However, the real issue was a slow database query in the product-catalog service, which affected the three components above. The correlation misled the team into focusing on the wrong area, demonstrating the limitations of correlation without causal context.&nbsp;</p><h3 id="example-2-%E2%80%93-misinterpreting-memory-spikes-as-service-defects"><strong>Example 2 – Misinterpreting Memory Spikes as Service Defects:</strong>&nbsp;</h3><p>An alert for repeated pod restarts in the recommendation engine, accompanied by memory spikes across backend services, led engineers to suspect a memory leak. Observability tools flagged backend services as anomalous based on correlated metrics. However, the actual root cause was a recent frontend change that dramatically increased the frequency and size of incoming requests. The backend services were merely reacting to an upstream trigger. This highlights how correlation can obscure the true origin of a problem, especially when external factors are involved.&nbsp;</p><p>Relying on statistical correlation alone can be dangerous. Correlations can be misleading without causal grounding, leading teams to chase false positives, miss root causes, or implement ineffective remediations.&nbsp;</p><h2 id="beyond-monitoring-towards-autonomous-operations">Beyond Monitoring: Towards Autonomous Operations&nbsp;</h2><p>Traditional observability focuses on "what" happened. Causal Reasoning focuses on “why”. Causal Reasoning captures, represents, understands and analyzes cause-and-effect relationships and uses these, among other inferences, to automatically infer root cause based on observed anomalies.<s> </s>By embracing Causal Reasoning, organizations can move beyond the reactive model of monitoring and alerting to a world of <strong>autonomous operations</strong>, where systems can diagnose and heal themselves with minimal human intervention. This is essential for achieving the promise of resilient, scalable, always-on cloud-native applications.&nbsp;</p><p>Causal Reasoning is driven by <strong>ontology</strong>. An <strong>ontology</strong> is a formal model that defines:&nbsp;</p><ul><li>The types of entities, attributes and relationships in a domain, including root causes and symptoms&nbsp;</li><li>The relationships that can exist between entities, including the causality relationships between root causes and symptoms, and attribute dependencies&nbsp;&nbsp;</li><li>The behaviors or constraints (e.g., "a pod can be scheduled on one node ", "a pod can have multiple containers")&nbsp;</li></ul><p>It’s like the grammar and vocabulary for talking about a subject.&nbsp;</p><p>Causal Reasoning uses a <strong>knowledge graph</strong> to organize the real-world information based on the ontology&nbsp;</p><ul><li>Nodes are specific instances of entities (e.g., checkout-service, inventory-service, payment-service, production-database)&nbsp;</li><li>Edges are actual relationships between the instances (e.g., checkout-service -&gt; depends_on -&gt; production-database)&nbsp;</li><li>Edges can also be relationship between attribute dependencies (e.g. “calls to a user facing API invokes the backend GRPC method”, “backend GRPC method invocation produces async messages on a specific topic”)&nbsp;</li><li>Metadata: e.g. CPU usage, error logs, deployment time, configs&nbsp;</li></ul><p>The knowledge graph is the filled-out version of the ontology, populated with facts. The discovered topology of a microservices application is a knowledge graph, which describes the application components and their relationships using an ontology. It describes what is like a <strong>semantic network</strong>, but it doesn’t say anything about <strong>why</strong> or <strong>what will happen if</strong> something changes.&nbsp;</p><p>Using the ontology and the knowledge graph Causal Reasoning automatically generates a <strong>causal graph</strong>. A causal graph is a directed acyclic graph (DAG) with focus on <strong>why things happen</strong>:&nbsp;</p><ul><li>Nodes are specific causes and observations&nbsp;</li><li>Directed edges that represent <strong>causal links</strong>, not just association&nbsp;</li><li>Example: DatabaseMalfunction -&gt; causes -&gt; ClientServiceErrors&nbsp;</li><li>Allows you to ask "<strong>what if</strong>" questions:&nbsp;</li><li><em>What happens to service errors if the database is recovered?</em>&nbsp;</li></ul><p>In short, a knowledge graph describes what is connected to what, while a causal graph, describes what causes what.&nbsp;</p><h2 id="causal-reasoning-engineering-intelligence-into-service-operations">Causal Reasoning: Engineering Intelligence into Service Operations&nbsp;</h2><p>The causal knowledge in the ontology captures essential system behaviors and relationships without getting lost in the weeds. Driven by the causal knowledge, causal reasoning enables engineering teams focus on what matters. Instead of reacting to every blip on a dashboard, casual reasoning drives a <em>top-down</em> focus on the critical causes that impact service reliability.&nbsp;</p><p>Using causal reasoning, we can:&nbsp;</p><ul><li>Understand why a service's performance degraded, not just that it did.&nbsp;</li><li>Infer the root causes instead of guessing based on symptoms.&nbsp;</li><li>Develop proactive, preventive strategies instead of reactive firefighting.&nbsp;</li></ul><p>Causal reasoning empowers teams to diagnose, remediate, and even <strong>predict and prevent</strong> risks to service-level objectives (SLOs) with clarity and confidence.&nbsp; &nbsp;</p><h2 id="causely-a-purpose-built-autonomous-reliability-system">Causely: A Purpose-built  Autonomous Reliability System&nbsp;</h2><p>At Causely, we’ve developed a purpose-built Causal Reasoning Platform for service reliability. The key tenets of the platform are:&nbsp;</p><ul><li><strong>Causal model</strong>: an ontology describing cloud-native application environments, including the causal knowledge of root causes, symptoms and causality between them&nbsp;</li><li><strong>Topology graph</strong>: a knowledge graph of the specific managed environment. The graph is generated automatically by discovering the managed environment.&nbsp;</li><li><strong>Abductive inference engine</strong>: an engine that automatically generates a causality graph from the ontology and the topology graph and used it in real time to infer root causes based on observed symptoms/anomalies&nbsp; &nbsp;</li></ul><p>Causely solves a problem that no other vendor is solving by delivering&nbsp;</p><ul><li><strong>Clarity in Complexity:</strong> Our models scale with your systems, maintaining meaningful insights even as your architecture evolves and grows more intricate.&nbsp;</li><li><strong>Actionable Insights:</strong> We don't just flag anomalies—we infer root causes and deliver clear, prioritized paths to resolution.&nbsp;</li><li><strong>Proactive Prevention:</strong> With an ontology and causal reasoning, we spot risks before they become incidents, shifting organizations from reactive to proactive.&nbsp;</li><li><strong>Seamless Integration:</strong> Our platform integrates with your workflows and CI/CD pipelines, delivering instant value without requiring manual retraining or rule-writing.&nbsp;</li></ul><h2 id="conclusion">Conclusion&nbsp;</h2><p>We do see LLMs as a powerful technology that can benefit from being used in tandem with Causal reasoning, stay tuned for more on that.</p><p>Causal Reasoning is the foundation of Causely’s solution. It empowers organizations to cut through complexity, deliver precise insights, prevent downtime, and free engineers to focus on what matters most: innovation.&nbsp;</p><p>Let us show you the power of <a href="https://www.causely.ai/blog/capabilities-causal-analysis?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>causal reasoning</u></a> and cross-organizational collaboration in cloud-native environments. See Causely for yourself. <a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Book a meeting with the Causely team</u></a> or <a href="https://auth.causely.app/oauth/account/sign-up?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>start your free trial</u></a> now.&nbsp;</p><p>#cloudnative #servicereliability #causalreasoning #abstraction #siteReliabilityEngineering #sre #DevOps #observability #AI #ML&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[How We Used Causely to Solve a Crashing Bug in Our Own App—Fast]]></title>
      <link>https://causely.ai/blog/how-we-used-causely-to-solve-a-crashing-bug-in-our-own-app-fast-2</link>
      <guid>https://causely.ai/blog/how-we-used-causely-to-solve-a-crashing-bug-in-our-own-app-fast-2</guid>
      <pubDate>Thu, 17 Apr 2025 17:16:38 GMT</pubDate>
      <description><![CDATA[At Causely, we don’t just ship software – we run a reasoning platform designed to detect, diagnose, and resolve failure conditions with minimal human intervention. Our own cloud-native application runs in a highly distributed environment, with dozens of interdependent microservices communicating in real-time. It’s complex, dynamic, and constantly evolving—just like the environments our customers run. 

Recently, we encountered an issue that perfectly illustrates the value of Causely’s Causal Rea]]></description>
      <author>Christine Miller</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/04/iStock-901208180-3.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>At Causely, we don’t just ship software – we run a reasoning platform designed to detect, diagnose, and resolve failure conditions with minimal human intervention. Our own cloud-native application runs in a highly distributed environment, with dozens of interdependent microservices communicating in real-time. It’s complex, dynamic, and constantly evolving—just like the environments our customers run.&nbsp;</p><p>Recently, we encountered an issue that perfectly illustrates the value of Causely’s Causal Reasoning Platform in action.&nbsp;&nbsp;</p><h2 id="a-crash-that-didn%E2%80%99t-want-to-be-found">A Crash That Didn’t Want to Be Found&nbsp;</h2><p>While reviewing our staging environment— which uses our existing OpenTelemetry instrumentation to identify root causes in our production deployment of Causely — I noticed something disturbing: a frequent crash failure had been quietly recurring over six days in one of our analytics microservices.&nbsp;</p><p>This type of issue is notoriously difficult to detect. The logs were noisy. The failures were intermittent. The symptoms appeared scattered—some in one analysis module, others in a different component entirely. This issue would’ve gone unnoticed in most environments or been chalked up to transient behavior. But Causely identified the cause. Not as an alert or a spike in a dashboard – but as an active root cause that is impacting our system. It wasn’t just showing us individual failures. It was showing us causality - the propagation behind the symptoms.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/2025/04/Screenshot-2025-04-17-at-12.53.48-PM.png" class="kg-image" alt="" loading="lazy" width="2000" height="1140" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/04/Screenshot-2025-04-17-at-12.53.48-PM.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/04/Screenshot-2025-04-17-at-12.53.48-PM.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/04/Screenshot-2025-04-17-at-12.53.48-PM.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2025/04/Screenshot-2025-04-17-at-12.53.48-PM.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Different subcomponents—analysis-3, analysis-5—were failing in seemingly unrelated ways. One had high error rates; another had frequent panics. Without Causely, we would’ve been chasing these symptoms in isolation. Logging into pods, checking dashboards, comparing timelines. A manual, time-consuming wild goose chase.&nbsp;</p><h2 id="what-causely-did-differently">What Causely Did Differently&nbsp;</h2><p>&nbsp;Causely automatically inferred these separate symptoms across services and time into a single root cause: a concurrency issue in a shared in-memory map used across multiple analysis workers.&nbsp;</p><p>The issue? A concurrent write-read condition that would only manifest under higher-volume conditions – something nearly impossible to catch in development or test environments. But Causely connected the dots through symptom propagation and causal relationships, allowing us to zero in on the problem fast.&nbsp;</p><h2 id="technical-deep-dive-the-bug-behind-the-crash">Technical Deep Dive: The Bug Behind the Crash&nbsp;</h2><p>The crash was caused by concurrent access to a Go map, which is unsafe for concurrent reads and writes without explicit synchronization.&nbsp;</p><p>We were using a plain <em>map [string]SomeStruct</em> to store intermediate analysis results across worker goroutines. This map was being:&nbsp;</p><ul><li>Written to by one thread collecting event evidence in real time&nbsp;</li><li>Read from simultaneously by another thread responsible for emitting outputs&nbsp;</li></ul><p>Under low volume, this worked fine. But under sustained load in production, the race condition emerged—resulting in panics like:&nbsp;</p><p><em>fatal error: concurrent map read and map write</em>&nbsp;</p><p>&nbsp;This is a classic Go gotcha. The fix was straightforward: we wrapped the map with a <em>sync.RWMutex</em> to protect both reads and writes, ensuring thread safety. Alternatively, we could have used a <em>sync.Map,</em> but since our access patterns were relatively simple and performance-sensitive, the mutex approach was more appropriate.&nbsp;</p><p>This was one of those bugs where:&nbsp;</p><ul><li>It was impossible to reproduce locally&nbsp;</li><li>The crash symptoms varied depending on which thread accessed the map first&nbsp;</li><li>And the logs, while technically correct, gave no useful hint unless you already suspected the concurrency issue&nbsp;</li></ul><p>&nbsp;Without Causely surfacing the underlying causality across services and across time, this would have remained an intermittent ghost bug.&nbsp;</p><h2 id="from-root-cause-to-resolution-%E2%80%93-in-hours-not-days">From Root Cause to Resolution – In Hours, Not Days&nbsp;</h2><p>Thanks to the inference Causely provided, we identified the concurrency bug, coded a fix, and shipped it – all within a couple of hours.&nbsp;</p><p>This wasn’t just about speed. It was about precision. Without Causely, our team might have:&nbsp;</p><ul><li>Spent days troubleshooting unrelated alerts and errors&nbsp;</li><li>Overlooked the root cause due to insufficient signal&nbsp;</li><li>Delayed resolving an issue that was actively degrading reliability&nbsp;</li></ul><h2 id="why-this-matters">Why This Matters&nbsp;</h2><p>Crashes like these don’t always surface in ways traditional monitoring or modern observability systems can catch. They emerge over time, appear intermittent, and produce symptoms that span components.&nbsp;</p><p>Causely gave us what observability tools couldn’t: contextual understanding. Observability tools show you context. Causely helps you understand it. It didn’t just tell us something was broken. It told us what was breaking, why, and what needed to change.&nbsp;</p><p>This is the power of causal reasoning over correlation.&nbsp;</p><p>We built Causely to operate in the most demanding environments – and we hold ourselves to that same standard. This incident was a clear example of how our platform enables us to move faster, with more confidence, and resolve issues before they impact users.&nbsp;</p><p>And that’s exactly the kind of reliability we aim to bring to every engineering team.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Eating Our Own Dog Food: Causely’s Journey with OpenTelemetry]]></title>
      <link>https://causely.ai/blog/eating-our-own-dog-food-causelys-journey-with-opentelemetry-causal-ai</link>
      <guid>https://causely.ai/blog/eating-our-own-dog-food-causelys-journey-with-opentelemetry-causal-ai</guid>
      <pubDate>Tue, 25 Mar 2025 09:25:00 GMT</pubDate>
      <description><![CDATA[Implementing OpenTelemetry at the core of our observability strategy for Causely’s SaaS product was a natural decision. This post shares context on our rationale and how the combination of OpenTelemetry and causal reasoning underpin our platform.]]></description>
      <author>Endre Sara</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/02/2-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Implementing OpenTelemetry at the core of our observability strategy for <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causely’s SaaS product</a> was a natural decision. In this article, I'll share some background on our rationale and how the combination of OpenTelemetry and causal reasoning addresses several critical requirements that allow us to scale our services more efficiently.</p><h2 id="avoiding-common-observability-pitfalls">Avoiding common observability pitfalls</h2><p>We already know – based on <a href="https://www.causely.ai/company?ref=causely-blog.ghost.io" rel="noreferrer">decades of experience</a> working in and with operations teams in the most challenging environments – that bridging the gap between the vast ocean of observability data and actionable insights has and continues to be a <a href="https://www.causely.ai/blog/causely-launches-new-integration-with-opentelemetry-cutting-through-the-observability-noise-and-pinpointing-what-matters?ref=causely-blog.ghost.io" rel="noreferrer">major pain point</a>. This is especially true in the complex world of cloud-native applications.</p><h3 id="missing-application-insights">Missing application insights</h3><p><a href="https://www.oreilly.com/library/view/observability-engineering/9781492076438/ch01.html?ref=causely-blog.ghost.io" rel="noopener">Application observability</a> remains an elusive beast for many, especially in complex microservices architectures. While infrastructure monitoring has become readily available, neglecting application data paints an incomplete picture, hindering effective troubleshooting and operations.</p><h3 id="siloed-solutions">Siloed solutions</h3><p>Traditional observability solutions have relied on siloed, proprietary agents and data sources, leading to fragmented visibility across teams and technologies. This makes it difficult to understand the complete picture of service composition and dependencies.</p><p>To me, this is like trying to solve a puzzle with missing pieces – that’s essentially a problem that many DevOps teams face today – piecing together a picture of how microservices, serverless functions, databases, and other elements interact with one  another, and underlying infrastructure and cloud services they run on. This makes it <a href="https://www.causely.ai/blog/spend-less-time-troubleshooting?ref=causely-blog.ghost.io" rel="noreferrer">hard to collaborate and troubleshoot</a>; it's a struggle to pinpoint the root cause of performance issues or outages.</p><h3 id="vendor-lock-in">Vendor lock-in</h3><p>Many vendors’ products also lock customers’ data into their cloud services. This can result in customers paying through the nose, because licensing costs are predicated on the volume of data that is being collected and stored in the service providers’ backend SaaS. It can also be very hard to exit these services once <a href="https://www.infoworld.com/article/3623721/cloud-lock-in-is-real.html?ref=causely-blog.ghost.io" rel="noopener">locked in</a>.</p><p>These are all pitfalls we wanted to avoid at Causely as we set out to build our our Causal Reasoning Platform.</p><h2 id="the-pillars-of-our-observability-architecture-pointed-us-to-opentelemetry">The pillars of our observability architecture pointed us to OpenTelemetry</h2><p><a href="https://opentelemetry.io/?ref=causely-blog.ghost.io" rel="noopener">OpenTelemetry</a> provides us with a path to break free from these limitations, establishing a common framework that transcends programming languages and platforms that we are using to build our services, and satisfying the requirements laid out in the pillars of our observability architecture:</p><h3 id="precise-instrumentation">Precise instrumentation</h3><p>OpenTelemetry offers <a href="https://youtu.be/RuyUXBOdjGI?feature=shared&ref=causely-blog.ghost.io" rel="noopener">automatic instrumentation</a> options that minimize the amount of work we need to do on manual code modifications and streamline the integration of our internal observability capabilities into our chosen backend applications.</p><h3 id="unified-picture">Unified picture</h3><p>By providing a standardized data model powered by <a href="https://opentelemetry.io/docs/concepts/semantic-conventions/?ref=causely-blog.ghost.io" rel="noopener">semantic conventions</a>, OpenTelemetry enables us to paint an end-to-end picture of how all of our services are composed, including application and infrastructure dependencies. We can also gain access to critical telemetry information, utilizing this semantically consistent data across multiple backend microservices even when written in different languages.</p><h3 id="vendor-neutral-data-management">Vendor-neutral data management</h3><p>OpenTelemetry allows us to avoid locking our application data into 3rd party vendors’ services by decoupling it from proprietary vendor formats. This gives us the freedom to choose the best tools on an ongoing basis based on the value they provide. If something new comes along that we want to exploit, we can easily plug it into our architecture.</p><h3 id="resource-optimized-observability">Resource-optimized observability</h3><p>With OpenTelemetry, we can take a <a href="https://www.causely.ai/blog/be-smarter-about-observability-data?ref=causely-blog.ghost.io" rel="noreferrer">top down approach</a> to data collection, starting with the problems we are looking to solve and eliminating unnecessary information. In doing so, this minimizes our storage costs and optimizes compute resources we need to support our observability pipeline.</p><p>We believe that following these pillars and building our Causal Reasoning Platform on top of OpenTelemetry propels our product’s performance, enables rock-solid reliability, and ensures consistent service experiences for our customers as we scale our business. We also minimize ongoing operational costs, creating a win-win for us and our customers.</p><h2 id="opentelemetry-causal-analysis-scaling-for-performance-and-cost-efficiency">OpenTelemetry + causal analysis: scaling for performance and cost efficiency</h2><p>Ultimately, observability aims to illuminate the behavior of distributed systems, enabling proactive maintenance and swift troubleshooting. Yet isolated failures manifest as cascading symptoms across interconnected services.</p><p>While OpenTelemetry enables back-end applications to use this data to provide a unified picture in maps, graphs and dashboards, the job of figuring out the cause and effect in the correlated data <a href="https://www.causely.ai/blog/devops-may-have-cheated-death-but-do-we-all-need-to-work-for-the-king-of-the-underworld/?ref=causely-blog.ghost.io">still requires highly skilled resources</a>. This process can also be very time consuming, tying up personnel across multiple teams, with ownership for different elements of overall services.</p><p>There is a lot of noise in the industry right now about how AI and LLMs are going to magically come to the rescue, but reality paints a different picture. All of the solutions available in the market today focus on correlating data versus uncovering a direct understanding of causal relationships between problems and the symptoms they cause, leaving DevOps teams with noise, not answers.</p><p>Traditional AI and LLMs also require <a href="https://www.snowflake.com/guides/what-large-language-model-and-what-can-llms-do-data-science/?ref=causely-blog.ghost.io" rel="noopener">massive amounts of data</a> as input for training and learning behaviors on a continuous basis. This is data that ultimately ends up being transferred and stored in some form of SaaS. Processing these large datasets is very <a href="https://towardsdatascience.com/behind-the-millions-estimating-the-scale-of-large-language-models-97bd7287fb6b?ref=causely-blog.ghost.io" rel="noopener">computationally intensive</a>. This all translates into significant cost overheads for the SaaS providers as customer datasets grow overtime – costs that ultimately result in ever increasing bills for customers.</p><h3 id="at-causely-were-taking-a-different-approach">At Causely, we're taking a different approach</h3><p>Our causal reasoning software provides operations and engineering teams with an understanding of the “why”, which is crucial for effective and timely troubleshooting and decision-making.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2023/06/millionways-causality-chain.png" class="kg-image" alt="Application: Database Connection Noisy Neighbor causing service and infrastructure symptoms" loading="lazy" width="744" height="231" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2023/06/millionways-causality-chain.png 600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2023/06/millionways-causality-chain.png 744w" sizes="(min-width: 720px) 720px"></figure>
<!--kg-card-begin: html-->
<span style="font-size: 8pt;"><em>Example causality chain: Database Connection Noisy Neighbor causing service and infrastructure symptoms</em></span>
<!--kg-card-end: html-->
<p>Our Causal Reasoning Platform uses predefined models of how problems behave and propagate. When combined with real-time information about a system’s specific structure, Causely computes a map linking all potential problems to their observable symptoms.</p><p>This map acts as a reference guide, eliminating the need to analyze massive datasets every time the platform encounters an issue. Think of it as checking a dictionary instead of reading an entire encyclopedia.</p><p>The bottom line is, in contrast to traditional AI, Causely operates on a much smaller dataset, requires far less resources for computation and provides more meaningful actionable insights, all of which translate into lower ongoing operational costs and profitable growth.</p><h2 id="summing-it-up">Summing it up</h2><p>There’s massive potential for causal analysis and OpenTelemetry to come together to tackle the limitations of traditional AI to get to the “why.”  This is what we’re building at Causely – <a href="https://www.causely.ai/blog/otel-actionable-insight?ref=causely-blog.ghost.io" rel="noreferrer">check out our recent integration news</a>. Doing so delivers numerous benefits:</p><ul><li><strong>Less time on Ops, more time on Dev:</strong> OpenTelemetry provides standardized data while Causely analyzes it to automate the root cause analysis (RCA) process, which will significantly reduce the time our DevOps teams have to spend on troubleshooting.</li><li><strong>Instant gratification, no training lag:</strong> We can eliminate AI’s slow learning curve, because Causely leverages OpenTelemetry’s semantic language and the Causal Reasoning Platform’s domain knowledge of cause and effect to deliver actionable results, right out of the box without massive amounts of data and with no training lag!</li><li><strong>Small data, lean computation, big impact:</strong> Unlike traditional AI’s data gluttony and significant computational overheads, Causely thrives on targeted data streams. OpenTelemetry’s smart filtering keeps the information flow lean, allowing Causely to identify the root causes with a significantly smaller dataset and compute footprint.</li><li><strong>Fast root cause identification:</strong> Traditional AI might tell us “<a href="https://www.kdnuggets.com/2019/01/dr-data-ice-cream-linked-shark-attacks.html?ref=causely-blog.ghost.io" rel="noopener">ice cream sales and shark attacks rise together</a>,” but causal reasoning reveals the truth – it’s the summer heat and not the sharks, driving both! By understanding cause-and-effect relationships, Causely cuts through the noise and identifies the root causes behind performance degradation and service malfunctions.</li></ul><p>Having these capabilities is critical if we want to move beyond the labor intensive processes associated with how RCA is performed in DevOps today, and eventually achieve autonomous service reliability. This is why we are eating our own dog food and using Causely as part of our tech stack to manage the services we provide to customers.</p><p>Want to learn more about our <a href="https://www.causely.ai/blog/otel-actionable-insight?ref=causely-blog.ghost.io" rel="noreferrer">integration with OpenTelemetry</a> or see if Causely can help you build better, more reliable cloud-native applications? <a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer">Book a meeting</a> with the Causely team. We'd love to chat!  </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/03/sharks-400x400-1-1.png" class="kg-image" alt="" loading="lazy" width="304" height="277"><figcaption><span style="white-space: pre-wrap;">AI might tell us that sharks love sugar, but causal reasoning reveals the truth!</span></figcaption></figure>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Ep15 - Ask Me Anything About DevOps, Cloud, Kubernetes, Platform Engineering,... w/Endre Sara]]></title>
      <link>https://causely.ai/blog/devopstoolkit-ep15</link>
      <guid>https://causely.ai/blog/devopstoolkit-ep15</guid>
      <pubDate>Thu, 20 Mar 2025 20:30:54 GMT</pubDate>
      <description><![CDATA[In this DevOps Toolkit episode, Endre Sara joins Viktor Farcic for an Ask Me Anything session.]]></description>
      <author>Endre Sara</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/03/Screenshot-2025-03-20-at-4.27.09-PM-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>In this DevOps Toolkit episode, Endre Sara joins Viktor Farcic for an Ask Me Anything session. Attendees are invited to ask anything about DevOps, Cloud, Kubernetes, Platform Engineering, containers, or anything else. </p>
<!--kg-card-begin: html-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/lK0Hh47YUc8?si=gNg73Mq_4Jpqc96s" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
<!--kg-card-end: html-->
]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Using OpenTelemetry and the OTel Collector for Logs, Metrics, and Traces]]></title>
      <link>https://causely.ai/blog/using-opentelemetry-and-the-otel-collector-for-logs-metrics-and-traces</link>
      <guid>https://causely.ai/blog/using-opentelemetry-and-the-otel-collector-for-logs-metrics-and-traces</guid>
      <pubDate>Thu, 13 Mar 2025 16:03:00 GMT</pubDate>
      <description><![CDATA[This production-focused guide offers an understanding of what OpenTelemetry is, its core components, and a detailed look at the OTel Collector.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/02/otelcollector-lesswhite.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>OpenTelemetry (fondly known as OTel) is an open-source project that provides a unified set of APIs, libraries, agents, and instrumentation to capture and export logs, metrics, and traces from applications. The project’s goal is to standardize observability across various services and applications, enabling better monitoring and troubleshooting.</p><p><a href="https://www.causely.ai/company?ref=causely-blog.ghost.io" rel="noreferrer">Our team</a> at Causely has adopted OpenTelemetry within our own platform, and we recently <a href="https://www.causely.ai/blog/otel-actionable-insight?ref=causely-blog.ghost.io" rel="noreferrer">announced a formal integration</a> with OpenTelemetry, which prompted us to share a production-focused guide. Our goal is to help developers, DevOps engineers, software engineers, and SREs understand what OpenTelemetry is, its core components, and a detailed look at the OpenTelemetry Collector (OTel Collector). This background will help you use OTel and the OTel Collector as part of a comprehensive strategy to monitor and observe applications.</p><h2 id="what-data-does-opentelemetry-collect">What Data Does OpenTelemetry Collect?</h2><p>There are 3 types of data that are gathered by OpenTelemetry using the OTel Collector: logs, metrics, and traces.</p><h3 id="logs">Logs</h3><p>Logs are records of events that occur within an application. They provide a detailed account of what happened, when it happened, and any relevant data associated with the event. Logs are helpful for debugging and understanding the behavior of applications.</p><p>OpenTelemetry collects and exports logs, providing insights into events and errors that occur within the system. For example, if a user reports a slow response time in a specific feature of the application, engineers can use OpenTelemetry logs to trace back the events leading up to the reported issue.</p><h3 id="metrics">Metrics</h3><p>Metrics are quantitative data that measure the performance and health of an application. Metrics help in tracking system behavior and identifying trends over time. OpenTelemetry collects metrics data, which helps in tracking resource usage, system performance, and identifying anomalies.</p><p>For instance, if a spike in CPU usage is detected using OpenTelemetry metrics, engineers can investigate the potential issue using the OTel data collected and make necessary adjustments to optimize performance.</p><p>Developers use OpenTelemetry metrics to see granular resource utilization data, which helps understand how the application is functioning under different conditions.</p><h3 id="traces">Traces</h3><p>Traces provide a detailed view of request flows within a distributed system. Traces help understand the execution path, diagnose application behaviors, and see the interactions between different services.</p><p>For example, if a user reports slow response times on a website, developers can use trace data to help better identify which service is experiencing issues. Traces can also help in debugging issues such as failed requests or errors by providing a step-by-step view of how requests are processed through the system.</p><h2 id="introduction-to-otel-collector">Introduction to OTel Collector</h2><p>You can deploy the OTel Collector as a standalone agent or as a sidecar alongside your application. The OTel Collector also includes some helpful features for sampling, filtering, and transforming data before sending it to a monitoring backend.</p><h3 id="how-it-works">How it Works</h3><p>The OTel Collector works by receiving telemetry data from many different sources, processing it based on configured pipelines, and exporting it to chosen backends. This modular architecture allows for customization and scalability.</p><blockquote><strong>The OTel Collector acts as a central data pipeline for collecting, processing, and exporting telemetry data (metrics, logs, traces) within an </strong><a href="https://www.techtarget.com/searchitoperations/tip/Top-observability-tools?ref=causely-blog.ghost.io" rel="noopener"><strong>observability stack</strong></a><strong>.</strong></blockquote><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/otelcollector-e1720626204557.png" class="kg-image" alt="" loading="lazy" width="940" height="548" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/07/otelcollector-e1720626204557.png 600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/otelcollector-e1720626204557.png 940w" sizes="(min-width: 720px) 720px"></figure>
<!--kg-card-begin: html-->
<span style="font-size: 10pt;"><em>Image source: opentelemetry.io</em></span>
<!--kg-card-end: html-->
<p>Here’s a technical breakdown:</p><h3 id="data-ingestion">Data Ingestion:</h3><ul><li>Leverages pluggable receivers for specific data sources (e.g., <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/redisreceiver/README.md?ref=causely-blog.ghost.io" rel="noopener">Redis receiver</a>, <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/mysqlreceiver/README.md?ref=causely-blog.ghost.io" rel="noopener">MySQL receiver</a>).</li><li>Receivers can be configured for specific endpoints, authentication, and data collection parameters.</li><li>Supports various data formats (e.g., native application instrumentation libraries, vendor-specific formats) through receiver implementations.</li></ul><h3 id="data-processing">Data Processing:</h3>
<!--kg-card-begin: html-->
<ul>
<li>Processors can be chained to manipulate the collected data before export.</li>
<li>Common processing functions include:
<ul>
<li><strong>Batching:</strong> Improves efficiency by sending data in aggregates.</li>
<li><strong>Filtering:</strong> Selects specific data based on criteria.</li>
<li><strong>Sampling:</strong> Reduces data volume by statistically sampling telemetry.</li>
<li><strong>Enrichment:</strong> Adds contextual information to the data.</li>
</ul>
</li>
</ul>
<!--kg-card-end: html-->
<h3 id="data-export">Data Export:</h3><ul><li>Utilizes exporters to send the processed data to backend systems.</li><li>Exporters are available for various observability backends (e.g., <a href="https://www.jaegertracing.io/?ref=causely-blog.ghost.io" rel="noopener">Jaeger</a>, <a href="https://zipkin.io/?ref=causely-blog.ghost.io" rel="noopener">Zipkin</a>, <a href="https://prometheus.io/?ref=causely-blog.ghost.io" rel="noopener">Prometheus</a>).</li><li>Exporter configurations specify the destination endpoint and data format for the backend system.</li></ul><h3 id="internal-representation">Internal Representation:</h3><ul><li>Leverages OpenTelemetry’s internal Protobuf data format (pdata) for efficient data handling.</li><li>Receivers translate source-specific data formats into pdata format for processing.</li><li>Exporters translate pdata format into the backend system’s expected data format.</li></ul><h3 id="scalability-and-configurability">Scalability and Configurability:</h3><ul><li>Designed for horizontal scaling by deploying multiple collector instances.</li><li>Configuration files written in YAML allow for dynamic configuration of receivers, processors, and exporters.</li><li>Supports running as an agent on individual hosts or as a standalone service.</li></ul><blockquote><strong>The OTel Collector is format-agnostic and flexible, built to work with various backend observability systems.</strong></blockquote><h2 id="setting-up-the-opentelemetry-otel-collector">Setting up the OpenTelemetry (OTel) Collector</h2><p>Starting with OpenTelemetry for your new system is a straightforward process that takes only a few steps:</p><ol><li><strong>Download the OTel Collector:</strong> Obtain the latest version from the official <a href="https://opentelemetry.io/docs/collector/?ref=causely-blog.ghost.io" rel="noopener">OpenTelemetry website</a> or your preferred package manager.</li><li><strong>Configure the OTel Collector:</strong> Edit the configuration file to define data sources and export destinations.</li><li><strong>Run the OTel Collector:</strong> Start the Collector to begin collecting and processing telemetry data.</li></ol><p>Keep in mind that the example we will show here is relatively simple. A large scale production implementation will require fine-tuning to ensure optimal results. Make sure to follow your OS-specific instructions to deploy and run the OTel collector.</p><p>Next, we need to configure some exporters for your application stack.</p><h2 id="integration-with-popular-tools-and-platforms">Integration with Popular Tools and Platforms</h2><p>Let’s use an example system running a multi-tier web application using <a href="https://github.com/nginx?ref=causely-blog.ghost.io" rel="noopener">NGINX</a>, <a href="https://www.mysql.com/?ref=causely-blog.ghost.io" rel="noopener">MySQL</a>, and <a href="https://redis.io/?ref=causely-blog.ghost.io" rel="noopener">Redis</a>. Each source platform will have some application-specific configuration parameters.</p><h3 id="configuring-receivers">Configuring Receivers</h3>
<!--kg-card-begin: html-->
<h4><!--kg-card-begin: html--><span style="font-size: 12pt;">redisreceiver:</span><!--kg-card-end: html--></h4>
<!--kg-card-end: html-->
<ul><li>Replace <code>receiver_name</code> with <code>redisreceiver</code></li><li>Set <code>endpoint</code> to the port where your Redis server is listening (default: 6379)</li><li>You can configure additional options like authentication and collection intervals in the receiver configuration. Refer to the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/redisreceiver/README.md?ref=causely-blog.ghost.io" rel="noopener">official documentation</a> for details.</li></ul>
<!--kg-card-begin: html-->
<h4><!--kg-card-begin: html--><span style="font-size: 12pt;">mysqlreceiver:</span><!--kg-card-end: html--></h4>
<!--kg-card-end: html-->
<ul><li>Replace <code>receiver_name</code> with <code>mysqlreceiver</code></li><li>Set endpoint to the connection string for your MySQL server (e.g., <code>mysql://user:password@localhost:3306/database</code>)</li><li>Similar to Redis receiver, you can configure authentication and collection intervals. Refer to the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/mysqlreceiver/README.md?ref=causely-blog.ghost.io" rel="noopener">documentation</a> for details.</li></ul>
<!--kg-card-begin: html-->
<h4><!--kg-card-begin: html--><span style="font-size: 12pt;">nginxreceiver:</span><!--kg-card-end: html--></h4>
<!--kg-card-end: html-->
<ul><li>Replace <code>receiver_name</code> with <code>nginxreceiver</code></li><li>No endpoint configuration needed as it scrapes metrics from the NGINX process.</li><li>You can configure what metrics to collect and scraping intervals in the receiver configuration. Refer to the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/nginxreceiver/README.md?ref=causely-blog.ghost.io" rel="noopener">documentation</a> for details.</li></ul><p>The OpenTelemetry Collector can export data to multiple providers including Prometheus, Jaeger, Zipkin, and, of course, <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causely</a>. This flexibility allows users to leverage their existing tools while adopting OpenTelemetry.</p><h3 id="configuring-exporters">Configuring Exporters</h3><p>Replace <code>exporter_name</code> with the actual exporter type for your external system. Here are some common options:</p><ul><li><code>jaeger</code> for <a href="https://www.jaegertracing.io/docs/1.18/opentelemetry/?ref=causely-blog.ghost.io" rel="noopener">Jaeger backend</a></li><li><code>zipkin</code> for <a href="https://opentelemetry.io/docs/languages/js/exporters/?ref=causely-blog.ghost.io" rel="noopener">Zipkin backend</a></li><li><code>otlp/causely</code> for Causely backend</li><li>There are exporters for many other systems as well. Refer to the <a href="https://opentelemetry.io/docs/languages/js/exporters/?ref=causely-blog.ghost.io" rel="noopener">documentation</a> for a complete list.</li></ul><p>Set <code>endpoint</code> to the URL of your external system where you want to send the collected telemetry data. You might need to configure additional options specific to the chosen exporter (e.g., authentication for Jaeger).</p><p>There is also a growing list of supporting <a href="https://opentelemetry.io/ecosystem/vendors/?ref=causely-blog.ghost.io" rel="noopener">vendors who consume OpenTelemetry data</a>.</p><h2 id="conclusion">Conclusion</h2><p>OpenTelemetry provides a standardized approach to collecting and exporting logs, metrics, and traces. Implementing OpenTelemetry and the OTel Collector offer a scalable and flexible solution for managing telemetry data, making it a popular and effective tool for modern applications.</p><p>You can use OpenTelemetry as part of your monitoring and observability practice in order to gather data that can help drive better understanding of the state of your applications. The most valuable part of OpenTelemetry is the ability to ingest the data for deeper analysis.</p><h2 id="how-causely-works-with-opentelemetry">How Causely Works with OpenTelemetry</h2><p>At Causely, we use OpenTelemetry as one of many data sources to assure autonomous service reliability for our clients. OpenTelemetry data is ingested by our <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causal Reasoning Platform</a>, which is a model-driven AI system that automatically instruments your environment, defines SLOs, and pinpoints risk to reliability - without manual setup. By cutting through observability overload, Causely eliminates complex troubleshooting and ensures services meet reliability expectations. Engineering teams instantly improve service reliability, reduce downtime, and lower operational costs – freeing them to focus on innovation. </p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Shmuel Kliger on Causely’s Integration with OpenTelemetry]]></title>
      <link>https://causely.ai/blog/shmuel-kliger-on-causelys-integration-with-opentelemetry</link>
      <guid>https://causely.ai/blog/shmuel-kliger-on-causelys-integration-with-opentelemetry</guid>
      <pubDate>Wed, 05 Mar 2025 21:19:56 GMT</pubDate>
      <description><![CDATA[Shmuel talks with Techstrong.tv's Alan Shimel about Causely launching its integration with OpenTelemetry, which has redefined observability by standardizing how telemetry data is collected and processed.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/10/techstrong-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Shmuel talks with Techstrong.tv's Alan Shimel about Causely launching its integration with OpenTelemetry, which has redefined observability by standardizing how telemetry data is collected and processed.</p><p>Click below to watch the interview, or check it out on Techstrong.tv <a href="https://techstrong.tv/videos/interviews/shmuel-kliger-on-causelys-integration-with-opentelemetry?ref=causely-blog.ghost.io" rel="noreferrer">here</a>.</p><figure class="kg-card kg-image-card"><a href="https://techstrong.tv/videos/interviews/shmuel-kliger-on-causelys-integration-with-opentelemetry?ref=causely-blog.ghost.io"><img src="https://causely-blog.ghost.io/content/images/2025/03/Screenshot-2025-03-05-at-4.14.58-PM.png" class="kg-image" alt="" loading="lazy" width="1102" height="619" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/03/Screenshot-2025-03-05-at-4.14.58-PM.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/03/Screenshot-2025-03-05-at-4.14.58-PM.png 1000w, https://causely-blog.ghost.io/content/images/2025/03/Screenshot-2025-03-05-at-4.14.58-PM.png 1102w" sizes="(min-width: 720px) 720px"></a></figure>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Point Your OpenTelemetry Data at Causely to Find What Matters in the Noise]]></title>
      <link>https://causely.ai/blog/point-your-opentelemetry-data-at-causely-to-find-what-matters-in-the-noise</link>
      <guid>https://causely.ai/blog/point-your-opentelemetry-data-at-causely-to-find-what-matters-in-the-noise</guid>
      <pubDate>Wed, 05 Mar 2025 20:31:00 GMT</pubDate>
      <description><![CDATA[Causely is a new player on the observability scene. The main problem their platform addresses is that modern teams are drowning in too many alerts and too much data coming from multiple observability solutions across open-source and 3rd party vendors.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/11/techtimes-1.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p><a href="http://www.causely.ai/?ref=causely-blog.ghost.io" rel="noopener">Causely</a>&nbsp;is a new player on the observability scene. The main problem their platform addresses is that modern teams are drowning in too many alerts and too much data coming from multiple observability solutions across open-source and 3rd party vendors. </p><p><em>View the full article by Carl Williams on </em><a href="https://www.techtimes.com/articles/309501/20250226/point-your-opentelemetry-data-causely-find-what-matters-noise.htm?ref=causely-blog.ghost.io" rel="noreferrer"><em>Tech Times</em></a><em>. </em></p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Announces OpenTelemetry Integration: Solving Observability Overload For Developers]]></title>
      <link>https://causely.ai/blog/causely-announces-opentelemetry-integration-solving-observability-overload-for-developers</link>
      <guid>https://causely.ai/blog/causely-announces-opentelemetry-integration-solving-observability-overload-for-developers</guid>
      <pubDate>Wed, 05 Mar 2025 17:44:14 GMT</pubDate>
      <description><![CDATA[Causely is announcing its integration with OpenTelemetry, bringing a fresh approach to observability that cuts through the noise and surfaces only what matters.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/03/Screenshot-2025-03-05-at-12.43.08-PM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>View the original post on <a href="https://tfir.io/causely-announces-opentelemetry-integration-solving-observability-overload-for-developers/?ref=causely-blog.ghost.io" rel="noreferrer">TFIR.io</a>.</p><p><em>Causely is announcing its integration with OpenTelemetry, bringing a fresh approach to observability that cuts through the noise and surfaces only what matters</em>.</p>
<!--kg-card-begin: html-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/GsoUIckMPCw?si=C3shhXWepUqlRIPK" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
<!--kg-card-end: html-->
<p>In today’s world of modern cloud applications,&nbsp;<a href="https://tfir.io/tag/observability/?ref=causely-blog.ghost.io" rel="noopener">observability</a>&nbsp;is both a necessity and a challenge. With countless services interacting dynamically, engineering teams are often flooded with logs, traces, and metrics, making it difficult to extract meaningful insights when something goes wrong. Traditional observability approaches rely on collecting everything—but the sheer volume of data can be overwhelming, turning troubleshooting into a time-consuming and frustrating process.</p><p><a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noopener">Causely</a>&nbsp;is here to change that. Today, the company is announcing its integration with<a href="https://tfir.io/?s=opentelemetry&ref=causely-blog.ghost.io" rel="noopener">&nbsp;OpenTelemetry</a>, bringing a fresh approach to observability that cuts through the noise and surfaces only what matters.</p><p><a href="https://www.linkedin.com/in/yyemini?ref=causely-blog.ghost.io" rel="noopener nofollow">Yotam Yemini</a>, CEO of&nbsp;<a href="https://tfir.io/tag/causely/?ref=causely-blog.ghost.io" rel="noopener">Causely</a>, highlights that&nbsp;<a href="https://opentelemetry.io/?ref=causely-blog.ghost.io" rel="noopener">OpenTelemetry</a>&nbsp;adoption often requires significant effort, as teams must determine which data to collect, how to sample it, and how to extract useful insights. “Causely is focused on cutting through that noise to pinpoint what matters,” Yemini says.&nbsp;<a href="https://tfir.io/tag/causely/?ref=causely-blog.ghost.io" rel="nofollow">Causely</a>&nbsp;does this by automatically filtering out unnecessary data and providing a structured understanding of system dependencies and potential failure points. The integration makes OpenTelemetry more practical and efficient, ensuring engineers receive only the most relevant insights.</p><p><strong>The Challenge of Observability Today</strong></p><p>Modern applications generate an immense amount of telemetry data, whether it’s from microservices, distributed systems, or cloud infrastructure. While OpenTelemetry has become the de facto standard for collecting and transmitting this data, its adoption often comes with significant hurdles. Engineers frequently struggle with:</p><ul><li>Too much data, too little insight – Logs and traces pile up, making it difficult to identify real issues.</li><li>Slow&nbsp;<a href="https://tfir.io/tag/root/?ref=causely-blog.ghost.io" rel="nofollow">root</a>&nbsp;cause analysis – Teams spend too much time manually sifting through observability data.</li><li>Uncertain data collection strategies – Debates about sampling vs. full collection leave engineers guessing about what’s really needed.</li></ul><p>The result? An overload of information but very few actionable insights.</p><p><strong>A Smarter Approach to Observability</strong></p><p>Causely’s Causal Reasoning Platform takes a fundamentally different approach. Instead of just presenting raw telemetry data, it analyzes dependencies and causal relationships between different components of an application. This means that instead of looking at an overwhelming sea of logs, engineers can quickly pinpoint what is causing an issue and how it propagates through the system.</p><p>With the new OpenTelemetry integration, Causely makes it even easier for teams to:</p><ul><li>Seamlessly ingest OpenTelemetry traces and metrics without any complicated setup.</li><li>Automatically filter out irrelevant data and highlight the most critical insights.</li><li>Predict potential failures before they happen, allowing teams to be proactive instead of reactive.</li></ul><p>By integrating with OpenTelemetry, Causely aims to bridge the gap between raw telemetry collection and intelligent observability. Teams can proactively manage service reliability, prevent latency issues, and optimize performance, making observability more effective and manageable.</p><p><strong>How it Works</strong></p><p>Causely provides two easy ways to integrate with OpenTelemetry. If your team is already using OpenTelemetry, you can simply point your existing collector at Causely. Within seconds, the system begins processing the telemetry data and identifying key dependencies, bottlenecks, and potential risks.</p><p>For teams that haven’t yet adopted OpenTelemetry, Causely offers an alternative: auto-instrumentation using<a href="https://ebpf.io/?ref=causely-blog.ghost.io" rel="noopener">&nbsp;eBPF</a>. This allows teams to collect the necessary telemetry data without having to manually modify their applications. The best part? It all runs in your own cloud environment, ensuring data privacy and&nbsp;<a href="https://tfir.io/?s=security&ref=causely-blog.ghost.io" rel="noopener">security</a>&nbsp;while keeping infrastructure overhead minimal.</p><p><strong>Beyond OpenTelemetry: Observability without Limits</strong></p><p>While this new integration enhances how teams use OpenTelemetry, Causely isn’t limited to a single observability standard. The platform is designed to work with any telemetry source, whether proprietary or open-source, ensuring flexibility for engineering teams regardless of their stack.</p><p>By decoupling the intelligence layer from data collection, Causely provides a unified way to analyze and act on observability data, no matter where it comes from. This approach helps reduce the complexity caused by vendor sprawl, allowing teams to focus on application performance rather than managing observability infrastructure.</p><p><strong>Who Benefits from Causely?</strong></p><p>Causely is built for any engineering team dealing with complex, microservices-driven applications. Whether you’re operating a large-scale SaaS platform, managing a&nbsp;<a href="https://tfir.io/tag/fintech/?ref=causely-blog.ghost.io" rel="nofollow">fintech</a>&nbsp;infrastructure, or ensuring the reliability of a healthcare application, Causely helps keep things running smoothly by automating root cause detection and providing predictive insights.</p><p>Companies already leveraging Causely include publicly traded SaaS providers, financial institutions, telecom giants, and healthcare tech firms—all benefiting from the ability to reduce incident resolution times and improve service reliability.</p><p><strong>Get Started with Causely</strong></p><p>Observability doesn’t have to be overwhelming. With Causely’s OpenTelemetry integration, teams can move beyond raw data collection and gain instant, actionable insights that make troubleshooting faster and easier.</p><p>For those interested in trying it out, Causely offers a free trial where you can explore a demo environment or set up a real-world deployment in minutes. To get started, visit<a href="https://causely.ai/?ref=causely-blog.ghost.io" rel="noopener">&nbsp;causely.ai</a>&nbsp;and see how causal reasoning can transform your observability strategy.</p><p>Instead of spending hours digging through logs, let Causely do the heavy lifting—so you can focus on building great applications.</p><p><strong>Guest:&nbsp;</strong><a href="https://www.linkedin.com/in/yyemini?ref=causely-blog.ghost.io" rel="noopener nofollow"><strong>Yotam Yemini</strong></a><br><strong>Company:&nbsp;</strong><a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noopener"><strong>Causely</strong></a><br><strong>Show:&nbsp;</strong><a href="https://tfir.io/category/content-type/video/lets-talk/?ref=causely-blog.ghost.io" rel="noopener"><strong>Let’s Talk</strong></a></p><p><strong><em>Summary is written by&nbsp;</em></strong><a href="mailto:emilylnicholls@gmail.com"><strong><em>Emily Nicholls</em></strong></a></p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Why there needs to be a paradigm shift in observability]]></title>
      <link>https://causely.ai/blog/why-there-needs-to-be-a-paradigm-shift-in-observability</link>
      <guid>https://causely.ai/blog/why-there-needs-to-be-a-paradigm-shift-in-observability</guid>
      <pubDate>Wed, 05 Mar 2025 17:37:21 GMT</pubDate>
      <description><![CDATA[View the original article on CIODive.

I’ve spent over three decades in IT Operations. Despite all the talk of transformation, many of the fundamental challenges remain unchanged, or have even worsened. The rise of modern DevOps and observability promised to revolutionize how we monitor and maintain systems, but in reality, we’ve simply scaled up the same old problems. More data, more dashboards, and more alerts haven’t led to better outcomes.  
 
The core issue? Our approach to observability ha]]></description>
      <author>Shmuel Kliger</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/03/Causely-Screenshot-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>View the original article on <a href="https://www.ciodive.com/spons/why-there-needs-to-be-a-paradigm-shift-in-observability/741526/?ref=causely-blog.ghost.io" rel="noreferrer">CIODive</a>.</p><p>I’ve spent over three decades in IT Operations. Despite all the talk of transformation, many of the fundamental challenges remain unchanged, or have even worsened. The rise of modern DevOps and observability promised to revolutionize how we monitor and maintain systems, but in reality, we’ve simply scaled up the same old problems. More data, more dashboards, and more alerts haven’t led to better outcomes. &nbsp;<br>&nbsp;<br>The core issue? Our approach to observability has been misguided. We need a paradigm shift from a bottom-up approach to a top-down approach. Instead of bottoms up collecting everything hoping we find insights in the data, we need to start with the purpose or the insights we are looking for and collect only the data that will help us to infer these insights.&nbsp;</p><h2 id="part-1-looking-back-at-the-days-of-itops">Part 1: Looking Back at the Days of ITOps&nbsp;</h2><p>From the early days of IT Operations, we have been focusing on keeping the lights on. At the early days, systems were monolithic, monitoring was rudimentary, and troubleshooting often involved sifting through log files for hours on end. A major incident meant a war room filled with engineers manually correlating data, trying to pinpoint the root cause of an outage.&nbsp;</p><p>As infrastructure grew more complex, the industry responded by layering on more tools, with each promising to make troubleshooting easier. But in practice, these tools often just created more dashboards, more logs, and more alerts, leading to information overload. IT Operations needed to evolve, which gave rise to modern DevOps, but we are still in the early innings of the game.&nbsp;&nbsp;</p><h2 id="part-2-the-problem-today-with-observability">Part 2: The Problem Today with Observability&nbsp;</h2><p>Observability was supposed to solve these challenges by giving teams a deeper understanding of their systems. The idea was that by collecting and analyzing vast amounts of telemetry – metrics, events, logs, and traces – organizations would gain better insights and respond to issues faster.&nbsp;</p><p>Instead, we’ve seen an explosion of complexity. Today’s observability landscape is defined by:&nbsp;</p><ul><li><strong>Too much noise:&nbsp;</strong>The sheer volume of logs and alerts makes it nearly impossible for engineers to separate meaningful signals from the flood of data.&nbsp;</li><li><strong>Too many tools:</strong>&nbsp;Companies rely on a fragmented ecosystem of monitoring, logging, and tracing solutions, leading to silos and inefficiencies.&nbsp;</li><li><strong>Too much manual troubleshooting:</strong>&nbsp;Despite all this data, engineers still spend most of their time manually diagnosing incidents, correlating logs, and responding to false positives.&nbsp;</li></ul><p>The promise of observability has not been fully realized because organizations are still focusing on data collection instead of shifting their focus to intelligent data processing.&nbsp;</p><h2 id="part-3-causely%E2%80%99s-paradigm-shift">Part 3: Causely’s Paradigm Shift&nbsp;</h2><p>At&nbsp;<a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noreferrer noopener">Causely</a>, we believe that Observability must be disrupted and go through a paradigm shift from a bottom-up data collection to top-down purpose driven analytics that saves engineers from needing to spend hours sifting through data to figure out “the why behind the what.” We reject the notion that engineers will always need to spend time drilling through dashboards and making sense of the data coming from their tools. Instead, we believe that implementing the proper abstraction layer would allow systems to self-manage and take humans out of the troubleshooting loop. Instead of using tools that provide information for the engineers to analyze, the engineers should deploy systems that can actually make autonomous decisions.&nbsp;<br><br>This journey to autonomous service reliability is built on a number of core principles that I outlined in a&nbsp;<a href="https://www.causely.ai/blog/capabilities-causal-analysis.?ref=causely-blog.ghost.io" rel="noreferrer noopener">recent blog</a>. This shift from passively collecting data to relying on a system that actively interprets and acts on these insights is the future of observability.&nbsp;</p><h2 id="part-4-the-rise-of-ai-and-agentic-ai">Part 4: The Rise of AI and Agentic AI&nbsp;</h2><p>We are on the cusp of a new era where Agentic AI can fundamentally reshape IT Operations. Instead of engineers reacting to alerts, AI-driven systems will proactively maintain service reliability, predicting and preventing failures before they occur.&nbsp;</p><p>I’m not talking about simple alerting rules or anomaly detection; I’m talking about true agentic AI that continuously analyzes system behavior and adapts in real-time to prevent service degradation.&nbsp;&nbsp;</p><p>Causely is positioned to lead this transformation, helping organizations shift from reactive troubleshooting to proactive, AI-driven service reliability. This isn’t just a step forward in observability—it’s a paradigm shift in how we think about IT Operations.&nbsp;</p><h2 id="conclusion">Conclusion&nbsp;</h2><p>For 30+ years, we’ve been fighting the same battles in IT Operations, just at a larger scale. It’s time to rethink observability, move beyond endless data collection, and embrace a future where AI-driven automation ensures service reliability with minimal human intervention. Organizations that adopt this new paradigm won’t just reduce downtime and incident costs—they’ll free their engineers to focus on what truly matters: building the future.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Episode 60: Shmuel Kliger, Founder of Causely]]></title>
      <link>https://causely.ai/blog/episode-60-shmuel-kliger-founder-of-causely</link>
      <guid>https://causely.ai/blog/episode-60-shmuel-kliger-founder-of-causely</guid>
      <pubDate>Wed, 05 Mar 2025 16:56:37 GMT</pubDate>
      <description><![CDATA[In this 10KMedia Podcast interview, Adam sits down with Shmuel to discuss the problems with traditional observability, the importance of OpenTelemetry, and how Causely is helping teams find the signal in the noise.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/03/Screenshot-2025-03-05-at-12.16.38-PM.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Adam sits down with Shmuel to discuss the problems with traditional observability, the importance of OpenTelemetry, and how Causely is helping teams find the signal in the noise.</p><p>Check out the interview below or view it <a href="https://open.spotify.com/episode/69Xb72sLigpMhzTPCQBH61?si=db155c40a66a4a1c&ref=causely-blog.ghost.io" rel="noreferrer">on Spotify</a>.</p>
<!--kg-card-begin: html-->
<iframe style="border-radius:12px" src="https://open.spotify.com/embed/episode/69Xb72sLigpMhzTPCQBH61?utm_source=generator" width="100%" height="352" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>
<!--kg-card-end: html-->
]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Launches New Integration with OpenTelemetry, Cutting Through the Observability Noise and Pinpointing What Matters]]></title>
      <link>https://causely.ai/blog/causely-launches-new-integration-with-opentelemetry-cutting-through-the-observability-noise-and-pinpointing-what-matters</link>
      <guid>https://causely.ai/blog/causely-launches-new-integration-with-opentelemetry-cutting-through-the-observability-noise-and-pinpointing-what-matters</guid>
      <pubDate>Wed, 05 Mar 2025 15:08:24 GMT</pubDate>
      <description><![CDATA[Causely, the causal reasoning platform for modern engineering teams, today launches a native integration with OpenTelemetry.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/03/Causely-Screenshot.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><strong>NEW YORK --(</strong><a href="https://www.businesswire.com/news/home/20250305519212/en/Causely-Launches-New-Integration-with-OpenTelemetry-Cutting-Through-the-Observability-Noise-and-Pinpointing-What-Matters?ref=causely-blog.ghost.io" rel="noreferrer"><strong>BUSINESS WIRE</strong></a><strong>)--</strong><a href="http://www.causely.ai/?ref=causely-blog.ghost.io"><u>Causely</u></a>, the causal reasoning platform for modern engineering teams, today launches a native integration with OpenTelemetry. OpenTelemetry has redefined observability by standardizing how telemetry data is collected and processed; however, the overwhelming volume of metrics, logs, and traces collected across an organization's cloud and microservices architecture can overwhelm teams, slow down workflows, inflate costs, and make it difficult for engineers to pinpoint root causes. Causely utilizes built-in causal models and advanced analytics to cut through the observability noise and automatically pinpoint what matters based on service impact.</p><p>“For years, the IT industry has struggled to make sense of the overwhelming amounts of data coming from dozens of observability platforms and monitoring tools,” said Yotam Yemini, CEO of Causely. “OpenTelemetry is an important paradigm shift that gives teams a more flexible and vendor-neutral way to collect and use telemetry data, but it also exacerbates the big data problem of modern DevOps: deriving actionable insights from overwhelming amounts of telemetry data.”</p><p>Causely installs in cloud-native environments in under a minute and immediately bridges the gap between the overwhelming amount of observability data and the actionable insights that can be found within that data, maximizing the potential benefits of OpenTelemetry. Teams who have yet to adopt OpenTelemetry can leverage Causely’s eBPF-based auto-instrumentation to start collecting traces instantly.&nbsp;</p><p>The Causely system works by automatically mapping an application’s topology and service dependencies, then applying a finite set of likely root causes to this data. This novel approach to observability is counter to traditional tooling and methods that encourage businesses to collect as much data as possible – a situation that hasn’t fundamentally changed in decades – which then requires human troubleshooting to respond to alerts, make sense of patterns, identify root cause, and ultimately determine the best action for remediation.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdWf6lE7XycuITVFQY7HZSwrRCpfAVw3bDPPcuvQ8r9ujWTlDrYPmovpdvuAOyw9uI6EAGRQyEDBOz4KTcHgHKXy3QP9qn2PzG16hX4Jck37BTn0RA0W58ctylEKjU4d6a3i432?key=Bx4z2HMwbfUO7MEodl2OWS2G" class="kg-image" alt="" loading="lazy" width="907" height="510"></figure><p>“Traditionally an alert fires and we have to throw a bunch of subject matter experts at the problem to comb through all of the telemetry and troubleshoot why these things are going on,” said Matt Titmus, Engineering Manager at Yext. “I’m enthusiastic about Causely because it gets us to the root cause much faster and also helps us be more proactive."</p><p>Causely was founded by Shmuel Kliger who has been developing systems for IT Operations for over two decades. He was also the founder of Turbonomic (acquired by IBM) and the CTO of SMARTS (acquired by EMC), bringing together technical experience with a track record of successfully scaling companies.</p><p><strong><u>About Causely</u></strong></p><p><a href="http://www.causely.ai/?ref=causely-blog.ghost.io"><u>Causely</u></a> leverages causal reasoning to cut through the observability noise and pinpoint what matters. Engineers are overwhelmed by too many tools, alerts, and data coming from their existing observability solutions. Causely automatically surfaces only the most critical risks to service reliability, enabling businesses to minimize operational overhead and maintain reliability without manual troubleshooting.</p><p><strong><u>Contact</u></strong></p><p>Adam LaGreca<br>Founder of 10KMedia<br><a href="mailto:adam@10kmedia.co" rel="noreferrer">adam@10kmedia.co</a></p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Launching our new integration with OpenTelemetry]]></title>
      <link>https://causely.ai/blog/otel-actionable-insight</link>
      <guid>https://causely.ai/blog/otel-actionable-insight</guid>
      <pubDate>Tue, 04 Mar 2025 21:30:26 GMT</pubDate>
      <description><![CDATA[Bridging the gap between observability data and actionable insight]]></description>
      <author>Endre Sara</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/03/Causely-OTel.png" type="image/jpeg" />
      <content:encoded><![CDATA[<h2 id="bridging-the-gap-between-observability-data-and-actionable-insight">Bridging the gap between observability data and actionable insight</h2><p>Observability has become a cornerstone of application reliability and performance. As systems grow more complex—spanning microservices, third-party APIs, and asynchronous messaging patterns—the ability to monitor and debug these systems is both a necessity and a challenge.&nbsp;</p><p>OpenTelemetry (OTEL) has emerged as a powerful, open source framework that standardizes the collection of telemetry data across distributed systems. It promises unprecedented visibility into logs, metrics, and traces, empowering engineers to identify issues and optimize performance across multiple languages, technologies and cloud environments.&nbsp;</p><p>But with great visibility comes a hidden cost. While OTEL democratizes observability, <a href="https://www.causely.ai/blog/be-smarter-about-observability-data?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>it also exacerbates the “big data problem”</u></a> of modern DevOps.&nbsp;</p><p>This is where Causely comes in—<a href="https://www.causely.ai/blog/causely-launches-new-integration-with-opentelemetry-cutting-through-the-observability-noise-and-pinpointing-what-matters?ref=causely-blog.ghost.io" rel="noreferrer">today, we announced a new integration</a> with OTEL that bridges the gap between OTEL's data deluge and actionable insights. In this post, we’ll explore the strengths and limitations of OpenTelemetry, the challenges it introduces, and how Causely transforms raw telemetry into precise, cost-effective analytics.&nbsp;</p><h2 id="the-opentelemetry-opportunity">The OpenTelemetry opportunity&nbsp;</h2><p>Microservices are a tangled web of interdependencies that communicate over REST or gRPC. Asynchronous systems like Kafka shuttle messages between loosely coupled services. Infrastructure dynamically scales resources to meet demand. Observability has become the glue that holds these systems together, enabling engineers to monitor performance, troubleshoot issues, and ensure reliability.&nbsp;</p><p>At the heart of the observability revolution is OpenTelemetry (OTEL), an open-source standard that unifies the instrumentation and collection of telemetry data across logs, metrics, and traces. Its modular architecture, community-driven development, and broad compatibility with existing observability tools have made OTEL the de facto choice for modern DevOps teams.&nbsp;</p><h3 id="what-does-opentelemetry-do">What does OpenTelemetry do?&nbsp;</h3><p>OpenTelemetry provides APIs, SDKs, and tools to capture three primary types of telemetry data:&nbsp;</p><ol><li><strong>Logs</strong>: Detailed, timestamped records of system events (e.g., errors, warnings, and custom events).&nbsp;</li><li><strong>Metrics</strong>: Quantitative measurements of system health and performance (e.g., CPU usage, request latency, error rates).&nbsp;</li><li><strong>Traces</strong>: End-to-end views of requests flowing through distributed systems, mapping dependencies and execution paths.&nbsp;</li></ol><p>With OTEL, engineers can instrument their code to emit these telemetry signals, use an <a href="https://www.causely.ai/blog/using-opentelemetry-and-the-otel-collector-for-logs-metrics-and-traces?ref=causely-blog.ghost.io" rel="noreferrer">OpenTelemetry Collector</a> to aggregate and process the data, and export it to observability backends like Prometheus, Tempo, or Elasticsearch.&nbsp;</p><h3 id="why-opentelemetry-is-a-game-changer">Why OpenTelemetry is a game-changer&nbsp;</h3><p>OpenTelemetry addresses a critical pain point in observability: fragmentation. Historically, different tools and platforms required unique instrumentation libraries, making it difficult to standardize observability across an organization. OTEL simplifies this by providing:&nbsp;</p><ul><li><strong>Vendor-Agnostic Instrumentation</strong>: A single API to instrument applications regardless of the backend.&nbsp;</li><li><strong>Centralized Data Collection</strong>: The OpenTelemetry Collector serves as a pluggable data pipeline, consolidating telemetry from various sources.&nbsp;</li><li><strong>Interoperability</strong>: Native support for popular backends like Prometheus, Tempo, and other vendors, allowing teams to integrate OTEL into their existing observability stack.&nbsp;</li></ul><h3 id="technical-example-debugging-latency-issues">Technical example: Debugging latency issues&nbsp;</h3><p>Consider a microservices-based e-commerce application experiencing high latency during checkout. With OTEL traces, you can get a lot of information about the performance of this service but it is hard to find out what is responsible for the latency. For example: <a href="https://github.com/esara/robot-shop/blob/instrumentation/dispatch/main.go?ref=causely-blog.ghost.io#L172" rel="noreferrer noopener"><u>https://github.com/esara/robot-shop/blob/instrumentation/dispatch/main.go#L172</u></a>&nbsp;</p><p>```&nbsp;</p><p>func processOrder(headers map[string]interface{}, order []byte) {&nbsp;<br>&nbsp;&nbsp;&nbsp; start := time.Now()&nbsp;<br>&nbsp;&nbsp;&nbsp; log.Printf("processing order %s\n", order)&nbsp;<br>&nbsp;&nbsp;&nbsp; tracer := otel.Tracer("dispatch")&nbsp;<br>&nbsp;<br>&nbsp;&nbsp;&nbsp; // headers is map[string]interface{}&nbsp;<br>&nbsp;&nbsp;&nbsp; // carrier is map[string]string&nbsp;<br>&nbsp;&nbsp;&nbsp; carrier := make(propagation.MapCarrier)&nbsp;<br>&nbsp;&nbsp;&nbsp; // convert by copying k, v&nbsp;<br>&nbsp;&nbsp;&nbsp; for k, v := range headers {&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; carrier[k] = v.(string)&nbsp;<br>&nbsp;&nbsp;&nbsp; }&nbsp;<br>&nbsp;<br>&nbsp;&nbsp;&nbsp; ctx := otel.GetTextMapPropagator().Extract(context.Background(), carrier)&nbsp;<br>&nbsp;<br>&nbsp;&nbsp;&nbsp; opts := []oteltrace.SpanStartOption{&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; oteltrace.WithSpanKind(oteltrace.<em>SpanKindConsumer</em>),&nbsp;<br>&nbsp;&nbsp;&nbsp; }&nbsp;<br>&nbsp;&nbsp;&nbsp; ctx, span := tracer.Start(ctx, "processOrder", opts...)&nbsp;<br>&nbsp;&nbsp;&nbsp; defer span.End()&nbsp;<br>&nbsp;<br>&nbsp;&nbsp;&nbsp; span.SetAttributes(&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; semconv.MessagingOperationReceive,&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; semconv.MessagingDestinationName("orders"),&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; semconv.MessagingRabbitmqDestinationRoutingKey("orders"),&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; semconv.MessagingSystem("rabbitmq"),&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; semconv.NetAppProtocolName("AMQP"),&nbsp;<br>&nbsp;&nbsp;&nbsp; )&nbsp;</p><p>```&nbsp;</p><p>By exporting these traces to a backend like Tempo, engineers can visualize the request flow and identify bottlenecks, such as consuming messages from RabbitMQ in the dispatch service and inserting an order in a MongoDB database.&nbsp;&nbsp;</p><h2 id="the-big-data-problem-of-observability">The Big Data problem of observability&nbsp;</h2><p>OpenTelemetry’s ability to capture detailed telemetry data is a double-edged sword. While it empowers engineers with unprecedented visibility into their systems, it also introduces challenges that can hinder the very goals observability aims to achieve. The sheer volume of data collected—logs, metrics, and traces from thousands of microservices—can overwhelm infrastructure, slow down workflows, inflate costs, and most importantly drown engineers with data. This “big data problem” of observability is a natural consequence of OpenTelemetry’s strengths but must be addressed to make the most of its potential.&nbsp;</p><h3 id="opentelemetry-collects-a-lot-of-data">OpenTelemetry collects a lot of data&nbsp;</h3><p>At its core, OpenTelemetry is designed to be exhaustive. This design ensures engineers can instrument their systems to capture every possible detail. For example:&nbsp;</p><ul><li>A high-traffic e-commerce site might generate logs for every HTTP request, metrics for CPU and memory usage, and traces for each request spanning multiple services.&nbsp;</li><li>OpenTelemetry auto instrumentation libraries are an easy way to instrument HTTP, GRPC, messaging, database and caching libraries in all languages, but they generate metrics and traces for every call between every microservice, managed service, database and third-party API.&nbsp;</li></ul><p>Consider a production environment running thousands of microservices, each processing hundreds of requests per second. Using OpenTelemetry:&nbsp;</p><ul><li><strong>Logs</strong>: A single request might generate dozens of log entries, resulting in millions of logs per minute.&nbsp;</li><li><strong>Metrics</strong>: Resource utilization metrics are emitted periodically, adding continuous streams of quantitative data.&nbsp;</li><li><strong>Traces</strong>: Distributed traces can contain hundreds of spans, each adding its own metadata.&nbsp;</li></ul><p>While this level of detail is invaluable for debugging and optimization, it quickly scales beyond what many teams are prepared to manage. The amount of data makes it difficult to <a href="https://www.causely.ai/blog/spend-less-time-troubleshooting?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>troubleshoot</u></a> problems, manage <a href="https://www.causely.ai/blog/eliminate-escalations-and-finger-pointing?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>escalations</u></a>, be proactive about <a href="https://www.causely.ai/blog/be-proactive-about-reliability?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>deploying new code</u></a>, and <a href="https://www.causely.ai/blog/be-more-proactive-about-changes-to-your-environment?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>plan for future investments</u></a>.&nbsp;</p><h3 id="the-cost-of-data">The cost of data&nbsp;</h3><p>The problem with this massive volume of telemetry data isn’t just about storage; it’s also about processing and time-to-insight. Let’s break it down:&nbsp;</p><ul><li><strong>Networking Costs</strong>: Transmitting telemetry data from distributed systems, microservices, or edge devices to central storage or processing locations incurs significant bandwidth usage. This can result in substantial networking costs, especially for real-time telemetry pipelines or when dealing with geographically dispersed infrastructure.&nbsp;</li><li><strong>Storage Costs</strong>: Logs, metrics, and traces consume vast amounts of storage, often requiring specialized solutions like Elasticsearch, Amazon S3, or Prometheus’s TSDB. These systems must scale horizontally, adding significant operational overhead.&nbsp;</li><li><strong>Compute Costs</strong>: Telemetry data needs to be parsed, indexed, queried, and analyzed. Complex queries, such as joining multiple traces to identify bottlenecks, can place a heavy burden on compute resources.&nbsp;</li><li><strong>Time Costs</strong>: During a high-severity incident, every second counts. Pinpointing the root cause is like looking for a needle in a haystack. With OpenTelemetry, the haystack is much bigger, making the task harder and longer.&nbsp;&nbsp;</li></ul><h3 id="time-to-insight-delays">Time-to-insight delays&nbsp;</h3><p>Imagine a scenario where an outage occurs in a distributed system. An engineer might start by querying logs for errors, then switch to metrics to identify anomalies, and finally inspect traces to pinpoint the failing service. Each query takes time, and engineers often waste effort chasing irrelevant leads. This delay increases Mean Time to Detect (MTTD) and <a href="https://www.causely.ai/blog/mttr-meaning?ref=causely-blog.ghost.io" rel="noreferrer">Mean Time to Resolve</a> (MTTR), directly impacting uptime and user satisfaction.&nbsp;</p><h3 id="noise-vs-signal">Noise vs. signal&nbsp;</h3><p>Another challenge is separating the signal (useful insights) from the noise (redundant or irrelevant data). With OTEL:&nbsp;</p><ul><li><strong>Logs </strong>can be overly verbose, capturing routine events that clutter debugging efforts.&nbsp;</li><li><strong>Metrics </strong>might lack the context needed to tie resource anomalies back to specific root causes.&nbsp;</li><li><strong>Traces </strong>can become overwhelming in high-traffic systems, with thousands of spans providing more detail than is actionable.&nbsp;</li></ul><p>While OTEL excels at capturing data, it doesn’t inherently prioritize it. This creates a bottleneck for engineers who need actionable insights quickly.&nbsp;</p><h2 id="the-need-for-top-down-analytics">The need for top-down analytics&nbsp;</h2><p>Along with the benefits of modern observability tooling come challenges that need to be addressed. OpenTelemetry (OTEL) may unify telemetry data collection, but its bottom-up approach leaves teams drowning in redundant metrics, irrelevant logs, and sprawling traces. Without a clear purpose, teams end up collecting everything “just in case,” overwhelming engineers with noise and diluting the actionable insights needed to keep systems running.&nbsp;</p><p>A top-down approach to observability flips the script. Instead of starting with what data is available, it begins with defining the goals: root cause analysis, SLO compliance, or performance optimization. By focusing on purpose, teams can build the analytics required to achieve those goals and then collect only the data necessary to power those insights.&nbsp;&nbsp;</p><p>For example:&nbsp;</p><ul><li>If the goal is root cause analysis, focus on traces that map dependencies across microservices, rather than capturing every granular log.&nbsp;</li><li>If the goal is performance optimization, prioritize metrics that highlight latency bottlenecks over exhaustive resource utilization data.&nbsp;</li></ul><p>This shift reduces noise, minimizes data storage and processing costs, and accelerates time-to-insight.&nbsp;</p><h3 id="the-cost-of-ignoring-purpose">The cost of ignoring purpose&nbsp;</h3><p>The current approach to observability is plagued by fragmentation. Point tools like APMs, native Kubernetes instrumentation, and cloud-specific monitors operate in silos, each with its own data model and semantics. This forces engineers to manually correlate information across dashboards, increasing time to resolution and undermining efficiency. Over time, the storage, compute, and human costs of managing fragmented data become unsustainable.&nbsp;</p><p>Ask yourself:&nbsp;</p><blockquote>How much of your telemetry data is redundant or irrelevant?&nbsp;</blockquote><blockquote>Are your engineers spending more time troubleshooting tools than resolving incidents?&nbsp;</blockquote><blockquote>Is your observability stack delivering insights or merely adding complexity?&nbsp;</blockquote><p>Without a unified purpose and targeted analytics, observability becomes another “big data problem,” and your total cost of ownership (TCO) spirals out of control.&nbsp;</p><h2 id="causely-can-help">Causely can help&nbsp;</h2><p>Causely transforms OpenTelemetry’s raw telemetry data into actionable insights by applying a top-down, purpose-driven approach. Instead of drowning in logs, metrics, and traces, Causely’s platform leverages built-in causal models and advanced analytics to automatically pinpoint root causes, prioritize issues based on service impact, and predict potential failures before they occur. This turns observability from a reactive big data challenge into a system that continuously assure application reliability and performance.&nbsp;</p><h3 id="how-causely-brings-focus">How Causely brings focus&nbsp;</h3><p>Causely’s platform addresses these challenges head-on. Its causal reasoning starts with defining what matters: actionable insights to keep systems performing reliably and efficiently. Using built-in causal models and top-down analytics, Causely automatically pinpoints root causes and eliminates noise. <a href="https://www.causely.ai/blog/causely-launches-new-integration-with-opentelemetry-cutting-through-the-observability-noise-and-pinpointing-what-matters?ref=causely-blog.ghost.io" rel="noreferrer">By integrating with OTEL</a> and other telemetry sources, Causely ensures that only the most critical data is collected, processed, and presented in real time.&nbsp;</p><p>For example:&nbsp;</p><ul><li>In a microservices architecture, Causely maps dependencies and pinpoints the root cause of cascading failures, reducing MTTR.&nbsp;</li><li>Similarly, with async messaging systems like Kafka, Causely pinpoints the bottlenecks that cause the consumer lag or delivery failures with actionable context, ensuring faster resolution.&nbsp;</li><li>In cases where a third-party software is a root cause of issues, Causely pinpoints the root cause&nbsp;<s>&nbsp; </s>by analyzing services impact.&nbsp;</li></ul><p>This approach not only reduces the TCO of observability but also ensures teams can focus on delivering value rather than managing data.&nbsp;</p><h3 id="how-causely-works-with-opentelemetry">How Causely works with OpenTelemetry&nbsp;</h3><p>Causely Reasoning Platform is a model-driven, <a href="https://www.causely.ai/blog/capabilities-causal-analysis?ref=causely-blog.ghost.io" rel="noreferrer">purpose-built Agentic AI system</a> delivering multiple AI workers built on a common data model. &nbsp;&nbsp;</p><p>Causely integrates seamlessly with OpenTelemetry, using its telemetry streams as input while applying context and intelligence to deliver precise, actionable outputs. Here’s how Causely solves common observability challenges:&nbsp;</p><ul><li><strong>Automated topology discovery</strong>: Causely automatically builds a dependency map of your entire environment, identifying how applications, services, and infrastructure components interact. OpenTelemetry’s traces provide raw data, but Causely’s topology discovery transforms it into a visual graph that highlights critical paths and dependencies.&nbsp;&nbsp;</li><li><strong>Root cause analysis in real time</strong>: Using causal models, Causely automatically maps all potential root causes to the observable symptoms they may cause. Causely uses this mapping in real time to automatically pinpoint the root causes based on the observed symptoms, prioritizing those that directly impact SLOs. For instance, when<s>&nbsp;</s> request latency spikes<s>&nbsp;</s> are&nbsp; detected across multiple services, Causely pinpoints whether the spikes stem from a database query (and which database), a messaging queue (and which queue), or an external API (and which one), reducing MTTD and MTTR.&nbsp;</li><li><strong>Proactive prevention</strong>: Beyond solving problems, Causely helps prevent them. Its analytics can simulate “what-if” scenarios to predict the impact of configuration changes, workload spikes, or infrastructure upgrades. For example, Causely can warn you if scaling down a Kubernetes node pool might lead to resource contention under expected load.&nbsp;</li></ul><h3 id="example-1-causely-otel-and-microservices">Example 1: Causely, OTEL, and microservices&nbsp;</h3><p>In a distributed e-commerce platform, a checkout service experiences intermittent failures. OpenTelemetry traces capture the flow of requests, but the data alone doesn’t explain the root cause. Causely’s causal models analyze the traces and identify that a dependent payment service is timing out due to a slow database query. This insight allows the team to address the issue without wasting time on manual debugging.&nbsp;</p><h3 id="example-2-causely-otel-and-third-party-software">Example 2: Causely, OTEL, and third-party software&nbsp;</h3><p>A team - using a third-party CRM API - notices degraded response times during peak hours. OpenTelemetry provides metrics showing increased latency, but engineers are left guessing whether the issue lies with their application or the external service. Causely reasons about the API latency and third-party requests and identifies that the CRM is rate-limiting requests, prompting the team to implement retry logic.&nbsp;</p><h3 id="example-3-causely-otel-and-async-messaging-with-kafka">Example 3: Causely, OTEL, and async messaging with Kafka&nbsp;</h3><p>A Kafka-based event pipeline shows sporadic delays in message processing. While OpenTelemetry traces highlight lagging consumers, it doesn’t explain why. Causely, reasoning about the behavior of the consumer microservices, identifies the root cause in the application’s mutex locking which is causing the slow consumption. The engineering team can focus on improving the locking of the data structure without the messaging infrastructure team having to scale up resources and waste time debugging Kafka.&nbsp;</p><h3 id="reducing-the-big-data-burden">Reducing the big data burden&nbsp;</h3><p>Causely’s approach minimizes the data burden by focusing on relevance. Unlike traditional observability stacks that collect and store massive volumes of telemetry data, Causely processes raw metrics and traces locally, pushing only relevant context (e.g., topology and symptoms) to its backend analytics. This reduces storage and compute costs while ensuring engineers get the insights<s>,</s> they need, without delay.&nbsp;</p><h2 id="conclusion-transforming-observability-with-causely">Conclusion: Transforming observability with Causely&nbsp;</h2><p>OpenTelemetry has redefined observability by standardizing how telemetry data is collected and processed, but its bottom-up approach leaves teams overwhelmed by the sheer volume of logs, metrics, and traces. Observability shouldn’t be about how much data you collect—it’s about how much insight you can gain to keep your systems running efficiently. Without clear prioritization and contextual insights, the observability stack can quickly become a costly burden—both in terms of infrastructure and engineering time.&nbsp;</p><p>Causely integrates seamlessly with OpenTelemetry and helps bring order to the chaos, empowering teams to make smarter, faster decisions that directly impact reliability and user experience. Causely uses causal models, automated topology discovery and real-time analytics to pinpoint root causes, prevent incidents, and optimize performance. This reduces noise, eliminates unnecessary data collection, and allows teams to focus on delivering reliable systems rather than managing observability overhead.&nbsp;</p><p>Ready to move beyond data overload and transform your observability strategy? <a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Book a demo</u></a> or <a href="https://auth.causely.app/oauth/account/sign-up?ref=causely-blog.ghost.io" rel="noreferrer">start your free trial</a> to see how Causely can help you take control of your telemetry data and build more reliable cloud-native applications.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[The Journey from Actionable Analytics to Autonomous Service Reliability & Agentic AI]]></title>
      <link>https://causely.ai/blog/capabilities-causal-analysis</link>
      <guid>https://causely.ai/blog/capabilities-causal-analysis</guid>
      <pubDate>Mon, 03 Feb 2025 17:09:00 GMT</pubDate>
      <description><![CDATA[We’ll introduce the 6 common components and 7 AI Workers of our Causal Reasoning Platform, explaining how the platform works to enable autonomous service reliability.]]></description>
      <author>Shmuel Kliger</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/01/causal-reasoning-platform.webp" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Autonomous Service Reliability is a nirvana we have been trying to get to for several decades. <a href="https://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)?ref=causely-blog.ghost.io" rel="noreferrer">John McCarthy</a>, one of the original pioneers of AI, proposed in 1961 the idea of self-repairing computer programs. Three decades later (in the 1990s), tech giants like IBM pushed the concepts of “<a href="https://www.ibm.com/docs/en/db2/11.5?topic=servers-autonomic-computing-overview&ref=causely-blog.ghost.io" rel="noreferrer">autonomic computing</a>” and “<a href="https://www.networkcomputing.com/network-infrastructure/ibm-gets-autonomic-with-self-healing-software?ref=causely-blog.ghost.io" rel="noreferrer">self-healing IT systems</a>.”&nbsp;</p><p>As an industry, we have made progress in many areas. But when it comes to technology operations and application management fundamentals, we are far from the desired state. No matter what new terms and buzz words people use, the industry still has a way to go. The reality is that no magic black box or new AI trend will get us there on its own.&nbsp;</p><p>Autonomous Service Reliability requires a system that autonomically keeps all applications performing and meeting their objectives while satisfying their operational constraints (i.e. the “Desired State”). To continuously maintain the Desired State, the system needs to:&nbsp;&nbsp;</p><ul><li><strong>Assess</strong> whether all the applications are in their Desired State&nbsp;</li><li><strong>Pinpoint the root cause(s) </strong>and identify the actions that will get the applications that are not in their Desired State back to their Desired State&nbsp;&nbsp;</li><li><strong>Determine what actions</strong> will prevent applications from getting out of their Desired State in the first place&nbsp;</li><li><strong>Continuously assess</strong> environment trends to identify what actions should be taken to prevent deviation from the Desired State&nbsp;</li></ul><p>These are the goals of our <a href="https://causely.ai/product/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Causal Reasoning Platform</u></a>, which&nbsp;is a model-driven, purpose-built Agentic AI system that includes multiple AI Workers built on a common data model. These AI Workers collaborate seamlessly to continuously assure application reliability and performance. Each of these workers utilizes specific analytics and they all share the common components of the Causal Reasoning Platform.&nbsp;&nbsp;</p><p>There are seven AI Workers delivered by our Causal Reasoning Platform, sharing six common components. In this post, we’ll first introduce the six common components followed by a description of the seven workers, explaining what each worker is doing and how the platform works.&nbsp;&nbsp;</p><h2 id="causal-models"><strong>Causal Models</strong></h2><p>The Causal Reasoning Platform is driven by Causal Models. Causely is delivered with built-in Causal Models that capture the root causes that can degrade application performance. These Causal Models enable Causely to automatically pinpoint root causes as soon as it is deployed in an environment with zero configuration. </p><p>There are at least a few important details to highlight about these Causal Models:&nbsp;&nbsp;</p><ul><li><strong>They capture potential root causes</strong> in a broad range of entities including applications, databases, caches, messaging, load balancers, DNS, compute, storage, and more.&nbsp;</li><li><strong>They describe how the root causes will propagate </strong>across the entire environment and what symptoms may be observed when each of the root causes occurs.&nbsp;&nbsp;</li><li><strong>They are completely independent</strong> from any specific environment and are applicable to any modern application environment.&nbsp;&nbsp;</li></ul><h2 id="attribute-dependency-models"><strong>Attribute Dependency Models</strong></h2><p>Causely is delivered with built-in Attribute Dependency Models that extend the Causal Models to capture the dependencies between attributes across entities and the constraints attributes must satisfy. These Attribute Dependency Models enable Causely to automatically correlate performance trends across the entire environment, figure out the Desired State (as described earlier) and&nbsp;the actions to keep the environment in that state.&nbsp;</p><p>There are at least a few important details to highlight about these Attribute Dependency Models: &nbsp;</p><ul><li><strong>They can capture attribute dependencies</strong> in a broad range of entities including services, applications, databases, caches, messaging, load balancers, DNS, compute, storage, and more.&nbsp;</li><li><strong>They describe the functions </strong>between the attributes, but more importantly the functions can be learned. &nbsp;</li><li><strong>They describe the desired state </strong>in terms of the applications' goals and the constraints they should operate within. &nbsp;</li><li><strong>They are completely independent</strong> from any specific environment and are applicable to any modern application environment.</li></ul><h2 id="automatic-topology-discovery"><strong>Automatic Topology Discovery</strong></h2><p>Cloud-native environments are a tangled web of applications and services layered over complex and dynamic infrastructure. Causely automatically discovers all the entities in the environment including the applications, services, databases, caches, messaging, load balancers, compute, storage, etc., as well as how they all relate to each other. </p><p>For each discovered entity, Causely automatically discovers its:&nbsp;&nbsp;</p><ul><li><strong>Connectivity</strong>&nbsp;- the entities it is connected to and the entities it is communicating with horizontally&nbsp;&nbsp;</li><li><strong>Layering</strong>&nbsp;- the entities it is vertically layered over or underlying&nbsp;</li><li><strong>Composition</strong>&nbsp;- what the entity itself is composed of&nbsp;&nbsp;</li></ul><p>Causely automatically stitches all of these relationships together to generate a&nbsp;Topology Graph, which is a clear dependency map of the entire environment. This&nbsp;Topology Graph updates continuously in real time, accurately representing the&nbsp;current state of the environment at all times.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/01/product-hero.png" class="kg-image" alt="" loading="lazy" width="1001" height="650" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/01/product-hero.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/01/product-hero.png 1000w, https://causely-blog.ghost.io/content/images/2025/01/product-hero.png 1001w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Causely delivers automatic topology discovery </span></figcaption></figure><h2 id="automatic-causality-mapping-generation"><strong>Automatic Causality Mapping Generation</strong></h2><p>Using the out-of-the-box Causal Models and the Topology Graph as described above, Causely automatically generates a causal mapping between all the possible root causes and the symptoms each of them may cause, along with the probability that each symptom would be observed when the root cause occurs.&nbsp;</p><p>Causely automatically generates two data structures to capture the causality mapping:</p><ul><li>A <strong>Causality Graph</strong> is a directed acyclic graph (DAG), where the nodes are root causes and symptoms and the edges represent the causality, i.e., an edge from node A to node B means that A may cause B. The edges are labeled with the probability of the causality.</li><li>A <strong>Codebook </strong>is a table where the columns represent the root causes and the rows represent the symptoms. Each column is a vector of probabilities defining a unique signature of the root cause. A cell in the vector represents the probability that the root cause may cause the symptom.</li></ul><h2 id="automatic-attribute-dependency-graph-generation"><strong>Automatic Attribute Dependency Graph Generation</strong></h2><p>Using the out-of-the-box Attributes Dependency Model and the Topology Graph as described above, Causely automatically generates an Attribute Dependency Graph.&nbsp;</p><p>The Attribute Dependency Graph is a directed acyclic graph (DAG) where:</p><ul><li>The&nbsp; nodes are attributes.</li><li>The edges represent a dependency between the attributes. For example, an edge from attribute A to attribute B means that the value of B is a function of attribute A.&nbsp;</li><li>The edges are labeled with the functions. The functions can be defined in the Attributes Dependency Model or can be learned if they are not defined in the Model.&nbsp;</li><li>The nodes representing attributes that must satisfy a constraint will be decorated with the constraint the attribute must satisfy.&nbsp;</li></ul><h2 id="contextual-presentation"><strong>Contextual Presentation</strong></h2><p>We believe explainability is a critical capability for AI-driven systems to demonstrate. For this purpose, the system presents its work intuitively in the Causely UI. This enables users to see the root causes, related symptoms, the service impacts and initiate actions. These insights can also be sent to external systems to initiate incident response workflows as well as to notify teams who are responsible for taking action and/or those whose services are impacted.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/01/cta-image.png" class="kg-image" alt="" loading="lazy" width="1083" height="640" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/01/cta-image.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/01/cta-image.png 1000w, https://causely-blog.ghost.io/content/images/2025/01/cta-image.png 1083w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Insights from Causely are presented in a visual UI</span></figcaption></figure><p>The Models, the automated topology discovery, and the automatic generation of the Causality Mapping and the Attribute Dependency Graph empower multiple AI workers that together deliver an autonomous application reliability system that continuously assures application performance.</p><h2 id="root-cause-analysis-rca-worker"><strong>Root Cause Analysis (RCA) Worker</strong> </h2><p>The RCA Worker uses the Codebook described above to automatically pinpoint root causes based on observed symptoms in real time.<strong>&nbsp;</strong>No configuration is required for the worker to immediately pinpoint a broad set of root causes (100+), ranging from applications malfunctioning to services congestion to infrastructure bottlenecks.&nbsp;&nbsp;<br><br>In any given environment, there can be tens of thousands of different root causes&nbsp;that may cause hundreds of thousands of symptoms. Causely prevents SLO&nbsp;violations by detangling this mess and pinpointing the root cause putting your SLOs&nbsp;at risk and driving remediation actions before SLOs are violated. For example,&nbsp;Causely proactively pinpoints if a software update changes performance behaviors&nbsp;for dependent services before those services are impacted.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/01/prevent-slo.webp" class="kg-image" alt="" loading="lazy" width="876" height="651" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/01/prevent-slo.webp 600w, https://causely-blog.ghost.io/content/images/2025/01/prevent-slo.webp 876w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Causely automates root cause analysis</span></figcaption></figure><h2 id=""></h2><h2 id="performance-analysis-worker"><strong>Performance Analysis Worker</strong></h2><p>The Performance Analysis Worker uses the Attribute Dependency Graph and Causality Graph to analyze microservices performance bottleneck propagation by automatically learning, based on your data:&nbsp;&nbsp;</p><ul><li><strong>The correlation between the loads on services</strong>, i.e., how a change in load of one cascades and impacts the loads on other services;&nbsp;</li><li><strong>The correlation between services latencies</strong>, i.e., how latency of one cascades and impacts the latencies of other services; and&nbsp;&nbsp;</li><li><strong>The likelihood</strong> a service or resource bottleneck may cause performance degradations on dependent services.&nbsp;</li></ul><h2 id="constraint-analysis-worker"><strong>Constraint Analysis Worker</strong></h2><p>The Constraint Analysis Worker uses the Attribute Dependency Graph decorated with performance goals like throughput and latency, as well as capacity and/or cost constraints, to automatically compute the Desired State of the environment and to figure out what actions need to be taken to assure the goals are accomplished while satisfying the defined constraints.&nbsp;&nbsp;</p><h2 id="prevention-analysis-worker"><strong>Prevention Analysis Worker</strong></h2><p>The Prevention Analysis Worker uses the Causality Graph and the Attribute Dependency Graph to enable prevention analysis. Teams are empowered to analyze the potential impacts or problems of changes. </p><p>Teams can ask "what if'' questions to: </p><ul><li>Understand the services that may be degraded if a potential problem were to occur </li><li>Understand the impact a planned change may have on services </li></ul><p>In doing so, teams can support planning of service/architecture changes, maintenance activities, and service resiliency improvements, and assure that none of these cause unexpected outages that may dramatically impact the business. </p><h2 id="predictive-analysis-worker"><strong>Predictive Analysis Worker</strong></h2><p>The Predictive Analysis Worker uses machine learning (ML) together with the Causality Graph and the Attribute Dependency Graph for predictive analysis. Causely uses:</p><ul><li>ML to analyze the performance behavior of a small subset of attributes, e.g. some services loads, to predict their trends.</li><li>The Attribute Dependency Graph and the predicted trends to predict the state of the environment, i.e., the state of all the attributes.</li><li>The Causality Graph and the predicted future state to pinpoint potential bottlenecks and suggest actions that may prevent bottlenecks.</li></ul><p>In doing so, Causely pinpoints the actions required to prevent future degradations, SLO violations, or constraint violations.</p><h2 id="service-impact-analysis-worker"><strong>Service Impact Analysis Worker</strong></h2><p>The Service Impact Analysis Worker uses the Causality Graph to automatically analyze the impact of the root causes on SLOs, prioritizing the root causes based on the violated SLOs and those that are at risk. Causely automatically defines standard SLOs (based on latency and error rate) and uses machine learning to improve its anomaly detection over time. However, environments that already have SLO definitions in another system can easily be incorporated in place of Causely’s default settings.&nbsp;</p><h2 id="postmortem-analysis-worker"><strong>Postmortem Analysis Worker</strong></h2><p>The Postmortem Analysis Worker uses the Causality Graph to save the relevant context of prior incidents to enable postmortem analysis. Causely saves the root cause, the Causality Graph of the root cause, the symptoms in the Causality Graph and the relevant attribute trends.&nbsp;Teams can review prior incidents and see clear explanations of why these occurred and what the effect was, simplifying the process of postmortems and enabling actions to be taken to avoid re-occurrences.&nbsp;&nbsp;</p><h2 id="see-causely-for-yourself">See Causely for Yourself!</h2><p><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Book a meeting with the Causely team</u></a>&nbsp;and let us show you how to transform the state of escalations and cross-organizational collaboration in cloud-native environments, or <a href="https://auth.causely.app/oauth/account/sign-up?ref=causely-blog.ghost.io" rel="noreferrer">start your free trial</a> now.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[In 2025, I resolve to be smarter about observability data]]></title>
      <link>https://causely.ai/blog/be-smarter-about-observability-data</link>
      <guid>https://causely.ai/blog/be-smarter-about-observability-data</guid>
      <pubDate>Fri, 17 Jan 2025 16:40:54 GMT</pubDate>
      <description><![CDATA[Collecting “more data” has been the defining characteristic of observability practices and tools for the last few decades. But over-collection creates inefficiencies, noise, and cost without adding meaningful value. This trajectory must and can be changed.]]></description>
      <author>Shmuel Kliger</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/01/causely_blog5_infographic.png" type="image/jpeg" />
      <content:encoded><![CDATA[<h2 id="observability-isn%E2%80%99t-about-how-much-data-you-collect%E2%80%94it%E2%80%99s-about-how-much-insight-you-can-gain-to-keep-your-systems-running">Observability isn’t about how much data you collect—it’s about how much insight you can gain to keep your systems running.&nbsp;</h2><p>Collecting “more data” has been the defining characteristic of observability practices and tools for the last few decades. The allure of multiple tools that instrument virtually every layer of the stack, from applications to infrastructure and beyond, traps us into believing that more data equals more insights. But the reality is often the opposite: over-collection creates inefficiencies, noise, and cost without adding meaningful value. As I said during a recent <a href="https://www.causely.ai/blog/dr-shmuel-kliger-on-causely-causal-ai-and-the-challenging-journey-to-application-health?ref=causely-blog.ghost.io" rel="noreferrer">podcast interview</a>, it may make you less blind, but not necessarily smarter.&nbsp;&nbsp;&nbsp;</p><p>Instead of clarifying what’s happening in your systems, excessive data leads to confusion. Teams are inundated with redundant metrics, irrelevant logs, and false-positive alerts. Each new piece of data requires engineers to interpret, process, and correlate, pulling focus away from actionable insights. As a result, observability becomes a “big data” problem, where the volume of information collected outweighs its usefulness.&nbsp;</p><p>We always collected too much data, but microservices architectures and the new emerging instrumentation, <a href="https://ebpf.io/?ref=causely-blog.ghost.io" rel="noreferrer">eBPF</a> and <a href="https://opentelemetry.io/?ref=causely-blog.ghost.io" rel="noreferrer">OpenTelemetry</a>, exacerbate the problem. With tens of thousands of microservices, each collecting even a few metrics every few seconds, adds up very quickly to millions of data points requiring overloaded processing and terabytes of storage. The realization that we need application instrumentation to get insights about the higher layer on the stack and the interactions between services led to the adoption of eBPF and OpenTelemetry, but with these came a tsunami of traces that overwhelms observability tools and makes them practically useless.&nbsp;</p><p>This trajectory must and can be changed.&nbsp;</p><h1 id="when-collecting-data-offers-diminishing-returns">When collecting data offers diminishing returns&nbsp;</h1><p>Collecting data turns observability into a “big data problem” instead of its intended purpose: to keep systems up and running. As a result, you don’t simply apply a band-aid to a bullet hole, you end up applying band-aids upon band-aids upon band-aids. Each new big data problem results in more big data solutions, which result in more big data problems, and the cycle continues.&nbsp;</p><h2 id="lack-of-purpose-creates-noise">Lack of purpose creates noise&nbsp;</h2><p>Observability data is often collected without a clear sense of purpose. Teams instrument everything they can, thinking it might be useful someday, but this creates a flood of irrelevant information. The result is alert fatigue, where low-priority or false-positive alerts drown out the critical signals engineers need to act on. In Kubernetes environments, for example, native instrumentation combined with third-party tools often generates more metrics than teams can effectively use, adding complexity instead of clarity.&nbsp;</p><p>Earlier this week, we wrote about the <a href="https://www.causely.ai/blog/spend-less-time-troubleshooting?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>difficulties of troubleshooting problems</u></a> in production. Excessive data only makes the haystack larger; it doesn’t help you find the needle any faster.&nbsp;</p><p><strong>Ask yourself: what percentage of the collected and stored data is ever being looked at? What is the cost of processing and storing this data?</strong>&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/01/observability-noise.png" class="kg-image" alt="" loading="lazy" width="1024" height="512" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/01/observability-noise.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/01/observability-noise.png 1000w, https://causely-blog.ghost.io/content/images/2025/01/observability-noise.png 1024w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Observability tooling is creating alert fatigue. Source: </span><a href="https://digma.ai/coding-with-java-observability/?ref=causely-blog.ghost.io"><span style="white-space: pre-wrap;">https://digma.ai/coding-with-java-observability/</span></a></figcaption></figure><h2 id="fragmentation-of-observability-tools">Fragmentation of observability tools&nbsp;</h2><p><a href="https://www.site24x7.com/learn/observability-tool-fragmentation.html?ref=causely-blog.ghost.io" rel="noreferrer">Observability is highly fragmented</a>, with different tools collecting and processing data in silos. Each technology or layer of the application stack is monitored by a point tool that monitors that specific technology or layer. The applications may be monitored by an <a href="https://thectoclub.com/tools/best-application-monitoring-software/?ref=causely-blog.ghost.io" rel="noreferrer">APM tool</a> like Datadog, New Relic, or Dynatrace. The services layer may be monitored by eBPF and OpenTelemetry. The Kubernetes infrastructure may be monitored by the native Kubernetes instrumentation. Cloud services may be monitored by cloud provider’s tools like <a href="https://aws.amazon.com/cloudwatch/?ref=causely-blog.ghost.io" rel="noreferrer">CloudWatch</a>.&nbsp; And the list goes on.&nbsp;</p><p>Some of these tools collect metrics, some collect traces, and others ingest logs or events. Regardless of whether these point tools come from different vendors or are provided by a single vendor, each one of them has its own data model. They lack not only a unified data model, but a common semantic model to interpret the data to analyze the data across the entire stack and to provide actionable insights. Engineers are left to manually correlate information across dashboards, increasing time to resolution. This fragmentation creates inefficiencies that undermine the very purpose of observability: delivering actionable insights.&nbsp;</p><p>Ask yourself: </p><blockquote>How many different point tools are you using? <br>Which one do you use when and why?&nbsp; <br>How much of the data is redundant?&nbsp;</blockquote><h2 id="overhead-of-storing-and-processing-data">Overhead of storing and processing data&nbsp;</h2><p>Storing and analyzing vast amounts of observability data comes at a cost—both financial and operational. Many tools charge by data volume, making it cost-prohibitive to retain all collected data. Even for teams with generous budgets, processing and querying such large datasets consumes time and resources, often delivering diminishing returns.&nbsp;</p><p>Ask yourself: </p><blockquote>Are you in the business of collecting observability data or are you in the business of keeping your systems running reliably and efficiently?&nbsp;</blockquote><h1 id="a-top-down-approach-to-observability">A top-down approach to observability&nbsp;</h1><p>&nbsp;The solution isn’t more data—it’s the right data. A top-down approach flips the traditional model of observability on its head. Instead of starting with what data is available, start with the purpose: what are you trying to achieve? </p><p><strong>Define the goals first</strong>—whether it’s root cause analysis, SLO compliance, or performance optimization—and then work down to identify the data needed to support those goals.&nbsp;</p><p><strong>Building the right analytics is critical. </strong>Analytics should be purpose-built to deliver insights that directly address the goals you’ve defined. For example, if the goal is to identify bottlenecks, focus on analytics that analyzes service latencies, instead of drowning in utilization metrics.&nbsp;&nbsp;&nbsp;</p><p>Finally, <strong>only collect the data you need. </strong>With the purpose and analytics in place, you can focus on gathering metrics and logs that directly contribute to actionable insights. This reduces noise, eliminates unnecessary alerts, and lowers the overhead of storing and processing data.&nbsp;</p><h1 id="causely-helps-you-focus-on-what-matters">Causely helps you focus on what matters&nbsp;</h1><p>Our <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causal Reasoning Platform</a> is a model-driven, purpose-built AI system delivering multiple analytics built on a common data model. It is designed to minimize data collection while maximizing insight. (You can learn more about how it works <a href="https://www.causely.ai/blog/10-capabilities-causal-analysis?ref=causely-blog.ghost.io" rel="noreferrer">here</a>). </p><p>As a result, Causely is laser focused on collecting only the metrics, traces, and logs required as input for the above analytics. This dramatically reduces the amount of data collected by Causely. Furthermore, all the raw metrics and traces are processed locally in real time, and none if it is pushed out to the cloud. The only information pushed to the back-end analytics in the cloud is the topology and symptoms. The only data stored in the cloud are the pinpointed root causes and the contextual information associated with the root causes. This minimizes Causely's TCO!&nbsp;</p><h1 id="conclusion">Conclusion&nbsp;</h1><p>Collecting more data doesn’t guarantee better observability. In fact, it often creates more problems than it solves. A top-down approach—starting with purpose, building targeted analytics, and focusing on necessary data—streamlines observability and makes it actionable. Platforms like Causely enable this approach, helping teams move beyond the big data trap and focus on delivering real value. Observability isn’t about how much data you collect—it’s about how much insight you can gain.&nbsp;</p><p>This week, we covered numerous ways teams can shift their posture from reactive infrastructure fixes to more proactive observability. Organizations that are more proactive will enjoy several benefits, including:&nbsp;</p><ul><li><a href="https://www.causely.ai/blog/spend-less-time-troubleshooting?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Spending less time troubleshooting</u></a>&nbsp;</li><li><a href="https://www.causely.ai/blog/eliminate-escalations-and-finger-pointing?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Preventing escalations</u></a> from devolving into finger-pointing and blame&nbsp;</li><li><a href="https://www.causely.ai/blog/be-proactive-about-reliability?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Predicting the impact of changes</u></a> before they are deployed&nbsp;</li><li><a href="https://www.causely.ai/blog/be-more-proactive-about-changes-to-your-environment?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Planning for future environment</u></a> and infrastructure changes&nbsp;</li></ul><p><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Book a meeting with the Causely team</u></a> and let us show you how you can bridge the gap between development and observability to build better, more reliable cloud-native applications.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[In 2025, I resolve to make my application environment more resilient]]></title>
      <link>https://causely.ai/blog/be-more-proactive-about-changes-to-your-environment</link>
      <guid>https://causely.ai/blog/be-more-proactive-about-changes-to-your-environment</guid>
      <pubDate>Thu, 16 Jan 2025 14:54:03 GMT</pubDate>
      <description><![CDATA[By identifying potential risks in real time, predicting future demand, and adapting resources dynamically, teams can maintain reliability even under extreme conditions. This isn’t about eliminating unpredictability; it’s about building systems that respond intelligently to it.]]></description>
      <author>Endre Sara</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/01/causely_blog4_infographic.png" type="image/jpeg" />
      <content:encoded><![CDATA[<h2 id="what-teams-can-do-in-2025-to-be-proactive-and-predict-changes-they%E2%80%99ll-need-to-make-to-their-environment">What teams can do in 2025 to be proactive and predict changes they’ll need to make to their environment&nbsp;</h2><p>Yesterday, we talked about <a href="https://www.causely.ai/blog/be-proactive-about-reliability?ref=causely-blog.ghost.io" rel="noreferrer">being more proactive</a> about the code our developers write, anticipating how code changes will impact our systems and infrastructure more broadly. Now, we turn our attention to planning ahead for external factors that could affect our systems.&nbsp;&nbsp;</p><p>Environmental changes are inevitable in any complex system. Whether it’s a sudden traffic spike, an unexpected hardware failure, or gradual infrastructure degradation, external pressures can quickly destabilize even the most carefully designed environments. The challenge lies in anticipating these changes before they disrupt services. In distributed systems, where <a href="https://www.causely.ai/blog/beyond-the-blast-radius-demystifying-and-mitigating-cascading-microservice-issues?ref=causely-blog.ghost.io" rel="noreferrer">small disruptions can ripple</a> across dependencies, waiting until issues surface is simply not an option.&nbsp;</p><p>Proactive observability provides a necessary safeguard against these challenges. By identifying potential risks in real time, predicting future demand, and adapting resources dynamically, teams can maintain reliability even under extreme conditions. This isn’t about eliminating unpredictability; it’s about building systems that respond intelligently to it.&nbsp;</p><h1 id="challenges-to-being-more-proactive-about-environment-changes">Challenges to being more proactive about environment changes&nbsp;</h1><p>Three core challenges make proactive management of environmental changes difficult: predicting load trajectories, gaining forward-looking visibility into the broader system, and scaling resources efficiently. While various tools exist to address parts of these problems, they often fall short in providing the level of predictability, visibility, and adaptability that modern systems require.&nbsp;</p><h2 id="predicting-future-load-is-difficult">Predicting future load is difficult&nbsp;</h2><p>Predicting how traffic patterns will evolve is one of the most complex aspects of managing modern systems. Seasonal events, marketing campaigns, or even unforeseen surges can create spikes in demand that exceed system capacity.&nbsp;&nbsp;</p><p>When managing an increased number of users or handling heavier loads, developers must evaluate how the system handles scaling demands. This involves monitoring how many API calls are made to backend services for each user request and modeling the anticipated growth in API usage based on projected user activity. Monitoring tools can provide real-time data to understand current usage patterns, but they cannot predict the behavior of dependent services. Load-testing tools like Grafana Labs' <a href="https://k6.io/?ref=causely-blog.ghost.io" rel="noreferrer">k6</a> and <a href="https://jmeter.apache.org/?ref=causely-blog.ghost.io" rel="noreferrer">Apache JMeter</a> are commonly used during development to simulate expected traffic patterns. These tools are useful for pre-deployment stress testing but are static by design—they can’t account for real-time changes in traffic behavior or external variables like regional events or user behavior shifts. As a result, engineers often find themselves underprepared, scrambling to adjust capacity once demand has already outpaced infrastructure.&nbsp;</p><h2 id="forward-looking-visibility-into-evolving-systems-is-limited">Forward-looking visibility into evolving systems is limited&nbsp;</h2><p>Proactively managing environmental changes requires visibility not just into what’s happening now but into how the system is likely to evolve. For instance, if a service is handling a steadily increasing workload, engineers need to predict how soon it will hit a performance bottleneck and which dependent services may be affected.&nbsp;&nbsp;</p><p>For systems using messaging architectures, such as event-driven designs, it’s crucial to evaluate the rate at which messages are published and consumed. Developers need to assess whether consumers can process messages fast enough to keep up with the increased production rate. This may involve checking the throughput of message queues (e.g., <a href="https://kafka.apache.org/?ref=causely-blog.ghost.io" rel="noreferrer">Kafka</a>, <a href="https://www.rabbitmq.com/?ref=causely-blog.ghost.io" rel="noreferrer">RabbitMQ</a>) and ensuring sufficient worker threads or instances are available to process the messages without creating backlogs.&nbsp;</p><p>Database performance is another critical factor. As the number of user requests grows, so does the frequency of database queries. Developers must analyze the types of queries executed, the tables affected, and whether the database schema or indexes need optimization. Load testing can simulate increased query volume to predict the impact on database performance and resource usage. This analysis helps in determining whether the existing database can handle the load with vertical or horizontal scaling or if architectural changes, such as <a href="https://aws.amazon.com/what-is/database-sharding/?ref=causely-blog.ghost.io#:~:text=Database%20sharding%20splits%20a%20single,original%20database's%20schema%20or%20design." rel="noreferrer">database sharding</a> or <a href="https://nadtakan-futhoem.medium.com/what-is-read-replicas-d5f15e6c3b21?ref=causely-blog.ghost.io" rel="noreferrer">read replicas</a>, are necessary.&nbsp;</p><p>Current tools often provide snapshots of real-time metrics but lack the ability to project those metrics forward into the future.&nbsp;</p><p><a href="https://www.datadoghq.com/?ref=causely-blog.ghost.io" rel="noreferrer">Datadog</a> and <a href="https://newrelic.com/?ref=causely-blog.ghost.io" rel="noreferrer">New Relic</a> offer dashboards that display system dependencies and performance under current conditions, but they focus on reactive analysis. Similarly, <a href="https://prometheus.io/?ref=causely-blog.ghost.io" rel="noreferrer">Prometheus</a> provides excellent real-time monitoring for resource metrics like CPU or memory utilization but does not include predictive capabilities for identifying future stress points.&nbsp;&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/01/grafana_prometheus_dashboard.png" class="kg-image" alt="" loading="lazy" width="1184" height="914" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/01/grafana_prometheus_dashboard.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/01/grafana_prometheus_dashboard.png 1000w, https://causely-blog.ghost.io/content/images/2025/01/grafana_prometheus_dashboard.png 1184w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Example Prometheus dashboard. Source: </span><a href="https://prometheus.io/docs/visualization/grafana/?ref=causely-blog.ghost.io"><span style="white-space: pre-wrap;">https://prometheus.io/docs/visualization/grafana/</span></a></figcaption></figure><p>This lack of forward-looking insight forces engineers into a reactive posture, leaving them vulnerable to issues that could have been anticipated and avoided.&nbsp;</p><h2 id="scaling-resource-efficiency-is-complicated">Scaling resource efficiency is complicated&nbsp;</h2><p>Scaling infrastructure to meet demand is both a reliability and cost challenge. Reactive scaling, where systems respond only after resource thresholds are exceeded, risks under-provisioning during critical moments, leading to downtime. On the other hand, over-provisioning wastes resources and drives up operational costs.&nbsp;&nbsp;</p><p>Tools like <a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/?ref=causely-blog.ghost.io" rel="noreferrer">Kubernetes Horizontal Pod Autoscaler</a> (HPA) and <a href="https://aws.amazon.com/autoscaling/?ref=causely-blog.ghost.io" rel="noreferrer">AWS Auto Scaling</a> automate scaling based on resource metrics like CPU or memory utilization. While these tools are helpful, they are fundamentally reactive—they adjust only after usage patterns have shifted. They lack the predictive capabilities to anticipate future demand and scale resources proactively, leaving teams vulnerable to sudden surges or inefficiencies.&nbsp;</p><h1 id="what-we-require-to-be-more-proactive">What we require to be more proactive&nbsp;</h1><p>Proactively managing environmental changes requires observability systems that go beyond traditional monitoring and alerting. These systems must not only detect anomalies in real time but also provide predictive insights and adaptive workflows that empower teams to anticipate and mitigate risks before they impact users. Here are the critical capabilities needed to achieve this:&nbsp;</p><ul><li><strong>Real-time anomaly detection at scale</strong>. Identifying traffic surges, resource bottlenecks, or unexpected failures as they occur. Systems should process vast amounts of data with minimal latency, identifying deviations from normal patterns. Early detection prevents small issues from spiraling into widespread outages.&nbsp;</li><li><strong>Predictive analytics for load and capacity needs</strong>. Beyond real-time data, teams also need analytics that use historical and live data to forecast future system behavior. These systems should forecast where and when additional capacity will be needed, including across long-term patterns such as seasonal demand spikes. Predictive insights allow teams to allocate resources proactively.&nbsp;</li><li><strong>Integrated resource management</strong>. Armed with data about what to do, systems should be deeply integrated so that teams can act quickly on their predictive knowledge and act.&nbsp;</li><li><strong>Comprehensive system modeling</strong>. Teams need the ability to test the impact of environmental changes before they happen. This includes simulating traffic surges, dependency failures, or configuration changes in a controlled environment to identify potential risks and validate resilience.&nbsp;</li><li><strong>Clear, actionable insight</strong>. Systems must provide insights that are not only detailed but also actionable. This includes translating complex system behavior into clear recommendations that engineers and decision-makers can act on quickly. Think context-rich alerts and dashboards that highlight the root cause of anomalies, the potential impact of changes, and the recommended actions to mitigate it all.&nbsp;</li></ul><p>Achieving true proactive observability requires more than just reacting to anomalies. It demands systems that predict, adapt, and empower teams to address environmental changes before they become critical. With these capabilities in place, organizations can ensure their systems remain stable and reliable, no matter what external pressures arise.&nbsp;</p><h1 id="causely-delivers-a-proactive-observability-system">Causely delivers a proactive observability system&nbsp;</h1><p>Our <a href="https://www.causely.ai/product?ref=causely-blog.ghost.io" rel="noreferrer">Causal Reasoning Platform</a> is a model-driven, purpose-built AI system delivering multiple analytics built on a common data model. You can learn more about its capabilities that uniquely help teams proactively anticipate and plan for environment changes <a href="https://www.causely.ai/blog/10-capabilities-causal-analysis?ref=causely-blog.ghost.io" rel="noreferrer">here</a>.</p><h1 id="conclusion">Conclusion&nbsp;</h1><p>Environmental changes like traffic spikes or hardware degradation are inevitable, but their impact doesn’t have to mean downtime. Reacting to problems after they occur leaves teams in firefighting mode, scrambling to restore stability. The key to reliability isn’t just responding quickly—it’s anticipating risks and adapting before issues arise.&nbsp;</p><p>Proactive observability makes this possible. With real-time anomaly detection, predictive analytics, and adaptive workflows, teams can prevent disruptions and maintain stability under changing conditions. By adopting tools and practices that prioritize foresight over reaction, engineers can shift from firefighting to proactive system management, ensuring their systems remain reliable no matter what comes next.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[In 2025, I resolve to be proactive about reliability]]></title>
      <link>https://causely.ai/blog/be-proactive-about-reliability</link>
      <guid>https://causely.ai/blog/be-proactive-about-reliability</guid>
      <pubDate>Wed, 15 Jan 2025 17:15:40 GMT</pubDate>
      <description><![CDATA[Making changes to production environments is one of the riskiest parts of managing complex systems. In 2025, let's transform how changes are made, empowering teams to anticipate risks, validate decisions, and protect system stability—all before the first line of code is deployed.]]></description>
      <author>Enlin Xu</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/01/causely_blog3_infographic.png" type="image/jpeg" />
      <content:encoded><![CDATA[<h2 id="what-developers-can-do-in-2025-to-be-proactive-and-prevent-incidents-before-they-happen-without-sacrificing-development-time">What developers can do in 2025 to be proactive and prevent incidents before they happen, without sacrificing development time&nbsp;</h2><p>So far this week, we’ve talked about the <a href="https://www.causely.ai/blog/spend-less-time-troubleshooting?ref=causely-blog.ghost.io" rel="noreferrer">difficulties involved with troubleshooting</a> and the <a href="https://www.causely.ai/blog/eliminate-escalations-and-finger-pointing?ref=causely-blog.ghost.io" rel="noreferrer">complexity of dealing with escalations</a>. What if we could head off these problems before they even happen?&nbsp;</p><p>Making changes to production environments is one of the riskiest parts of managing complex systems. Even a small, seemingly harmless tweak—a configuration update, a database schema adjustment, or a scaling decision—can have unintended consequences. These changes ripple across interconnected services, and without the right tools, it’s nearly impossible to predict their impact.&nbsp;</p><p>Business needs speed and agility in introducing new capabilities and features but without scarifying reliability, performance and predictability. Hence, engineers need to know how a new feature, or even a minor change, will affect performance, reliability, and SLO compliance before it goes live. But existing observability tools lack the required capabilities to provide useful insights that enable engineers to safely deploy new features. The result? Reduced productivity, slowdown in feature development, and reactive firefighting when changes go wrong -- all leading to downtime, stress, and diminished user trust.&nbsp;</p><p>Recently, we worked with a customer who sought to shift their reactive posture to one that more aggressively seeks out problems before they happen.&nbsp; With this customer, a bug within one of their microservices (the data producer) caused the producer to stop updating the <a href="https://kafka.apache.org/intro?ref=causely-blog.ghost.io" rel="noreferrer">Kafka topic</a> with new events.&nbsp; This created a backlog of events for all the consumers of this topic.&nbsp; As a result, their customers were looking at stale data.&nbsp; Problems like this lead to poor customer experience and revenue loss, so many customers need to adopt a preventative mode of operations. </p><p>This post explores how this trend can be disrupted by providing the analytics and the reasoning capabilities to transform how changes are made, empowering teams to anticipate risks, validate decisions, and protect system stability—all before the first line of code is deployed.&nbsp;</p><h1 id="being-proactive-is-easier-said-than-done">Being proactive is easier said than done&nbsp;</h1><p>Change management is a process designed to ensure changes are effective, resolve existing issues, and maintain system stability without introducing new problems. At its core, this process requires a deep understanding of both the system’s dependencies and its state before and after a change.&nbsp;&nbsp;</p><p>Production changes are risky because engineers typically lack sufficient visibility into how changes will impact the behavior of entire systems. What seems like an innocuous change to an environment file or an API endpoint could have far-reaching ramifications that aren’t always obvious to the developer.&nbsp;</p><p>While <a href="https://medium.com/agileinsider/a-decade-of-expertise-navigating-the-evolutionary-path-of-observability-technologies-c6607efbd9d4?ref=causely-blog.ghost.io" rel="noreferrer">observability tools have come a long way</a> in helping teams monitor systems, they lack the analytics required to understand, analyze, and predict the reliability and performance behavior of cloud-native systems. As a result, engineers are left to “deploy and hope for the best” ... and get a 3AM call when things didn’t work as expected. And, while we live in a veritable renaissance of developer tooling, most of these tools focus on developer productivity, not developer understanding, of whole systems and the consequences of changes made to one component or service.&nbsp;</p><h2 id="it%E2%80%99s-hard-to-predict-the-impact-of-code-changes">It’s hard to predict the impact of code changes&nbsp;</h2><p>When planning a change, the priority besides adding new functionality is to confirm whether it addresses the specific service degradations or issues it was designed to resolve. Equally important is ensuring that the change does not introduce new regressions or service degradations.&nbsp;</p><p>Achieving these goals requires a comprehensive understanding of the system’s architecture, particularly the north-south (layering) dependencies and the east-west (service-to-service) interactions. Beyond mapping the topology, it is crucial to understand the data flow within the system—how data is processed, transmitted, and consumed—because these flows often reveal hidden interdependencies and potential impact areas.&nbsp;&nbsp;</p><p>Even minor configuration changes can create cascading failures in distributed systems. For instance, adjusting the scaling parameters of an application might inadvertently overload a backend database, causing performance degradation across services. Engineers often rely on experience, intuition, or manual testing, but these methods can’t account for the full complexity of modern environments.&nbsp;</p><h2 id="unpredictable-performance-behavior-of-microservices">Unpredictable performance behavior of microservices&nbsp;&nbsp;</h2><p>As we discussed in our <a href="https://www.causely.ai/blog/eliminate-escalations-and-finger-pointing?ref=causely-blog.ghost.io" rel="noreferrer">post yesterday</a>, loosely coupled microservices communicate with each other and share resources. But which services depend on which? And what resources are shared by which services? These dependencies are continuously changing and, in many cases, unpredictable.&nbsp;&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/01/Screenshot-2025-01-15-at-10.33.19-AM.png" class="kg-image" alt="" loading="lazy" width="878" height="659" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/01/Screenshot-2025-01-15-at-10.33.19-AM.png 600w, https://causely-blog.ghost.io/content/images/2025/01/Screenshot-2025-01-15-at-10.33.19-AM.png 878w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Microservices architectures are complex. Source: </span><a href="https://www.slideshare.net/slideshow/microservices-the-right-way/51115560?ref=causely-blog.ghost.io#12"><span style="white-space: pre-wrap;">https://www.slideshare.net/slideshow/microservices-the-right-way/51115560#12</span></a></figcaption></figure><p>A congested database may cause performance degradations of some services that are accessing the database. But which one will be degraded? Hard to know. Depends. Which services are accessing which tables through what APIs? Are all tables or APIs impacted by the bottleneck? Which other services depend on the services that are degraded? Are all of them going to be degraded?&nbsp;&nbsp;</p><p>These are very difficult questions to answer. As a result, analyzing, predicting, and even just understanding the performance behavior of each service is very difficult. Furthermore, using existing brittle observability tools to diagnose how a bottleneck cascades across services is practically impossible.&nbsp;&nbsp;</p><h2 id="there%E2%80%99s-a-lack-of-%E2%80%9Cwhat-if%E2%80%9D-analysis-tools-for-testing-resilience">There’s a lack of “what-if” analysis tools for testing resilience&nbsp;</h2><p>Even though it’s important to simulate and test the impact of changes before deployment, the tools currently available are sorely lacking. Chaos engineering tools like <a href="https://www.gremlin.com/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Gremlin</u></a> and <a href="https://netflix.github.io/chaosmonkey/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Chaos Monkey</u></a> simulate failures, but don’t evaluate the impact of configuration changes. Tools like <a href="https://www.honeycomb.io/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Honeycomb</u></a> provide event-driven observability, but don’t help much with simulating what will happen with new builds. Inherently, if the tools can’t analyze the performance behavior of the services, they can’t support any “what-if” analysis.&nbsp;</p><p>Developer tools are inherently focused on the “build and deploy” phases of the software lifecycle, meaning they prioritize pre-deployment validation over predictive insights. They don’t provide answers to critical questions like: “How will this change impact my service’s reliability or my system’s SLOs?” or “Will this deployment create new bottlenecks?”&nbsp;</p><p>Predictive insights require correlating historical data, real-time metrics, dependency graphs, and most importantly deep understanding of the microservices performance behaviors. Developer tools simply aren’t built to ingest or analyze this kind of data at the system level.&nbsp;</p><h2 id="developer-and-operations-tools-today-are-both-insufficient">Developer and operations tools today are both insufficient&nbsp;</h2><p>Developer tools are essential for building functional, secure, and deployable code, but they are fundamentally designed for a different domain than observability. Developer tools focus on ensuring “what” is built and deployed correctly, while observability tools aim to identify “when” something is happening in production. The two domains overlap but rarely address the full picture.&nbsp;</p><p>&nbsp;Bridging this gap often involves integrating developer workflows—such as CI/CD pipelines—with observability systems. While this integration can surface useful metrics and automate parts of the release process, it still leaves a critical blind spot: understanding “why” something is happening. Neither traditional developer tools nor current observability platforms are designed to address the complexity of dynamic, real-world systems.&nbsp;</p><p>To answer the “why,” you need a purpose-built system to unravel the interactions, dependencies, and behaviors that drive modern production environments.&nbsp;</p><h1 id="building-for-reliability">Building for reliability&nbsp;</h1><p>Building reliable performing applications was never easy, but it has become much harder. As David Shergilashvili correctly states in his recent post <a href="https://www.linkedin.com/pulse/microservices-bottlenecks-david-shergilashvili-zsujf?utm_source=share&utm_medium=member_ios&utm_campaign=share_via" rel="noreferrer noopener"><u>Microservices Bottlenecks</u></a>, “In modern distributed systems, microservices architecture introduces complex performance dynamics that require deep understanding. Due to their distributed nature, service independence, and complex interaction patterns, microservices systems' performance characteristics differ fundamentally from monolithic applications.”&nbsp;</p><p>Continuing to collect data and presenting it to developers in pretty dashboards with very little or no built-in analytics to provide meaningful insights won’t get us to build reliable distributed microservices applications.&nbsp;</p><p>To accurately assess the impact of a change, the state of the system must be assessed both before and after the change is implemented. This involves monitoring key indicators such as system health, performance trends, anomaly patterns, threshold violations, and service-level degradations. These metrics provide a baseline for evaluating whether the change resolves known issues and whether it introduces new ones. However, the ultimate goal goes beyond metrics; it is to confirm that the known root causes of issues are addressed and that no new root causes emerge post-change.&nbsp;&nbsp;</p><p>We need to build systems that enable engineers to introduce new features quickly, efficiently, and most importantly safely, i.e., without risking the reliability and performance of their applications. Reasoning platforms with built-in analytics need to provide actionable insights that anticipate implications and prevent issues. These systems must have the following capabilities:&nbsp;</p><ul><li><strong>Services dependencies. </strong>Be able to capture, represent and auto-discover service dependencies. Without this knowledge a developer can’t know and anticipate which services may be impacted by a change.&nbsp;</li><li><strong>Causality knowledge</strong>. Be aware of all potential failures and the symptoms they may cause. Without these insights, a developer can’t assess the risk of introducing a change. Is the application code I am about to change a single point of failure? What will happen if for some reason my new code won’t perform as expected?&nbsp;</li><li><strong>Performance analysis. </strong>Understanding the <a href="https://causely.ai/blog/beyond-the-blast-radius-demystifying-and-mitigating-cascading-microservice-issues/?ref=causely-blog.ghost.io" rel="noreferrer"><u>blast radius of performance bottlenecks</u></a>. Without having a deep understanding of the interactions between the microservices as well as their shared resources, developers can’t predict all possible situations that may occur in production. Even the best tester can’t cover all possible scenarios. How can you know how a noisy neighbor will impact your application in production?&nbsp;&nbsp;</li><li><strong>What-if</strong>. The ability to assess load fluctuations before they happen or potential impacts of code changes to be introduced. Is your service ready for the predicted Black Friday traffic? What about the dependent upstream service?&nbsp;</li></ul><h1 id="causely-can-help-you-build-better-cloud-native-applications">Causely can help you build better cloud-native applications&nbsp;</h1><p>Our <a href="https://causely.ai/product/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Causal Reasoning Platform</u></a> is a model-driven, purpose-built AI system delivering multiple analytics built on a common data model. You can learn more about what makes it unique <a href="https://www.causely.ai/blog/10-capabilities-causal-analysis?ref=causely-blog.ghost.io" rel="noreferrer">here</a>. </p><p>With Causely, you can go from reactive troubleshooting to proactive design, development, and deployment of reliable cloud-native applications.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/01/prevent-slo-1.webp" class="kg-image" alt="" loading="lazy" width="876" height="651" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/01/prevent-slo-1.webp 600w, https://causely-blog.ghost.io/content/images/2025/01/prevent-slo-1.webp 876w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Causely automates root cause analysis, empowering proactive design, development and deployment of reliable cloud-native apps</span></figcaption></figure><h1 id="conclusion">Conclusion&nbsp;</h1><p>Making changes to production environments shouldn’t be a guessing game. Without the right systems, even the smallest change can introduce risks that compromise reliability, breach SLOs, and erode user trust. Reactive approaches to troubleshooting only address problems after they occur, leaving engineers stuck firefighting rather than innovating.&nbsp;</p><p>A purpose-built reasoning platform can flip this narrative. By incorporating predictive analytics, “what-if” modeling, and automated risk detection, teams can anticipate issues before they arise and deploy with confidence. Platforms like Causely empower engineers with the insights they need to write reliable code, validate changes, and ensure system stability at every step.&nbsp;</p><p><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Book a meeting with the Causely team</u></a> and let us show you how you can bridge the gap between development and observability to build better, more reliable cloud-native applications.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[In 2025, I resolve to eliminate escalations and finger pointing]]></title>
      <link>https://causely.ai/blog/eliminate-escalations-and-finger-pointing</link>
      <guid>https://causely.ai/blog/eliminate-escalations-and-finger-pointing</guid>
      <pubDate>Tue, 14 Jan 2025 14:50:44 GMT</pubDate>
      <description><![CDATA[Explore the challenges of multi-team escalations, and the capabilities needed to address them. We’ll show how observability can be transformed to make escalations less contentious and more productive.]]></description>
      <author>Steffen Geißinger</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/01/causely_blog2_infographic.png" type="image/jpeg" />
      <content:encoded><![CDATA[<h2 id="make-escalations-less-about-blame-and-more-about-progress">Make escalations less about blame and more about progress&nbsp;</h2><p>Microservices architectures introduce complex, dynamic dependencies between loosely coupled components. In turn, these dependencies lead to complex, hard to predict interactions. In these environments, any resource bottleneck, or any service bottleneck or malfunction, will cascade and affect multiple services, crossing team boundaries. As a result, the response often spirals into a chaotic mix of war rooms, heated Slack threads, and finger-pointing. The problem isn’t just technical—it’s structural. Without a clear understanding of dependencies and ownership, every team spends more time defending their work than solving the issue. It’s a waste of effort that undermines collaboration and prolongs downtime.&nbsp;</p><p><a href="https://causely.ai/blog/spend-less-time-troubleshooting?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Yesterday, we resolved to spend less time troubleshooting in 2025.</u></a>&nbsp;</p><p>Troubleshooting and escalation are closely intertwined. A single unresolved bottleneck can ripple outward, forcing multiple teams into reactive mode as they struggle to isolate the true root cause. This dynamic creates inefficiencies and delays, with teams often focusing on band-aiding symptoms instead of remediating and solving the root causes. To eliminate this friction, we need systems that do more than detect anomalies—they must provide a seamless view of dependencies, understand and analyze the performance behaviors of the microservices, assign ownership intelligently, and guide engineers toward resolution with precision and context.&nbsp;</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/01/escalation_chatgpt-1.png" class="kg-image" alt="" loading="lazy" width="1784" height="863" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/01/escalation_chatgpt-1.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/01/escalation_chatgpt-1.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/01/escalation_chatgpt-1.png 1600w, https://causely-blog.ghost.io/content/images/2025/01/escalation_chatgpt-1.png 1784w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The complexity of escalations in SRE and DevOps orgs, according to ChatGPT</span></figcaption></figure><p>Take, for example, an application developer who notices high request duration for users who are trying to interact with their application. This application communicates with many different services, and it happens to run within a container environment on public cloud infrastructure.&nbsp; There are more than 50 possible root causes that might be causing the high request duration issue.&nbsp; That developer would need to investigate garbage collection issues, disk congestion, app-locking problems, and node congestion among many other potential root causes until accurately determining that a congested database is the source of their problem.&nbsp; The only proper way to determine root cause is by considering all the cause-and-effect relationships between all the possible root causes and the symptoms they may cause. This process can often take hours or days before the correct root cause is pinpointed, resulting in a variety of <a href="https://causely.ai/blog/fools-gold-or-future-fixer-can-ai-powered-causality-crack-the-rca-code-for-cloud-native-applications/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>business consequences</u></a> (unhappy users, missed SLOs, SLA violations, etc.).&nbsp;</p><p>In this post, we’ll explore the challenges of multi-team escalations, and the capabilities needed to address them. From automated dependency mapping to explainable triage workflows, we’ll show how observability can be transformed from chaos into clarity, making escalations less contentious and far more productive.&nbsp;</p><h1 id="escalations-can-cripple-teams">Escalations can cripple teams&nbsp;</h1><p>Escalations create inefficiencies that extend downtime, frustrate teams, and waste resources. These inefficiencies stem from a combination of structural and technical gaps in how dependencies are understood, root causes are isolated, and ownership is assigned. Here are some of the key challenges that make escalations so painful today:&nbsp;</p><ul><li>There is a lack of cross-team visibility into dependencies&nbsp;</li><li>It can be hard to predict or analyze the performance behaviors of loosely coupled dependent microservices&nbsp;&nbsp;</li><li>It can be difficult to isolate the root cause among all affected services&nbsp;</li><li><a href="https://www.gartner.com/reviews/market/observability-platforms?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Legacy observability tools</u></a> must be stitched together to provide even partial visibility into issues&nbsp;</li></ul><h2 id="lack-of-cross-team-visibility">Lack of cross-team visibility&nbsp;</h2><p>Microservices architectures are complex and full of deeply interconnected components. An issue in one can cascade into others. Without clear visibility into these dependencies, teams are left guessing which components are impacted and which team should take ownership.&nbsp;</p><p>Your favorite observability tools help you visualize dependencies, but they lack real-time accuracy. These maps can quickly become outdated in environments with frequent changes. Some of them are great for aggregating logs, but don’t offer much insight into service relationships. Engineers are often left to piece together dependencies manually.&nbsp;</p><h2 id="unpredictable-performance-behavior-of-microservices">Unpredictable performance behavior of microservices&nbsp;&nbsp;</h2><p>Loosely coupled microservices communicate with each other and share resources. But which services depend on which? And what resources are shared by which services? These dependencies are continuously changing and, in many cases, unpredictable. &nbsp;</p><p>A congested database may cause performance degradations of some services that are accessing the database. But which one will be degraded? Hard to know. Depends. Which services are accessing which tables through what APIs? Are all tables or APIs impacted by the bottleneck? Which other services depend on the services that are degraded? Are all of them going to be degraded? These are very difficult questions to answer.&nbsp;&nbsp;</p><p>As a result, predicting, understanding and analyzing the performance behavior of each service is very difficult. Using existing brittle observability tools to diagnose how a bottleneck cascades across services is practically impossible.&nbsp;&nbsp;</p><h2 id="difficulty-identifying-root-causes-among-all-affected-services">Difficulty identifying root causes among all affected services&nbsp;</h2><p>Determining what’s a cause and what’s a symptom can be an incredibly time-consuming aspect of troubleshooting and escalations. Further, the person or team identifying a problem may well be looking at only their <a href="https://www.cuemath.com/calculus/local-maximum-and-minimum/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>local maxima</u></a>: the part of the system they work on or are directly affected by. They often don’t see the full picture of all intertwined systems. Identifying the root cause among all affected services can be inordinately difficult.&nbsp;</p><p>Even if you have tools that are excellent for visualizing time-series data, you must still rely on engineers to manually correlate metrics. APM tools can help you examine application performance but require significant manual effort to link symptoms to underlying causes, especially in microservices-based, cloud-native applications.&nbsp;</p><h2 id="legacy-observability-tooling-only-gives-you-partial-functionality">Legacy observability tooling only gives you partial functionality&nbsp;</h2><p>While both established and up-and-coming tools offer valuable capabilities, they often address only one part of the problem, leaving critical gaps. Dependency visibility, performance analysis and root cause isolation need to be integrated seamlessly to reduce the chaos of escalations. Today’s tools, however, are fragmented, requiring engineers to bridge the gaps manually, costing valuable time and effort during incidents. Solving these problems demands a holistic approach that ties all these elements together in real time.&nbsp;</p><h1 id="how-escalations-should-be-handled">How escalations should be handled&nbsp;</h1><p>Escalations have negative consequences for organizations of all sizes. Let’s work together to build systems that render escalations less about blame and more about opportunities to foster trust and collaboration. These systems will have the following capabilities:&nbsp;</p><ul><li><strong>End-to-end discovery of service dependencies</strong>. Automatically discover and maintain, in real time, a complete view of how systems interact.&nbsp;</li><li><strong>Workflow integration that directs root cause resolution to the correct team</strong>. Use the tools your organization has already invested in to turn root causes into actions for the correct team, reducing delays caused by miscommunication.&nbsp;</li><li><strong>Performance analysis of bottlenecks propagations. </strong>Provide insights on how bottlenecks cascade across services.&nbsp;</li><li><strong>Detailed identification of root causes across an entire system</strong>. Empower engineers to act confidently without over-reliance on senior team members.&nbsp;</li></ul><p>With these new systems, escalations can result in positive business outcomes:&nbsp;</p><ul><li>Deep, real-time understanding of complete microservices systems&nbsp;</li><li>Better collaboration between teams to remediate issues as they arise&nbsp;</li><li>More innovation, less time in war rooms&nbsp;</li><li>Empowered engineering teams that solve problems instead of pointing fingers&nbsp;</li></ul><h1 id="causely-helps-you-handle-escalations-quickly-and-confidently">Causely helps you handle escalations quickly and confidently&nbsp;</h1><p>Our <a href="https://causely.ai/product/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Causal Reasoning Platform</u></a> is a model-driven, purpose-built AI system delivering multiple analytics built on a common data model. It includes several features to help you understand issues and handle escalations efficiently:&nbsp;</p><ul><li><strong>Out-of-the-box Causal Models</strong>. Causely is delivered with built-in causality knowledge capturing the common root causes that can occur in cloud-native environments. This causality knowledge enables Causely to automatically pinpoint root causes out-of-the-box as soon as it is deployed in an environment. There are at least a few important details to share about this causality knowledge:&nbsp;&nbsp;<ul><li>It captures potential root causes in a broad range of entities including applications, databases, caches, messaging, load balancers, DNS, compute, storage, and more.&nbsp;</li><li>It describes how the root causes will propagate across the entire environment and what symptoms may be observed when each of the root causes occurs.&nbsp;&nbsp;</li><li>It is completely independent from any specific environment and is applicable to any cloud-native application environment.&nbsp;&nbsp;</li></ul></li><li><strong>Automatic topology discovery</strong>. Cloud-native environments are a tangled web of applications and services layered over complex and dynamic infrastructure. Causely automatically discovers all the entities in the environment including the applications, services, databases, caches, messaging, load balancers, compute, storage, etc., as well as how they all relate to each other. For each discovered entity, Causely automatically discovers its:&nbsp;&nbsp;<ul><li><strong>Connectivity</strong> - the entities it is connected to and the entities it is communicating with horizontally&nbsp;&nbsp;</li><li><strong>Layering</strong> - the entities it is vertically layered over or underlying&nbsp;</li><li><strong>Composition</strong> - what the entity itself is composed of&nbsp;&nbsp;<br><br>Causely automatically stitches all of these relationships together to generate a&nbsp;Topology Graph, which is a clear dependency map of the entire environment. This&nbsp;Topology Graph updates continuously in real time, accurately representing the&nbsp;current state of the environment at all times.&nbsp;</li></ul></li><li><strong>Root cause analysis</strong>. Using the out-of-the-box Causal Models and the Topology Graph as described above, Causely automatically generates a causal mapping between all the possible root causes and the symptoms each of them may cause, along with the probability that each symptom would be observed when the root cause occurs. Causely uses this causal mapping to automatically pinpoint root causes based on observed symptoms in real time.<strong> </strong>No configuration is required for Causely to immediately pinpoint a broad set of root causes (100+), ranging from applications malfunctioning to services congestion to infrastructure bottlenecks.&nbsp;&nbsp;<br><br>In any given environment, there can be tens of thousands of different root causes&nbsp;that may cause hundreds of thousands of symptoms. Causely prevents SLO&nbsp;violations by detangling this mess and pinpointing the root cause putting your SLOs&nbsp;at risk and driving remediation actions before SLOs are violated. For example,&nbsp;Causely proactively pinpoints if a software update changes performance behaviors&nbsp;for dependent services before those services are impacted.&nbsp;</li><li><strong>Performance analysis. </strong>Causely analyzes microservices performance bottleneck propagation by automatically learning, based on your data:&nbsp;&nbsp;<ul><li>the correlation between the loads on services, i.e., how a change in load of one cascades and impacts the loads on other services;&nbsp;</li><li>the correlation between services latencies, i.e., how latency of one cascades and impacts the latencies of other services; and&nbsp;&nbsp;</li><li>the likelihood a service or resource bottleneck may cause performance degradations on dependent services.&nbsp;</li></ul></li><li><strong>Constraints analysis. </strong>Causely uses performance goals like throughput and latency, and capacity or cost constraints, and automatically figures out what actions need to be taken to assure the goals are accomplished while satisfying the constraints.&nbsp;&nbsp;</li><li><strong>Prevention analysis</strong>. Teams can also ask "what if'' questions to understand the impact that potential problems might have if they were to occur to support the planning of service/architecture changes, maintenance activities, and service resiliency improvements.</li><li><strong>Predictive analysis. </strong>Causely automatically analyzes performance trends and pinpoints the actions required to prevent future degradations, SLO violations, or constraints.&nbsp;&nbsp;</li><li><strong>Service impact analysis</strong>. Causely automatically analyzes the impact of the root causes on SLOs, prioritizing the root causes based on the violated SLOs and those that are at risk. Causely automatically defines standard SLOs (based on latency and error rate) and uses machine learning to improve its anomaly detection over time. However, environments that already have SLO definitions in another system can easily be incorporated in place of Causely’s default settings.&nbsp;</li><li><strong>Contextual presentation</strong>. Results are intuitively presented in the Causely UI, enabling users to see the root causes, related symptoms, the service impacts and initiate remedial actions. The results can also be sent to external systems to alert teams who are responsible for remediating root cause problems, to notify teams whose services are impacted, and to initiate incident response workflows.&nbsp;</li><li><strong>Postmortem analysis</strong>. Teams can also review prior incidents and see clear explanations of why these occurred and what the effect was, simplifying the process of postmortems, enabling actions to be taken to avoid re-occurrences.&nbsp;&nbsp;</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/01/better-alignment.webp" class="kg-image" alt="" loading="lazy" width="876" height="651" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/01/better-alignment.webp 600w, https://causely-blog.ghost.io/content/images/2025/01/better-alignment.webp 876w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Causely improves alignment between teams with a 360-degree view of service topology and cause-and-effect relationships</span></figcaption></figure><h1 id="conclusion">Conclusion&nbsp;</h1><p>Escalations don’t need to devolve into chaos, finger pointing, and frayed relationships. They can be opportunities for teams to solve real problems together. The key is having dependable, real-time information on service dependencies and root causes of problems. Armed with the right information, teams can work efficiently and collaboratively to maintain system reliability.&nbsp;</p><p><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Book a meeting with the Causely team</u></a> and let us show you how to transform the state of escalations and cross-organizational collaboration in cloud-native environments.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[In 2025, I resolve to spend less time troubleshooting]]></title>
      <link>https://causely.ai/blog/spend-less-time-troubleshooting</link>
      <guid>https://causely.ai/blog/spend-less-time-troubleshooting</guid>
      <pubDate>Mon, 13 Jan 2025 14:46:48 GMT</pubDate>
      <description><![CDATA[SREs and developers can make troubleshooting more manageable in 2025 by adopting systems that solve the root cause analysis problem.]]></description>
      <author>Christine Miller</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/01/causely_blog1_infographic--1-.png" type="image/jpeg" />
      <content:encoded><![CDATA[<h2 id="what-sres-and-developers-can-do-in-2025-to-make-troubleshooting-more-manageable">What SREs and developers can do in 2025 to make troubleshooting more manageable&nbsp;</h2><p>Troubleshooting is an unavoidable part of life for SREs and developers alike, and it often feels like an endless grind. The moment a failure occurs, the clock starts ticking. If the failure impacts a mission critical application, every second counts. Outages can cost hours of wasted productivity, to say nothing of lost revenue and pricing concessions when you’ve violated an SLO. Pinpointing the root cause requires sifting through piles of logs, metrics that blur together, and false positives. Troubleshooting becomes a search for a needle in a haystack, and to make things even more complex, the needle may not even be in the haystack. Furthermore, when the failure originates in your scope of control, the pressure intensifies—you’re expected to resolve it quickly, minimize downtime, and restore service without disrupting the rest of your work. It’s a reactive process, and it’s draining.&nbsp;&nbsp;</p><p>But it doesn’t have to be this way. By adopting systems that solve the root cause analysis problem and automate troubleshooting, you can shift troubleshooting from a time-consuming, heavy-lifting chore to a streamlined task. Automated root cause analysis cuts through the noise and pinpoints the issue in no time.&nbsp;&nbsp;&nbsp;</p><p>With the right approach, troubleshooting becomes a quick, manageable part of your day, freeing you to focus on building systems that don’t just react better but fail less often.&nbsp;</p><h1 id="what-do-we-mean-by-troubleshooting">What do we mean by troubleshooting?&nbsp;</h1><p>In a distributed microservices environment, troubleshooting often begins with an alert from the monitoring system or user feedback about degraded performance, such as increased latency or error rates. Typically, these issues are first observed in the service exposed to end users, such as an API gateway or frontend service. However, the root cause often lies deeper within the service architecture, making initial diagnosis challenging. The development team must begin by confirming the scope of the issue, correlating the alert with specific user-reported problems to identify whether it is isolated or systemic.&nbsp;</p><p>The next step involves tracing the source of the alert within the service ecosystem. Using distributed tracing tools like <a href="https://opentelemetry.io/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>OpenTelemetry,</u></a> the team tracks requests as they propagate through various microservices, identifying where bottlenecks or failures occur. Concurrently, a service dependency map, often visualized through monitoring platforms, provides a bird’s-eye view of interactions between services, databases, caches, and other dependencies, helping to pinpoint potential hotspots in the architecture.&nbsp;</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/2025/01/data-src-image-fc0c1bee-f2bc-45ff-a83e-cc91d60b343b.gif" class="kg-image" alt="" loading="lazy" width="960" height="566" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/01/data-src-image-fc0c1bee-f2bc-45ff-a83e-cc91d60b343b.gif 600w, https://causely-blog.ghost.io/content/images/2025/01/data-src-image-fc0c1bee-f2bc-45ff-a83e-cc91d60b343b.gif 960w" sizes="(min-width: 720px) 720px"></figure><p><em>Example service dependency map. Source: </em><a href="https://grafana.com/grafana/plugins/novatec-sdg-panel/?ref=causely-blog.ghost.io" rel="noreferrer noopener"><em><u>Grafana</u></em></a>&nbsp;</p><p>&nbsp;</p><p>Once the potential hotspots are identified, developers turn to metrics and logs for further insights. Resource utilization metrics, such as CPU, memory, and disk I/O, are analyzed to detect bottlenecks, while logs reveal specific errors or anomalies like timeouts or failed database queries. This analysis attempts to correlate symptoms with the timeline of the issue, offering clues to its origin. Often, the team experiments with quick fixes, such as scaling up CPU, memory, or storage for the affected services or infrastructure. While these adjustments might temporarily relieve symptoms, they rarely address the root cause and must be rolled back if ineffective.&nbsp;&nbsp;</p><p>When resource adjustments fail, a deeper dive into the affected components is necessary. Distributed traces provide detailed insights into slow transactions or failures, highlighting which services or calls are problematic. Developers then use continuous profiling tools to examine runtime data for each service, identifying resource-intensive methods, excessive memory allocations, or inefficient call paths. This granular analysis helps uncover inefficiencies or regressions in code performance.&nbsp;</p><p>If the issue involves a database, further investigation focuses on query performance. <a href="https://www.techtarget.com/searchdatamanagement/tip/Evaluate-and-choose-from-the-top-data-profiling-tools?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Database profiling tools</u></a> are used to analyze query execution times, frequency, and data volume. Developers assess whether queries are taking longer than usual, retrieving excessive data, or being executed too frequently. This step often reveals issues such as missing indexes, inefficient joins, or unoptimized queries, which could be contributing to overall service degradation. By iteratively analyzing and addressing these factors, the root cause of the problem is eventually resolved, restoring system stability and performance.&nbsp;</p><p>Troubleshooting is reactive, time-consuming, and exhausting.&nbsp; Developers should be focusing their time and energy (and their company’s investment) on innovation, yet troubleshooting forces them to turn their attention elsewhere.&nbsp;</p><p>Troubleshooting doesn’t have to dominate your role; with the right systems, it can become efficient and manageable.&nbsp;</p><p>&nbsp;</p><h1 id="troubleshooting-is-hard">Troubleshooting is hard&nbsp;</h1><p>When trying to find the root cause of service or application outages or degradations, developers face numerous challenges:&nbsp;</p><ul><li>It’s hard to pinpoint which service is the source of the degradations amid a flood of alerts&nbsp;</li><li>It’s hard to diagnose and remediate the root cause&nbsp;</li><li>It’s hard to see the forest from the trees&nbsp; </li></ul><h2 id="it%E2%80%99s-hard-to-pinpoint-which-service-is-the-source-of-the-degradations-amid-a-flood-of-alerts">It’s hard to pinpoint which service is the source of the degradations amid a flood of alerts&nbsp;</h2><p>Failures propagate and amplify through the environment. A congested database or a congested resource will cause application starvation and service degradation that cascades throughout the system. Even if you deploy an observability tool to monitor the database or the resource, you may observe nothing on the database or the resource. And if you deploy an observability tool to monitor the applications and services, you will be flooded with alerts about application starvation and service degradations. Given the flood of alerts, pinpointing the bottleneck is very complex. As described above, it entails a time-consuming, heavy lifting, manual process under pressure.&nbsp;&nbsp;&nbsp;</p><p>The more observability tools you deploy, the more data you collect and the harder the problem gets. More alerts, more noise, more data you need to sift through. This is a journey to nowhere, a trajectory you want to reverse.&nbsp;</p><h2 id="diagnosing-and-remediating-the-root-cause-of-a-problem-is-hard">Diagnosing and remediating the root cause of a problem is hard&nbsp;</h2><p>Pinpointing the congested service, database or resource is hard and inefficient. Even if you know <em>where</em> the root cause is, you may not know <em>what</em> the root cause is. Without knowing what the root cause is, you can’t know what to remediate nor how to remediate.&nbsp;&nbsp;</p><p>Whether pinpointing <em>where</em> the bottleneck is or pinpointing <em>what</em> the root cause is, engineers rely on manual workflows to sift through logs, metrics, and traces. While new observability tools have emerged over the past decade focusing on cloud-native application infrastructure, and the traditional old guards have expanded their coverage to monitor the new technology landscape, <a href="https://www.youtube.com/watch?v=rs-5SYlCj80&ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>neither has solved the problem</u></a>. Some may do a better job than others in correlating anomalies or slicing and dicing the information for you, but they leave it to you to diagnose and pinpoint the root cause, leaving the hardest part unsolved. Furthermore, most of them require time consuming setup and configuration, deep expertise to operate, and deep domain knowledge to realize their benefits.&nbsp;</p><p>In practice, this means engineers are still performing most of the diagnostic work manually. The tools may be powerful, even elegant, but they don’t address the core challenge: <a href="https://causely.ai/blog/the-rising-cost-of-digital-incidents-understanding-and-mitigating-outage-impact?ref=causely-blog.ghost.io" rel="noreferrer"><u>diagnosing and remediating root causes remains a slow, resource-intensive process</u></a>, particularly when time is of the essence during an incident. These gaps prolong resolution times, increase stress, and reduce time for proactive system improvements.&nbsp;</p><h2 id="it%E2%80%99s-hard-to-see-the-forest-from-the-trees">It’s hard to see the forest from the trees&nbsp;</h2><p>Once the engineers turn to dashboards to investigate further, they stare at dashboards created by bottom-up tools. These tools collect a lot of data (often at great cost) and present this data in their dashboards without regard to the purpose of the information and the problem that needs to be solved. Engineers sift through metrics, logs, and time-series data, trying to understand context, composition, and dependencies so they can manually piece together patterns and correlations. This is highly labor-intensive and drives the engineer to get lost in the weeds without understanding the big picture of how the business or the service is impacted.&nbsp; Are service level objectives (SLOs) being violated? Are SLOs at risk?&nbsp;</p><p>Take your favorite observability tool. It probably excels at visualizing time-series data. However, it requires engineers to manually connect trends across dashboards and services, which can be especially challenging in distributed systems. Similarly, application performance management (APM) tools provide rich metrics and infrastructure insights, but the sheer volume of data presented in their dashboards can overwhelm users, making it difficult to focus on the most relevant information.&nbsp;&nbsp;</p><p>These tools, while powerful, often fall short in helping engineers see the forest from the trees. Instead of guiding engineers toward the right priorities and actionable insights about the broader system or the root cause, or even better, automatically pinpointing the root cause and remediating, they frequently amplify the noise. Irrelevant data, ambiguous relationships, and false positives force engineers to wade through excessive details, wasting time and delaying resolution. The lack of a top-down perspective makes it harder to understand how symptoms connect to underlying problems, leaving engineers stuck in the weeds.&nbsp;</p><h2 id="the-negative-consequences-of-troubleshooting-today">The negative consequences of troubleshooting today&nbsp;</h2><p>The way troubleshooting is done today has serious ramifications for organizations, teams, and individuals. It affects business outcomes and quality of life.&nbsp;</p><h3 id="failing-to-meet-the-slas">Failing to meet the SLAs&nbsp;</h3><p>Whether the goal is 5-nines, 4-nines, or even only 3-nines, if we continue to manually troubleshoot, we will never meet these SLAs. The table below illustrates how many minutes in a month the given SLA allows for downtime.&nbsp;&nbsp;</p>
<!--kg-card-begin: html-->
<table border="1">
  <thead>
    <tr>
      <th>Availability %</th>
      <th>Downtime per year</th>
      <th>Downtime per quarter</th>
      <th>Downtime per month</th>
      <th>Downtime per week</th>
      <th>Downtime per day (24 hours)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>90% ("one nine")</td>
      <td>36.53 days</td>
      <td>9.13 days</td>
      <td>73.05 hours</td>
      <td>16.80 hours</td>
      <td>2.40 hours</td>
    </tr>
    <tr>
      <td>99% ("two nines")</td>
      <td>3.65 days</td>
      <td>21.9 hours</td>
      <td>7.31 hours</td>
      <td>1.68 hours</td>
      <td>14.40 minutes</td>
    </tr>
    <tr>
      <td>99.9% ("three nines")</td>
      <td>8.77 hours</td>
      <td>2.19 hours</td>
      <td>43.83 minutes</td>
      <td>10.08 minutes</td>
      <td>1.44 minutes</td>
    </tr>
    <tr>
      <td>99.99% ("four nines")</td>
      <td>52.60 minutes</td>
      <td>13.15 minutes</td>
      <td>4.38 minutes</td>
      <td>1.01 minutes</td>
      <td>8.64 seconds</td>
    </tr>
    <tr>
      <td>99.999% ("five nines")</td>
      <td>5.26 minutes</td>
      <td>1.31 minutes</td>
      <td>26.30 seconds</td>
      <td>6.05 seconds</td>
      <td>864.00 milliseconds</td>
    </tr>
  </tbody>
</table>
<!--kg-card-end: html-->
<p>Source: <a href="https://en.wikipedia.org/wiki/High_availability?ref=causely-blog.ghost.io#Percentage_calculation&nbsp;" rel="noreferrer">High Availability, Wikipedia</a></p><p>3-nines means 99.9% uptime—in other words, all services are performing reliably at least 99.9% of the time. So, if any of the services is degraded for more than 43.2 minutes in a month, the 3-nines SLA is not met. Because of the length of time manual troubleshooting entails, a single incident in the month will cause us to miss delivering on a 3-nines SLA. And 3-nines is not even so great!&nbsp;</p><h3 id="high-mean-time-to-detect-and-resolve-mttdmttr">High Mean Time to Detect and Resolve (MTTD/MTTR)&nbsp;</h3><p>The longer it takes to detect and resolve an issue, <a href="https://causely.ai/blog/real-time-data-modern-uxs-the-power-and-the-peril-when-things-go-wrong?ref=causely-blog.ghost.io" rel="noreferrer"><u>the greater the impact</u></a> on customers and the business. Traditional troubleshooting workflows, which often rely on reactive and manual processes, are inherently slow. Engineers are forced to navigate through an overwhelming volume of alerts, sift through logs, and correlate metrics without clear guidance. This delay can lead to:&nbsp;</p><ul><li>Prolonged outages that damage user trust and satisfaction.&nbsp;</li><li>Breaches of service level objectives (SLOs), which can result in financial penalties for organizations with stringent service level agreements (SLAs)&nbsp;</li><li>Snowballing effects, where unresolved issues trigger secondary failures, compounding the problem and making resolution even more challenging.&nbsp;</li></ul><h3 id="individual-stress-and-burnout-from-constant-reactive-tasks">Individual stress and burnout from constant reactive tasks&nbsp;</h3><p>The reactive nature of troubleshooting takes a significant toll on individual engineers. When every incident feels like a race against the clock, the pressure to resolve issues quickly can become overwhelming. Engineers often work under constant stress, juggling:&nbsp;</p><ul><li>Interruptions to their regular work, leading to disrupted schedules and decreased productivity.&nbsp;</li><li>Escalations where they are expected to step in as subject matter experts, often during nights or weekends.&nbsp;</li><li>Repeated exposure to alert noise, which can cause decision fatigue and desensitization to critical alerts.&nbsp;</li></ul><p>This relentless pace contributes to burnout. All it takes is a few hours of perusing the <a href="https://www.reddit.com/r/sre/?ref=causely-blog.ghost.io" rel="noreferrer noopener">/r/sre<u> subreddit</u></a> to see that burnout is a very common issue among SREs and developers tasked with maintaining system reliability. Burnout not only affects individuals but also leads to higher attrition rates, disrupting team continuity and increasing hiring and training costs.&nbsp;</p><h3 id="reduced-time-for-proactive-reliability-engineering">Reduced time for proactive reliability engineering&nbsp;</h3><p>Troubleshooting dominates the time and energy of engineering teams, leaving little room for <a href="https://ashkapow.medium.com/balancing-proactive-and-reactive-tasks-as-an-sre-ed7a4966dd0a?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>proactive reliability</u></a> initiatives. As we will see later this week, proactive reliability engineering has extraordinary promise for the entire company: product/engineering, operations, business leaders. But instead of focusing on preventing incidents, engineers are stuck in a reactive loop. This trade-off results in:&nbsp;</p><ul><li>Delayed implementation of improvements that could enhance system stability and scalability.&nbsp;</li><li>Accumulation of technical debt.&nbsp;</li><li>A vicious cycle where the lack of proactive work increases the likelihood of future incidents, perpetuating the troubleshooting burden.&nbsp;</li></ul><p>By constantly reacting to problems rather than proactively addressing underlying issues, teams lose the ability to innovate and build resilient systems. This dynamic not only affects engineering morale but also has broader implications for an organization’s ability to compete and adapt in fast-paced markets.&nbsp;</p><h1 id="how-troubleshooting-should-look">How troubleshooting should look&nbsp;</h1><p>If we all recognize that the state of the art of troubleshooting is awful today, let’s work together to imagine a future where troubleshooting is routine and fast:&nbsp;</p><ul><li><strong>Systems automatically pinpoint root causes within your domain quickly and accurately</strong>. Modern troubleshooting workflows must prioritize speed and precision. Systems should go beyond flagging symptoms and directly <a href="https://causely.ai/blog/bridging-the-gap-between-observability-and-automation-with-causal-reasoning/?ref=causely-blog.ghost.io" rel="noreferrer"><u>pinpoint the underlying cause</u></a> within your domain.&nbsp;</li><li><strong>Actionable information provides necessary context upfront</strong>. Systems need to focus on identifying the actions and ideally automating the automatable.&nbsp;</li><li><strong>Troubleshooting workflows are streamlined</strong>. Workflows should be intuitive and efficient, designed to minimize context switching and maximize focus with unified dashboards that integrate with your operational workflows.&nbsp;</li></ul><p>&nbsp;These systems must have certain capabilities to be effective:&nbsp;</p><ul><li><strong>Causality. </strong>The ability to capture, represent, understand and analyze cause and effect relations.&nbsp;</li><li><strong>Reasoning. </strong>Generic analytics that can reason about causality and automatically pinpoint root causes based on observed symptoms.&nbsp;</li><li><strong>Automatic topology discovery. </strong>The ability to automatically discover the environment, the entities and the relationships between them.&nbsp;</li></ul><p>With these systems, proper troubleshooting can drive positive business outcomes, such as:&nbsp;</p><ul><li><strong>Delivering on SLOs and meeting SLAs. </strong>Reduce the number of incidents.&nbsp;</li><li><strong>Faster issue resolution, minimizing downtime</strong>. Reduce mean time to detect (MTTD) and mean time to resolve or recover (<a href="https://causely.ai/blog/mttr-meaning?ref=causely-blog.ghost.io" rel="noreferrer"><u>MTTR</u></a>), keeping systems operational and minimizing the impact on users.&nbsp;</li><li><strong>Improved productivity by reducing time spent on reactive tasks</strong>. Enable engineers to focus on high-value innovation.&nbsp;</li></ul><h1 id="causely-automates-troubleshooting">Causely automates troubleshooting&nbsp;</h1><p>Our <a href="https://causely.ai/product/?ref=causely-blog.ghost.io" rel="noreferrer">Causal Reasoning Platform</a> is a model-driven, purpose-built AI system delivering multiple analytics built on a common data model. It is designed to make troubleshooting much simpler and more effective by providing:&nbsp;&nbsp;</p><ul><li><strong>Out-of-the-box Causal Models</strong>. Causely is delivered with built-in causality knowledge capturing the common root causes that can occur in cloud-native environments. This causality knowledge enables Causely to automatically pinpoint root causes out-of-the-box as soon as it is deployed in an environment. There are at least a few important details to share about this causality knowledge:&nbsp;&nbsp;<ul><li>It captures potential root causes in a broad range of entities including applications, databases, caches, messaging, load balancers, DNS compute, storage, and more.&nbsp;</li><li>It describes how the root causes will propagate across the entire environment and what symptoms may be observed when each of the root causes occurs.&nbsp;&nbsp;</li><li>It is completely independent from any specific environment and is applicable to any cloud-native application environment.&nbsp;&nbsp;</li></ul></li><li><strong>Automatic topology discovery</strong>. Cloud-native environments are a tangled web of applications and services layered over complex and dynamic infrastructure. Causely automatically discovers all the entities in the environment including the applications, services, databases, caches, messaging, load balancers, compute, storage, etc., as well as how they all relate to each other. For each discovered entity, Causely automatically discovers its:&nbsp;&nbsp;<ul><li><strong>Connectivity </strong>- the entities it is connected to and the entities it is communicating with horizontally&nbsp;&nbsp;</li><li><strong>Layering</strong> - the entities it is vertically layered over or underlying&nbsp;</li><li><strong>Composition</strong> - what the entity itself is composed of&nbsp;</li></ul></li></ul><p>Causely automatically stitches all of these relationships together to generate a Topology Graph, which is a clear dependency map of the entire environment. This Topology Graph updates continuously in real time, accurately representing the current state of the environment at all times.&nbsp;</p><ul><li><strong>Root cause analysis</strong>. Using the out-of-the-box Causal Models and the Topology Graph as described above, Causely automatically generates a causal mapping between all the possible root causes and the symptoms each of them may cause, along with the probability that each symptom would be observed when the root cause occurs. Causely uses this causal mapping to automatically pinpoint root causes based on observed symptoms in real time.<strong> </strong>No configuration is required for Causely to immediately pinpoint a broad set of root causes (100+), ranging from applications malfunctioning to services congestion to infrastructure bottlenecks.&nbsp;&nbsp;</li></ul><p>In any given environment, there can be tens of thousands of different root causes that may cause hundreds of thousands of symptoms. Causely prevents SLO violations by detangling this mess, pinpointing the root cause that’s putting your SLOs at risk, and driving remediation actions before SLOs are violated. For example, Causely proactively pinpoints if a software update changes performance behaviors for dependent services before those services are impacted.&nbsp;</p><ul><li><strong>Service impact analysis</strong>. Causely automatically analyzes the impact of the root causes on SLOs, prioritizing the root causes based on the violated SLOs and the ones that are at risk. Causely automatically defines standard SLOs (based on latency and error rate) and uses machine learning to improve its anomaly detection over time. However, environments that already have SLO definitions in another system can easily be incorporated in place of Causely’s default settings.&nbsp;</li><li><strong>Contextual presentation</strong>. The results are intuitively presented in the Causely UI, enabling users to see the root causes, related symptoms, the service impacts and initiate remedial actions. The results can also be sent to external systems to alert teams who are responsible for remediating root cause problems, to notify teams whose services are impacted, and to initiate incident response workflows.&nbsp;</li><li><strong>Prevention analysis</strong>. Teams can also ask "what if'' questions to understand the impact that potential problems might have if they were to occur to support the planning of service/architecture changes, maintenance activities and improving the resilience of services. &nbsp;</li><li><strong>Postmortem analysis</strong>. Teams can also review prior incidents and see clear explanations of why these occurred and what the effect was, simplifying the process of postmortems, enabling actions to be taken to avoid re-occurrences. &nbsp;</li></ul><h2 id="conclusion">Conclusion&nbsp;</h2><p>Troubleshooting doesn’t have to dominate a developer’s or SRE’s nightmare when the right systems are in place. Empower yourself with the only system that solves the root cause analysis problem to make troubleshooting a small, manageable part of your job.&nbsp;</p><p><a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer noopener"><u>Book a meeting with the Causely team</u></a> and let us show you how to stop troubleshooting and consistently meet your reliability expectations in cloud-native environments.&nbsp;</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[The O11ys 2024 – The Winners!]]></title>
      <link>https://causely.ai/blog/the-o11ys-2024-the-winners</link>
      <guid>https://causely.ai/blog/the-o11ys-2024-the-winners</guid>
      <pubDate>Thu, 02 Jan 2025 19:48:53 GMT</pubDate>
      <description><![CDATA[Read the Observability 360 announcement of all The O11ys 2024 winners.  Best Use of AI Winner: Causely Many observability systems now claim to support Root Cause Analysis. At the same time though, most of these systems use algorithms – admittedly, advanced…]]></description>
      <author>Karina Babcock</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2025/01/the-ollys-2024.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Read the <a href="https://observability-360.com/article/ViewArticle?id=o11ys-observability-awards-2024&ref=causely-blog.ghost.io" rel="noopener">Observability 360</a> announcement of all The O11ys 2024 winners.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/images/articles/o11ys-2024/olly-icon-sm-trans.png" class="kg-image" alt="o11y image" loading="lazy"></figure><p>Best Use of AI</p><h3 id="winner-causely">Winner: <a href="https://causely-blog.ghost.io/home/">Causely</a></h3><p>Many observability systems now claim to support Root Cause Analysis. At the same time though, most of these systems use algorithms – admittedly, advanced algorithms, which are based, fundamentally, on correlation rather than causation. For us Causely stands out as a system which truly embeds Causal AI in its reasoning and can therefore genuinely go beyond correlation and make more intelligent analysis of system behaviour.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[The Humans of OpenTelemetry]]></title>
      <link>https://causely.ai/blog/the-humans-of-opentelemetry</link>
      <guid>https://causely.ai/blog/the-humans-of-opentelemetry</guid>
      <pubDate>Thu, 19 Dec 2024 04:50:00 GMT</pubDate>
      <description><![CDATA[Adriana Villela (Dynatrace) and Reese Lee (New Relic) interviewed Causely Co-founder Endre Sara, along with several other OpenTelemetry users and contributors, during KubeCon NA 2024.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/02/opentelemetry-horizontal-color-1.svg" type="image/jpeg" />
      <content:encoded><![CDATA[<p><a href="https://github.com/avillela?ref=causely-blog.ghost.io" rel="noopener">Adriana Villela</a>&nbsp;(Dynatrace) and <a href="https://github.com/reese-lee?ref=causely-blog.ghost.io" rel="noreferrer">Reese Lee</a> (New Relic)&nbsp;interviewed Causely Co-founder Endre Sara, along with several other OpenTelemetry users and contributors, during KubeCon NA 2024. </p><p>Their full recap is available on the <a href="https://opentelemetry.io/blog/2024/humans-of-otel-na-2024/?ref=causely-blog.ghost.io" rel="noreferrer">OpenTelemetry blog</a>, and you can watch the full recording below.  </p><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/TIMgKXCeiyQ?start=223&amp;feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" title="Humans of OTel - KubeCon NA 2024"></iframe></figure>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Tackling CPU Throttling in Kubernetes for Better Application Performance]]></title>
      <link>https://causely.ai/blog/tackling-cpu-throttling-in-kubernetes</link>
      <guid>https://causely.ai/blog/tackling-cpu-throttling-in-kubernetes</guid>
      <pubDate>Wed, 27 Nov 2024 19:26:29 GMT</pubDate>
      <description><![CDATA[CPU throttling is a frequent challenge in containerized environments, particularly for resource-intensive applications. It happens when a container surpasses its allocated CPU limits, prompting the scheduler to restrict CPU usage. While this mechanism ensures fair resource sharing, it can significan]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/11/cpu-throttling.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>CPU throttling is a frequent challenge in containerized environments, particularly for resource-intensive applications. It happens when a container surpasses its allocated CPU limits, prompting the scheduler to restrict CPU usage. While this mechanism ensures fair resource sharing, it can significantly impact performance if not properly managed. CPU throttling can be a major obstacle for applications like web APIs, video streaming platforms, and gaming servers. Addressing this issue involves two key steps: identifying throttling and implementing effective solutions.</p><h2 id="what-is-cpu-throttling">What is CPU Throttling?</h2><p>CPU throttling in containers occurs due to resource constraints set by control groups (cgroups). Kubernetes and other container orchestrators rely on cgroups to enforce resource limits. When a container attempts to use more CPU than its assigned quota, it gets throttled, delaying execution of tasks. (When containers have CPU limits defined, they will be converted to a cgroup CPU quota.)</p><p>Working on the vendor side for over a decade, I have seen the impact CPU throttling can have on different services across many industries.  Here are three top-of-mind examples, both from my days at <a href="https://www.ibm.com/products/turbonomic?ref=causely-blog.ghost.io" rel="noopener">Turbonomic</a> and from recent conversations with customers at <a href="https://causely-blog.ghost.io/home/">Causely:</a></p><h3 id="financial-systems">Financial Systems</h3><!--kg-card-begin: html--><ul>
<li style="list-style-type: none;">
<ul>
<li><strong>Example:</strong> A stock trading platform uses containers to handle real-time market data feeds and execute trades. Throttling during peak trading hours delays data processing, potentially causing missed opportunities or incorrect order placements.</li>
<li><strong>Impact:</strong> Missed deadlines for transaction processing.</li>
</ul>
</li>
</ul><!--kg-card-end: html--><h3 id="gaming-servers">Gaming Servers</h3><!--kg-card-begin: html--><ul>
<li style="list-style-type: none;">
<ul>
<li><strong>Example:</strong> Online multiplayer games hosted in containers experience throttling, leading to delayed responses (lag) during gameplay. Players may experience slow rendering of in-game actions or disconnections during high traffic.</li>
<li><strong>Impact:</strong> Latency and poor user experience.</li>
</ul>
</li>
</ul><!--kg-card-end: html--><h3 id="video-streaming-platforms">Video Streaming Platforms</h3><ul><li><strong>Example: </strong>A video-on-demand service runs encoding jobs in containers to transcode videos. Throttling increases encoding times, leading to delayed content availability or poor streaming quality for users.</li><li><strong>Impact:</strong> Degraded video quality and buffering issues.</li></ul><h2 id="how-to-identify-cpu-throttling">How to Identify CPU Throttling</h2><p>It’s often difficult to catch CPU throttling because it can happen even when the host CPU usage is low. It’s critical to have the right level of monitoring set up in order to see CPU throttling when it happens, or even better, before it becomes a problem.</p><h3 id="monitor-your-cgroup-metrics"><strong>Monitor Your cgroup Metrics</strong></h3><p>Linux cgroups provide detailed metrics about CPU usage and throttling. Look for the <code>cpu.stat</code> file within the container’s cgroup directory (usually under <code>/sys/fs/cgroup)</code>:<br>Within the <code>cpu.stat</code> file there are three key metrics:</p><ul><li><code>nr_throttled</code>: Number of times the container was throttled.</li><li><code>throttled_time</code>: Total time spent throttled – which I believe is in nanoseconds</li><li><code>nr_periods</code>: Total CPU allocation periods.</li></ul><p>Example:<br><code>cat /sys/fs/cgroup/cpu/cpu.stat</code></p><p>Output:<br><code>nr_periods 12345</code><br><code>nr_throttled 543</code><br><code>throttled_time 987654321</code></p><p>If <code>nr_throttled</code> or <code>throttled_time</code> is high relative to <code>nr_periods</code>, then you have CPU throttling on your container.</p><h3 id="monitor-container-orchestration-metrics">Monitor Container Orchestration Metrics</h3><p>If you’re running Kubernetes, you can use the <a href="https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/kubectl_top_pod/?ref=causely-blog.ghost.io" rel="noopener">kubectl top pod</a> command to get metric data on the highest utilized pods.  Try the command below to get metrics for a pod and all the associated containers:<br><code>kubectl top pod --containers</code><br>This is a very manual process, and you will need to compare the CPU usage against the defined limits in the pod’s resource config.  If you run a describe command on the pod it will show you this information.  This also means you will need to know which pod is having an issue on it.  Usually when issues arise on an application it’s going to take some time to drill down to a component that might be performing poorly.  You will need the Kubernetes metric server to be installed in order to run commands like kubectl top but being able to access metrics like “container_cpu_cfs_throttled_periods_total” and “container_cpu_cfs_periods_total” offer valuable insights into CPU usage and throttling.</p><h3 id="application-performance-metrics">Application Performance Metrics</h3><p>Although they can be expensive, <a href="https://www.g2.com/categories/application-performance-monitoring-apm?ref=causely-blog.ghost.io" rel="noopener">application performance monitoring (APM) tools</a> provide invaluable insights into CPU throttling, offering detailed visibility that can help uncover the issue. These tools can often track throttling over time, identify exactly when it first occurred, and, in some cases, even predict future throttling trends based on usage patterns. Many organizations use a combination of monitoring tools to get a comprehensive view of their systems. APM tools also highlight the symptoms of CPU throttling, which may manifest as:</p><ul><li><strong>Prolonged request durations</strong>, leading to slower application response times.</li><li><strong>Decreased throughput</strong>, resulting in fewer transactions or tasks processed within a given timeframe.</li><li><strong>Irregular CPU usage patterns</strong>, which can signal performance instability or inefficiencies.</li></ul><p>By combining the capabilities of APM tools with metrics collected from Kubernetes, teams can proactively address CPU throttling and ensure optimal application performance.</p><h2 id="best-practices-to-manage-cpu-throttling-in-kubernetes">Best Practices to Manage CPU Throttling in Kubernetes</h2><p>There are many ways to fix CPU throttling and even a few ways you can prevent it. Most root causes of CPU throttling are overcommitted nodes, or misconfigured CPU limits.  Below are some ways to fix CPU throttling when it occurs and some best practices to avoid it in the future.</p><h3 id="adjust-cpu-limits">Adjust CPU Limits</h3><p>Update resource limits in your container or pod configuration. Kubernetes resource specs can be updated like in the example below.  Usually what I see from the customers I have worked with is they will set the limit just above the peak usage in the last 30,60, or even 90 days.  For non-critical workloads I have seen a few companies set this limit to 80% of max usage, and a few companies use more advanced techniques like calculating percentiles:<br><br><code>resources:</code><br><code>requests:</code><br><code>cpu: "500m"</code><br><code>limits:</code><br><code>cpu: "1000m"</code></p><ul><li>Increase the <code>limits.cpu</code> value to reduce throttling frequency.</li><li>Set <code>requests.cpu</code> to ensure better performance during contention.  Note that if you do not set the request, Kubernetes will automatically set the request to the limit.</li></ul><h3 id="use-autoscaling-like-horizontal-pod-autoscaler">Use Autoscaling like Horizontal Pod Autoscaler</h3><p>The <a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/?ref=causely-blog.ghost.io" rel="noopener">Horizontal Pod Autoscaler</a> (HPA) in Kubernetes helps address CPU throttling by dynamically adjusting the number of pods in a deployment based on real-time resource usage.  Resources like CPU and memory are monitored, and when certain thresholds are met, HPA kicks in to provision more pods.  In more idle periods it will also scale down the number of pods to help you run more efficiently.  By distributing the workload across more pods, HPA reduces the CPU demands on individual pods, thereby mitigating CPU throttling.<br><br><code>apiVersion: autoscaling/v2</code><br><code>kind: HorizontalPodAutoscaler</code><br><code>spec:</code><br><code>minReplicas: 2</code><br><code>maxReplicas: 10</code><br><code>metrics:</code><br><code>- type: Resource</code><br><code>resource:</code><br><code>name: cpu</code><br><code>targetAverageUtilization: 80</code><br><br>In this example, if average CPU utilization across all pods exceeds 80% then it will add more pods as necessary within the bounds of 2-10 (Min and Max Replicas).</p><h3 id="analyze-node-resource-allocation">Analyze Node Resource Allocation</h3><p>Check overall CPU availability on nodes using a describe command:<br><code>kubectl describe node &lt;node-name&gt;</code><br>Ensure nodes aren’t overcommitted. Use taints and tolerations to control scheduling and ensure high-priority workloads run on dedicated nodes.  Nodes that are overcommitted run the risk of not having available CPUs.  If containers’ requests are higher than the node CPU availability, you are going to run into scheduling problems.  Or if the limits are set too high relative to CPU availability on the node and the workload suddenly increases, then you are going to have contention.</p><h3 id="tweak-cpu-cfs-settings">Tweak CPU CFS Settings</h3><p>Containers use the <a href="https://docs.kernel.org/scheduler/sched-design-CFS.html?ref=causely-blog.ghost.io" rel="noopener">Completely Fair Scheduler</a> (CFS) by default. The CFS in Kubernetes is a mechanism inherited from the Linux kernel that enforces CPU usage limits on containers. It works by using two key parameters from Linux cgroups: <code>cpu.cfs_quota_us</code> and <code>cpu.cfs_period_us</code>. These parameters allow Kubernetes to control the amount of CPU time a container can use over a specific period:</p><ul><li><code>cpu.cfs_quota_us</code>: Maximum microseconds of CPU time allowed per period.</li><li><code>cpu.cfs_period_us</code>: Length of a scheduling period in microseconds.</li></ul><p>To prevent CPU throttling: Increase <code>cpu.cfs_quota_us</code> to provide more CPU time:<br><code>echo 200000 &gt; /sys/fs/cgroup/cpu/cpu.cfs_quota_us</code><br>I have seen this create issues before though so be careful with this adjustment as it can lead to overcommitment.  In other words, if too many tasks are scheduled and you increase the amount of time a container can use the CPU, then it will create delays and throttling.  Start by playing around with this in Dev or Test clusters before you make any changes to prod… duh.</p><h3 id="use-cpu-pinning">Use CPU Pinning</h3><p>This is more of an edge case, but instead of using CPU shares and limits, pin containers to specific CPUs for predictable performance.  The Kubernetes CPU Manager controls how CPUs are allocated to containers.  The enable CPU pinning, the <a href="https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/?ref=causely-blog.ghost.io#static-policy" rel="noopener">static CPU manager policy</a> must be used, which provides exclusive CPU allocation.  Just enable the policy in the Kubelet configuration file:<br><code>cpuManagerPolicy: static</code><br>With the flag set to “static,” containers are allocated exclusive CPUs on the nodes in a cluster.  Kubernetes assigns the container specific CPUs and it runs only on those cores.  The big challenges with CPU pinning are overhead and scalability.  When you manage pinned workloads it requires detailed planning to avoid fragmentation and underutilization.  CPU pinning is good for workloads that are sensitive to CPU throttling but not ideal for volatile and dynamic workloads.</p><h2 id="cpu-throttling-is-a-double-edged-sword-in-kubernetes">CPU Throttling is a Double-Edged Sword in Kubernetes</h2><p>While CPU throttling plays a crucial role in resource management and stability, it can also hinder application performance if not managed correctly. By understanding how CPU throttling works and implementing best practices, you can optimize your Kubernetes environment, ensuring efficient resource use and enhanced application performance. As Kubernetes continues to grow and evolve, keeping a close eye on resource management will be key to maintaining robust and responsive applications.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[What’s a “Service Owner” and how can they improve application reliability?]]></title>
      <link>https://causely.ai/blog/whats-a-service-owner-and-how-can-they-improve-application-reliability</link>
      <guid>https://causely.ai/blog/whats-a-service-owner-and-how-can-they-improve-application-reliability</guid>
      <pubDate>Mon, 18 Nov 2024 13:46:50 GMT</pubDate>
      <description><![CDATA[Assuring application reliability is a persistent challenge faced by every IT organization, complicated by rapid technology evolution and the increased emphasis on lean engineering.  One trend among progressive companies is to designate a “Service Owner” who is responsible for making…]]></description>
      <author>Yotam Yemini</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/11/pexels-pixabay-276347.webp" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Assuring application reliability is a persistent challenge faced by every IT organization, complicated by rapid technology evolution and the increased emphasis on lean engineering.  One trend among progressive companies is to designate a “Service Owner” who is responsible for making sure applications meet their objectives for uptime and customer satisfaction.</p><p>In this post, we’ll explain what it means to be a Service Owner, outline key responsibilities associated with the role, and offer advice for companies looking to build a culture of service ownership.</p><h2 id="how-the-service-owner-came-to-be">How the Service Owner came to be</h2><p>When the <a href="https://devops.com/the-origins-of-devops-whats-in-a-name/?ref=causely-blog.ghost.io" rel="noopener">DevOps movement took off</a> around 2010, it promised to fix the issues with fragmented teams and inefficient software lifecycle management that had been hindering application performance and reliability for years.  This new wave of IT fostered cross-team collaboration, communication, and transparency as a way to accelerate software delivery of software and deliver higher quality of service (QoS).</p><p>But there was something still missing: accountability across the end- to-end application lifecycle, from development to roadmap design to customer satisfaction. Hence the emergence of the <strong>Service Owner</strong>.</p><h2 id="so-what-is-a-service-owner">So, what is a Service Owner?</h2><p>Many IT teams today, especially those using microservice architectures, employ multiple Service Owners.  The exact definition varies slightly from company to company, but most would define a Service Owner (SO) as <strong>the person who is responsible for making sure the application or service meets its designated service level objectives (SLOs)</strong>. Implicit in this definition: Service Owners are accountable for the end-to-end lifecycle of applications, from development through operational performance production monitoring, to ensure uptime and customer satisfaction.</p><p>For a little more context, here’s how <a href="https://wiki.en.it-processmaps.com/index.php/ITIL_Roles?ref=causely-blog.ghost.io#Service_Owner" rel="noopener">ITIL defines</a> the role:</p><ul><li>The Service Owner is responsible for delivering a particular service within the agreed service levels.</li><li>Typically, the Service Owner acts as the counterpart of the <a href="https://wiki.en.it-processmaps.com/index.php/ITIL_Roles?ref=causely-blog.ghost.io#Service_Level_Manager" rel="noopener">Service Level Manager</a> when negotiating Operational Level Agreements (OLAs).</li><li>Often, this role will lead a team of technical specialists or an internal support unit.</li></ul><p>Usually, Service Owners are aligned to a specific product line for the business. Their formal title may be Product Owner or Engineering Manager. For example, imagine a healthcare tech company that sells solutions to businesses and hospitals.  They have 5 different products within their offering suite:</p><ul><li>AI platform</li><li>Online therapy product</li><li>Analytics product</li><li>Telehealth service</li><li>Financial management product</li></ul><p>A Service Owner would be assigned to each of these product lines and assume full responsibility for its application lifecycle management and <a href="https://newsletter.pragmaticengineer.com/p/reliability-engineering?ref=causely-blog.ghost.io" rel="noopener">reliability engineering</a>.  A Service Owner role can also be broken down by customer, by services within the products, or even by mobile vs desktop applications.</p><h2 id="responsibilities-of-the-service-owner">Responsibilities of the Service Owner</h2><p>Service Owners’ responsibilities tend to vary slightly from company to company but they often include:</p><ol><li><strong>Design and Roadmap:</strong> Writing the code and overseeing the design and implementation of the service, ensuring it functions properly and is scalable. This includes managing the ongoing balance and prioritization of releasing new features vs. improving the reliability of existing functionality.</li><li><strong>Maintenance and Support:</strong> Ensuring that the service is properly maintained, updated, and supported over its lifecycle.  This is where creating and adhering to SLOs comes into play.</li><li><strong>Performance Monitoring:</strong> Monitoring the performance and reliability of the service, by implementing metrics and logging to track its health and flag when things break.  Performance Monitoring also includes implementing proactive monitoring to prevent downtime.</li><li><strong>Collaboration:</strong> Working closely with other teams, such as Product Management, Sales, Platform Engineering, etc. to align the service with the business goals.</li><li><strong>Documentation:</strong> Creating and maintaining comprehensive documentation for the service, including things like APIs, user guides, and architecture diagrams.</li><li><strong>Governance and Compliance:</strong> Ensuring that the service adheres to relevant policies, standards, and regulatory requirements.</li><li><strong>Stakeholder Communication:</strong> Acting as a point of contact for stakeholders, addressing their needs and concerns regarding the service.</li></ol><p>Service Owners work closely with technical experts such as SREs, DevOps, or Developers to maintain service reliability, though they each have distinct roles and tool preferences:</p><ul><li>Service Owners typically use collaboration and project management tools like Jira or Asana and monitor high-level metrics on observability dashboards like Grafana.</li><li>Technical Experts handle incident resolution and service reliability, relying primarily on observability and incident management tools like PagerDuty.</li></ul><p>Tools like ServiceNow, Atlassian’s Service Management (OpsGenie, Jira), and PagerDuty’s workflow orchestration attempt to bridge the gap between these two roles by providing a unified space for planning, alerting, diagnosis, and response. This enables Service Owners and technical experts to operate more effectively together, allowing engineering teams to enhance alignment, transparency, and accountability.</p><h2 id="code-it-ship-it-own-it">Code it, ship it, own it</h2><p>A service owner’s job goes beyond writing and compiling code and bug fixes.  They are responsible for their applications and services after they’ve been shipped to production.  When Service Owners own the lifecycle, organizations see improved QoS and faster <a href="https://causely-blog.ghost.io/mttr-meaning/">MTTR</a>.</p><p>If they wrote the code, they know how to fix it.  If something breaks, they are the first responders and take accountability for failures.  They have a deeper knowledge of issues within their service and application so they can fix and develop the fastest, but they must be held accountable for downtime.  No more finger pointing!</p><p>One thing service owners MUST do is align themselves with their customers.  They must understand what the customers’ needs and expectations are and then design the code, roadmap, and establish SLOs accordingly.</p><p>The benefits of this approach? Products directly solve customer pain, and in most cases, services are delivered faster to customers.  It reminds me of a funny meme I saw years ago — it couldn’t be more accurate.  This situation is exactly what Service Owners are preventing!</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/11/service_owner_meme.webp" class="kg-image" alt="A six-panel comic showing variations of a tree swing, highlighting differences in user requests, analyst views, design, programming, desires, and final outcome." loading="lazy" width="466" height="516"></figure><p><em>Source: <a href="https://www.reddit.com/r/ProgrammerHumor/comments/cwo1oj/true_story/?ref=causely-blog.ghost.io" rel="noopener">Reddit</a></em></p><h2 id="building-a-culture-of-service-ownership">Building a culture of service ownership</h2><p>Since service ownership is a culture and not a tool, it needs to grow over time; it can’t happen overnight, no matter how much pressure the business puts on IT.  There are ways to actively foster a shift in mentality, so teams thrive in an environment where they have more responsibility and where better services are being delivered to the customer.</p><p>Based on my conversations with leaders in IT, these are some common best practices for building a culture of service ownership.</p><ol><li><strong>Promote collaboration: </strong>Encourage open communication between teams—development, operations, and business units. Regular cross-functional meetings and collaborative tools can help break down silos faster and foster a shared understanding of service objectives.</li><li><strong>Establish a customer-first mentality: </strong>As mentioned before, service ownership will not thrive if everyone has different ideas and goals.  Establishing a common goal like focusing on customers’ needs can align teams. If customer satisfaction is the north star, companies will have more satisfied customers, which means bigger checks 😉.  Defining customer specific SLOs is an excellent way to keep everyone aligned on the mission.  SLOs on latency, number of customer tickets/escalations, and even uptime are some standard ones I see most Service Owners using.</li><li><strong>Embrace failure: </strong>Taking accountability and responsibility for something that directly impacts a business can be scary. That’s probably the single biggest reason why most software engineers are hesitant to adopt a service ownership role.  If leadership fosters a culture where failure is seen as progress and not regression, then it becomes more appetizing to developers.  No one wants to lose their job over a silly mistake, but they need to learn from these slips and drive towards a more reliable application architecture.</li></ol><p>Building a culture of service ownership in IT requires a deliberate and consistent approach. By defining roles, fostering collaboration, and empowering teams, IT can create an environment where service ownership is continuously improved. This culture not only enhances QoS but also drives innovation and responsiveness, ultimately benefiting the health of any business.</p><hr><h2 id="faqs">FAQs</h2><ul><li><strong>What is application lifecycle management? </strong><br>Application lifecycle management is the end-to-end process of developing, building, deploying, and managing software applications over time to ensure consistent and ongoing quality, reliability, and resilience.</li><li><strong>What is reliability engineering? </strong><br>Reliability engineering is the practice of ensuring that applications, products, or systems function without failure. Reliability engineers focus on proactively identifying potential failures to determine their root cause and mitigation strategies before they happen.</li><li><strong>What is a Service Owner? </strong><br>A Service Owner is someone who is responsible for meeting agreed service levels. They ultimately usually own the overall engineering, management and governance of the lifecycle of a service.</li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Watch out! Sharks at KubeCon]]></title>
      <link>https://causely.ai/blog/watch-out-sharks-at-kubecon</link>
      <guid>https://causely.ai/blog/watch-out-sharks-at-kubecon</guid>
      <pubDate>Tue, 05 Nov 2024 15:38:18 GMT</pubDate>
      <description><![CDATA[Based on my LinkedIn news feed, it must be that time of year when thousands of open source enthusiasts congregate to talk tech at various parties, dinners, and other networking events surrounding KubeCon. In fact, we’re hosting a couple of…]]></description>
      <author>Prashant Sridharan</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/11/causely-blog-featured-image-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Based on my LinkedIn news feed, it must be that time of year when thousands of open source enthusiasts congregate to talk tech at various parties, dinners, and other networking events surrounding <a href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/?ref=causely-blog.ghost.io" rel="noopener">KubeCon</a>. In fact, we’re hosting a couple of our own: You can <a href="https://lu.ma/d8dz3bal?ref=causely-blog.ghost.io" rel="noopener">register here for our happy hour</a> or <a href="https://lu.ma/gumzlja1?ref=causely-blog.ghost.io" rel="noopener">here for our dinner</a>.</p><p>And if you pay attention, you might start to notice some buzz building around one event in particular. Word on the KubeStreet is this will be one of those IYKYK type things, so I’m here to help you get in the know (cue Jaws music 🦈 🎶).</p><p>First, we need to set a bit of a foundation about the evolution of DevOps and what’s missing from the Observability market.</p><h2 id="the-beginning-of-devops">The beginning of DevOps</h2><p><a href="https://www.jedi.be/?ref=causely-blog.ghost.io" rel="noopener">Patrick Debois</a>, a tech consultant based in Belgium, is often cited with popularizing the term <em>DevOps</em>. This is because, in 2009, he organized the first <a href="https://devopsdays.org/?ref=causely-blog.ghost.io" rel="noopener">DevOpsDays</a> conference. However, the truth is the roots of DevOps extend further back.</p><p>Earlier that same year, John Allspaw and Paul Hammond’s influential talk at the Velocity Conference, <em><a href="https://www.slideshare.net/slideshow/10-deploys-per-day-dev-and-ops-cooperation-at-flickr/1628368?ref=causely-blog.ghost.io" rel="noopener">10+ Deploys per Day: Dev and Ops Cooperation at Flickr</a></em> highlighted the importance of collaboration between development and operations to enable frequent, reliable deployments.</p><p>We can go back even further though, as the groundwork for DevOps was already being laid in 2001. Back then, The <a href="https://agilemanifesto.org/?ref=causely-blog.ghost.io" rel="noopener"><em>Agile Manifesto</em> </a>championed collaboration and iterative development, which are principles that would later shape the DevOps ethos. Meanwhile, tools like <a href="https://www.puppet.com/?ref=causely-blog.ghost.io" rel="noopener">Puppet</a>, <a href="https://www.chef.io/?ref=causely-blog.ghost.io" rel="noopener">Chef</a>, and <a href="https://www.jenkins.io/?ref=causely-blog.ghost.io" rel="noopener">Jenkins</a> were gaining traction in the oughts, helping bridge the gap between development and operations teams.</p><p>These influences helped shape what is commonly known as DevOps, which is a movement that champions collaboration, cultural transformation, and tooling integration to achieve high-frequency, reliable software delivery.</p><h2 id="then-came-observability">Then came Observability</h2><p>As part of this movement, engineering teams started to adopt microservices architecture as a software development method of breaking applications down into small, independent services that interact with each other to perform the job of the application. This method became popular because of the speed and ease that it adds to the software development process.</p><p>On the other hand, a microservices approach makes it exponentially more difficult to diagnose, remediate, and prevent application performance problems. The growing popularity of DevOps practices and microservices architecture paved the way for advancements in monitoring and its evolution into what many know today as observability.</p><p>As DevOps led teams to release faster and decompose applications into microservices, traditional monitoring – which relies on static alerts based on isolated metrics – could no longer keep up with the resulting complexity. Observability tries to address this gap by offering a more holistic view to help engineers pinpoint issues, understand dependencies, and continuously improve system performance.</p><p>Companies large and small appreciate how observability tools provide their engineering teams with a <em>single pane of glass</em>, nifty dashboards, and helpful data correlation features, but the more discerning engineers have matured to view this market of tools as being more like a <em>single glass of pain</em>.</p><h2 id="cool-so-what-does-this-have-to-do-with-sharks">Cool, so what does this have to do with sharks?</h2><p>Suppose I told you that ice cream consumption and shark attacks were correlated. From this you might infer that <em>sharks love sugar</em>. Of course, you’ve probably heard this example used before to help make the point that correlation does not equal causation. It’s certainly useful to understand that two metrics are temporally correlated, but it’s misleading that observability vendors claim this to be root cause analysis, and in some scenarios it can hurt more than it helps.</p><p>At Causely, we value the observability market and believe it’s provided a good start for engineers who want to reduce <a href="https://www.causely.ai/blog/mttr-meaning/?ref=causely-blog.ghost.io">MTTD and MTTR</a>. The more discerning engineers are coming to realize they will only reach the next frontier of autonomous service reliability if they start to look at the problem from a different, top-down perspective.</p><p>We’ll be happy to tell you more about this if you’re lucky enough to find the shark at the show next week. Here’s a hint: <em>she’ll be holding something sweet. </em>🍨 🥞</p><p>Additional information about our presence at KubeCon is <a href="https://causely-blog.ghost.io/kubecon/">here</a>.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Observability talks sure to make waves at KubeCon]]></title>
      <link>https://causely.ai/blog/observability-talks-sure-to-make-waves-at-kubecon-2024</link>
      <guid>https://causely.ai/blog/observability-talks-sure-to-make-waves-at-kubecon-2024</guid>
      <pubDate>Thu, 31 Oct 2024 19:52:58 GMT</pubDate>
      <description><![CDATA[KubeCon North America 2024 is around the corner! This year I’m especially excited, as it’s my first KubeCon since we launched Causely. The energy at KubeCon is unmatched, and it’s a great opportunity to catch up with familiar faces and make new…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/10/waves-1536x864-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p><a href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/?ref=causely-blog.ghost.io" rel="noopener">KubeCon North America 2024</a> is around the corner! This year I’m especially excited, as it’s my first KubeCon since we launched <a href="https://www.causely.ai/?ref=causely-blog.ghost.io">Causely</a>. The energy at KubeCon is unmatched, and it’s a great opportunity to catch up with familiar faces and make new connections in the community.</p><p>Cloud-native and open source technologies are foundational to everything we’re building at Causely, and I’m looking forward to diving into the latest developments with observability tools that will shape the reliability of countless modern applications. We’re excited about how observability tools, such as <a href="https://prometheus.io/?ref=causely-blog.ghost.io" rel="noopener">Prometheus</a>, <a href="https://opentelemetry.io/?ref=causely-blog.ghost.io" rel="noopener">OpenTelemetry</a>, <a href="https://github.com/grafana/beyla?ref=causely-blog.ghost.io" rel="noopener">Grafana Beyla</a>, and <a href="https://odigos.io/?ref=causely-blog.ghost.io" rel="noopener">Odigos</a>, are improving how systems are monitored and understood. These tools will underpin the reliability engineering strategy of countless modern application environments.</p><p>In this post, I’ll unpack some of the exciting innovations happening in the open source observability space, and highlight specific talks at KubeCon 2024 I’m looking forward to attending. Add them to your schedule if this is an area you’re following too, and I’ll see you there!</p><h2 id="what-s-new-in-the-observability-landscape">What’s New in the Observability Landscape</h2><p>The observability space is evolving rapidly. Several key developments with OpenTelemetry and <a href="https://ebpf.io/?ref=causely-blog.ghost.io" rel="noopener">eBPF</a> are reshaping how we approach monitoring and tracing in distributed systems. This helps make metric collection and auto-instrumentation easier and more flexible than ever.</p><h3 id="prometheus-and-opentelemetry-are-even-better-together">Prometheus and OpenTelemetry are even better together</h3><p>The collaboration between Prometheus and OpenTelemetry is delivering a more cohesive experience for capturing system metrics. Here’s what’s new:</p><ul><li><strong>Using the OTel Collector’s Prometheus Receiver:</strong> OpenTelemetry now allows for streamlined ingestion of Prometheus metrics, creating a more unified data pipeline.</li><li><strong>Exploring OTel-Native Metric Collection Options:</strong> For those deeply involved in Kubernetes monitoring, tools like Kubernetes Cluster Receiver and the Kubeletstats Receiver provide robust options for flexible metric collection.</li></ul><p>These integrations create a powerful foundation for system insights that can fuel everything from troubleshooting to strategic infrastructure optimizations.</p><h3 id="odigos-makes-migration-to-opentelemetry-easy">Odigos makes migration to OpenTelemetry easy</h3><p>Transitioning from proprietary observability tools can be challenging.  If you’re looking to transition from proprietary observability tools to OpenTelemetry, Odigos is a great enabler. Here’s why:</p><ul><li><strong>Simplified Migration:</strong> Odigos takes the complexity out of moving to OpenTelemetry, helping reduce the friction and downtime that can accompany large-scale tooling changes.</li><li><strong>Access to Enhanced Capabilities:</strong> With OpenTelemetry, you gain access to a broader ecosystem, opening doors to new integrations and data visualizations that enrich your observability setup.</li></ul><h3 id="streamline-ebpf-based-auto-instrumentation">Streamline eBPF-based auto-instrumentation</h3><p>Combining Grafana Beyla with <a href="https://grafana.com/docs/alloy/latest/?ref=causely-blog.ghost.io" rel="noopener">Grafana Alloy</a> makes eBPF-powered observability easy and minimally intrusive. Whether you’re working with standalone systems or Kubernetes clusters, this integration provides high-precision monitoring capabilities without heavy overhead.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/10/screenshot-2024-10-31-at-11-45-18-am.png" class="kg-image" alt="A group of people smiling with arms raised in front of a booth at a conference. Text reads: &quot;KubeCon CloudNativeCon North America 2024, November 12-15, Salt Lake City, Utah." loading="lazy" width="620" height="248" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/10/screenshot-2024-10-31-at-11-45-18-am.png 600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/10/screenshot-2024-10-31-at-11-45-18-am.png 620w"></figure><h2 id="kubecon-2024-deep-dives-and-hands-on-sessions">KubeCon 2024 Deep Dives and Hands-On Sessions</h2><p>The schedule at KubeCon North America 2024 will feature several sessions dedicated to observability. Here are some highlights I’m looking forward to:</p><ol><li><a href="https://sched.co/1iW8e?ref=causely-blog.ghost.io" rel="noopener"><strong>OpenTelemetry: The OpenTelemetry Hero’s Journey – Working with Open Source Observability </strong></a><br><em>Takeaway:</em> Dive into the current capabilities of OpenTelemetry, including correlated metrics, traces, and logs for a complete observability picture. The talk also addresses gaps in today’s open source observability tools.</li><li><a href="https://sched.co/1iW8h?ref=causely-blog.ghost.io" rel="noopener"><strong>Inspektor Gadget: eBPF for Observability, Made Easy and Approachable </strong></a><br><em>Takeaway:</em> This project lightning talk will explain how Inspektor Gadget simplifies eBPF distribution and deployment, allowing you to build fast, efficient data collection pipelines that plug into popular observability tools.</li><li><a href="https://sched.co/1i7lA?ref=causely-blog.ghost.io" rel="noopener"><strong>Optimizing LLM Performance in Kubernetes with OpenTelemetry</strong></a><br><em>Takeaway: </em>Speakers Ashok Chandrasekar (Google) and Liudmila Molkova (Microsoft) will help you gain practical insights into observing and optimizing Large Language Model (LLM) deployments on Kubernetes. This session covers everything from client and server tracing with OpenTelemetry to advanced autoscaling strategies.</li><li><a href="https://sched.co/1i7lJ?ref=causely-blog.ghost.io" rel="noopener"><strong>Unifying Observability: Correlating Metrics, Traces, and Logs with Exemplars and OpenTelemetry</strong></a><br><em>Takeaway: </em>Speakers Kruthika Prasanna Simha and Charlie Le (Apple) will help attendees learn how to correlate data across metrics, traces, and logs, with exemplars. This session demonstrates practical visualization techniques in Grafana, making it easy to move from high-level metrics down to detailed traces.</li><li><a href="https://sched.co/1i7pF?ref=causely-blog.ghost.io" rel="noopener"><strong>Now You See Me: Tame MTTR with Real-Time Anomaly Detection</strong></a><br><em>Takeaway: </em>Speakers Kruthika Prasanna Simha and Raj Bhensadadia (Apple) will dive into the latest in real-time anomaly detection, with insights into applying machine learning to time series data in cloud-native environments.</li></ol><h2 id="find-causely-at-kubecon-2024-">Find Causely at KubeCon 2024!</h2><p>If the above topics are also interesting to you, we’d love to meet up. At Causely, we’re passionate about helping organizations unlock the full potential of their observability data to continuously assure service reliability. We believe that flying with overwhelming volumes of observability data is just as bad as flying blind.</p><p>Here’s where you can <a href="https://causely-blog.ghost.io/kubecon/">find us at KubeCon</a>:</p><ul><li><strong>🥂 Come to our happy hour!</strong> We’re co-hosting a <a href="https://lu.ma/d8dz3bal?ref=causely-blog.ghost.io" rel="noopener">happy hour</a> with friends from <a href="https://www.nvidia.com/en-us/?ref=causely-blog.ghost.io" rel="noopener">NVIDIA</a>, <a href="https://alma.security/?ref=causely-blog.ghost.io" rel="noopener">Alma Security</a>, <a href="https://edera.dev/?ref=causely-blog.ghost.io" rel="noopener">Edera</a>, and <a href="https://645ventures.com/?ref=causely-blog.ghost.io" rel="noopener">645 Ventures</a> on November 12th.</li><li><strong>🦈 Find the shark!</strong> Look for a shark at the show on the afternoon of November 13th. Here’s a hint: <em>she’ll be holding something sweet.</em></li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causal Reasoning Software with Causely’s Francis Cordon]]></title>
      <link>https://causely.ai/blog/causal-reasoning-software-with-causelys-francis-cordon</link>
      <guid>https://causely.ai/blog/causal-reasoning-software-with-causelys-francis-cordon</guid>
      <pubDate>Mon, 30 Sep 2024 20:18:41 GMT</pubDate>
      <description><![CDATA[]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/01/Screenshot-2024-12-13-at-11.50.00-AM.png" type="image/jpeg" />
      <content:encoded><![CDATA[]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[The use of eBPF – in Netflix, GPU infrastructure, Windows programs and more]]></title>
      <link>https://causely.ai/blog/the-use-of-ebpf-in-netflix-gpu-infrastructure-windows-programs-and-more</link>
      <guid>https://causely.ai/blog/the-use-of-ebpf-in-netflix-gpu-infrastructure-windows-programs-and-more</guid>
      <pubDate>Wed, 25 Sep 2024 16:51:26 GMT</pubDate>
      <description><![CDATA[Takeaways from eBPF Summit 2024 How are organizations applying eBPF to solve real problems in observability, security, profiling, and networking? It’s a question I’ve found myself asking as I work in and around the observability space – and I was pleasantly…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/social-preview-ebpf-summit.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<!--kg-card-begin: html--><h2><img class="wp-image-1705 alignright" src="https://storage.ghost.io/c/f4/30/f4302d25-b539-4f59-b4cb-e3882f8f5026/content/files/wp-content/uploads/2024/09/ebpf-summit-1.xml" alt="eBPF Summit 2024" width="382" height="247"><!--kg-card-begin: html--><span style="color: #000000; font-size: 18pt;">Takeaways from eBPF Summit 2024</span><!--kg-card-end: html--></h2><!--kg-card-end: html--><p>How are organizations applying <a href="https://ebpf.io/?ref=causely-blog.ghost.io" rel="noopener">eBPF</a> to solve real problems in observability, security, profiling, and networking? It’s a question I’ve found myself asking as I work in and around the observability space – and I was pleasantly surprised when <a href="https://isovalent.com/?ref=causely-blog.ghost.io" rel="noopener">Isovalent</a>’s recent <a href="https://ebpf.io/summit-2024/?ref=causely-blog.ghost.io" rel="noopener">eBPF Summit</a> provided some answers.</p><p>For those new to eBPF, it’s an open source technology that empowers observability practices. Many organizations and vendors have adopted it as a data source (including <a href="https://www.causely.ai/?ref=causely-blog.ghost.io">Causely</a>, where we use it to enhance our instrumentation for Kubernetes).</p><p>Many of the eBPF sessions highlighted real challenges companies faced and how they used eBPF to overcome them. In the spirit of helping others, my cliff notes and key takeaways from eBPF Summit are below.</p><!--kg-card-begin: html--><h2><!--kg-card-begin: html--><span style="font-size: 18pt;">Organizations like Netflix and Datadog are using eBPF in new, creative ways</span><!--kg-card-end: html--></h2><!--kg-card-end: html--><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">The use of eBPF in Netflix</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><p>One of the Keynote presentations was <a href="https://youtu.be/Pkz65BJHN2M?si=8sV-B_E8WgHbm4-o&ref=causely-blog.ghost.io" rel="noopener">delivered by Shweta Saraf</a> who described specific problems Netflix overcame using eBPF, such as noisy neighbors. This is a common problem faced by many companies with cloud-native environments.</p><!--kg-card-begin: html--><div id="attachment_1713" style="width: 310px" class="wp-caption alignleft"><a href="https://youtu.be/Pkz65BJHN2M?si=sewwpeDTj8GwQgX9&ref=causely-blog.ghost.io" target="_blank" rel="noopener"><img aria-describedby="caption-attachment-1713" class="wp-image-1713 size-medium" src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/screenshot-2024-09-25-at-12-06-36-pm-1.png" alt="Shweta Saraf described Netflix's use cases for eBPF" width="300" height="150"></a><p id="caption-attachment-1713" class="wp-caption-text"><!--kg-card-begin: html--><span style="font-size: 8pt;"><em>Shweta Saraf described Netflix’s use cases for eBPF</em></span><!--kg-card-end: html--></p></div><!--kg-card-end: html--><p>Netflix uses eBPF to measure how long processes spend in the CPU scheduled state.  When processes are taking too long, it usually indicates a performance bottleneck on CPU resources — like CPU throttling or over-allocation.  (Netflix’s compute and performance team released a <a href="https://netflixtechblog.com/noisy-neighbor-detection-with-ebpf-64b1f4b3bbdd?ref=causely-blog.ghost.io" rel="noopener">blog</a> recently with much more detail on the subject.)  In solving the noisy neighbor problem, the Netflix team also created a tool called bpftop which is designed to measure the CPU usage of the eBPF code they instrumented.</p><p>The Netflix team released <a href="https://github.com/Netflix/bpftop?ref=causely-blog.ghost.io" rel="noopener">bpftop</a> for the community to use, and it will ultimately help organizations implement efficient eBPF programs.  This is especially useful if an eBPF program is hung, allowing teams to quickly identify any overhead that an eBPF program has.  We have come full circle: <strong><em>monitoring our monitoring programs</em></strong> 😁.</p><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">The use of eBPF in Datadog</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><p>Another use case for eBPF – and one that can be easily overlooked – is in chaos engineering.  <a href="https://www.linkedin.com/in/scottgerring/?originalSubdomain=ch&ref=causely-blog.ghost.io" rel="noopener">Scott Gerring</a>, a technical advocate at Datadog, shared his experience on the matter.  This quote resonated with me: <em>“with eBPF… we have this universal language of destruction”</em> – controlled destruction that is.</p><!--kg-card-begin: html--><div id="attachment_1715" style="width: 310px" class="wp-caption alignright"><a href="https://youtu.be/_5Zabryx0nE?si=n3vlWdW-wBygE_Fr&ref=causely-blog.ghost.io" target="_blank" rel="noopener"><img aria-describedby="caption-attachment-1715" class="wp-image-1715 size-medium" src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/screenshot-2024-09-25-at-12-23-33-pm-1.png" alt="Scott Gerring discussed eBPF's use in Datadog" width="300" height="144"></a><p id="caption-attachment-1715" class="wp-caption-text"><!--kg-card-begin: html--><span style="font-size: 8pt;"><em>Scott Gerring discussed eBPF’s use in Datadog</em></span><!--kg-card-end: html--></p></div><!--kg-card-end: html--><p>The benefit of eBPF is that we can inject failures into cloud-native systems without having to re-write the code of an application.  Interestingly, there are open source projects out there for chaos engineering that already use eBPF, such as <a href="https://github.com/chaos-mesh/chaos-mesh?ref=causely-blog.ghost.io" rel="noopener">ChaosMesh</a>.</p><p>Scott listed a few examples like Kernel Probes that are attached to the openat system that will cause access denied errors for 50% of calls made by system processes that a user can select or define.  Or, using the traffic control subsystem to drop packets for sockets on process you want to mark for failure.</p><!--kg-card-begin: html--><h2><!--kg-card-begin: html--><span style="font-size: 18pt;">eBPF will underpin AI development</span><!--kg-card-end: html--></h2><!--kg-card-end: html--><p>Isovalent Co-founder and CTO <a href="https://www.linkedin.com/in/thomas-graf-73104547/?ref=causely-blog.ghost.io" rel="noopener">Thomas Graf</a> presented the eBPF roadmap and what he is most excited about.  Notably: eBPF will deliver value in enabling the GPU and DPU infrastructure wave fueled by AI.  AI is undoubtably one of the hottest topics in tech right now.  Many companies are using GPUs and DPUs to accelerate AI and ML (Machine Learning) tasks, because CPUs cannot deliver the processing power demanded by today’s AI models.</p><!--kg-card-begin: html--><div id="attachment_1714" style="width: 310px" class="wp-caption alignleft"><a href="https://youtu.be/oVoW5BUBRJk?si=MbFhH8fbtRMd-eQk&ref=causely-blog.ghost.io" target="_blank" rel="noopener"><img aria-describedby="caption-attachment-1714" class="wp-image-1714 size-medium" src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/screenshot-2024-09-25-at-12-20-30-pm-1.png" alt="Thomas Graf talked about the value of eBPF in enabling GPU and DPU infrastructures" width="300" height="144"></a><p id="caption-attachment-1714" class="wp-caption-text"><!--kg-card-begin: html--><span style="font-size: 8pt;"><em>Thomas Graf talked about the value of eBPF in enabling GPU and DPU infrastructures</em></span><!--kg-card-end: html--></p></div><!--kg-card-end: html--><p>As Tom mentioned, whether the AI wave produces anything meaningful is up for debate, but companies will undoubtedly try, and they will make significant investments in GPUs and DPUs along the way.  The capabilities of eBPF will be applied to this new wave of infrastructure in the same manner they did for CPUs.</p><p>GPUs and DPUs are expensive, so companies do not want to waste processing power on programs that will drive up utilization. The efficiency of eBPF programs can help maximize the performance of costly GPUs. For example, eBPF can be used for GPU profiling by hooking into GPU events such as memory, sync, and kernel launches.  Unlocking this type of data can be used to understand which kernels are used most frequently, improving efficiencies of AI development.</p><!--kg-card-begin: html--><h2><!--kg-card-begin: html--><span style="font-size: 18pt;">eBPF support for Windows is growing</span><!--kg-card-end: html--></h2><!--kg-card-end: html--><p>Another interesting milestone in eBPF’s journey is the support for Windows.  In fact, there is a growing Git Repository for eBPF programs on Windows that exists today:<a href="https://github.com/microsoft/ebpf-for-windows?ref=causely-blog.ghost.io" rel="noopener"> https://github.com/microsoft/ebpf-for-windows</a></p><p>The project supports Windows 10 or later and Windows Server 2019 or later, and while there is not feature parity yet to Linux, there is a lot of development in this space.  The community is hard at work porting over the same tooling for eBPF on Linux, but it is a challenging endeavor as the hook points for Linux eBPF components (like Just-In-Time compilation or eBPF bytecode signatures) will differ on Windows.</p><p>It will be exciting to watch the same networking, security, and observability eBPF capabilities on Linux become available for Windows.</p><!--kg-card-begin: html--><h2><!--kg-card-begin: html--><span style="font-size: 18pt;">The need for better observability is fueling eBPF ecosystem growth</span><!--kg-card-end: html--></h2><!--kg-card-end: html--><p>eBPF tools have been created by the community for both applications and infrastructure use cases.  There a 9 major projects for applications and over 30 exciting emerging projects for applications.  Notably, while there are a few production-ready runtimes and tools within the infrastructure ecosystem (like Linux and LLVM Compiler), there are many emerging projects such as eBPF for Windows.</p><p>With a user base across Meta, Apple, Capital One, LinkedIn, and Walmart (just to name a few), we can expect the number of eBPF projects to grow considerably in the coming years.  The overall number of projects is actually forecasted in the triple digits by the end of 2025.</p><p>One of the top catalysts for growth? The urgent need for better observability.  Of all the topics at last year’s <a href="https://www.cncf.io/kubecon-cloudnativecon-events/?ref=causely-blog.ghost.io" rel="noopener">KubeCon</a> in Chicago, observability ranked the highest, beating competing topics like cost and automation.  As with any other tool, eBPF can allow organizations gather a lot of data, but the “why” is important. Are you using that data to create more noise and more alerts, or can you apply it to get to the root cause of problems that surface, or for other applications?</p><p>It is exciting to watch the eBPF community develop and implement creative new ways to use eBPF and the 2024 eBPF summit was (and still is) an excellent source of real-world eBPF use cases and community-generated tooling.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[The “R” in MTTR: Repair or Recover? What’s the difference?]]></title>
      <link>https://causely.ai/blog/mttr-meaning</link>
      <guid>https://causely.ai/blog/mttr-meaning</guid>
      <pubDate>Tue, 17 Sep 2024 18:24:54 GMT</pubDate>
      <description><![CDATA[Finding meaning in a world of acronyms There are so many ways to measure application reliability today, with hundreds of key performance indicators (KPIs) to measure availability, error rates, user experiences, and quality of service (QoS). Yet every organization I…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/mttx-e1726595097344-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<h2 id="finding-meaning-in-a-world-of-acronyms">Finding meaning in a world of acronyms</h2><p>There are so many ways to measure application reliability today, with hundreds of key performance indicators (KPIs) to measure availability, error rates, user experiences, and quality of service (QoS). Yet every organization I speak with struggles to effectively use these metrics.  Some applications and services require custom metrics around reliability while others can be measured with just uptime vs. downtime.</p><p>In my role at <a href="https://www.causely.ai/?ref=causely-blog.ghost.io">Causely</a>, I work with companies every day who are trying to improve the reliability, resiliency, and agility of their applications. One method of measuring reliability that I keep a close eye on is MTT(XYZ).  Yes, I made that up, but it’s meant to capture all the different variations of mean time to “X” out there.  We have MTTR, MTTI, MTTF, MTTA, MTBF, MTTD, and the list keeps going.  In fact, some of these acronyms have multiple definitions.  The one whose meaning I want to discuss today is MTTR.</p><h2 id="so-what-s-the-meaning-of-mttr-anyway">So, what’s the meaning of MTTR anyway?</h2><p>Before cloud-native applications, MTTR meant one thing – Mean Time to Repair. It’s a metric focused on how quickly an organization can respond to and fix problems that cause downtime or performance degradation.  It’s simple to calculate too:</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/screenshot-2024-09-17-at-1-39-26-pm.png" class="kg-image" alt="MTTR meaning: How to calculate" loading="lazy" width="480" height="117"></figure><p>Total time spent on repairs is the length of time IT spends fixing issues, and number of repairs is the number of times a fix has been implemented.  Some organizations look at this over a week or a month in production. It’s a great metric to understand how resilient your system is and how quickly the team can fix a known issue.  Unfortunately, data suggests that most IT organizations’ <a href="https://www.cncf.io/blog/2024/04/18/the-challenges-of-rising-mttr-and-what-to-do/?ref=causely-blog.ghost.io" rel="noopener">MTTR is increasing</a> every year, despite massive investments in the observability stack.</p><p>For monolithic applications, MTTR has historically been an excellent measurement; as soon as a fix is applied, the entire application is usually back online and performing well.  Now that IT is moving toward serverless and cloud-native applications, it is a much different story.  When a failure occurs in Kubernetes – where there are many different containers, services, applications, and more  all communicating in real time – the entire system can take much longer to <em><strong>recover</strong></em>.</p><h2 id="the-new-mttr-mean-time-to-recover">The new MTTR: Mean Time to Recover</h2><p>I am seeing more and more organizations redefine the meaning of MTTR from “mean time to repair” to “mean time to <em><strong>recover</strong></em>.”  <em><strong>Recover</strong></em> means that not only is everything back online, but the system is performing well and satisfying any QoS or SLAs <em><strong>AND</strong></em> a preventative approach has been implemented.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/mttx-e1726595097344.png" class="kg-image" alt loading="lazy" width="1920" height="975" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/09/mttx-e1726595097344.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/09/mttx-e1726595097344.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/wp-content/uploads/2024/09/mttx-e1726595097344.png 1600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/mttx-e1726595097344.png 1920w" sizes="(min-width: 720px) 720px"></figure><p>For example, take a common problem within Kubernetes: a pod enters a CrashLoopBackoff state.  There are many reasons why a pod might continuously restart including deployment errors, resourcing constraints, DNS resolution errors, missing K8s dependencies, etc.  But let’s say you completed your investigation and found out that your pod did not have sufficient memory and therefore was crashing/restarting.  So you increased the limit on the container or the deployment and the pod(s) seems to be running fine for a bit…. but wait, it just got evicted.</p><p>The node now has increased memory usage and pods are being evicted.  Or, what if now we created noisy neighbors, and that pod is “stealing” resources like memory from others on the same node?  This is why organizations are moving away from <em><strong>repair</strong></em> because sometimes when the applied fix brings everything online, it doesn’t mean the system is healthy. “Repaired” can be a subjective term. Furthermore, sometimes the fix is merely a band-aid, and the problem returns hours, days, or weeks later.</p><p>Waiting for the entire application system to become healthy and applying a preventative measure will get us better insight into reliability.  It is a more accurate way to measure how long it takes from a failure event to a healthy environment.  After all, just because something is online does not mean it is performing well.  The tricky issue here is: How do you measure “healthy”?   In other words, how do we know the entire system is healthy and our preventative patch is truly preventing problems?  There are some good QoS benchmarks like response time or transactions per second, but there is usually some difficulty in defining these thresholds.  An improvement in MTBF (mean time between failures) is another good benchmark to test to see if your preventative approach is working.</p><h2 id="how-can-we-improve-mean-time-to-recover">How can we improve Mean Time to Recover?</h2><p>There are many ways to improve system recovery, and ultimately the best way to improve MTTR is to improve all the MTT(XYZ) that come before it on incident management timelines.</p><ul><li><strong>Automation:</strong> Automating tasks like ticket creation, assigning incidents to appropriate teams, and probably most importantly, automating the fix can all help reduce the time from problem identification to recovery.  But, the more an organization scrutinizes every single change and configuration, the longer it takes to implement a fix.  Becoming less strict drives faster results.</li><li><strong>Well-defined Performance Benchmarks:</strong> Lots of customers I speak with have a couple KPIs they track, but the more specific the better.  For example, instead of making a blanket statement that every application needs to have 200ms of response time or less,  set these metrics on an app by app basis.</li><li><strong>Chaos Engineering:</strong> This is an often-overlooked <a href="https://www.splunk.com/en_us/blog/learn/chaos-engineering.html?ref=causely-blog.ghost.io#:~:text=Improve%20failure%20recovery%20%2D%20Since%20chaos%20tests,engineering%20enhances%20failure%20recovery%20and%20reduces%20downtime" rel="noopener">methodology to improve recovery rate</a>.  Practicing and simulating failures helps improve how quickly we can react, troubleshoot, and apply a fix.  It does take a lot of time though, so it is not an easy strategy to adhere to.</li><li><strong>Faster Alerting Mechanisms:</strong> This is simple: The faster we get notified of a problem, the quicker we can fix it.  We need to not just identify the symptoms but also quickly find the root cause.  I see many companies try to set up proactive alerts, but they often get more smoke than fire.</li><li><strong>Knowledge Base:</strong> This was so helpful for me in a previous role. Building a KB in a system like Atlassian, SharePoint, or JIRA can help immensely in the troubleshooting process.  The KB needs to be searchable and always changing as the environment evolves.  Being able to search for a specific string in an error message within a KB can immediately highlight not just a root cause but also a fix.</li></ul><p>To summarize, MTTR is a metric that needs to capture the state of a system from the moment of failure until the entire system is healthy again.  This is a much more accurate representation of how fast we recover from a problem, and how resilient the application architecture is.  MTTR is a principle that extends beyond the world of IT; its applications exist in security, mechanics, even healthcare.  Just remember, a good surgeon is not only measured by how fast he can repair a broken bone, but by how fast the patient can recover.</p><blockquote><em>Improving application resilience and reliability is something we spend a lot of time thinking about at Causely. We’d love to hear how you’re handling this today, and what metric you’ve found most useful toward this goal. Comment here or <a href="https://www.causely.ai/?ref=causely-blog.ghost.io#contact">contact us</a> with your thoughts!</em></blockquote>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Intelligence Augmentation: An Important Step in the Journey to Continuous Application Reliability]]></title>
      <link>https://causely.ai/blog/intelligence-augmentation-an-important-step-in-the-journey-to-continuous-application-reliability</link>
      <guid>https://causely.ai/blog/intelligence-augmentation-an-important-step-in-the-journey-to-continuous-application-reliability</guid>
      <pubDate>Wed, 11 Sep 2024 01:53:27 GMT</pubDate>
      <description><![CDATA[In an article that I published nearly two years ago titled Are Humans Actually Underrated, I talked about how technology can be used to augment human intelligence to empower humans to work better, smarter and faster. The notion that technology…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/supercharging-1.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/supercharging.jpg" class="kg-image" alt loading="lazy" width="647" height="336" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/09/supercharging.jpg 600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/supercharging.jpg 647w"></figure><p>In an article that I published nearly two years ago titled <a href="https://www.linkedin.com/pulse/humans-actually-underrated-andrew-mallaband/?trackingId=E6%2FWQPtkSk67IjPHc7awGA%3D%3D&lipi=urn%3Ali%3Apage%3Ad_flagship3_pulse_read%3BZQeG1oz%2BTF21r%2BjGSeF%2FAg%3D%3D&ref=causely-blog.ghost.io" rel="noopener">Are Humans Actually Underrated</a>, I talked about how technology can be used to augment human intelligence to empower humans to work better, smarter and faster.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/augmented-intel-1.jpg" class="kg-image" alt loading="lazy" width="1272" height="584" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/09/augmented-intel-1.jpg 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/09/augmented-intel-1.jpg 1000w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/augmented-intel-1.jpg 1272w" sizes="(min-width: 720px) 720px"></figure><p><br>The notion that technology can enhance human capabilities is far from novel. Often termed Intelligence Augmentation, Intelligence Amplification, Cognitive Augmentation, or Machine Augmented Intelligence, this concept revolves around leveraging information technology to bolster human intellect. Its roots trace back to the 1950s and 60s, a testament to its enduring relevance.</p><p>From the humble mouse and graphical user interface to the ubiquitous iPhone and the cutting-edge advancements in Artificial Intelligence like ChatGPT, Intelligence Augmentation has been steadily evolving. These tools and platforms serve as tangible examples of how technology can be harnessed to augment our cognitive abilities and propel us towards greater efficiency and innovation.</p><p>An area of scientific development closely aligned with Intelligence Augmentation is the field of Causal Reasoning. Before diving into this, it’s essential to underscore the fundamental importance of causality. Understanding why things happen, not just what happened, is the cornerstone of effective problem-solving, decision-making, and innovation.</p><h2 id="humans-crave-causality">Humans Crave Causality</h2><p>Our innate curiosity drives us to seek explanations for the world around us. This deep-rooted desire to understand cause-and-effect relationships is fundamental to human cognition. Here’s why:</p><p><strong>Survival:</strong> At the most basic level it all boils down to survival. By understanding cause-and-effect, we can learn what actions lead to positive outcomes (food, shelter, safety) and avoid negative ones (danger, illness, death).</p><p><strong>Learning:</strong> Understanding cause-and-effect is fundamental to learning and acquiring knowledge. We learn by observing and making connections between events, forming a mental model of how the world works.</p><p><strong>Prediction:</strong> Being able to predict what will happen allows us to plan for the future and make informed choices. We can anticipate the consequences of our actions and prepare for them.</p><p><strong>Problem-solving:</strong> Cause-and-effect is crucial for solving problems efficiently. By identifying the cause of an issue, we can develop solutions that address the root cause rather than just treating the symptoms.</p><p><strong>Scientific Discovery:</strong> This innate desire to understand causality drives scientific inquiry. By seeking cause-and-effect relationships, we can unravel the mysteries of the universe and develop new technologies.</p><p><strong>Technological Advancement:</strong> Technology thrives on our ability to understand cause-and-effect. From inventing tools to building machines, understanding how things work allows us to manipulate the world around us.</p><p><strong>Societal Progress:</strong> When we understand the causes of social issues, we can develop solutions to address them. Understanding cause-and-effect fosters cooperation and allows us to build a better future for ourselves and future generations.</p><h2 id="understanding-cause-effect-in-the-digital-world">Understanding Cause &amp; Effect In The Digital World</h2><p>In the complex digital age, this craving for causality remains as potent as ever. Nowhere is this more evident than in the world of cloud native applications. These intricate systems, composed of interconnected microservices and distributed components, can be challenging to manage and troubleshoot. When things go wrong, pinpointing the root cause can be akin to searching for a needle in a haystack.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/haystack-1.jpg" class="kg-image" alt loading="lazy" width="403" height="265"></figure><p>This is increasingly important today because so many businesses rely on modern applications and real time data to conduct their daily business. Delays, missing data and malfunctions can have a crippling effect on business processes and customer experiences which in turn can have significant financial consequences.</p><p>Understanding causality in this context is paramount. It’s the difference between reacting to symptoms and addressing the underlying issue. For instance, a sudden spike in error rates might be attributed to increased traffic. While this might be a contributing factor, the root cause could lie in a misconfigured database, a network latency issue, or a bug in a specific microservice. Without a clear understanding of the causal relationships between these components, resolving the problem becomes a matter of trial and error.</p><p>Today site reliability engineers (SREs) and developers, tasked with ensuring the reliability and performance of cloud native systems, rely heavily on causal reasoning. They do this by constructing mental models of how different system components interact, they attempt to anticipate potential failure points and develop strategies to mitigate risks.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/rca-1.jpg" class="kg-image" alt loading="lazy" width="369" height="153"></figure><p>When incidents occur, SREs and developers work together, employing a systematic approach to identify the causal chain, from the initial trigger to the eventual impact on users. We rely heavily on their  knowledge to implement effective remediation steps and prevent future occurrences.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/triage-1.jpg" class="kg-image" alt loading="lazy" width="205" height="198"></figure><p>In the intricate world of cloud native applications, where complexity reigns, this innate ability to connect cause and effect is essential for building resilient, high-performing systems.</p><h2 id="the-crucial-role-of-observability-in-understanding-causality-in-cloud-native-systems">The Crucial Role of Observability in Understanding Causality in Cloud Native Systems</h2><p><a href="https://www.causely.ai/resources/glossary-cloud-native-technologies/?ref=causely-blog.ghost.io">OpenTelemetry</a> and its ecosystem of observability tools provide a window into the complex world of cloud native systems. By collecting and analyzing vast amounts of data, engineers can gain valuable insights into system behavior. However, understanding why something happened – establishing causality – still remains a significant challenge.</p><p>The inability to rapidly pinpoint the root cause of issues is a costly affair. A recent <a href="https://www.pagerduty.com/resources/learn/cost-of-downtime/?ref=causely-blog.ghost.io" rel="noopener">PagerDuty customer survey</a> revealed that the average time to resolve digital incidents is a staggering 175 minutes. This delay impacts service reliability, erodes customer satisfaction, revenue and consumes lots of engineering cycles in the process. A time-consuming process often leaves engineering teams overwhelmed and firefighting.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/pagerduty-1.jpg" class="kg-image" alt loading="lazy" width="1258" height="320" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/09/pagerduty-1.jpg 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/09/pagerduty-1.jpg 1000w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/09/pagerduty-1.jpg 1258w" sizes="(min-width: 720px) 720px"></figure><p>To drive substantial improvements in system reliability and performance, organizations must accelerate their ability to understand causality. This requires a fundamental shift in how we approach observability. By investing in advanced analytics that can reason about causality, we can empower engineers to quickly identify root causes and their effects so they can prioritize what is important and implement effective solutions.</p><h2 id="augmenting-human-ingenuity-with-causal-reasoning">Augmenting Human Ingenuity with Causal Reasoning</h2><p>In this regard, causal reasoning software like <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io">Causely</a> represent a quantum leap forward in the evolution of human-machine collaboration. <a href="https://www.causely.ai/blog/eating-our-own-dog-food-causelys-journey-with-opentelemetry-causal-ai/?ref=causely-blog.ghost.io">By combining this capability with OpenTelemetry</a>, the arduous task of causal reasoning can be automated, liberating SREs and developers from the firefighting cycle. Instead of being perpetually mired in troubleshooting, they can dedicate more cognitive resources to innovation and strategic problem-solving.</p><p>Imagine these professionals equipped with the ability to process vast quantities of observability data in mere seconds, unveiling intricate causal relationships that would otherwise remain hidden. This is the power of causal reasoning software that is built to amplify the processes associated with Reliability Engineering. They amplify human intelligence, transforming SREs and developers from reactive problem solvers into proactive architects of system reliability.</p><p>By accelerating incident resolution from today’s averages (175 minutes as documented in the PagerDuty’s customer survey) to mere minutes, these platforms not only enhance customer satisfaction but also unlock significant potential for business growth. With freed-up time, teams can focus on developing new features, improving system performance, and preventing future issues. Moreover, the insights derived from causal reasoning software can be leveraged to proactively identify vulnerabilities and optimize system performance, elevating the overall reliability and resilience of cloud native architectures.</p><p>The convergence of human ingenuity and machine intelligence, embodied in causal reasoning software, is ushering in a new era of problem-solving. This powerful combination enables us to tackle unprecedented challenges with unparalleled speed, accuracy, and innovation.</p><p>In the context of reliability engineering, the combination of OpenTelemetry and causal reasoning software offers a significant opportunity to accelerate progress towards continuous application reliability.</p><hr><h2 id="related-resources">Related resources</h2><ul><li><a href="https://www.causely.ai/blog/explainability-the-black-box-dilemma-in-the-real-world/?ref=causely-blog.ghost.io">Read the blog</a>: Explainability: The Black Box Dilemma in the Real World</li><li><a href="https://www.causely.ai/video/mission-impossible-cracking-the-code-of-complex-tracing-data/?ref=causely-blog.ghost.io">Watch the video</a>: See how Causely leverages OpenTelemetry</li><li><a href="https://www.causely.ai/resources/experience-causely/?ref=causely-blog.ghost.io">Take the interactive tour</a>: Experience Causely first-hand</li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Preventing Out-of-Memory (OOM) Kills in Kubernetes: Tips for Optimizing Container Memory Management]]></title>
      <link>https://causely.ai/blog/kubernetes-oom-killer-tips</link>
      <guid>https://causely.ai/blog/kubernetes-oom-killer-tips</guid>
      <pubDate>Wed, 28 Aug 2024 20:24:25 GMT</pubDate>
      <description><![CDATA[Running containerized applications at scale with Kubernetes demands careful resource management. One very complicated but common challenge is preventing Out-of-Memory (OOM) kills, which occur when a container’s memory consumption surpasses its allocated limit. This brutal termination by the Kubernet]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/08/preventing-oom-kills-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Running containerized applications at scale with Kubernetes demands careful resource management. One very complicated but common challenge is preventing Out-of-Memory (OOM) kills, which occur when a container’s memory consumption surpasses its allocated limit. This brutal termination by the Kubernetes kernel’s OOM killer disrupts application stability and can affect application availability and the health of your overall environment.</p><p>In this post, we’ll explore the reasons that OOM kills can occur and provide tactics to combat and prevent them.</p><p>Before diving in, it’s worth noting that OOM kills represent one symptom that can have a variety of root causes. It’s important for organizations to implement a system that solves the root cause analysis problem with speed and accuracy, allowing reliability engineering teams to respond rapidly, and to potentially prevent these occurrences in the first place.</p><h2 id="deep-dive-into-an-oom-kill">Deep dive into an OOM kill</h2><p>An Out-Of-Memory (OOM) kill in <a href="https://www.causely.ai/resources/glossary-cloud-native-technologies/?ref=causely-blog.ghost.io">Kubernetes</a> occurs when a container exceeds its memory limit, causing the Kubernetes kernel’s OOM killer to terminate the container. This impacts application stability and requires immediate attention.</p><p>Several factors can trigger OOM kills in your Kubernetes environment, including:</p><ul><li><strong>Memory limits exceeded:</strong> This is the most common culprit. If a container consistently pushes past its designated memory ceiling, the OOM killer steps in to prevent a system-wide meltdown.</li><li><strong>Memory leaks:</strong> Applications can develop memory leaks over time, where they allocate memory but fail to release it properly. This hidden, unexpected growth eventually leads to OOM kills.</li><li><strong>Resource overcommitment:</strong> Co-locating too many resource-hungry pods onto a single node can deplete available memory. When the combined memory usage exceeds capacity, the OOM killer springs into action.</li><li><strong>Bursting workloads:</strong> Applications with spiky workloads can experience sudden memory surges that breach their limits, triggering OOM kills.</li></ul><p>As an example, a web server that experiences a memory leak code bug may gradually consume more and more memory until the OOM killer intervenes to prevent a crash.</p><p>Another case could be when a Kubernetes cluster over-commits resources by scheduling too many pods on a single node. The OOM killer may need to step in to free up memory and ensure system stability.</p><h2 id="the-devastating-effects-of-oom-kills-why-they-matter">The devastating effects of OOM kills: Why they matter</h2><p>OOM kills aren’t normally occurring events. They can trigger a cascade of negative consequences for your applications and the overall health of the cluster, such as:</p><ul><li><strong>Application downtime:</strong> When a container is OOM-killed, it abruptly terminates, causing immediate application downtime. Users may experience service disruptions and outages.</li><li><strong>Data loss:</strong> Applications that rely on in-memory data or stateful sessions risk losing critical information during an OOM kill.</li><li><strong>Performance degradation:</strong> Frequent OOM kills force containers to restart repeatedly. This constant churn degrades overall application performance and user experience.</li><li><strong>Service disruption:</strong> Applications often interact with each other. An OOM kill in one container can disrupt inter-service communication, causing cascading failures and broader service outages.</li></ul><p>If a container running a critical database service experiences an OOM kill, it could result in data loss and corruption. This leads to service disruptions for other containers that rely on the database for information, causing cascading failures across the entire application ecosystem.</p><h2 id="combating-oom-kills">Combating OOM kills</h2><p>There are a few different tactics to combat OOM kills in attempt to operate a memory-efficient Kubernetes environment.</p><h3 id="set-appropriate-resource-requests-and-limits">Set appropriate resource requests and limits</h3><p>For example, you can set a memory request of 200Mi and a memory limit of 300Mi for a particular container in your Kubernetes deployment. Requests ensure the container gets at least 200Mi of memory, while limits cap it at 300Mi to prevent excessive consumption.</p><pre><code class="language-json">resources:
  requests:
    memory: "200Mi"
  limits:
    memory: "300Mi"</code></pre><p>While this may mitigate potential memory use issues, it is a very manual process and does not deal at all with the dynamic nature of what we can achieve with Kubernetes. It also doesn’t solve the source issue, which may be a code-level problem triggering memory leaks or failed GC processes.</p><h3 id="transition-to-autoscaling">Transition to autoscaling</h3><p>Leveraging <a href="https://www.causely.ai/resources/glossary-cloud-native-technologies/?ref=causely-blog.ghost.io">autoscaling</a> capabilities is a core dynamic option for resource allocation. There are two autoscaling methods:</p><ul><li><strong>Vertical Pod Autoscaling (VPA):</strong> <a href="https://kubernetes.io/docs/concepts/workloads/autoscaling/?ref=causely-blog.ghost.io" rel="noopener">VPA</a> dynamically adjusts resource limits based on real-time memory usage patterns. This ensures containers have enough memory to function but avoids over-provisioning.</li><li><strong>Horizontal Pod Autoscaling (HPA):</strong> <a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/?ref=causely-blog.ghost.io" rel="noopener">HPA</a> scales the number of pods running your application up or down based on memory utilization. This distributes memory usage across multiple pods, preventing any single pod from exceeding its limit. The following HPA configuration shows an example of scaling based on memory usage:</li></ul><pre><code class="language-json">apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80</code></pre><h3 id="monitor-memory-usage">Monitor memory usage</h3><p>Proactive monitoring is key. For instance, you can configure <a href="https://www.causely.ai/resources/glossary-cloud-native-technologies/?ref=causely-blog.ghost.io">Prometheus</a> to scrape memory metrics from your Kubernetes pods every 15 seconds and set up <a href="https://grafana.com/?ref=causely-blog.ghost.io" rel="noopener">Grafana</a> dashboards to visualize memory usage trends over time. Additionally, you can create alerts in Prometheus to trigger notifications when memory usage exceeds a certain threshold.</p><h3 id="optimize-application-memory-usage">Optimize application memory usage</h3><p>Don’t underestimate the power of code optimization. Address memory leaks within your applications and implement memory-efficient data structures to minimize memory consumption.</p><h3 id="pool-disruption-budgets-pdb">Pool disruption budgets (PDB)</h3><p>When deploying updates, <a href="https://www.causely.ai/resources/glossary-cloud-native-technologies/?ref=causely-blog.ghost.io">PDBs</a> ensure a minimum number of pods remain available, even during rollouts. This mitigates the risk of widespread OOM kills during deployments. Here is a PDB configuration example that helps ensure minimum pod availability.</p><pre><code class="language-json">apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: my-app</code></pre><h3 id="manage-node-resources">Manage node resources</h3><p>You can apply a node selector to ensure that a memory-intensive pod is only scheduled on nodes with a minimum of 8GB of memory. Additionally, you can use taints and tolerations to dedicate specific nodes with high memory capacity for memory-hungry applications, preventing OOM kills due to resource constraints.</p><pre><code class="language-json">nodeSelector:
  disktype: ssd
tolerations:
  - key: "key"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"</code></pre><h3 id="use-qos-classes">Use QoS classes</h3><p>Kubernetes offers Quality of Service (<a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/?ref=causely-blog.ghost.io" rel="noopener">QoS</a>) classes that prioritize resource allocation for critical applications. Assign the highest QoS class to applications that can least tolerate OOM kills. Here is a sample resource configuration with QoS parameters:</p><pre><code class="language-json">resources:
  requests:
    memory: "1Gi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "500m"</code></pre><p>These are a few potential strategies to help prevent OOM kills. The challenge comes with the frequency with which they can occur, and the risk to your applications when they happen.</p><p>As you can imagine, it’s not possible to manually manage resource utilization, and guarantee the stability and performance of your containerized applications within your Kubernetes environment.</p><h2 id="manual-thresholds-rigidity-and-risk">Manual thresholds = Rigidity and risk</h2><p>These techniques can help reduce the risk of OOM kills. The issue is not entirely solved though. By setting manual thresholds and limits, you’re removing many of the dynamic advantages of Kubernetes.</p><p>A more ideal way to solve the OOM kill problem is to use adaptive, dynamic resource allocation. Even if you get resource allocation right on initial deployment, there are many factors that change that affect how your application consumes resources. There is also a risk because application and resource issues don’t just affect one pod, or one container. Resource issues can reach every part of the cluster and degrade the other running applications and services.</p><h2 id="which-strategy-works-best-to-prevent-oom-kills">Which strategy works best to prevent OOM kills?</h2><p>Vertical Pod Autoscaling (VPA) and Horizontal Pod Autoscaling (HPA) are common strategies used to manage resource limits in Kubernetes containers. VPA adjusts resource limits based on real-time memory usage patterns, while HPA scales pods based on memory utilization.</p><p>Monitoring with tools like Prometheus may help with the troubleshooting of memory usage trends. Optimizing application memory usage is no easy feat because it’s especially challenging to identify whether it is infrastructure or code causing the problem.</p><p>Pod Disruption Budgets (PDB) may help ensure a minimum number of pods remain available during deployments, while node resources can be managed using node selectors and taints. Quality of Service (QoS) classes prioritize resource allocation for critical applications.</p><p>One thing is certain: OOM kills are a common and costly challenge to manage using traditional monitoring tools and methods.</p><p>At <a href="https://www.causely.ai/?ref=causely-blog.ghost.io">Causely</a>, we’re focused on applying causal reasoning software to help organizations keep applications healthy and resilient. By automating root cause analysis, issues like OOM kills can be resolved in seconds, and unintended consequences of new releases or application changes can be avoided.</p><h2 id="related-resources">Related resources</h2><ul><li><a href="https://www.causely.ai/blog/understanding-kubernetes-readiness-probe-to-ensure-your-applications-availability/?ref=causely-blog.ghost.io">Read the blog</a>: Understanding the Kubernetes Readiness Probe: A Tool for Application Health</li><li><a href="https://www.causely.ai/blog/bridging-the-gap-between-observability-and-automation-with-causal-reasoning/?ref=causely-blog.ghost.io">Read the blog</a>: Bridging the gap between observability and automation with causal reasoning</li></ul><p>Watch our webinar on Causal AI and why DevOps teams need it:</p><figure class="kg-card kg-embed-card"><iframe width="200" height="150" src="https://www.youtube.com/embed/pSl4pGCczOU?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" title="What is Causal AI and why do DevOps teams need it?"></iframe></figure>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely brings on a new CEO to accelerate growth]]></title>
      <link>https://causely.ai/blog/causely-brings-on-a-new-ceo-to-accelerate-growth</link>
      <guid>https://causely.ai/blog/causely-brings-on-a-new-ceo-to-accelerate-growth</guid>
      <pubDate>Thu, 22 Aug 2024 07:50:15 GMT</pubDate>
      <description><![CDATA[Yotam Yemini joins Causely as CEO after departing Cisco and previously leading go-to-market efforts at Oort, Quantum Metric, and IBM Turbonomic   Thursday, August 22, 2024 – Today, Causely is excited to welcome Yotam Yemini as the company’s Chief Executive…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/08/yotam-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<h2 id="yotam-yemini-joins-causely-as-ceo-after-departing-cisco-and-previously-leading-go-to-market-efforts-at-oort-quantum-metric-and-ibm-turbonomic"><em>Yotam Yemini joins Causely as CEO after departing Cisco and previously leading go-to-market efforts at Oort, Quantum Metric, and IBM Turbonomic</em></h2><p><strong>Thursday, August 22, 2024</strong> – Today, Causely is excited to welcome Yotam Yemini as the company’s Chief Executive Officer. In this role, Yotam will be instrumental in helping Causely fulfill its mission to enable continuous application reliability and streamline the software development lifecycle for modern applications.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/08/yotam-1.png" class="kg-image" alt loading="lazy" width="255" height="255"></figure><!--kg-card-begin: html--><span style="font-size: 10pt;">Yotam Yemini, CEO of Causely</span><!--kg-card-end: html--><p>“We are excited to welcome Yotam to the team,” said Causely Founder Shmuel Kliger. “Yotam brings an exceptional track record of building and scaling go-to-market strategy. His leadership will be pivotal as we deliver our <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io">Causal Reasoning Platform</a> to Engineering and DevOps teams.”</p><p>The news comes on the heels of strong market signals from its early access program and the hiring of Francis Cordón as Chief Customer Officer in July. Francis is a domain expert and seasoned technology leader. His experience includes customer success leadership at Quantum Metric and IBM Turbonomic, resiliency and performance architecture at BNY Mellon, and sales engineering at Dynatrace.</p><p>“Causely’s causal reasoning software is a well-timed and much needed innovation for the tech industry,” said Alex Sukennik, CIO, Semrush. “The team behind Causely is uniquely suited to help engineering teams assure service levels for business-critical applications.”</p><p>In addition to today’s news, the company shares its deepest gratitude to Causely Founder Ellen Rubin for her work building the company, early product and team over the past few years. We wish Ellen the best as she moves on to her next adventures as board member, investor, advisor, and builder in the Boston startup ecosystem.</p><p>See Causely first-hand through a self-guided tour at <a href="https://www.causely.ai/resources/experience-causely/?ref=causely-blog.ghost.io">causely.io/resources/experience-causely</a> or sign up today at <a href="https://www.causely.ai/trial?ref=causely-blog.ghost.io">causely.io/trial</a>.</p><hr><h3 id="about-causely">About Causely</h3><p>Causely is the leading provider of causal reasoning software, which enables continuous application reliability and streamlines the software development lifecycle for modern applications. Whereas reliability engineering today tends to be overly complex and labor-intensive, Causely amplifies engineering productivity through its patent-pending Causal Reasoning Platform. The platform identifies cause and effect relationships in runtime to automate the process of root cause and impact analysis. This drastically shortens mean-time-to-repair (MTTR), reduces the number of incidents that occur, and empowers engineering teams to build more resilient applications and business services. Causely, Inc. is a remote-first company headquartered in New York City. Visit <a href="https://www.causely.ai/?ref=causely-blog.ghost.io">causely.io</a> to learn more.</p><h3 id="media-contact">Media Contact</h3><p>Karina Babcock</p><p>kbabcock@causely.io</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[The Rising Cost of Digital Incidents: Understanding and Mitigating Outage Impact]]></title>
      <link>https://causely.ai/blog/the-rising-cost-of-digital-incidents-understanding-and-mitigating-outage-impact</link>
      <guid>https://causely.ai/blog/the-rising-cost-of-digital-incidents-understanding-and-mitigating-outage-impact</guid>
      <pubDate>Thu, 08 Aug 2024 18:36:25 GMT</pubDate>
      <description><![CDATA[Digital disruptions have reached alarming levels. Incident response in modern application environments is frequent, time-consuming and labor intensive. Our team has first-hand experience dealing with the far-reaching impacts of these disruptions and outages, having spent decades in IT Ops….]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/08/mitigating-outage-impact-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Digital disruptions have reached alarming levels. Incident response in modern application environments is frequent, time-consuming and labor intensive. <a href="https://www.causely.ai/about/?ref=causely-blog.ghost.io">Our team has first-hand experience</a> dealing with the far-reaching impacts of these disruptions and outages, having spent decades in IT Ops.</p><p><a href="https://www.pagerduty.com/resources/learn/cost-of-downtime/?ref=causely-blog.ghost.io" rel="noopener">PagerDuty recently published a study</a><sup>1</sup> that shines a light on how broken our existing incident response systems and practices are. The recent <a href="https://www.forbes.com/sites/kateoflahertyuk/2024/08/07/crowdstrike-reveals-what-happened-why-and-whats-changed/?ref=causely-blog.ghost.io" rel="noopener">Crowdstrike debacle</a> is further evidence of this. Even with all the investment in observability, AI Ops, automation, and playbooks, things aren’t improving. In some ways, they’re actually worse; we’re collecting more and more data and we’re overloaded with tooling, creating confusion between users and teams who struggle to understand the holistic environment and all of its interdependencies.</p><p>With a <strong>mean resolution time of 175 minutes</strong>, each customer-impacting digital incident costs both time and money. The industry needs to reset and revisit current processes so we can evolve and change the trajectory.</p><h2 id="the-impact-of-outages-and-application-downtime">The impact of outages and application downtime</h2><p><strong>Outages erode customer trust.</strong> 90% of IT leaders report that disruptions have reduced customer confidence. Protecting sensitive data, ensuring swift service restoration, and providing real-time customer updates are essential for maintaining trust when digital incidents happen. Thorough, action-oriented postmortems are critical post-incident to prevent recurrences. And – at risk of reinforcing the obvious – IT organizations need to put operational practices in place to minimize outages from happening in the first place.</p><p><strong>Yet even though IT leaders understand the implications on customer confidence, incident frequency continues to rise.</strong> 59% of IT leaders report an increase in customer-impacting incidents, and it’s not going to get better unless we change the way we observe and mitigate problems in our applications.</p><h2 id="automation-can-help-but-adoption-is-slow">Automation can help, but adoption is slow</h2><p>Despite the growing threat, many organizations are lagging behind in incident response automation:</p><ul><li>Over 70% of IT leaders report that key incident response tasks are not yet fully automated.</li><li>38% of responders’ time is spent dealing with manual incident response processes.</li><li>Organizations with manual processes take on average 3 hours 58 minutes to resolve customer-impacting incidents, compared to 2 hours 40 minutes for those with automated processes.</li></ul><p>It doesn’t take an IT expert to know that spending nearly half their time in manual processes is a waste of resources. And those that have automated operations are still taking almost 3 hours to resolve incidents. <strong>Why is incident response still so slow?</strong></p><p>It’s not just about process automation. We also need to accelerate decision automation, driven by a deep understanding of the state of applications and infrastructure.</p><h2 id="causal-reasoning-the-missing-link">Causal reasoning: The missing link</h2><p>Causal reasoning technology promises a bridge between observability and automated incident response. We're referring to causal reasoning software that applies machine learning to automatically capture cause and effect relationships. This has the potential to help Dev and Ops teams better plan for changes to code, configurations or load patterns, so they can stay focused on achieving service-level and business objectives instead of firefighting.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/08/screenshot-2024-08-08-at-1-54-56-pm-1.png" class="kg-image" alt="Incident response tasks that aren't automated" loading="lazy" width="1190" height="675" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/08/screenshot-2024-08-08-at-1-54-56-pm-1.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/08/screenshot-2024-08-08-at-1-54-56-pm-1.png 1000w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/08/screenshot-2024-08-08-at-1-54-56-pm-1.png 1190w" sizes="(min-width: 720px) 720px"></figure>
<!--kg-card-begin: html-->
<span style="font-size: 10pt;">Source: <a href="https://www.pagerduty.com/resources/learn/cost-of-downtime/?ref=causely-blog.ghost.io" target="_blank" rel="noopener">PagerDuty</a></span>
<!--kg-card-end: html-->
<p>With causal reasoning technology, many of the incident response tasks that are currently manual can be automated:</p><ul><li>When service entities are degraded or failing and affecting other entities that make up business services, causal reasoning software surfaces the relationship between the problem and the symptoms it’s causing.</li><li>The team with responsibility for the failing or degraded service is immediately notified so they can get to work resolving the problem. Some problems can be remediated automatically.</li><li>Notifications can be sent to end users and other stakeholders, letting them know that their services are affected along with an explanation for why this occurred and when things will be back to normal.</li><li>Postmortem documentation is automatically generated.</li><li>There’s no more complex triage processes that would otherwise involve multiple teams and managers to orchestrate. Incidents and outages are reduced and root cause analysis is automated, so DevOps teams spend less time troubleshooting and more time shipping code.</li></ul><h2 id="introducing-causely">Introducing Causely</h2><p>This potential to transform the way DevOps teams work is why we built Causely. Our <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io">Causal Reasoning Platform</a> automatically pinpoints the root cause of observed symptoms based on real-time, dynamic data across the entire application environment. Causely transforms incident response and improves <a href="https://www.causely.ai/blog/mttr-meaning?ref=causely-blog.ghost.io" rel="noreferrer">mean time to resolution</a> (MTTR), so DevOps teams focus on building new services and innovations that propel the business forward.</p><p>By automatically understanding cause-and-effect relationships in application environments, Causely also enables predictive maintenance and better overall operational resilience. It can help to prevent outages and identify the root cause of potential issues before they escalate.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/08/screenshot-2024-08-08-at-2-43-39-pm-1.png" class="kg-image" alt="" loading="lazy" width="839" height="402" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/08/screenshot-2024-08-08-at-2-43-39-pm-1.png 600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/08/screenshot-2024-08-08-at-2-43-39-pm-1.png 839w" sizes="(min-width: 720px) 720px"></figure><p>Here’s how it works, at a high level:</p><ol><li>Our Causal Reasoning Platform is shipped with out-of-the-box <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io#causal-models"><strong>Causal Models</strong></a> that drive the platform’s behavior.</li><li>Once deployed, Causely automatically discovers the application environment and generates a <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io#topology-graph"><strong>Topology Graph</strong></a> of it.</li><li>A <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io#causality-graph"><strong>Causality Graph</strong></a> is generated by instantiating the Causal Models with the Topology Graph to reflect cause and effect relationships between the root causes and their symptoms, specific to that environment at that point in time.</li><li>A <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io#codebook"><strong>Codebook</strong></a> is generated from the Causality Graph.</li><li>Using the Codebook, our <strong>Causal Reasoning Platform</strong> automatically and continuously pinpoints the root cause of issues.</li></ol><p>Users can dig into incidents, understand their root causes, take remediation steps, and proactively plan for new releases and application changes – all within Causely.</p><p>This decreases downtime, enhances operational efficiency, and improves customer trust long-term.</p><h2 id="it%E2%80%99s-time-for-a-new-approach">It’s time for a new approach</h2><p>It’s time to shift from manual to automated incident response. Causely can help teams prevent outages, reduce risk, cut costs, and build sustainable customer trust.</p><p>Don’t hesitate to <a href="https://www.causely.ai/?ref=causely-blog.ghost.io#contact">contact us</a> about how to bring automation into your organization, or you can <a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer">see Causely for yourself</a>.</p><hr>
<!--kg-card-begin: html-->
<span style="font-size: 10pt;"><sup>1 </sup>“Customer impacting incidents increased by 43% during the past year- each incident costs nearly $800,000.” PagerDuty. (2024, June 26). https://www.pagerduty.com/resources/learn/cost-of-downtime/&nbsp;</span>
<!--kg-card-end: html-->
<h2 id="related-resources">Related resources</h2><ul><li><a href="https://www.causely.ai/blog/devops-may-have-cheated-death-but-do-we-all-need-to-work-for-the-king-of-the-underworld/?ref=causely-blog.ghost.io">Read the blog:</a> DevOps may have cheated death, but do we all need to work for the king of the underworld?</li><li>Learn about our <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io">Causal Reasoning Platform</a></li><li><a href="https://www.causely.ai/resources/experience-causely/?ref=causely-blog.ghost.io">See Causely</a> for yourself</li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Explainability: The Black Box Dilemma in the Real World]]></title>
      <link>https://causely.ai/blog/explainability-the-black-box-dilemma-in-the-real-world</link>
      <guid>https://causely.ai/blog/explainability-the-black-box-dilemma-in-the-real-world</guid>
      <pubDate>Wed, 07 Aug 2024 20:34:35 GMT</pubDate>
      <description><![CDATA[The software industry is at a crossroads. I believe those who embrace explainability as a key part of their strategy will emerge as leaders. Those who resist will risk losing customer confidence and market share. The time for obfuscation is…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/08/xai-header-1.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>The software industry is at a crossroads. I believe those who embrace explainability as a key part of their strategy will emerge as leaders. Those who resist will risk losing customer confidence and market share. The time for obfuscation is over. The era of explainability has begun.</p><h2 id="what-is-the-black-box-dilemma">What Is The Black Box Dilemma?</h2><p>Imagine a masterful illusionist, their acts so breathtakingly deceptive that the secrets behind them remain utterly concealed. Today’s software is much the same. We marvel at its abilities to converse, diagnose, drive, and defend, yet the inner workings often remain shrouded in mystery. This is often referred to as the “black box” problem.</p><figure class="kg-card kg-image-card"><img src="https://media.licdn.com/dms/image/D4E12AQG266zahtDZZA/article-inline_image-shrink_1500_2232/0/1723060776101?e=1728518400&amp;v=beta&amp;t=YTdQdMQm3xbKpNvsuZa5TUSiRAimi80gWwX8evmDtkM" class="kg-image" alt loading="lazy" width="374" height="255"></figure><p>The recent CrowdStrike incident is also a stark reminder of the risks of this opacity. A simple software update, intended to enhance security, inadvertently caused widespread system crashes. It’s as if the magician’s assistant accidentally dropped the secret prop, revealing the illusion for what it truly was – an error prone process with no resilience. Had organizations understood the intricacies of CrowdStrike’s software release process, they might have been better equipped to mitigate risks and prevent the disruptions that followed.</p><p>This incident, coupled with the rapid advancements in AI, underscores the critical importance of explainability. Understanding the entire lifecycle of software – from conception to operation – is no longer optional but imperative. It is the cornerstone of trust, a shield against catastrophic failures, and an important foundation for accountability.</p><p>As our world becomes increasingly reliant on complex systems, understanding their inner workings is no longer a luxury but a necessity. Explainability acts as a key to unlocking the black box, revealing the logic and reasoning behind complex outputs. By shedding light on the decision-making processes of software, AI, and other sophisticated systems, we foster trust, accountability, and a deeper comprehension of their impact.</p><h2 id="the-path-forward-cultivating-explainability-in-software">The Path Forward: Cultivating Explainability in Software</h2><p>Achieving explainability demands a comprehensive approach that addresses several critical dimensions.</p><ul><li><strong>Software Centric Reasoning and Ethical Considerations:</strong> Can the system’s decision-making process be transparently articulated, justified, and aligned with ethical principles? Explainability is essential for building trust and ensuring that systems used to support decision making operate fairly and responsibly.</li><li><strong>Effective Communication and User Experience:</strong> Is the system able to communicate its reasoning clearly and understandably to both technical and non-technical audiences? Effective communication enhances collaboration, knowledge sharing, and user satisfaction by empowering users to make informed decisions.</li><li><strong>Robust Data Privacy and Security:</strong> How can sensitive information be protected while preserving the transparency necessary for explainability? Rigorous data handling and protection are crucial for safeguarding privacy and maintaining trust in the system.</li><li><strong>Regulatory Compliance and Continuous Improvement: </strong>Can the system effectively demonstrate adherence to relevant regulations and industry standards for explainability? Explainability is a dynamic process requiring ongoing evaluation, refinement, and adaptation to stay aligned with the evolving regulatory landscape.</li></ul><p>By prioritizing these interconnected elements, software vendors and engineering teams can create solutions where explainability is not merely a feature, but a cornerstone of trust, reliability, and competitive advantage.</p><figure class="kg-card kg-image-card"><img src="https://media.licdn.com/dms/image/D4E12AQF0Nfltsk7Y_w/article-inline_image-shrink_1000_1488/0/1723060776069?e=1728518400&amp;v=beta&amp;t=Jml_jTyXKkjh7z0Fehc7nctsIoW2atQFKPgN2VRtimU" class="kg-image" alt loading="lazy" width="298" height="203"></figure><h3 id="an-example-of-explainability-in-action-with-causely">An Example of Explainability in Action with Causely</h3><p>Causely is a pioneer in applying causal reasoning to revolutionize cloud-native application reliability. <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io">Our platform</a> empowers operations teams to rapidly identify and resolve the root causes of service disruptions, preventing issues before they impact business processes and customers. The enables dramatic reductions in Mean Time to Repair (MTTR), minimizing business disruptions and safeguarding customer experiences.</p><p>Causely also uses its Causal Reasoning Platform to manage its SaaS offering, detecting and responding to service disruptions, and ensuring swift resolution with minimal impact. You can learn more about this in <a href="https://www.linkedin.com/in/endresara/?ref=causely-blog.ghost.io" rel="noopener">Endre Sara</a>‘s article “<a href="https://www.causely.ai/blog/eating-our-own-dog-food-causelys-journey-with-opentelemetry-causal-ai/?ref=causely-blog.ghost.io">Eating Our Own Dogfood: Causely’s Journey with OpenTelemetry &amp; Causal AI</a>”.</p><p>Causal Reasoning, often referred to as Causal AI, is a specialized field in computer science dedicated to uncovering the underlying cause-and-effect relationships within complex data. As the foundation of explainable AI, <a href="https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2020.513474/full?ref=causely-blog.ghost.io" rel="noopener">it surpasses the limitations of traditional statistical methods</a>, which frequently produce misleading correlations.</p><p>Unlike opaque black-box models, causal reasoning illuminates the precise mechanisms driving outcomes, providing transparent and actionable insights. When understanding the <em>why</em> behind an event is equally important as predicting the <em>what</em>, causal reasoning offers superior clarity and reliability.</p><p>By understanding causation, we transcend mere observation to gain predictive and interventional power over complex systems.</p><p>Causely is built on a deep-rooted foundation of causal reasoning. The founding team, <a href="https://www.linkedin.com/posts/andrew-mallaband-88b1b7_the-genesis-of-causal-reasoning-in-it-operations-activity-7213538946411044864-0K-_??ref=causely-blog.ghost.io" rel="noopener">pioneers in applying this science to IT operations</a>, led the charge at System Management Arts (SMARTS). SMARTS revolutionized root cause analysis for distributed systems and networks, empowering more than 1500 global enterprises and service providers to ensure the reliability of mission-critical IT services. Their groundbreaking work earned industry accolades and solidified SMARTS as a leader in the field.</p><p>Explainability is a cornerstone of the Causal Reasoning Platform from Causely. The company is committed to transparently communicating how its software arrives at its conclusions, encompassing both the underlying mechanisms used in Causal Reasoning and the practical application within organizations operational workflows.</p><h2 id="explainable-operations-enhancing-workflow-efficiency-for-continuous-application-reliability">Explainable Operations: Enhancing Workflow Efficiency for Continuous Application Reliability</h2><p>Causely converts raw observability data into actionable insights by pinpointing root causes and their cascading effects within complex application and infrastructure environments.</p><p>Today incident response is a complex, resource-intensive process of triage and troubleshooting that often diverts critical teams from strategic initiatives. This reactive approach hampers innovation, erodes efficiency, and can lead to substantial financial losses and reputational damage, when application reliability is not continuously assured.</p><p>The complex interconnected nature of cloud-native environments magnifies the effect/impact, leading to cascading disruptions, which can propagate across services when Root Cause problems occur.</p><p>By automating the identification and explanation of cause-and-effect relationships, Causely accelerates incident response. Relevant teams responsible for root cause problems receive immediate alerts, complete with detailed explanations of the cause &amp; effect, empowering them to prioritize remediation based on impact. Simultaneously, teams whose services are impacted gain insights into the root causes and who is responsible for resolution, enabling proactive risk mitigation without the need for extensive troubleshooting.</p><p>For certain types of Root Cause problems it may also be possible to automate the remediation when they occur without human intervention.</p><p>By maintaining historical records of the cause and effect of past Root Cause problems and identifying recurring patterns, Causely enables reliability engineering teams to anticipate future potential problems and implement targeted mitigation strategies.</p><p>Causely’s ability to explain the effect of potential degradations and failures, before they even happen through “what if” analysis, also empowers reliability engineering teams to identify single points of failure, changes in load patterns and assess the impact of planned changes on related applications, business processes, and customers.</p><p><strong>The result?</strong> Through explainability organizations can dramatically reduce MTTR, improve business continuity, and increase cycles for development and innovation. Causely turns reactive troubleshooting into proactive prevention, ensuring application reliability can be continuously assured. This short video tells the story.</p><figure class="kg-card kg-embed-card"><iframe title="YouTube video player" src="https://www.youtube.com/embed/5ofO9ParE-0?si=Sd8xLLBxniiK4MYQ" width="560" height="315" frameborder="0" allowfullscreen="allowfullscreen"></iframe></figure><h2 id="unveiling-causely-how-our-platform-delivers-actionable-insights">Unveiling Causely: How Our Platform Delivers Actionable Insights</h2><p>Given the historical challenges and elusive nature of automating root cause analysis in IT operations, customer skepticism is warranted in this field. This problem has long been considered the “holy grail” of the industry, with countless vendors falling short of delivering a viable solution.</p><p>As a consequence Causely has found that it is important to prioritize transparency and explainability around how our Causal Reasoning Platform works and produces the results described earlier.</p><p>Much has been written about this — <a href="https://www.causely.ai/blog/beyond-the-blast-radius-demystifying-and-mitigating-cascading-microservice-issues/?ref=causely-blog.ghost.io">learn more here</a>.</p><p>This is an approach which is grounded in sound scientific principles, making it both effective and comprehensible.</p><p>Beyond understanding how the platform works, customers also value transparency around data handling. In this regard our approach to data management offers unique benefits in terms of data privacy and data management cost savings. You can learn more about this <a href="https://www.causely.ai/causely-platform-security-architecture/?ref=causely-blog.ghost.io">here</a>.</p><h2 id="in-summary">In Summary</h2><p>Explainability is the cornerstone of Causely’s mission. As they advance their technology, their dedication to transparency and understanding will only grow stronger. Don’t hesitate to visit the website or reach out to me or members of the Causely team to learn more about the approach and to <a href="https://www.causely.ai/resources/experience-causely/?ref=causely-blog.ghost.io">experience Causely</a> firsthand.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Understanding the Kubernetes Readiness Probe: A Tool for Application Health]]></title>
      <link>https://causely.ai/blog/understanding-kubernetes-readiness-probe-to-ensure-your-applications-availability</link>
      <guid>https://causely.ai/blog/understanding-kubernetes-readiness-probe-to-ensure-your-applications-availability</guid>
      <pubDate>Tue, 23 Jul 2024 15:36:34 GMT</pubDate>
      <description><![CDATA[Application reliability is a dynamic challenge, especially in cloud-native environments. Ensuring that your applications are running smoothly is make-or-break when it comes to user experience. One essential tool for this is the Kubernetes readiness probe. This blog will explore the…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/k8s-readiness-probe-1-2.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Application reliability is a dynamic challenge, especially in cloud-native environments. Ensuring that your applications are running smoothly is make-or-break when it comes to user experience. One essential tool for this is the <a href="https://www.causely.ai/resources/glossary-cloud-native-technologies/?ref=causely-blog.ghost.io">Kubernetes readiness probe</a>. This blog will explore the concept of a readiness probe, explaining how it works and why it’s a key component for managing your Kubernetes clusters.</p><h2 id="what-is-a-kubernetes-readiness-probe">What is a Kubernetes Readiness Probe?</h2><p>A readiness probe is essentially a check that Kubernetes performs on a container to ensure that it is ready to serve traffic. This check is needed to prevent traffic from being directed to containers that aren’t fully operational or are still in the process of starting up.</p><p>By using readiness probes, Kubernetes can manage the flow of traffic to only those containers that are fully prepared to handle requests, thereby improving the overall stability and performance of the application.</p><p>Readiness probes also help in preventing unnecessary disruptions and downtime by only including healthy containers in the load balancing process. This is an essential part of a comprehensive SRE operational practice for maintaining the health and efficiency of your Kubernetes clusters.</p><h2 id="how-readiness-probes-work">How Readiness Probes Work</h2><p>Readiness probes are configured in the pod specification and can be of three types:</p><ol><li><strong>HTTP Probes:</strong> These probes send an HTTP request to a specified endpoint. If the response is successful, the container is considered ready.</li><li><strong>TCP Probes:</strong> These probes attempt to open a TCP connection to a specified port. If the connection is successful, the container is considered ready.</li><li><strong>Command Probes:</strong> These probes execute a command inside the container. If the command returns a zero exit status, the container is considered ready.</li></ol><p>Below is an example demonstrating how to configure a readiness probe in a Kubernetes deployment:</p>
<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>apiVersion: v1</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>kind: Pod</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>metadata:</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>&nbsp;&nbsp;name: readiness-example</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>spec:</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>&nbsp;&nbsp;containers:</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>&nbsp;&nbsp;- name: readiness-container</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>&nbsp;&nbsp;&nbsp;&nbsp;image: your-image</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>&nbsp;&nbsp;&nbsp;&nbsp;readinessProbe:</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;httpGet:</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;path: /healthz</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;port: 8080</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;initialDelaySeconds: 5</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->

<!--kg-card-begin: html-->
<p style="padding-left: 40px;"><!--kg-card-begin: html--><span style="color: #03bc85;"><code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;periodSeconds: 10</code></span><!--kg-card-end: html--></p>
<!--kg-card-end: html-->
<p>This YAML file defines the Kubernetes pod with a readiness probe configured based on the following parameters:</p>
<!--kg-card-begin: html-->
<ol>
<li><strong>apiVersion: v1</strong> – Specifies the API version used for the configuration.</li>
<li><strong>kind: Pod</strong> – Indicates that this configuration is for a Pod.</li>
<li><strong>metadata:</strong>
<ul>
<li><strong>name: readiness-example</strong> – Sets the name of the Pod to “readiness-example.”</li>
</ul>
</li>
<li><strong>spec</strong> – Describes the desired state of the Pod.
<ul>
<li><strong>containers:</strong>
<ul>
<li><strong>name: readiness-container</strong> – Names the container within the Pod as “readiness-container.”</li>
<li><strong>image: your-image</strong> – Specifies the container image to use, named “your-image.”</li>
<li><strong>readinessProbe</strong> – Configures a readiness probe to check if the container is ready to receive traffic.
<ul>
<li><strong>httpGet:</strong>
<ul>
<li><strong>path: /healthz</strong> – Sends an HTTP GET request to the <!--kg-card-begin: html--><span style="color: #03bc85;">/healthz</span><!--kg-card-end: html--> path.</li>
<li><strong>port: 8080</strong> – Targets port 8080 for the HTTP GET request.</li>
</ul>
</li>
<li><strong>initialDelaySeconds: 5</strong> – Waits 5 seconds before performing the first probe after the container starts.</li>
<li><strong>periodSeconds: 10</strong> – Repeats the probe every 10 seconds.</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ol>
<!--kg-card-end: html-->
<p>This relatively simple configuration creates a Pod named “readiness-example” with a single container running “your-image.” It includes a readiness probe that checks the</p>
<!--kg-card-begin: html-->
<span style="color: #03bc85;"> /healthz</span>
<!--kg-card-end: html-->
<p>endpoint on port 8080, starting 5 seconds after the container launches and repeating every 10 seconds to determine if the container is ready to accept traffic.</p><h2 id="importance-of-readiness-probes">Importance of Readiness Probes</h2><p>The goal is to make sure you can prevent traffic from being directed to a container that is still starting up or experiencing issues. This helps maintain the overall stability and reliability of your application by only sending traffic to containers that are ready to handle it.</p><p>Readiness probes can be used in conjunction with <a href="https://www.causely.ai/resources/glossary-cloud-native-technologies/?ref=causely-blog.ghost.io">liveness probes</a> to further enhance the health checking capabilities of your containers.</p><p>Readiness probes are important for a few reasons:</p><ul><li><strong>Prevent traffic to unready pods:</strong> They ensure that only ready pods receive traffic, preventing downtime and errors.</li><li><strong>Facilitate smooth rolling updates:</strong> By making sure new pods are ready before sending traffic to them.</li><li><strong>Enhanced application stability:</strong> They can help with the overall stability and reliability of your application by managing traffic flow based on pod readiness.</li></ul><p>Remember that your readiness probes only check for availability, and don’t understand why a container is not available. Readiness probe failure is a symptom that can manifest from many root causes. It’s important to know the purpose, and limitations before you rely too heavily on them for overall application health.</p>
<!--kg-card-begin: html-->
<span style="color: #4338a6;"><em><strong>Related: Causely solves the root cause analysis problem, applying Causal AI to DevOps. <!--kg-card-begin: html--><span style="color: #03bc85;"><!--kg-card-begin: html--><a style="color: #03bc85;" href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io">Learn about our Causal Reasoning Platform</a><!--kg-card-end: html-->.</span><!--kg-card-end: html--></strong></em></span>
<!--kg-card-end: html-->
<h2 id="best-practices-for-configuring-readiness-probes">Best Practices for Configuring Readiness Probes</h2><p>To make the most of Kubernetes readiness probes, consider the following practices:</p>
<!--kg-card-begin: html-->
<ol>
<li><strong>Define Clear Health Endpoints:</strong> Ensure your application exposes a clear and reliable health endpoint.</li>
<li><strong>Set Appropriate Timing:</strong> Configure <!--kg-card-begin: html--><span style="color: #03bc85;">initialDelaySeconds</span><!--kg-card-end: html--> and <!--kg-card-begin: html--><span style="color: #03bc85;">periodSeconds</span><!--kg-card-end: html--> based on your application’s startup and response time.</li>
<li><strong>Monitor and Adjust:</strong> Continuously monitor the performance and adjust the probe configurations as needed.</li>
</ol>
<!--kg-card-end: html-->
<p>For example, if your application requires a database connection to be fully established before it can serve requests, you can set up a readiness probe that checks for the availability of the database connection.</p><p>By configuring the initialDelaySeconds and periodSeconds appropriately, you can ensure that your application is only considered ready once the database connection is fully established. This will help prevent any potential issues or errors that may occur if the application is not fully prepared to handle incoming requests.</p><h2 id="limitations-of-readiness-probes">Limitations of Readiness Probes</h2><p>Readiness probes are handy, but they only check for the availability of a specific resource and do not take into account the overall health of the application. This means that even if the database connection is established, there could still be other issues within the application that may prevent it from properly serving requests.</p><p>Additionally, readiness probes do not automatically restart the application if it fails the check, so it is important to monitor the results and take appropriate action if necessary. Readiness probes are still a valuable tool for ensuring the stability and reliability of your application in a Kubernetes environment, even with these limitations.</p><h2 id="troubleshooting-kubernetes-readiness-probes-common-issues-and-solutions">Troubleshooting Kubernetes Readiness Probes: Common Issues and Solutions</h2>
<!--kg-card-begin: html-->
<h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Slow Container Start-up</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p><strong>Problem:</strong> If your container’s initialization tasks exceed the <code>initialDelaySeconds</code> of the readiness probe, the probe may fail.</p><p><strong>Solution:</strong> Increase the<code> initialDelaySeconds</code> to give the container enough time to start and complete its initialization. Additionally, optimize the startup process of your container to reduce the time required to become ready.</p>
<!--kg-card-begin: html-->
<h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Unready Services or Endpoints</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p><strong>Problem:</strong> If your container relies on external services or dependencies (e.g., a database) that aren’t ready when the readiness probe runs, it can fail. Race conditions may also occur if your application’s initialization depends on external factors.</p><p><strong>Solution:</strong> Ensure that external services or dependencies are ready before the container starts. Use tools like Helm Hooks or init containers to coordinate the readiness of these components with your application. Implement synchronization mechanisms in your application to handle race conditions, such as using locks, retry mechanisms, or coordination with external components.</p>
<!--kg-card-begin: html-->
<h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Misconfiguration of the Readiness Probe</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p><strong>Problem:</strong> Misconfigured readiness probes, such as incorrect paths or ports, can cause probe failures.</p><p><strong>Solution:</strong> Double-check the readiness probe configuration in your Pod’s YAML file. Ensure the path, port, and other parameters are correctly specified.</p>
<!--kg-card-begin: html-->
<h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Application Errors or Bugs</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p><strong>Problem:</strong> Application bugs or issues, such as unhandled exceptions, misconfigurations, or problems with external dependencies, can prevent it from becoming ready, leading to probe failures.</p><p><strong>Solution:</strong> Debug and resolve application issues. Review application logs and error messages to identify the problems preventing the application from becoming ready. Fix any bugs or misconfigurations in your application code or deployment.</p>
<!--kg-card-begin: html-->
<h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Insufficient Resources</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p><strong>Problem:</strong> If your container is running with resource constraints (CPU or memory limits), it might not have the resources it needs to become ready, especially under heavy loads.</p><p><strong>Solution:</strong> Adjust the resource limits to provide the container with the necessary resources. You may also need to optimize your application to use resources more efficiently.</p>
<!--kg-card-begin: html-->
<h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Conflicts Between Probes</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p><strong>Problem:</strong> Misconfigured liveness and readiness probes might interfere with each other, causing unexpected behavior.</p><p><strong>Solution:</strong> Ensure that your probes are configured correctly and serve their intended purposes. Make sure that the settings of both probes do not conflict with each other.</p>
<!--kg-card-begin: html-->
<h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Cluster-Level Problems</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p><strong>Problem:</strong> Kubernetes cluster issues, such as <a href="https://www.causely.ai/resources/glossary-cloud-native-technologies/?ref=causely-blog.ghost.io">kubelet</a> or networking problems, can result in probe failures.</p><p><strong>Solution:</strong> Monitor your cluster for any issues or anomalies and address them according to Kubernetes best practices. Ensure that the kubelet and other components are running smoothly.</p><p>These are common issues to keep an eye out for. Watch for problems that the readiness probes are not surfacing or that might be preventing them from acting as expected.</p><h2 id="summary">Summary</h2><p>Ensuring that your applications are healthy and ready to serve traffic is necessary for maximizing uptime. The Kubernetes readiness probe is one helpful tool for managing Kubernetes clusters; it should be a part of a comprehensive Kubernetes operations plan.</p><p>Readiness probes can be configured in pod specifications and can be HTTP, TCP, or command probes. They help prevent disruptions and downtime by ensuring only healthy containers are included in the load-balancing process.</p><p>They also use the prevention of sending traffic to unready pods for smooth rolling updates and enhancing application stability. It’s good practice that your readiness probes include defining clear health endpoints, setting appropriate timing, and monitoring and adjusting configurations.</p><p>Don’t forget that readiness probes have clear limitations, as they only check for the availability of a specific resource and do not automatically restart the application if it fails the check. A Kubernetes readiness probe failure is merely a symptom that can be attributed to many root causes. To automate root cause analysis across your entire Kubernetes environment, check out <a href="https://www.causely.ai/platform/causely-for-cloud-native-applications/?ref=causely-blog.ghost.io">Causely for Cloud-Native Applications</a>.</p><hr><h2 id="related-resources">Related resources</h2><ul><li><a href="https://www.causely.ai/podcast/webinar/what-is-causal-ai-why-do-devops-need-it/?ref=causely-blog.ghost.io">Webinar</a>: What is Causal AI and why do DevOps teams need it?</li><li><a href="https://www.causely.ai/blog/bridging-the-gap-between-observability-and-automation-with-causal-reasoning/?ref=causely-blog.ghost.io">Blog:</a> Bridging the gap between observability and automation with causal reasoning</li><li><a href="https://www.causely.ai/video/causely-overview/?ref=causely-blog.ghost.io">Product Overview</a>: Causely for Cloud-Native Applications</li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Beyond the Blast Radius: Demystifying and Mitigating Cascading Microservice Issues]]></title>
      <link>https://causely.ai/blog/beyond-the-blast-radius-demystifying-and-mitigating-cascading-microservice-issues</link>
      <guid>https://causely.ai/blog/beyond-the-blast-radius-demystifying-and-mitigating-cascading-microservice-issues</guid>
      <pubDate>Mon, 15 Jul 2024 12:51:40 GMT</pubDate>
      <description><![CDATA[Microservices architectures offer many benefits, but they also introduce new challenges. One such challenge is the cascading effect of simple failures. A seemingly minor issue in one microservice can quickly snowball, impacting other services and ultimately disrupting user experience. The…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/beyond-blast-image-1.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Microservices architectures offer many benefits, but they also introduce new challenges. One such challenge is the cascading effect of simple failures. A seemingly minor issue in one microservice can quickly snowball, impacting other services and ultimately disrupting user experience.</p><figure class="kg-card kg-embed-card"><iframe title="YouTube video player" src="https://www.youtube.com/embed/0-FTUuVud68?si=hElveRhchEolNME9" width="560" height="315" frameborder="0" allowfullscreen="allowfullscreen"></iframe></figure><h2 id="the-domino-effect-from-certificate-expiry-to-user-frustration">The Domino Effect: From Certificate Expiry to User Frustration</h2><p>Imagine a scenario where a microservice’s certificate expires. This seemingly trivial issue prevents it from communicating with others. This disruption creates a ripple effect:</p><ul><li><strong>Microservice Certificate Expiry:</strong> The seemingly minor issue is a certificate going past its expiration date.</li><li><strong>Communication Breakdown:</strong> This expired certificate throws a wrench into the works, preventing the microservice from securely communicating with other dependent services. It’s like the microservice is suddenly speaking a different language that the others can’t understand.</li><li><strong>Dependent Service Unavailability:</strong> Since the communication fails, dependent services can no longer access the data or functionality provided by the failing microservice. Imagine a domino not receiving the push because the first one didn’t fall.</li><li><strong>Errors and Outages:</strong> This lack of access leads to errors within dependent services. They might malfunction or crash entirely, causing outages – the domino effect starts picking up speed.</li><li><strong>User Frustration (500 Errors):</strong> Ultimately, these outages translate to error messages for the end users. They might see cryptic “500 errors” or experience the dreaded “service unavailable” message – the domino effect reaches the end user, who experiences the frustration.</li></ul><h2 id="the-challenge-untangling-the-web-of-issues">The Challenge: Untangling the Web of Issues</h2><p>Cascading failures pose a significant challenge due to the following reasons:</p><ul><li><strong>Network Effect:</strong> The root cause gets obscured by the chain reaction of failures, making it difficult to pinpoint the source.</li><li><strong>Escalation Frenzy:</strong> Customer complaints trigger incident tickets, leading to a flurry of investigations across multiple teams (DevOps Teams, Service Desk, customer support, etc.).</li><li><strong>Resource Drain:</strong> Troubleshooting consumes valuable time from developers, SREs, and support personnel, diverting them from core tasks.</li><li><strong>Hidden Costs:</strong> The financial impact of lost productivity and customer dissatisfaction often goes unquantified.</li></ul><h2 id="beyond-certificate-expiry-the-blast-radius-of-microservice-issues">Beyond Certificate Expiry: The Blast Radius of Microservice Issues</h2><p>Certificate expiry is just one example. Other issues with similar cascading effects include:</p><ul><li><strong>Noisy Neighbors: </strong>A resource-intensive microservice can degrade performance for others sharing the same resources (databases, applications) which in turn impact other services that depend on them.</li><li><strong>Code Bugs:</strong> Code errors within a microservice can lead to unexpected behavior and downstream impacts.</li><li><strong>Communication Bottlenecks:</strong> Congestion or malfunctioning in inter-service communication channels disrupts data flow and service availability.</li><li><strong>Third-Party Woes:</strong> Outages or performance issues in third-party SaaS services integrated with your microservices can create a ripple effect.</li></ul><h2 id="platform-pain-points-when-infrastructure-falters">Platform Pain Points: When Infrastructure Falters</h2><p>The impact can extend beyond individual microservices. Platform-level issues can also trigger cascading effects:</p><ul><li><strong>Load Balancer Misconfigurations:</strong> Incorrectly configured load balancers can disrupt service delivery to clients and dependent services.</li><li><strong>Container Cluster Chaos:</strong> Problems within Kubernetes pods, nodes, can lead to application failures and service disruptions.</li></ul><h2 id="blast-radius-and-asynchronous-communication-the-data-lag-challenge">Blast Radius and Asynchronous Communication: The Data Lag Challenge</h2><p>Synchronous communication provides immediate feedback, allowing the sender to know if the message was received successfully. In contrast, <a href="https://www.causely.ai/video/causely-for-asynchronous-communication/?ref=causely-blog.ghost.io">asynchronous communication introduces a layer of complexity</a>:</p><ul><li><strong>Unpredictable Delivery:</strong> Messages may experience varying delays or, in extreme cases, be lost entirely. This lack of real-time confirmation makes it difficult to track the message flow and pinpoint the exact location of a breakdown.</li><li><strong>Limited Visibility:</strong> Unlike synchronous communication where a response is readily available, troubleshooting asynchronous issues requires additional effort. You may only have user complaints as a starting point, which can be a delayed and incomplete indicator of the problem.</li></ul><p>The root cause of problems could be because of several factors that result delays or lost messages in asynchronous communication:</p>
<!--kg-card-begin: html-->
<h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Microservice Issues:</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<ul><li><strong>Congestion:</strong> A microservice overloaded with tasks may struggle to process or send messages promptly, leading to delays.</li><li><strong>Failures:</strong> A malfunctioning microservice may be entirely unable to process or send messages, disrupting the flow of data.</li></ul>
<!--kg-card-begin: html-->
<h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Messaging Layer Issues:</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p>Problems within the messaging layer itself can also cause disruptions:</p><ul><li><strong>Congestion:</strong> Congestion in message brokers, clusters, or cache instances can lead to delays in message delivery.</li><li><strong>Malfunctions:</strong> Malfunctions within the messaging layer can cause messages to be lost entirely.</li></ul><h2 id="the-cause-effect-engine-unveiling-the-root-of-microservice-disruptions-in-real-time">The Cause &amp; Effect Engine: Unveiling the Root of Microservice Disruptions in Real Time</h2><p>So what can we do to tame this chaos?</p><p>Imagine a system that acts like a detective for your application services. It understands all of the cause-and-effect relationships within your complex architecture. It does this by automatically discovering and analyzing your environment to maintain an up-to-date picture of services, infrastructure and dependencies and from this computes a dynamic knowledge base of root causes and the effects they will have.</p><p>This knowledge is automatically computed in a <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io#causality-graph">Causality Graph</a> that depicts all of the relationships between the potential root causes that could occur and the symptoms they may cause. In an environment with thousands of entities, it might represent hundreds of thousands of problems and the set of symptoms each one will cause.</p><p>A separate data structure is derived from this called a “<a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io#codebook">Codebook</a>“. This table is like a giant symptom checker, mapping all the potential root causes (problems) to the symptoms (errors) they might trigger.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/andrew-causely-platform-graphic.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="923" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/07/andrew-causely-platform-graphic.jpg 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/07/andrew-causely-platform-graphic.jpg 1000w, https://causely-blog.ghost.io/content/images/size/w1600/wp-content/uploads/2024/07/andrew-causely-platform-graphic.jpg 1600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/andrew-causely-platform-graphic.jpg 2218w" sizes="(min-width: 720px) 720px"></figure><p>Hence, each root cause in the Codebook has a unique signature, a vector of <em>m</em> probabilities, that uniquely identifies the root cause. Using the Codebook, the system quickly searches and pinpoints the root causes based on the observed symptoms.</p><p>The Causality Graph and Codebook are constantly updated as application services and infrastructure evolve. This ensures the knowledge in the Causality Graph and Codebook stays relevant and adapts to changes.</p><p>These powerful capabilities enable:</p><ul><li><strong>Machine Speed Root Cause Identification:</strong> Unlike traditional troubleshooting, the engine can pinpoint the culprit in real time, saving valuable time and resources.</li><li><strong>Prioritization Based on Business Impact: </strong>By revealing the effects of specific root causes on related services, problem resolution can be prioritized.</li><li><strong>Reduced Costs:</strong> Faster resolution minimizes downtime and associated costs.</li><li><strong>Improved Collaboration: </strong>Teams responsible for failing services receive immediate notifications and a visualize a Causality Graph explaining the issue’s origin and impact. This streamlines communication and prioritizes remediation efforts based on the effect the root cause problem has.</li><li><strong>Automated Actions:</strong> In specific cases, the engine can even trigger automated fixes based on the root cause type.</li><li><strong>Empowered Teams:</strong> Teams affected by the problem are notified but relieved of troubleshooting burdens. They can focus on workarounds or mitigating downstream effects, enhancing overall system resilience.</li></ul><p>The system represents a significant leap forward in managing cloud native applications. By facilitating real-time root cause analysis and intelligent automation, it empowers teams to proactively address disruptions and ensure the smooth operation of their applications.</p><p>The knowledge in the system is not just relevant to optimize the <a href="https://www.causely.ai/platform/causely-for-cloud-native-applications/use-cases/?ref=causely-blog.ghost.io">incident response</a> process. It is also valuable for performing “what if” analysis to understand what the impact of future changes and planned maintenance will have so that steps can be taken to proactively understand and mitigate the risks of these activities.</p><p>Through its understanding of cause and effect, it can also play a role in business continuity planning, enabling teams to identify single points of failure in complex services to improve service resilience.</p><p>The system can also be used to streamline the process of incident postmortems because it contains the prior history of previous root cause problems, why they occurred and what the effect was — their causality. This avoids the complexity and time involved in reconstructing what happened and enables mitigating steps to be taken to avoid recurrences.</p><h2 id="the-types-of-root-cause-problems-their-effects">The Types of Root Cause Problems &amp; Their Effects</h2><p>The system computes its causal knowledge based on Causal Models. These describe the behaviours of how root cause problems will propagate symptoms along relationships to dependent entities independently of a given environment. This knowledge is instantiated through service and infrastructure auto discovery to create the Causal Graph and Codebook.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/causal-models-amgraphic.jpg" class="kg-image" alt="" loading="lazy" width="2000" height="875" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/07/causal-models-amgraphic.jpg 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/07/causal-models-amgraphic.jpg 1000w, https://causely-blog.ghost.io/content/images/size/w1600/wp-content/uploads/2024/07/causal-models-amgraphic.jpg 1600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/causal-models-amgraphic.jpg 2044w" sizes="(min-width: 720px) 720px"></figure><p>Examples of these <a href="https://www.causely.ai/platform/causely-for-cloud-native-applications/problem-coverage/?ref=causely-blog.ghost.io">types of root cause problems</a> that are modeled in the system include:</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/problem-coverage-amgraphic.jpg" class="kg-image" alt="" loading="lazy" width="1166" height="674" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/07/problem-coverage-amgraphic.jpg 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/07/problem-coverage-amgraphic.jpg 1000w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/problem-coverage-amgraphic.jpg 1166w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/problemcoverage-amgraphic-2-1.jpg" class="kg-image" alt="" loading="lazy" width="1184" height="664" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/07/problemcoverage-amgraphic-2-1.jpg 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/07/problemcoverage-amgraphic-2-1.jpg 1000w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/problemcoverage-amgraphic-2-1.jpg 1184w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/problemcoverage-amgraphic-3-1.jpg" class="kg-image" alt="" loading="lazy" width="1226" height="641" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/07/problemcoverage-amgraphic-3-1.jpg 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/07/problemcoverage-amgraphic-3-1.jpg 1000w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/problemcoverage-amgraphic-3-1.jpg 1226w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/problemcoverage-amgraphic-4-1.jpg" class="kg-image" alt="" loading="lazy" width="1348" height="584" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/07/problemcoverage-amgraphic-4-1.jpg 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/07/problemcoverage-amgraphic-4-1.jpg 1000w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/07/problemcoverage-amgraphic-4-1.jpg 1348w" sizes="(min-width: 720px) 720px"></figure><h2 id="science-fiction-or-reality">Science Fiction or Reality</h2><p>The inventions behind the system go back to the 90’s, and was at the time and still is groundbreaking. It was successfully deployed, at scale, by some of the largest telcos, system integrators and Fortune 500 companies in the early 2000’s. You can read about the original inventions <a href="https://www.linkedin.com/posts/andrew-mallaband-88b1b7_high-speed-event-correlation-activity-7213538946411044864-MMI-?utm_source=share&utm_medium=member_desktop&lipi=urn%3Ali%3Apage%3Ad_flagship3_pulse_read%3BODdn9qH6R%2Fmfff24XPwZsg%3D%3D" rel="noopener">here</a>.</p><p>Today the problems that these inventions set out to address have not changed and the adoption of cloud-native technologies has only heightened the need for a solution. As real-time data has become pervasive in today’s application architectures, <strong>every second of service disruption is a lost business opportunity.</strong></p><p>These inventions have been taken and engineered in a modern, commercially available platform by Causely to address the challenges of assuring continuous application reliability in the cloud-native world. The <a href="https://www.causely.ai/company?ref=causely-blog.ghost.io" rel="noreferrer">founding engineering team at Causely</a> were the creators of the tech behind two high-growth companies: SMARTS and Turbonomic.</p><p>If you would like to learn more about this, <a href="https://www.causely.ai/try?ref=causely-blog.ghost.io" rel="noreferrer">book a meeting</a>&nbsp;with the Causely team. We'd love to chat!</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Causely Overview]]></title>
      <link>https://causely.ai/blog/causely-overview</link>
      <guid>https://causely.ai/blog/causely-overview</guid>
      <pubDate>Thu, 13 Jun 2024 14:12:52 GMT</pubDate>
      <description><![CDATA[Causely assures continuous reliability of cloud applications. Causely for Cloud-Native Applications, built on our Causal Reasoning Platform, automatically captures cause and effect relationships based on real-time, dynamic data across the entire application environment. This means that we can detect]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/06/screenshot-2024-06-13-at-10-11-09-am.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Causely assures continuous reliability of cloud applications. <a href="https://www.causely.ai/platform/causely-for-cloud-native-applications/?ref=causely-blog.ghost.io">Causely for Cloud-Native Applications</a>, built on our <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io">Causal Reasoning Platform</a>, automatically captures cause and effect relationships based on real-time, dynamic data across the entire application environment. This means that we can detect, remediate and even prevent problems that result in service impact. With Causely, Dev and Ops teams are better equipped to plan for ongoing changes to code, configurations or load patterns, and they stay focused on achieving service-level and business objectives instead of firefighting.</p><p>Watch the video to see Causely in action, or take the product for a <a href="https://www.causely.ai/resources/experience-causely/?ref=causely-blog.ghost.io">self-guided tour</a>.</p><figure class="kg-card kg-embed-card"><iframe title="Causely Overview" src="https://player.vimeo.com/video/962337550?title=0&amp;byline=0&amp;portrait=0&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" width="750" height="421.88" frameborder="0"></iframe></figure>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[The State of AI in Observability Today]]></title>
      <link>https://causely.ai/blog/the-state-of-ai-in-observability-today</link>
      <guid>https://causely.ai/blog/the-state-of-ai-in-observability-today</guid>
      <pubDate>Mon, 10 Jun 2024 20:27:00 GMT</pubDate>
      <description><![CDATA[Reposted with permission from Observability 360]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/2025/09/ai-general-1.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p><em>Reposted with permission from </em><a href="https://observability-360.com/article/ViewArticle?id=ai-in-observability&ref=causely-blog.ghost.io" rel="noreferrer"><em>Observability 360</em></a></p><h4 id="a-brief-round-up">A Brief Round-Up</h4><p>The recent advances in AI and Machine Learning open up enormous possibilities for the observability sector. Observability backends ingest vast amounts of telemetry and therefore have an ocean of raw data to mine for rich analytics and diagnostics. In addition to this, LLM's themselves have become a first class IT citizen in many organisations and are therefore also a subject for observability. Indeed, OpenTelemetry have now released a set of&nbsp;<a href="https://www.linkedin.com/pulse/opentelemetry-semantic-conventions-generative-ai-drew-robbins-eguhc/?ref=causely-blog.ghost.io">semantic conventions</a>&nbsp;for working with Generative AI.</p><p>Observability is one of the most dynamic and competitive sectors in the IT industry and it is no surprise that vendors have begun to incorporate AI features into their platforms. It goes without saying that AI is a broad term and is used in different ways - it is generally used to cover both machine learning as well as generative AI. In this article we will briefly survey a number of leading platforms and products in the observability market and look at some of the ways they have incorporated AI capabilities into their products.</p><h4 id="ibm-instanaintelligent-remediation">IBM Instana - Intelligent Remediation</h4><p>IBM has a long and illustrious track record in the fields of AI and Machine Learning - including the 1997 victory of Deep Blue over chess grandmaster Gary Kasparov and IBM Watson’s winning turn on the US TV quiz Jeopardy in 2011. Watson has now been superseded by WatsonX, and this is the engine which powers the Intelligent Remediation feature in IBM’s&nbsp;<a href="https://www.ibm.com/products/instana?ref=causely-blog.ghost.io">Instana observability platform</a>. Intelligent Remediation is a preview technology which continuously monitors a system for faults and anomalies. As well as drawing upon system telemetry, it also uses expert knowledge for causal analysis and then suggests remediations. The remediations can be implemented using pre-built actions selected from a catalogue. As well as the Remediation feature, Instana also has AI-driven capabilities for summarising, diagnostics and making recommendations.</p><h4 id="logzioanomaly-detection">Logz.Io - Anomaly Detection</h4><p><a href="https://logz.io/?ref=causely-blog.ghost.io">Logz.Io</a>&nbsp;is a popular full-stack observability platform built on top of open source technologies such as OpenSearch, Prometheus and Jaeger. Whilst their platform has been equipped with AI tooling for some time, the company are circumspect in not overplaying the current capabilities of AI. Whilst they recognise that AI can assist in areas such as reducing noise and summarising incidents, they do not make any claims in terms of causal analysis or remediation. You can learn more on the company’s AI posture from this&nbsp;<a href="https://logz.io/blog/anomaly-detection-application-observability?ref=causely-blog.ghost.io">really illuminating webinar</a>.</p><p>The Logz.Io platform ships with an Observability IQ Assistant, which harnesses AI to support natural language querying and chat-based analytics on your telemetry data. The most powerful AI feature in the Logz.Io platform though, is probably the Anomaly Detection tooling that is integrated into the App 360 module. One problem with anomaly detection is that it is not business-aware, and, if not applied carefully, it may end up creating yet more alert fatigue. To combat this, Logz.io Anomaly Detection allows users to target critical services and take a more SLO-driven approach.</p><h4 id="elasticsupercharging-search">Elastic - Supercharging Search</h4><p>The ELK Stack has been at the forefront of the log aggregation and analytics space for many years. Despite controversies over licensing, Elastic is still a hugely popular and influential product. Highly powerful search capabilities are at the core of its product offering and it is no surprise that this is a domain where Elastic seeks to differentiate itself from other platforms in terms of its AI tooling. It seems though, that Elastic’s ambitions extend far beyond log searching and it is positioning itself as a first-choice platform for advanced corporate data analytics.</p><p>The centrepiece of this vision is the&nbsp;<a href="https://www.elastic.co/generative-ai/search-ai-lake?ref=causely-blog.ghost.io">Search AI Lake</a>, which incorporates RAG, search and security functions and is built on a cloud-native architecture. The company claims that this enables search over vast volumes of data at high speed and low cost. A quick glance at Elastic's&nbsp;<a href="https://ir.elastic.co/news/news-details/2024/Elastic-Reports-Fourth-Quarter-and-Fiscal-2024-Financial-Results/default.aspx?ref=causely-blog.ghost.io">latest financial report</a>&nbsp;really highlights the strategic importance of AI to the company. Pretty much every item listed in the Product Innovations and Updates section is AI-related. Search is obviously an area with great potential for AI and other vendors such as AWS have also incorporated AI into their search functionality. At the moment, this technology is still at the experimental stage, but when it matures natural language search over telemetry data will be a huge win for making observability systems accessible and of value across the enterprise.</p><h4 id="new-relicobservability-for-llms">New Relic - Observability for LLM's'</h4><p>The AI revolution poses a two-fold challenge for observability vendors. As well as harnessing AI to create more powerful systems, they also need to extend their functional scope to provide insights into the LLM functionality that customers are building into their systems. New Relic were the first major vendor to add LLM monitoring to their stack - although Datadog and Elastic have now followed suit. The New Relic&nbsp;<a href="https://docs.newrelic.com/whats-new/2024/03/whats-new-03-28-aimonitoringga/?ref=causely-blog.ghost.io">AI monitoring</a>&nbsp;product will check for “bias, toxicity, and hallucinations“ as well as identifying processing bottlenecks and scanning for potential vulnerabilities. As well as the usual APM signals, the tool also captures AI-specific metrics such as response quality and token counts. Sending data to LLM's can obviously represent a potential security issue, so the system also includes safeguards for protecting sensitive data.</p><h4 id="grafana">Grafana</h4><p>Grafana's initial application of AI to their stack concentrated on reducing toil and providing 'delight' for the user. This entailed functionality such as generating incident summaries or providing automated suggestions for names and titles of panels and other objects. Recently though, they have started to ramp up their AI features. One of the most notable of these is AI-powered insights for continuous profiling. Flame graphs are a great tool, they can, however, be visually very dense and it can take some time to unpack all the data and identify root causes and bottlenecks. The Grafana Cloud Profiles tool now supports an&nbsp;<a href="https://grafana.com/blog/2024/05/15/ai-powered-insights-for-continuous-profiling-introducing-flame-graph-ai-in-grafana-cloud?ref=causely-blog.ghost.io">AI-powered flame graph reader</a>&nbsp;to speed up and simplify diagnostics and analysis.</p><h4 id="causely">Causely</h4><p>Whilst most of the systems in this review harness AI to complement their existing stack,&nbsp;<a href="https://www.causely.io/?ref=causely-blog.ghost.io">Causely</a>&nbsp;is built on AI from the ground up. As the name suggests, it uses Causal AI - built on expert systems knowledge - to carry out root cause analysis as well as predictive diagnostics. This contrasts with most other systems whose root cause analysis is actually powered by correlation and inference - which are less reliable approaches. Causely is not a full stack system, instead it plugs in to your existing stack. If you are interested in digging deeper into Causely and causal AI then take a look at our&nbsp;<a href="https://observability-360.com/article/ViewArticle?id=causely-causal-ai&ref=causely-blog.ghost.io">recent feature article</a>.</p><h4 id="open-source">Open Source</h4><p>It is not only the large vendors who are harnessing AI capabilities to build new products and features.&nbsp;<a href="https://k8sgpt.ai/?ref=causely-blog.ghost.io">K8sGPT</a>&nbsp;is an open source project aiming to ease the burden for K8S admins by tapping into AI backends for assistance with diagnostics. Like much AI-based tooling, it works as a co-pilot rather than an autonomous operator. The tool is built on a set of Analysers which map to K8S resources such as pods, nodes, services etc and continually scan your cluster, looking for errors. It then sends a digest of the error context to the backend AI (it doesn’t have to be OpenAI) and presents the potential fixes to the user.</p><p><a href="https://observability-360.com/article/Langtrace%20AI?ref=causely-blog.ghost.io">Langtrace AI</a>&nbsp;is an open source tool offering observability for LLM Apps. It can be self-hosted, but there is also a SAAS version of the product. It provides full OpenTelemetry tracing support and also provides metrics around costs, accuracy and latency. It offers support for the Pinecone and ChromaDB vector databases and integrates with OpenAI and Anthropic LLM’s. There is also an integration for viewing your traces in SigNoz. There is an ambitious list of new features on the project’s backlog and it is likely to evolve quickly.</p><h4 id="conclusion">Conclusion</h4><p>It is still early days in terms of the incorporation of AI capabilities in observability systems. However, there are a few trends we can see emerging and AI features seem to be coalescing around:</p><ul><li>reducing toil</li><li>causal analysis/anomaly detection</li><li>natural language search</li><li>LLM observability</li></ul><p>Some vendors, such as&nbsp;<a href="https://cleric.io/blog/introducing-cleric?ref=causely-blog.ghost.io">Cleric</a>, have made some very bold claims about creating an AI Site Reliability Engineer and others have spoken about "closing the loop" but, in reality, this is a long way off. The changes we are seeing are incremental rather than transformational. Innovations such as natural language search will make tasks such as querying easier and more accessible, but these are assistive technologies. As the author of a recent article on the&nbsp;<a href="https://incident.io/blog/ai?ref=causely-blog.ghost.io">Incident.io blog</a>&nbsp;put it, AI is best envisaged as an exoskeleton rather than a robot.</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Real-time Data & Modern UXs: The Power and the Peril When Things Go Wrong]]></title>
      <link>https://causely.ai/blog/real-time-data-modern-uxs-the-power-and-the-peril-when-things-go-wrong</link>
      <guid>https://causely.ai/blog/real-time-data-modern-uxs-the-power-and-the-peril-when-things-go-wrong</guid>
      <pubDate>Fri, 07 Jun 2024 14:14:02 GMT</pubDate>
      <description><![CDATA[Imagine a world where user experiences adapt to you in real time. Personalized recommendations appear before you even think of them, updates happen instantaneously, and interactions flow seamlessly. This captivating world is powered by real-time data, the lifeblood of modern…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/06/real-time-header.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Imagine a world where user experiences adapt to you in real time. Personalized recommendations appear before you even think of them, updates happen instantaneously, and interactions flow seamlessly. This captivating world is powered by real-time data, the lifeblood of modern applications.</p><p>But this power comes at a cost. The intricate <a href="https://thenewstack.io/how-to-build-a-scalable-platform-architecture-for-real-time-data/?ref=causely-blog.ghost.io" rel="noopener">architecture behind real-time services</a> can make troubleshooting issues a nightmare. Organizations that rely on real-time data to deliver products and services face a critical challenge: ensuring data is delivered fresh and on time. Missing data or delays can cripple the user experience and demand resolutions within minutes, if not seconds.</p><p>This article delves into the world of real-time data challenges. We’ll explore the business settings where real-time data is king, highlighting the potential consequences of issues. Then I will introduce a novel approach that injects automation into the troubleshooting process, saving valuable time and resources, but most importantly mitigating the business impact when problems arise.</p><h2 id="lags-missing-data-the-hidden-disruptors-across-industries">Lags &amp; Missing Data: The Hidden Disruptors Across Industries</h2><p>Lags and missing data can be silent assassins, causing unseen disruptions that ripple through various industries. Let’s dig into the specific ways these issues can impact different business sectors.</p><!--kg-card-begin: html--><h3><img class="size-full wp-image-1002 aligncenter" src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/06/real-time-impacts-icons.jpg" alt="Disruptions in real-time data can cause business impact" width="1218" height="182"><br>
<!--kg-card-begin: html--><span style="font-size: 14pt;">Financial markets</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><ul><li><strong>Trading:</strong> In high-frequency trading, even milliseconds of delay can mean the difference between a profitable and losing trade. <a href="https://www.investopedia.com/terms/r/real_time.asp?ref=causely-blog.ghost.io" rel="noopener">Real-time data on market movements</a> is crucial for making informed trading decisions.</li><li><strong>Fraud detection:</strong> <a href="https://www.confluent.io/use-case/fraud-detection/?ref=causely-blog.ghost.io" rel="noopener">Real-time monitoring</a> of transactions allows financial institutions to identify and prevent fraudulent activity as it happens. Delays in data can give fraudsters a window of opportunity.</li><li><strong>Risk management:</strong> Real-time data on market volatility, creditworthiness, and other factors helps businesses assess and <a href="https://www.10xbanking.com/insights/how-real-time-data-is-reshaping-the-credit-card-industry?ref=causely-blog.ghost.io" rel="noopener">manage risk effectively</a>. Delays can lead to inaccurate risk assessments and potentially large losses.</li></ul><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;"><strong>Supply chain management</strong></span><!--kg-card-end: html--></h3><!--kg-card-end: html--><ul><li><strong>Inventory management:</strong> <a href="https://cogsy.com/inventory-management/real-time-inventory/?ref=causely-blog.ghost.io" rel="noopener">Real-time data on inventory levels</a> helps businesses avoid stockouts and optimize inventory costs. Delays can lead to overstocking or understocking, impacting customer satisfaction and profitability.</li><li><strong>Logistics and transportation:</strong> <a href="https://www.freightmango.com/blog/top-9-real-time-tracking-benefits-freight-logistics/?ref=causely-blog.ghost.io" rel="noopener">Real-time tracking of shipments</a> allows companies to optimize delivery routes, improve efficiency, and provide accurate delivery estimates to customers. Delays can disrupt logistics and lead to dissatisfied customers.</li><li><strong>Demand forecasting:</strong> Real-time data on customer behavior and sales trends allows businesses to <a href="https://www.deskera.com/blog/real-time-demand-forecasting/?ref=causely-blog.ghost.io" rel="noopener">forecast demand accurately</a>. Delays can lead to inaccurate forecasts and production issues.</li></ul><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Customer service</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><ul><li><strong>Live chat and phone support:</strong> Real-time access to customer data allows <a href="https://www.forbes.com/sites/homaycotte/2015/06/16/5-ways-improve-customer-service-with-real-time-data-and-real-time-responses/?sh=5ad77a2c6974&ref=causely-blog.ghost.io" rel="noopener">support agents to personalize</a><a href="https://www.forbes.com/sites/homaycotte/2015/06/16/5-ways-improve-customer-service-with-real-time-data-and-real-time-responses/?sh=5ad77a2c6974&ref=causely-blog.ghost.io" rel="noopener"> interactions</a> and resolve issues quickly. Delays can lead to frustration and longer resolution times.</li><li><strong>Social media monitoring:</strong> <a href="https://medium.com/quikai/what-are-the-benefits-of-real-time-sentiment-analysis-009aad4b424e?ref=causely-blog.ghost.io" rel="noopener">Real-time tracking of customer sentiment</a> on social media allows businesses to address concerns and build brand reputation. Delays can lead to negative feedback spreading before it’s addressed.</li><li><strong>Personalization:</strong> <a href="https://www.sitecore.com/resources/omnichannel-personalization/what-is-real-time-personalization?ref=causely-blog.ghost.io" rel="noopener">Real-time data on customer preferences</a> allows businesses to personalize website experiences, product recommendations, and marketing campaigns. Delays can limit the effectiveness of these efforts.</li></ul><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Manufacturing</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><ul><li><strong>Machine monitoring:</strong> <a href="https://guidewheel.com/en/machine-monitoring?ref=causely-blog.ghost.io" rel="noopener">Real-time monitoring</a> of machine performance allows for predictive maintenance, preventing costly downtime. Delays can lead to unexpected breakdowns and production delays.</li><li><strong>Quality control:</strong> <a href="https://www.automation.com/en-us/articles/january-2024/future-quality-control-manufacturing-facilities?ref=causely-blog.ghost.io" rel="noopener">Real-time data on product quality</a> allows for immediate identification and correction of defects. Delays can lead to defective products reaching customers.</li><li><strong>Process optimization:</strong> <a href="https://www.dataparc.com/blog/manufacturing-process-opimization/?ref=causely-blog.ghost.io" rel="noopener">Real-time data on production processes</a> allows for continuous improvement and optimization. Delays can limit the ability to identify and address inefficiencies.</li></ul><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Other examples</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><ul><li><strong>Online gaming:</strong> Real-time data is <a href="https://redpanda.com/blog/game-development-streaming-data-platform?ref=causely-blog.ghost.io" rel="noopener">crucial for smooth gameplay</a> and a fair playing field. Delays can lead to lag, disconnects, and frustration for players.</li><li><strong>Healthcare:</strong> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10681793/?ref=causely-blog.ghost.io" rel="noopener">Real-time monitoring</a> of vital signs and patient data allows for faster diagnosis and treatment. Delays can have serious consequences for patient care.</li><li><strong>Energy management:</strong> <a href="https://episensor.com/knowledge-base/what-are-the-key-benefits-of-real-time-energy-monitoring/?ref=causely-blog.ghost.io" rel="noopener">Real-time data on energy</a> consumption allows businesses and utilities to optimize energy use and reduce costs. Delays can lead to inefficient energy usage and higher costs.</li><li><strong>Cybersecurity:</strong> Real-time data is the <a href="https://www.fortinet.com/resources/cyberglossary/cybersecurity-analytics?ref=causely-blog.ghost.io" rel="noopener">backbone of modern cybersecurity</a>, enabling rapid threat detection, effective incident response, and accurate security analytics. However, delays in the ability to see and understand this data can create critical gaps in your defenses. From attackers having more time to exploit vulnerabilities to outdated security controls and hindered automated responses, data lags can significantly compromise your ability to effectively combat cyber threats.</li></ul><p>As we’ve seen, the consequences of lags and missing data can be far-reaching. From lost profits in financial markets to frustrated customers and operational inefficiencies, these issues pose a significant threat to business success. Having the capability to identify the root cause, impact and remediate issues with precision and speed is an imperative to mitigate the business impact.</p><!--kg-card-begin: html--><span style="color: #4338a6;">Causely  automatically captures cause and effect relationships based on real-time, dynamic data across the entire application environment.</span><!--kg-card-end: html--><p><a href="https://www.causely.ai/demo/?ref=causely-blog.ghost.io">Request a demo</a></p><!--kg-card-begin: html--><span style="color: #4338a6;">to see it&nbsp;in action.</span><!--kg-card-end: html--><h2 id="the-delicate-dance-a-web-of-services-and-hidden-culprits">The Delicate Dance: A Web of Services and Hidden Culprits</h2><p>Modern user experiences that leverage real-time data rely on complex chains of interdependent services – a delicate dance of microservices, databases, messaging platforms, and virtualized compute infrastructure. A malfunction in any one element can create a ripple effect, impacting the freshness and availability of data for users. This translates to frustrating delays, lags, or even complete UX failures.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/06/microservices-complexity.png" class="kg-image" alt="microservices environments are complex" loading="lazy" width="333" height="333"></figure><p>Let’s delve into the hidden culprits behind these issues and see how seemingly minor bottlenecks can snowball into major UX problems:</p><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Slowdown Domino with Degraded Microservice</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><ul><li><strong>Scenario:</strong> A microservice responsible for product recommendations experiences high latency due to increased user traffic and internal performance degradation (e.g., memory leak, code inefficiency).</li><li><strong>Impact 1:</strong> The overloaded and degraded microservice takes significantly longer to process requests and respond to the database.</li><li><strong>Impact 2:</strong> The database, waiting for the slow microservice response, experiences delays in retrieving product information.</li><li><strong>Impact 3:</strong> Due to the degradation, the microservice might also have issues sending messages efficiently to the message queue. These messages contain updates on product availability, user preferences, or other relevant data for generating recommendations.</li><li><strong>Impact 4:</strong> Messages pile up in the queue due to slow processing by the microservice, causing delays in delivering updates to other microservices responsible for presenting information to the user.</li><li><strong>Impact 5:</strong> The cache, not receiving timely updates from the slow microservice and the message queue, relies on potentially outdated data.</li><li><strong>User Impact:</strong> Users experience significant delays in seeing product recommendations. The recommendations themselves might be inaccurate or irrelevant due to outdated data in the cache, hindering the user experience and potentially leading to missed sales opportunities. Additionally, users might see inconsistencies between product information displayed on different pages (due to some parts relying on the cache and others waiting for updates from the slow microservice).</li></ul><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Message Queue Backup</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><ul><li><strong>Scenario:</strong> A sudden spike in user activity overwhelms the message queue handling communication between microservices.</li><li><strong>Impact 1:</strong> Messages pile up in the queue, causing delays in communication between microservices.</li><li><strong>Impact 2:</strong> Downstream microservices waiting for messages experience delays in processing user actions.</li><li><strong>Impact 3:</strong> The cache, not receiving updates from slow microservices, might provide outdated information.</li><li><strong>User Impact:</strong> Users experience lags in various functionalities – for example, slow loading times for product pages, delayed updates in shopping carts, or sluggish responsiveness when performing actions.</li></ul><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Cache Miss Cascade</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><ul><li><strong>Scenario:</strong> A cache experiences a high rate of cache misses due to frequently changing data (e.g., real-time stock availability).</li><li><strong>Impact 1:</strong> The microservice needs to constantly retrieve data from the database, increasing the load on the database server.</li><li><strong>Impact 2:</strong> The database, overloaded with requests from the cache, experiences performance degradation.</li><li><strong>Impact 3:</strong> The slow database response times further contribute to cache misses, creating a feedback loop.</li><li><strong>User Impact:</strong> Users experience frequent delays as the system struggles to retrieve data for every request, leading to a sluggish and unresponsive user experience.</li></ul><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Kubernetes Lag</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><ul><li><strong>Scenario:</strong> A resource bottleneck occurs within the Kubernetes cluster, limiting the processing power available to microservices.</li><li><strong>Impact 1:</strong> Microservices experience slow response times due to limited resources.</li><li><strong>Impact 2:</strong> Delays in microservice communication and processing cascade throughout the service chain.</li><li><strong>Impact 3:</strong> The cache might become stale due to slow updates, and message queues could experience delays.</li><li><strong>User Impact:</strong> Users experience lags across various functionalities, from slow page loads and unresponsive buttons to delayed updates in real-time data like stock levels or live chat messages.</li></ul><p>Even with advanced monitoring tools, pinpointing the root cause of these and other issues can be a <a href="https://www.causely.ai/blog/devops-may-have-cheated-death-but-do-we-all-need-to-work-for-the-king-of-the-underworld/?ref=causely-blog.ghost.io">time-consuming detective hunt</a>. The triage &amp; troubleshooting process often requires a team effort, bringing together experts from various disciplines. Together, they sift through massive amounts of observability data – traces, metrics, logs, and the results of diagnostic tests – to piece together the evidence and draw the right conclusions so they can accurately determine the cause and effect. The speed and accuracy of the process is very much determined by the skills of the available resources when issues arise</p><p>Only when the root cause is understood can the responsible team make informed decisions to resolve the problem and restore reliable service.</p><h2 id="transforming-incident-response-automation-of-the-triage-troubleshooting-process">Transforming Incident Response: Automation of the Triage &amp; Troubleshooting Process</h2><p>Traditional methods of incident response, often relying on manual triage and troubleshooting, can be slow, inefficient, and prone to human error. This is where automation comes in, particularly with the advancements in Artificial Intelligence (AI). Specifically, a subfield of AI called <a href="https://ssir.org/articles/entry/the_case_for_causal_ai?ref=causely-blog.ghost.io" rel="noopener">Causal AI</a> presents a revolutionary approach to transforming incident response.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/06/before-after-causal-ai.jpg" class="kg-image" alt="what troubleshooting looks like before and after causal AI " loading="lazy" width="1746" height="850" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/06/before-after-causal-ai.jpg 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/06/before-after-causal-ai.jpg 1000w, https://causely-blog.ghost.io/content/images/size/w1600/wp-content/uploads/2024/06/before-after-causal-ai.jpg 1600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/06/before-after-causal-ai.jpg 1746w" sizes="(min-width: 720px) 720px"></figure><p>Causal AI goes beyond correlation, directly revealing cause-and-effect relationships between incidents and their root causes. In an environment where services rely on real-time data and fast resolution is critical, Causal AI offers significant benefits:</p><ul><li><strong>Automated Triage:</strong> Causal AI analyzes alerts and events to prioritize incidents based on severity and impact. It can also pinpoint the responsible teams, freeing resources from chasing false positives.</li><li><strong>Machine Speed Root Cause Identification:</strong> By analyzing causal relationships, Causal AI quickly identifies the root cause, enabling quicker remediation and minimizing damage.</li><li><strong>Smarter Decisions:</strong> A clear understanding of the causal chain empowers teams to make informed decisions for efficient incident resolution.</li></ul><p><a href="https://www.causely.ai/?ref=causely-blog.ghost.io">Causely</a> is leading the way in applying Causal AI to incident response for modern cloud-native applications. Causely’s technology utilizes <a href="https://www.causely.ai/blog/bridging-the-gap-between-observability-and-automation-with-causal-reasoning/?ref=causely-blog.ghost.io">causal reasoning</a> to automate triage and troubleshooting, significantly reducing resolution times and mitigating business impact. Additionally, Causal AI streamlines post-incident analysis by automatically documenting the causal chain.</p><p>Beyond reactive incident response, Causal AI offers proactive capabilities that focus on measures to reduce the probability of future incidents and service disruptions, through improved hygiene, predictions and “what if” analysis.</p><p><a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io">The solution</a> is built for the modern world that incorporates real-time data, applications that communicate synchronously and asynchronously, and leverage modern cloud building blocks (databases, caching, messaging &amp; streaming platforms and Kubernetes).</p><p>This is just the beginning of the transformative impact Causal AI is having on incident response. As the technology evolves, we can expect even more advancements that will further streamline and strengthen organizations’ ability to continuously assure the reliability of applications.</p><p>If you would like to learn more about Causal AI and its applications in the world of real-time data and cloud-native applications, don’t hesitate to reach out.</p><p>You may also want to check out an <a href="https://www.causely.ai/blog/eating-our-own-dog-food-causelys-journey-with-opentelemetry-causal-ai/?ref=causely-blog.ghost.io">article by Endre Sara</a> which explains how Causely is using Causely to manage its own SaaS service, which is built around a real-time data architecture.</p><hr><h2 id="related-resources">Related Resources</h2><ul><li><a href="https://www.causely.ai/podcast/webinar/what-is-causal-ai-why-do-devops-need-it/?ref=causely-blog.ghost.io">Watch the on-demand webinar:</a> What is Causal AI and why do DevOps teams need it?</li><li><a href="https://www.causely.ai/blog/bridging-the-gap-between-observability-and-automation-with-causal-reasoning/?ref=causely-blog.ghost.io">Read the blog:</a> Bridging the gap between observability and automation with causal reasoning</li><li><a href="https://www.causely.ai/demo/?ref=causely-blog.ghost.io">See causal AI in action:</a> Request a demo of Causely</li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Crossing the Chasm, Revisited]]></title>
      <link>https://causely.ai/blog/crossing-the-chasm-revisited</link>
      <guid>https://causely.ai/blog/crossing-the-chasm-revisited</guid>
      <pubDate>Thu, 30 May 2024 03:11:47 GMT</pubDate>
      <description><![CDATA[Sometimes there’s a single book (or movie, podcast or Broadway show) that seems to define a particular time in your life. In my professional life, Geoffrey Moore’s Crossing the Chasm has always been that book. When I started my career…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/05/david-lusvardi-svmhnqutaty-unsplash.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Sometimes there’s a single book (or movie, podcast or Broadway show) that seems to define a particular time in your life. In my professional life, Geoffrey Moore’s <em><a href="https://en.wikipedia.org/wiki/Crossing_the_Chasm?ref=causely-blog.ghost.io" rel="noopener">Crossing the Chasm</a></em> has always been that book. When I started my career as VP Marketing in the 1990s, this was the absolute bible for early-stage B2B startups launching new products. Fast forward to today, and people still refer to it as a touchstone. Even as go-to-market motions have evolved and become more agile and data-driven, the need to identify a beachhead market entry point and solve early-adopter pain points fully before expanding to the mainstream market has remained relevant and true. I still use the <a href="https://www.elevatorpitchessentials.com/essays/CrossingTheChasmElevatorPitchTemplate.html?ref=causely-blog.ghost.io" rel="noopener">positioning framework</a> for every new product and company.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/05/geoffrey-moores-book-crossing-the-chasm-2-is-a-variation-on-the-technology-adoption.png" class="kg-image" alt="The gap between early adopters and early majority" loading="lazy" width="850" height="293" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/05/geoffrey-moores-book-crossing-the-chasm-2-is-a-variation-on-the-technology-adoption.png 600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/05/geoffrey-moores-book-crossing-the-chasm-2-is-a-variation-on-the-technology-adoption.png 850w" sizes="(min-width: 720px) 720px"></figure><p><em>Graphic from “Crossing the Chasm” showing the gap between early adopters and early majority. <a href="https://www.researchgate.net/figure/Geoffrey-Moores-book-Crossing-The-Chasm-2-is-a-variation-on-the-technology-adoption_fig2_345944786?ref=causely-blog.ghost.io" rel="noopener">Image source: Patel, Neeral &amp; Patlas, Michael &amp; Mafeld, Sebastian. (2020).</a></em></p><p>Recently while hosting the Causely team at their beautiful new offices for our quarterly meetup, our investors at <a href="https://www.645ventures.com/?ref=causely-blog.ghost.io" rel="noopener">645 Ventures</a> gave everyone a copy of the latest edition of <em>Crossing the Chasm</em>. It was an opportunity for me to review the basic concepts. Re-reading it brought back years of memories of startups past and made me think about the book in a new context: how have Moore’s fundamental arguments withstood the decades of technology trends I’ve experienced personally? Specifically, what does “crossing the chasm” actually mean when new product adoption can be so different from one technology shift to another?</p><h2 id="a-quick-refresher">A Quick Refresher</h2><p>One of Moore’s key insights is that innovators and early adopters are willing to try to a new product and work with a new company because it meets some specific needs – innovators love being first to try cool things, and early adopters see new technology as a way to solve problems not currently being met by existing providers. These innovators/early adopters then share their experiences with others in their organizations and industries, who trust and respect their knowledge. This allows the company to reach a broader market over time, cross the chasm and begin adoption by the early majority. Many years can go by during this process, much venture funding will be spent, and still the company may only have penetrated a small percent of the market. Only years later (and with many twists and turns) will the company reach the late majority and finally the laggards.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/05/david-lusvardi-svmhnqutaty-unsplash.jpg" class="kg-image" alt loading="lazy" width="353" height="353"></figure><p><em>Photo by <a href="https://unsplash.com/@lusvardi?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash" rel="noopener">David Lusvardi</a> on <a href="https://unsplash.com/photos/person-standing-near-edge-of-rocky-mountain-SVMHNQUtatY?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash" rel="noopener">Unsplash</a></em></p><h2 id="the-chasm-looks-different-over-time">The Chasm Looks Different Over Time</h2><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Netezza and Data Warehousing</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><p>I started to think about this in terms of technology shifts that I’ve lived through. Earlier in my career I had the good fortune to be part of a company that crossed the chasm: <a href="https://en.wikipedia.org/wiki/Netezza?ref=causely-blog.ghost.io" rel="noopener">Netezza</a>. We built a data warehousing system that was 100x the performance of existing solutions from IBM and Oracle, at half the cost. While this was clearly a breakthrough product, the data warehousing industry had not changed in any meaningful way for over a decade and the database admins who ran the existing solutions were in no rush to try something new or different, for all the usual reasons. Within 10 years we created a new category, the “data warehouse appliance.” We gained traction first with some true innovators and then with early adopters who brought the product into their data centers, proved the value and then used it more widely as a platform. However, “crossing the chasm” took many more years – we had a couple of hundred customers at the time of IPO – and only once the company was acquired by IBM did more mainstream adopters become ready to buy (since <a href="https://www.forbes.com/sites/duenablomstrom1/2018/11/30/nobody-gets-fired-for-buying-ibm-but-they-should/?sh=1e0f110048fc&ref=causely-blog.ghost.io" rel="noopener">no one ever gets fired for buying IBM</a>, etc). The product was so good that it remained in the market for over 20 years until the cloud revolution changed things, but it’s hard to argue that it ever gained broad market adoption compared with more traditional data warehouses.</p><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">Datadog and Cloud Observability</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><p>A second example, which is closer to the market <a href="https://www.causely.ai/?ref=causely-blog.ghost.io">my current company</a> operates in, is Datadog in the observability space. Fueled by the cloud computing revolution (which itself was in the process of crossing the chasm when <a href="https://www.forbes.com/companies/datadog/?sh=5b81e6149e03&ref=causely-blog.ghost.io" rel="noopener">Datadog was founded in 2010</a>), Datadog rode the new technology wave and solved old problems of IT operations management for new cloud-based applications. While this is not necessarily creating a new category, the company moved very quickly from early cloud innovators and adopters to mainstream customers, rocketing to around <a href="https://www.saastr.com/5-interesting-learnings-from-datadog-at-1-2-billion-in-arr/?ref=causely-blog.ghost.io" rel="noopener">$1B in revenues in 10 years</a>. What’s more impressive is that Datadog has become the de facto standard for cloud application observability; today the company has almost 30,000 customers and is still growing quickly in the “early majority” part of the observability market. Depending on which market size numbers you use, Datadog has already crossed the chasm or is well underway, with plenty of room to expand with “late majority” customers.</p><!--kg-card-begin: html--><h3><!--kg-card-begin: html--><span style="font-size: 14pt;">OpenAI and GenAI</span><!--kg-card-end: html--></h3><!--kg-card-end: html--><p>Finally, think about the market adoption in the current GenAI revolution. <a href="https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/?ref=causely-blog.ghost.io" rel="noopener">100 million users adopted ChatGPT</a> within two months of its initial release in late 2022 and OpenAI claims that over <a href="https://inside.com/ai/posts/openai-pitching-corporate-focused-chatgpt-to-fortune-500-420774?ref=causely-blog.ghost.io" rel="noopener">80% of F500 companies</a> are already using it. No need for me to add more statistics here (e.g., comparisons vs adoptions of internet and social technologies) – it’s clear that this is one of the fastest chasm-crossings in history, although it’s not yet clear how companies plan to use the new AI products even as they adopt them. The speed and market confusion make it hard to envision what crossing the chasm will mean for mainstream adopters and how the technology will fully solve a specific set of problems.</p><h2 id="defining-success-as-you-cross">Defining Success as You Cross</h2><p>Thinking through these examples made me realize some things I hadn’t understood earlier in my career:</p><ul><li>It’s easy to confuse a large financial outcome (through IPO or acquisition) with “crossing the chasm”, since the assumption is often that you’ve had enough market success for the outcome. In fact, these are not necessarily related issues. It’s possible to have a large $ acquisition or even a successful IPO (<a href="https://www.theregister.com/2007/07/21/netezza_ipo_up/?ref=causely-blog.ghost.io" rel="noopener">as Netezza did</a>) without having yet crossed to mainstream adoption.</li><li>The market and technology trends that surround and support a new company and product can lead to very different experiences in crossing the chasm: You can have a breakthrough and exciting product in a slow-moving market without major technology tailwinds (e.g., data warehousing in the early 2000s) but you can also have a huge tailwind like cloud computing that drives a new product to more mainstream adoption within 10 years (e.g., Datadog’s cloud-based observability). Or you can have a hyper-growth technology shift like GenAI that shrinks the entire process into a few years, leaving the early and mainstream adopters jumbled together and trying to determine how to turn the new products into something truly useful.</li><li>It can be hard to tell if you’ve really crossed the chasm since people think of many metrics that indicate adoption: % of customers in the total addressable market (Moore defines a <a href="https://thinkinsights.net/strategy/crossing-the-chasm/?ref=causely-blog.ghost.io" rel="noopener">bell curve with percentages for each stage</a>, but I’ve rarely seen people use these strictly), number of monthly active users, revenue market share, penetration within enterprise accounts, etc. Also at the early majority phase, the company can see so much excitement from early customers and analysts (“We’re a leader in the Gartner Magic Quadrant!”) that founders can confuse market awareness and marketing “noise” with true adoption by customers that are waiting for more proof points and additional product capabilities that weren’t as critical for the early adopters. It’s important to keep your eye on these requirements to avoid stalling out once you’ve reached the other side of the chasm.</li></ul><p>I would love to hear from other founders who have made this journey! Please share your thoughts on lessons learned and how you’re thinking about the chasm in the new AI-centric world.</p><hr><h2 id="related-resources">Related Resources</h2><ul><li><a href="https://www.causely.ai/blog/dont-forget-these-3-things-when-starting-a-cloud-venture/?ref=causely-blog.ghost.io">Don’t Forget These 3 Things When Starting a Cloud Venture</a></li><li><a href="https://www.causely.ai/blog/are-you-ready-to-eat-your-own-dogfood/?ref=causely-blog.ghost.io">Are You Ready to Eat Your Own Dogfood?</a></li><li><a href="https://www.causely.ai/blog/building-startup-culture-isnt-like-it-used-to-be/?ref=causely-blog.ghost.io">Building Startup Culture Isn’t Like It Used To Be</a></li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Bridging the Gap Between Observability and Automation with Causal Reasoning]]></title>
      <link>https://causely.ai/blog/bridging-the-gap-between-observability-and-automation-with-causal-reasoning</link>
      <guid>https://causely.ai/blog/bridging-the-gap-between-observability-and-automation-with-causal-reasoning</guid>
      <pubDate>Wed, 22 May 2024 18:53:56 GMT</pubDate>
      <description><![CDATA[Observability has become a growing ecosystem and a common buzzword. Increasing visibility with observability and monitoring tools is helpful, but stopping at visibility isn’t enough. Observability lacks causal reasoning and relies mostly on people to connect application issues with potential…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/05/bridging-the-gap-between-observability-automation.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Observability has become a growing ecosystem and a common buzzword. Increasing visibility with observability and monitoring tools is helpful, but stopping at visibility isn’t enough. Observability lacks causal reasoning and relies mostly on people to connect application issues with potential causes.</p><h2 id="causal-reasoning-solves-a-problem-that-observability-can-t">Causal reasoning solves a problem that observability can’t</h2><p>Combining observability with causal reasoning can revolutionize automated troubleshooting and boost application health. By pinpointing the “why” behind issues, causal reasoning reduces human error and labor.</p><p>This triggers a lot of questions from application owners and developers, including:</p><ul><li>What is observability?</li><li>What is the difference between causal reasoning and observability?</li><li>How does knowing causality increase performance and efficiency?</li></ul><p>Let’s explore these questions to see how observability pairs with causal reasoning for automated troubleshooting and more resilient application health.</p><h2 id="what-is-observability">What is Observability?</h2><p>Observability can be described as observing the state of a system based on its outputs. The three common sources for observability data are logs, metrics, and traces.</p><ul><li>Logs provide detailed records of ordered events.</li><li>Metrics offer quantitative but unordered data on performance.</li><li>Traces show the journey of specific requests through the system.</li></ul><p>The goal of observability is to provide insight into system behavior and performance to help identify and resolve any issues that are happening. However, traditional monitoring tools are “observing” and reporting in silos.</p><blockquote><em>“Observability is not control. Not being blind doesn’t make you smarter.”</em> – Shmuel Kliger, Causely founder in our recent <a href="https://www.causely.ai/podcast/dr-shmuel-kliger-on-causely-causal-ai-and-the-challenging-journey-to-application-health/?ref=causely-blog.ghost.io">podcast interview</a></blockquote><p>Unfortunately, this falls short of the goal above and requires <a href="https://www.causely.ai/blog/devops-may-have-cheated-death-but-do-we-all-need-to-work-for-the-king-of-the-underworld/?ref=causely-blog.ghost.io">tremendous human effort</a> to connect alerts, logs, and anecdotal application knowledge with possible root cause issues.</p><p>For example, if a website experiences a sudden spike in traffic and starts to slow down, observability tools can show logs of specific requests and provide metrics on server response times. Furthermore, engineers digging around inside these tools may be able to piece together the flow of traffic through different components of the system to identify candidate bottlenecks.</p><p>The detailed information can help engineers identify and address the root cause of the performance degradation. But we are forced to rely on human and anecdotal knowledge to augment observability. This human touch may provide guiding information and understanding that machines alone are not able to match today, but that <a href="https://www.turing.com/blog/devops-burnout-causes-prevention/?ref=causely-blog.ghost.io" rel="noopener">comes at the cost</a> of increased labor, staff burnout, and lost productivity.</p><h2 id="data-is-not-knowledge">Data is not knowledge</h2><p>Observability tools collect and analyze large amounts of data. This has created a new wave of challenges among IT operations teams and SREs, who are now left trying to solve a costly and complex <a href="https://www.datamation.com/big-data/state-of-observability-review-2024/?ref=causely-blog.ghost.io" rel="noopener">big data problem</a>.</p><p>The tool sprawl you experience, where each observability tool offers a unique piece of the puzzle, makes this situation worse and promotes inefficiency. For example, if an organization invests in multiple observability tools that each offer different data insights, it can create a fragmented and overwhelming system that hinders rather than enhances understanding of the system’s performance holistically.</p><p>This results in wasted resources spent managing multiple tools and an increased likelihood of errors due to the complexity of integrating and analyzing data from various sources. The resulting situation ultimately undermines the original goal of improving observability.</p><h2 id="data-is-not-action">Data is not action</h2><p>Even with a comprehensive observability practice, the fundamental issue remains: how do you utilize observability data to enhance the overall system? The problem is not about having some perceived wealth of information at your fingertips. The problem is relying on people and processes to interpret, correlate, and then decide what to do based on this data.</p><p>You need to be able to analyze and make informed decisions in order to effectively troubleshoot and assure continuous application performance. Once again, we find ourselves leaving the decisions and action plans to the team members, which is a cost and a risk to the business.</p><h2 id="causal-reasoning-cause-and-effect">Causal reasoning: cause and effect</h2><p>Analysis is essential to understanding the root cause of issues and making informed decisions to improve the overall system. By diving deep into the data and identifying patterns, trends, and correlations, organizations can proactively address potential issues before they escalate into major problems.</p><p>Causal reasoning uses available data to determine the cause of events, identifying whether code, resources, or infrastructure are the root cause of an issue. This deep analysis helps proactively and preventatively address potential problems before they escalate.</p><p>For example, a software development team may have been alerted about transaction slowness in their application. Is this a database availability problem? Have there been infrastructure issues happening that could be affecting database performance?</p><p>When you make changes based on observed behavior, it’s extremely important to consider how these changes will affect other applications and systems. Changes made without the full context are risky.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/05/causely-kubernetes-service-problem-view-db-connections-noisy-neighbor.png" class="kg-image" alt loading="lazy" width="599" height="353"></figure><p>Figure 1: A PostgreSQL-based application experiencing database congestion</p><p>Using causal reasoning based on the observed environment shows that a recent update to the application code is causing crashes for users during specific transactions. A code update may have introduced inefficient database calls, which is affecting the performance of the application. That change can also go far beyond just the individual application.</p><p>If a company decides to update their software without fully understanding how it interacts with other systems, it could result in technical issues that disrupt operations and lead to costly downtime. This is especially challenging in shared infrastructure where noisy neighbors can affect every adjacent application.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/05/causely-problem-view.gif" class="kg-image" alt loading="lazy" width="1920" height="1080" srcset="https://causely-blog.ghost.io/content/images/size/w600/wp-content/uploads/2024/05/causely-problem-view.gif 600w, https://causely-blog.ghost.io/content/images/size/w1000/wp-content/uploads/2024/05/causely-problem-view.gif 1000w, https://causely-blog.ghost.io/content/images/size/w1600/wp-content/uploads/2024/05/causely-problem-view.gif 1600w, https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/05/causely-problem-view.gif 1920w" sizes="(min-width: 720px) 720px"></figure><p>Figure 2: Symptoms, causes, and impact determination</p><p>This is an illustration showing how causal AI software can connect the problem to active symptoms, while understanding the likelihood of each potential cause. This is causal reasoning in action that also understands the effect on the rest of the environment as we evaluate potential resolutions.</p><p>Now that we have causal reasoning for the true root cause, we can go even further by introducing remediation steps.</p><h2 id="automated-remediation-and-system-reliability">Automated remediation and system reliability</h2><p>Automated remediation involves the real-time detection and resolution of issues without the need for human intervention. Automated remediation plays an indispensable role in reducing downtime, enhancing system reliability, and resolving issues before they affect users.</p><p>Yet, implementing automated remediation presents challenges, including the potential for unintended consequences like incorrect fixes that could worsen issues. Causal reasoning takes more information into account to drive the decision about root cause, impact, remediation options, and the effect of initiating those remediation options.</p><p>This is why a whole environment view combined with real-time causal analysis is required to be able to safely troubleshoot and take remedial actions without risk while also reducing the labor and effort required by operations teams.</p><h2 id="prioritizing-action-over-visibility">Prioritizing action over visibility</h2><p>Observability is a component of how we monitor and observe modern systems. Extending beyond observability with causal reasoning, impact determination, and automated remediation is the missing key to reducing human error and labor.</p><p>In order to move toward automation, you need trustworthy, data-driven decisions that are based on a real-time understanding of the impact of behavioral changes in your systems. Those decisions can be used to trigger automation and the orchestration of actions, ultimately leading to increased efficiency and productivity in operations.</p><p>Automated remediation can resolve issues before they escalate, and potentially before they occur at all. The path to automated remediation requires an in-depth understanding of the components of the system and how they behave as an overall system.</p><p>Integrating observability with automated remediation empowers organizations to boost their application performance and reliability. It’s important to assess your observability practices and incorporate causal reasoning to boost reliability and efficiency. The result is increased customer satisfaction, IT team satisfaction, and risk reduction.</p><figure class="kg-card kg-image-card"><a href="https://www.causely.ai/demo/?ref=causely-blog.ghost.io"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/05/see-causal-reasoning-in-action-1-e1716405335260.png" class="kg-image" alt loading="lazy" width="332" height="152"></a></figure><hr><h2 id="related-resources">Related resources</h2><ul><li>What is causal AI and why do DevOps teams need it? <a href="https://www.causely.ai/podcast/webinar/what-is-causal-ai-why-do-devops-need-it/?ref=causely-blog.ghost.io">Watch the webinar.</a></li><li>Moving beyond traditional RCA in DevOps: <a href="https://www.causely.ai/blog/moving-beyond-traditional-rca-in-devops/?ref=causely-blog.ghost.io">Read the blog</a>.</li><li>Assure application reliability with Causely: <a href="https://www.causely.ai/video/assure-application-reliability-with-causely/?ref=causely-blog.ghost.io">See the product.</a></li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[What is Causal AI & why do DevOps teams need it?]]></title>
      <link>https://causely.ai/blog/what-is-causal-ai-why-do-devops-need-it</link>
      <guid>https://causely.ai/blog/what-is-causal-ai-why-do-devops-need-it</guid>
      <pubDate>Wed, 01 May 2024 14:53:01 GMT</pubDate>
      <description><![CDATA[Causal AI can help IT and DevOps professionals be more productive, freeing hours of time spent troubleshooting so they can instead focus on building new applications. But when applying Causal AI to IT use cases, there are several domain-specific intricacies…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/05/screenshot-2024-05-01-at-10-46-11-am.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>Causal AI can help IT and DevOps professionals be more productive, freeing hours of time spent troubleshooting so they can instead focus on building new applications. But when applying Causal AI to IT use cases, there are several domain-specific intricacies that practitioners and developers must be mindful of.</p><p>The relationships between application and infrastructure components are complex and constantly evolving, which means relationships and related entities are dynamically changing too. It’s important not to conflate correlation with causation, or to assume that all application issues stem from infrastructure limitations.</p><p>In this webinar, Endre Sara defines Causal AI, explains what it means for IT, and talks through specific use cases where it can help IT and DevOps practitioners be more efficient.</p><p>We’ll dive into practical implementations, best practices, and lessons learned when applying Causal AI to IT. Viewers will leave with tangible ideas about how Causal AI can help them improve productivity and concrete next steps for getting started.</p><figure class="kg-card kg-embed-card"><iframe title="YouTube video player" src="https://www.youtube.com/embed/pSl4pGCczOU?si=TyQUgd29YTrxTTPW" width="560" height="315" frameborder="0" allowfullscreen="allowfullscreen"></iframe></figure><h2 id="tight-on-time-check-out-these-highlights">Tight on time? Check out these highlights</h2><ul><li><a href="https://www.youtube.com/clip/Ugkx5r5W1AJPHdg6g_SXGRg3Ixihssp0tl3X?ref=causely-blog.ghost.io" rel="noopener">What is root cause and what is it not?</a> Endre defines what we mean by “root cause” and how to know you’ve correctly identified it.</li><li><a href="https://youtube.com/clip/Ugkx79tr_UA9R4la5DcA4smtXFQohZUxuuUa?si=bmu2q2VyiKfE6Y4D&ref=causely-blog.ghost.io" rel="noopener">How do you install Causely?</a> What resources does it demand? Endre shows how easy it is.</li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Building Startup Culture Isn’t Like It Used To Be]]></title>
      <link>https://causely.ai/blog/building-startup-culture-isnt-like-it-used-to-be</link>
      <guid>https://causely.ai/blog/building-startup-culture-isnt-like-it-used-to-be</guid>
      <pubDate>Wed, 24 Apr 2024 18:05:29 GMT</pubDate>
      <description><![CDATA[When does culture get established in a startup? I’d say the company’s DNA is set during the first year or two, and the founding team should do everything possible to make this culture intentional vs a series of disconnected decisions….]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/04/image-from-ios-1-scaled-1.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>When does culture get established in a startup? I’d say the company’s DNA is set during the first year or two, and the founding team should do everything possible to make this culture intentional vs a series of disconnected decisions. Over the years, I’ve seen many great startup cultures that led to successful products and outcomes (and others that were hobbled from the beginning by poor DNA). However, as we plan for our upcoming <a href="https://www.causely.ai/?ref=causely-blog.ghost.io">Causely</a> quarterly team meetup in New York City, I’m struck by how things have changed in culture-building since my previous ventures.</p><h2 id="startup-culture-used-to-happen-organically">Startup culture used to happen organically</h2><p>Back in the day, we took a small office space, gathered the initial team and started designing and building. Our first few months were often in the incubator space at one of our early investors. This was a fun and formative experience, at least until we got big enough to be kicked out (“you’re funded, now get your own space!”). Sitting with a small group crowded around a table and sharing ideas with each other on all topics may not have been very comfortable or even efficient. But it did create a foundational culture based on jokes, stories and decisions we would refer back to for years to come. Also, it established the extremely open and non-hierarchical cultural norms we wanted to encourage as we added people.</p><p>Once we hit initial critical mass and needed more space for breakouts or private discussions, it was off to the Boston real estate market to see what could possibly be both affordable and reasonable for commutes. The more basic the space, the better in many ways, since it emphasized the need to run lean and spend money only on the things that mattered – hiring, engineering tools, early sales and marketing investments, etc. But most important was to spend on things that would encourage the team to get to know each other and build trust. Lunches, dinners, parties, local activities were all important, as was having the right snacks, drinks and games in the kitchen area to encourage people to hang out together (it’s amazing how much the snacks matter).</p><figure class="kg-card kg-image-card"><img src="https://www.causely.ai/wp-content/uploads/2024/04/Image-from-iOS-1.jpg" class="kg-image" alt loading="lazy" width="369" height="277"></figure><!--kg-card-begin: html--><span style="font-size: 8pt;"><em>Building culture in a startup requires in-person get-togethers</em></span><!--kg-card-end: html--><h2 id="the-new-normal">The new normal</h2><p>Fast forward to now, post-Covid and all the changes that have occurred in office space and working remotely. Causely is a remote-from-birth company, with people scattered around the US and a couple internationally. I would never have considered building a company this way before Covid, but when <a href="https://www.linkedin.com/in/shmuel-kliger-1a91963/?ref=causely-blog.ghost.io" rel="noopener">Shmuel</a> and I decided to start the company, it just didn’t seem that big an issue anymore. We committed ourselves to making the extra effort required to build culture remotely, and talked about it frequently with the <a href="https://www.causely.ai/about/?ref=causely-blog.ghost.io">early team and investors</a>.</p><!--kg-card-begin: html--><span style="color: #4338a6;">PS: We’re hiring! Want to help shape the Causely culture?</span><!--kg-card-end: html--><p><a href="https://www.causely.ai/about/?ref=causely-blog.ghost.io">Check out our open roles.</a></p><p>In my experiences hanging out with the local Boston tech community and hearing stories from other entrepreneurs, I’ve noticed some of the following trends (which I believe are typical for the startup world):</p><ul><li><strong>Most companies have one or more offices</strong> that people come to weekly, but not daily; attendance varies by team and is tied to days of the week that the team is gathering for meetings or planning. Peak days are Tues-Thurs but even then, attendance may vary widely.</li><li><strong>Senior managers show up more frequently</strong> to encourage their teams to come in, but they don’t typically enforce scheduling.</li><li><strong>The office has become more a place to build social and mentoring relationships</strong> and less about getting work done, which may honestly be more efficient from home.</li><li><strong>Employees like to come in, and more junior staff in particular benefit</strong> from in-person interaction with peers and managers, as well as having a separate workspace from their living space. But the flexibility of working remotely is very hard to give up and is something people value.</li><li><strong>Gathering the entire company together</strong> regularly (and smaller groups in local offices/meetups) is much more important than it used to be for creating a company-wide culture and helping people build relationships with others in different teams and functional areas.</li></ul><p>Given this new normal, I’ve been wondering where this takes us for the next generation of startup companies. It matters to me that people have a shared sense of the company’s vision and feel bound to each other on a company mission. Without this, joining a startup loses a big element of its appeal and it becomes harder to do the challenging, creative, exhausting and sometimes nutty things it takes to launch and scale. There are only so many hours anyone can spend on Zoom before fatigue sets in. And it’s harder to have the casual and serendipitous exchanges that used to generate new ideas and energize long-running brainstorming discussions.</p><h2 id="know-where-you-want-to-go-before-you-start">Know where you want to go before you start</h2><p>Building culture in the current startup world requires intention. Here are some things I think are critical to doing this well. I would love to hear about things that are working for other entrepreneurs!</p><ol><li><strong>Founders:</strong> spend more time sharing your vision on team calls and 1:1 with new hires – this is the “glue” that holds the company together.</li><li><strong>Managers:</strong> schedule more frequent open-ended, 1:1 calls to chat about what’s on people’s minds and hear ideas on any topic. Leave open blocks of time on your weekly calendar so people can “drop by” for a “visit.”</li><li><strong>Encourage local meetups</strong> as often as practical – make it easy for local teams to get together where and when they want.</li><li><strong>Invest in your all-team meetups</strong>, and make these as fun and engaging as possible. (We’ve tried packed agendas with all-day presentations and realized that this was too much scheduling). Leave time for casual hangouts and open discussions while people are working or catching up on email/Slack.</li><li><strong>Do even more sharing</strong> of information about the company updates and priorities – there’s no way for people to hear these informally, so more communications are needed and repetition is good 🙂</li><li><strong>Encourage newer/younger employees</strong> to share their work and ideas with the rest of the team – it’s too easy for them to lack feedback or mentoring and to lose engagement.</li><li><strong>Consider what you will do in an urgent situation</strong> that requires team coordination: simulations and reviews of processes are much more important than in the past.</li></ol><p>There’s no silver bullet to building great company culture, but instead a wide range of approaches that need to be tried and tested iteratively. These approaches also change as the company grows – building cross-functional knowledge and creativity requires all the above but even more leadership by the founders and management team (and a commitment to traveling regularly between locations to share knowledge). Recruiting, already such a critical element of building culture, now has an added dimension: will the person being hired succeed in this particular culture without many of the supporting structures they used to have? Will they thrive and help build bridges between roles and teams?</p><p>It’s easy to lose sight of the overall picture and trends amidst the day-to-day urgency, so it’s important to take a moment when you’re starting the company to actually write down what you want your company culture to be. Then check it as you grow and make updates as you see what’s working and where there are gaps. The founding team still sets the direction, but today more explicit and creative efforts are needed to stay on track and create a cultural “mesh” that scales.</p><hr><h2 id="related-reading">Related reading</h2><ul><li><a href="https://www.causely.ai/blog/dont-forget-these-3-things-when-starting-a-cloud-venture/?ref=causely-blog.ghost.io">Don’t forget these 3 things when starting a cloud venture</a></li><li><a href="https://www.causely.ai/blog/why-do-this-startup-thing-all-over-again-our-reasons-for-creating-causely/?ref=causely-blog.ghost.io">Why do this startup thing all over again? Our reasons for creating Causely</a></li><li><a href="https://www.causely.ai/blog/are-you-ready-to-eat-your-own-dogfood/?ref=causely-blog.ghost.io">Are you ready to eat your own dogfood?</a></li></ul>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Assure application reliability with Causely]]></title>
      <link>https://causely.ai/blog/assure-application-reliability-with-causely</link>
      <guid>https://causely.ai/blog/assure-application-reliability-with-causely</guid>
      <pubDate>Mon, 22 Apr 2024 20:43:21 GMT</pubDate>
      <description><![CDATA[In this video, we’ll show how easy it is to continuously assure application reliability using Causely’s causal AI platform.   In a modern production microservices environment, the number of alerts from observability tooling can quickly amount to hundreds or even…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/04/screenshot-2024-04-22-at-4-46-12-pm.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>In this video, we’ll show how easy it is to continuously assure application reliability using Causely’s causal reasoning platform.</p>
<!--kg-card-begin: html-->
<div style="position: relative; padding-bottom: 64.67065868263472%; height: 0;"><iframe style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" src="https://www.loom.com/embed/393746b718aa4ddaacd7e34796638f6e?sid=31b4c94f-18ed-4429-9bad-ef10e6ba77ef" frameborder="0" allowfullscreen="allowfullscreen"></iframe></div>
<!--kg-card-end: html-->
<p>In a modern production microservices environment, the number of alerts from observability tooling can quickly amount to hundreds or even thousands, and it’s extremely difficult to understand how all these alerts relate to each other and to the actual root cause. At Causely, we believe these overwhelming alerts should be consumed by software, and root cause analysis should be conducted at machine speed.</p><p>Our <a href="https://www.causely.ai/platform/?ref=causely-blog.ghost.io">Causal Reasoning Platform</a> automatically associates active alerts with their root cause, drives remedial actions, and enables review of historical problems as well. This information streamlines post-mortem analysis, frees DevOps time from complex, manual processes, and helps IT teams plan for upcoming changes that will impact their environment.</p><p>Causely installs in minutes and is <a href="https://www.causely.ai/security?ref=causely-blog.ghost.io" rel="noreferrer">SOC 2 compliant</a>. Share your troubleshooting stories below or <a href="https://www.causely.ai/demo/?ref=causely-blog.ghost.io">request a live demo</a> – we’d love to see how Causely can help!</p>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Cause and Effect: Solving the Observability Conundrum]]></title>
      <link>https://causely.ai/blog/cause-and-effect-solving-the-observability-conundrum</link>
      <guid>https://causely.ai/blog/cause-and-effect-solving-the-observability-conundrum</guid>
      <pubDate>Thu, 18 Apr 2024 16:18:38 GMT</pubDate>
      <description><![CDATA[The pressure on application teams has never been greater. Whether for Cloud-Native Apps, Hybrid Cloud, IoT, or other critical business services, these teams are accountable for solving problems quickly and effectively, regardless of growing complexity. The good news? There’s a…]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/04/screenshot-2024-04-30-at-10-45-39-pm.png" type="image/jpeg" />
      <content:encoded><![CDATA[<p>The pressure on application teams has never been greater. Whether for Cloud-Native Apps, Hybrid Cloud, IoT, or other critical business services, these teams are accountable for solving problems quickly and effectively, regardless of growing complexity. The good news? There’s a whole new array of tools and technologies for helping enable application monitoring and troubleshooting. Observability vendors are everywhere, and the maturation of machine learning is changing the game. The bad news? It’s still largely up to these teams to put it all together. Check out this episode of InsideAnalysis to learn how Causal AI can solve this challenge. As the name suggests, this technology focuses on extracting signal from the noise of observability streams in order to dynamically ascertain root cause analysis, and even fix mistakes automatically.</p><p>Tune in to hear Host Eric Kavanagh interview Ellen Rubin of Causely, as they explore how this fascinating new technology works.</p><figure class="kg-card kg-embed-card"><iframe title="YouTube video player" src="https://www.youtube.com/embed/NsIcU8KMVTo?si=vUiJ61fImUo2huK-" width="560" height="315" frameborder="0" allowfullscreen="allowfullscreen"></iframe></figure>]]></content:encoded>
    </item>
  
    <item>
      <title><![CDATA[Fools Gold or Future Fixer: Can AI-powered Causality Crack the RCA Code for Cloud Native Applications?]]></title>
      <link>https://causely.ai/blog/fools-gold-or-future-fixer-can-ai-powered-causality-crack-the-rca-code-for-cloud-native-applications</link>
      <guid>https://causely.ai/blog/fools-gold-or-future-fixer-can-ai-powered-causality-crack-the-rca-code-for-cloud-native-applications</guid>
      <pubDate>Mon, 08 Apr 2024 17:37:21 GMT</pubDate>
      <description><![CDATA[Applying AI to determine causality in an automated Root Cause Analysis solution sounds like the Holy Grail. It’s easier said than done.]]></description>
      <author>Causely</author>
      <enclosure url="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/03/image-from-ios.jpg" type="image/jpeg" />
      <content:encoded><![CDATA[<p>The idea of applying AI to determine causality in an automated Root Cause Analysis solution sounds like the Holy Grail, but it’s easier said than done. There’s a lot of misinformation surrounding RCA solutions. This article cuts the confusion and provides a clear picture. I will outline the essential functionalities needed for automated root cause analysis. Not only will I define these capabilities, I will also showcase some examples to demonstrate their impact.</p><p>By the end, you’ll have a clearer understanding of what a robust RCA solution powered by causal AI can offer and how it can empower your IT team to better navigate the complexities of your cloud-native environment and most importantly dramatically reduce MTTx.</p><h2 id="the-rise-and-fall-of-the-automated-root-cause-analysis-holy-grail">The Rise (and Fall) of the Automated Root Cause Analysis Holy Grail</h2><p>Modern organizations are tethered to technology. IT systems, once monolithic and predictable, have fractured into a dynamic web of cloud-native applications. This shift towards agility and scalability has come at a cost: unprecedented complexity.</p><p>Troubleshooting these intricate ecosystems is a constant struggle for DevOps teams. Pinpointing the root cause of performance issues and malfunctions can feel like navigating a labyrinth – a seemingly endless path of interconnected components, each with the potential to be the culprit.</p><p>For years, automating Root Cause Analysis (RCA) has been the elusive “Holy Grail” for service assurance, as the <a href="https://www.causely.ai/blog/time-to-rethink-devops-economics-the-path-to-sustainable-success/?ref=causely-blog.ghost.io">business consequences</a> of poorly performing systems are undeniable, especially as organizations become increasingly reliant on digital platforms.</p><p>Despite its importance, commercially available solutions for automated RCA remain scarce. While some hyperscalers and large enterprises have the resources and capital to attempt to develop in-house solutions to address the challenge (<a href="https://www.capitalone.com/tech/machine-learning/automated-detection-diagnosis-remediation-of-application-failure/?ref=causely-blog.ghost.io" rel="noopener">like Capital One’s example</a>), these capabilities are out of reach for most organizations.</p>
<!--kg-card-begin: html-->
<span style="color: #4338a6;"><strong><em>See how Causely can help your organization eliminate human troubleshooting. <a href="https://www.causely.ai/try?ref=causely-blog.ghost.io">Request a demo</a> of the Causal AI platform.&nbsp;</em></strong></span>
<!--kg-card-end: html-->
<h2 id="beyond-service-status-unraveling-the-cause-and-effect-relations-in-cloud-native-applications">Beyond Service Status: Unraveling the Cause-and-Effect Relations in Cloud Native Applications</h2><p>Highly distributed systems, regardless of technology, are vulnerable to failures that cascade and impact interconnected components. Cloud-native environments, due to their complex web of dependencies, are especially prone to this domino effect. Imagine a single malfunction in a microservice, triggering chain reaction, disrupting related microservices. Similarly, a database issue can ripple outwards, affecting its clients and in turn everything that relies on them.</p><p>The same applies to infrastructure services like Kubernetes, Kafka, and RabbitMQ. Problems in these platforms might not always be immediately obvious because of the symptoms they cause within their domain. Furthermore symptoms manifest themselves within applications they support. The problem can then propagate further to related applications, creating a situation where the root cause problem and the symptoms they cause are separated by several layers.</p><p>Although many observability tools offer maps and graphs to visualize infrastructure and application health, these can become overwhelming during service disruptions and outages. While a sea of red icons in a topology map might highlight one or more issues, they fail to illuminate cause-and-effect relationships. Users are then left to decipher the complex interplay of problems and symptoms to work out the root cause. This is even harder to decipher when multiple root causes are present that have overlapping symptoms.</p><figure class="kg-card kg-image-card"><img src="https://causely-blog.ghost.io/content/images/wp-content/uploads/2024/04/fools-gold-2.jpg" class="kg-image" alt="While topology maps show the status of services, they leave their users to interpret cause &amp; effect" loading="lazy" width="382" height="317"></figure>
<!--kg-card-begin: html-->
<span style="font-size: 8pt;"><em>While topology maps show the status of services, they leave their users to interpret cause &amp; effect</em></span>
<!--kg-card-end: html-->
<p>In addition to topology based correlation, DevOps team may also have experience of other types of correlation including event deduplication, time based correlation and path based analysis all of which attempt to reduce the noise in observability data. Don’t loose sight of the fact that this is not root cause analysis, just correlation, and correlation does not equal causation. This subject is covered further in a previous article I published <a href="https://www.causely.ai/blog/unveiling-the-causal-revolution-in-observability/?ref=causely-blog.ghost.io">Unveiling The Causal Revolution in Observability</a>.</p><p>The Holy Grail of troubleshooting lies in understanding causality. Moving beyond topology maps and graphs, we need solutions that represent causality depicting the complex chains of cause-and-effect relationships, with clear lines of responsibility. Precise root cause identification that clearly explains the relationship between root causes and the symptoms they cause, spanning the technology domains that support application service composition, empowers DevOps teams to:</p><ul><li><strong>Accelerate Resolution:</strong> By pinpointing the exact source of the issue and the symptoms that are caused by this, responsible teams are notified instantly and can prioritize fixes based on a clear understanding of the magnitude of the problem. This laser focus translates to faster resolution times.</li><li><strong>Minimize Triage:</strong> Teams managing impacted services are spared the burden of extensive troubleshooting. They can receive immediate notification of the issue’s origin, impact, and ownership, eliminating unnecessary investigation and streamlining recovery.</li><li><strong>Enhance Collaboration:</strong> With a clear understanding of complex chains of cause-and-effect relationships, teams can collaborate more effectively. The root cause owner can concentrate on fixing the issue, while impacted service teams can implement mitigating measures to minimize downstream effects.</li><li><strong>Automate Responses:</strong> Understanding cause and effect is also an enabler for automated workflows. This might include automatically notifying relevant teams through collaboration tools, notification systems and the service desk, as well as triggering remedial actions based on the identified problem.</li></ul><h2 id="bringing-this-to-life-with-real-world-examples">Bringing This to Life with Real World Examples</h2><p>The following examples will showcase the concept of causality relations, illustrating the precise relationships between root cause problems and the symptoms they trigger in interrelated components that make up application services.</p><p>This knowledge is crucial for several reasons. First, it allows for targeted notifications. By understanding the cause-and-effect sequences, the right teams can be swiftly alerted when issues arise, enabling faster resolution. Second, service owners impacted by problems can pinpoint the responsible parties. This clarity empowers them to take mitigating actions within their own services whenever possible and not waste time troubleshooting issues that fall outside of their area of responsibility.</p>
<!--kg-card-begin: html-->
<h3 id="ember4982" class="ember-view reader-content-blocks__paragraph"><!--kg-card-begin: html--><span style="font-size: 14pt;">Infra Problem Impacting Multiple Services</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p>In this example, a CPU congestion in a <a href="https://kubernetes.io/docs/concepts/workloads/pods/?ref=causely-blog.ghost.io" rel="noopener">Kubernetes Pod</a> is the root cause and this causes symptoms –  high latency – in application services that it is hosting. In turn, this results in high latency on other applications services. In this situation the causal relationships are clearly explained.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/03/Screenshot-2024-03-19-at-6.20.42-PM.png" class="kg-image" alt="" loading="lazy" width="2000" height="1062" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/03/Screenshot-2024-03-19-at-6.20.42-PM.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/03/Screenshot-2024-03-19-at-6.20.42-PM.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/03/Screenshot-2024-03-19-at-6.20.42-PM.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2025/03/Screenshot-2024-03-19-at-6.20.42-PM.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Example causality graph showing CPU congestion in Causely</span></figcaption></figure>
<!--kg-card-begin: html-->
<h3 id="ember4985" class="ember-view reader-content-blocks__paragraph"><!--kg-card-begin: html--><span style="font-size: 14pt;">A Microservice Hiccup Leads to Consumer Lag</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p>Imagine you’re relying on a real-time data feed, but the information you see is outdated. In this scenario, a bug within a microservice (the data producer) disrupts its ability to send updates. This creates a backlog of events, causing downstream consumers (the services that use the data) to fall behind. As a result, users/customers end up seeing stale data, impacting the overall user experience and potentially leading to inaccurate decisions. Very often the first time DevOps find out about these types of issues is when end users and customers complain about the service experience.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/03/inefficient-locking-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="735" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/03/inefficient-locking-1.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/03/inefficient-locking-1.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/03/inefficient-locking-1.png 1600w, https://causely-blog.ghost.io/content/images/size/w2400/2025/03/inefficient-locking-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Example causality graph showing inefficient locking in Causely</span></figcaption></figure>
<!--kg-card-begin: html-->
<h3 id="ember4988" class="ember-view reader-content-blocks__paragraph"><!--kg-card-begin: html--><span style="font-size: 14pt;">Database Problems</span><!--kg-card-end: html--></h3>
<!--kg-card-end: html-->
<p>In this example the clients of a database are experiencing performance issues because one of the clients is issuing queries that are particularly resource-intensive. Symptoms of this include:</p><ul><li><strong>Slow query response times:</strong> Other queries submitted to the database take a significantly longer time to execute.</li><li><strong>Increased wait times for resources:</strong> Applications using the database experience high error rate as they wait for resources like CPU or disk access that are being heavily utilized by the resource-intensive queries.</li><li><strong>Database connection timeouts:</strong> If the database becomes overloaded due to the resource-intensive queries, applications might experience timeouts when trying to connect.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://causely-blog.ghost.io/content/images/2025/03/millionways-causality-chain.png" class="kg-image" alt="" loading="lazy" width="1862" height="578" srcset="https://causely-blog.ghost.io/content/images/size/w600/2025/03/millionways-causality-chain.png 600w, https://causely-blog.ghost.io/content/images/size/w1000/2025/03/millionways-causality-chain.png 1000w, https://causely-blog.ghost.io/content/images/size/w1600/2025/03/millionways-causality-chain.png 1600w, https://causely-blog.ghost.io/content/images/2025/03/millionways-causality-chain.png 1862w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Example causality graph showing noisy neighbor in Causely</span></figcaption></figure><h2 id="summing-up">Summing Up</h2><p>Cloud-native systems bring agility and scalability, but troubleshooting can be a nightmare. Here’s what you need to conquer Root Cause Analysis (RCA) in this complex world:</p><ul><li><strong>Automated Analysis:</strong> Move beyond time-consuming manual RCA. Effective solutions automate data collection and analysis to pinpoint cause-and-effect relationships swiftly.</li><li><strong>Causal Reasoning:</strong> Don’t settle for mere correlations. True RCA tools understand causal chains, clearly and accurately explaining “why” things happen and the impact that they have.</li><li><strong>Dynamic Learning:</strong> Cloud-native environments are living ecosystems. RCA solutions must continuously learn and adapt to maintain accuracy as the landscape changes.</li><li><strong>Abstraction:</strong> Cut through the complexity. Effective RCA tools provide a clear view, hiding unnecessary details and highlighting crucial troubleshooting information.</li><li><strong>Time Travel:</strong>Post incident analysis requires clear explanations. Go back in time to understand “why” problems and understand the impact they had.</li><li><strong>Hypothesis: </strong>Understand the impact that degradation or failures in application services and infrastructure will have before they happen.</li></ul><p>These capabilities unlock significant benefits:</p><ul><li><strong>Faster Mean Time to Resolution (MTTR):</strong> Get back to business quickly.</li><li><strong>More Efficient Use Of Resources: </strong>Eliminate wasted time chasing the symptoms of problems and get to the root cause immediately.</li><li><strong>Free Up Expert Resources From Troubleshooting:</strong> Empower less specialized teams to take ownership of the work.</li><li><strong>Improved Collaboration: </strong>Foster teamwork because everyone understands the cause-and-effect chain.</li><li><strong>Reduced Costs &amp; Disruptions:</strong> Save money and minimize business interruptions.</li><li><strong>Enhanced Innovation &amp; Employee Satisfaction:</strong> Free up resources for innovation and create a smoother work environment.</li><li><strong>Improved Resilience: </strong>Take action now to prevent problems that could impact application performance and availability in the future</li></ul><p>If you would like to get to avoid the glitter of “Fools Gold” and get to the Holy Grail of service assurance with automated Root Cause Analysis don’t hesitate to reach out to <a href="https://www.linkedin.com/in/andrew-mallaband-88b1b7/?ref=causely-blog.ghost.io" rel="noopener">me</a> directly, or contact the team at <a href="https://www.causely.ai/?ref=causely-blog.ghost.io" rel="noopener">Causely</a> today to discuss your challenges and discover how they can help you.</p>]]></content:encoded>
    </item>
  
  </channel>
</rss>