SLO-Based Alerting in OpenObserve

Simran Kumari

August 20, 2025

11 min read

Don’t forget to share!

Table of Contents

SLO-Based Alerting in OpenObserve

For SREs and Developers, What You’ll Learn:

What Service Level Objectives (SLOs) are: measurable reliability target
How to define and select meaningful SLOs using SLIs and error budgets
How alerts based on SLOs reduce noise and improve relevance
How to create and automate burn rate alerts in OpenObserve using SQL
Practical SQL examples for error rates, latency, and request throughput

Introduction

Monitoring tools often throw hundreds of alerts like high CPU, slow responses, disk usage. But most of these don’t answer the real question: “Is the user experience impacted?”

That’s what Service Level Objectives (SLOs) are for.

SLOs let you focus on reliability goals tied to real user expectations, not just infrastructure signals. Instead of reacting to everything that could go wrong, you set targets for what must go right, based on your business and real user expectations.

SLO Basics: From Services to Reliability Targets

A service is anything your users rely on, like your website, login system, billing API, or background jobs. Each one is expected to function reliably.

To measure how reliably a service performs, we use Service Level Indicators (SLIs). These are metrics like:

Success rate (e.g., % of 2xx responses)
Error rate (e.g., % of 5xx)
Latency (e.g., 95th percentile response time)
Availability (e.g., uptime %)

A SLO is a target you set for an SLI over a time window. For example, "99.9% of requests should succeed over 7 days."

SLOs are reliability targets that guide operational focus. Instead of chasing infrastructure metrics like CPU or memory usage, SLOs help you focus on what matters: whether the service is available, responsive, and not throwing errors.

For example, your database service might show high CPU usage. That alone doesn’t matter to the service owner unless users are seeing slow queries or failed transactions. SLOs let you ignore noisy alerts and focus only when actual user impact is at risk.

Understanding Error Budgets

Every SLO implies an error budget i.e. the small, acceptable margin for failure. Example: If your SLO is 99.9%, you have an error budget of 0.1% failures in a given time window.

This error budget isn’t just for alerts. It’s a decision-making tool:

Can we release this new feature?
→ If we’re within budget, we might accept some risk.
Should we pause feature work and invest in reliability?
→ If we’re burning the error budget too quickly, yes.
Is this operational risk acceptable right now?
→ Depends on how much of the budget is left and how fast we’re consuming it.

You alert not because the SLO is breached, but when you’re burning through the error budget too fast.

Note: A SLO is not an alert by itself,it’s a long-term reliability target. Alerts are derived from how fast or how often you're deviating from that SLO.

Let’s see how SLOs play out with a real-world example.

SLO Workflow in OpenObserve: Demo

Prerequisites

Before we dive into the demo, make sure you have the following set up:

A running OpenObserve instance: Either self-hosted or an OpenObserve Cloud account.
At least one alert destination configured: You'll need this to receive alerts (e.g., Slack, email, webhook).
Alert templating enabled (recommended)
This allows you to use dynamic values in alert messages.
→ Templating Reference

The scenario: Your users are complaining about slow login experiences. Let's build a complete SLO monitoring system in OpenObserve that tracks, alerts, and helps debug latency issues.

We'll walk through the entire workflow:

Define SLO → Build Dashboard → Create Alerts → Debug Issues when at risk.

Where to Define SLOs in OpenObserve?

OpenObserve doesn’t have a built-in SLO object. You define an SLO by writing a SQL query that evaluates whether your service is meeting the target. You can then:

Add the query result to a dashboard to track trends
Set up a scheduled alert to trigger when the condition fails

In short, the SLO lives in the query and the alert logic you create , giving you full control over how it's defined and enforced.

Step 1: Setting Your Latency SLO

Business Context: Users expect login to feel responsive. Research shows anything above 500ms feels sluggish.

Sample trace data (already flowing into OpenObserve):

Note: OpenTelemetry collects traces in a nested JSON structure (e.g., resource.attributes, scopeSpans.spans). OpenObserve automatically flattens these fields during ingestion for easier querying. The flattening depth is controlled by the environment variable ZO_FLATTEN_LEVEL (default: 3). That’s why in the example below you see simple keys like service_name instead of deeply nested.

{
    "trace_id": "abc123abc123abc123abc123abc123ab",
    "span_id": "def456def456def4",
    "operation_name": "POST /login",
    "start_time": 1754664691403452700,
    "end_time": 1754664691409971500,
    "duration": 518600,
    "http_method": "POST",
    "http_status_code": 200,
    "http_url": "http://authservice.local/login",
    "service_name": "auth-service",
    "span_kind": 3,
    "span_status":"UNSET",
    "status_code":0,
    "status_message":"",
  }

Sample traces in OpenObserve UI

SLO: "95th percentile login response time should stay under 500ms over any 7-day rolling window"

(Note: This threshold is just an example. SLOs vary based on business impact, user expectations, and service risk tolerance.)

Why P95? Averages hide problems. P95 tells us 95% of users get a response faster than this threshold, catching tail latency issues that affect real users.

Step 2: Track your SLO

You need to verify if your service is meeting its defined performance targets.

In the OpenObserve UI , go to Stream section and select your traces stream and click on Explore icon.
Use the Include term icon to filter for respective service, operation name and status code.
After filtering use SQL function to calculate 95th percentile for response time and check if the target is met.

Note:

We're filtering for status_code < 400 because we want to consider only successful or redirected requests (i.e., not client or server errors) when evaluating latency SLOs.
The duration field is in microseconds in this case, so we divide by 1000 to convert the result to milliseconds for comparison with the 500ms SLO target.

So, p95_latency is <500 ms, so our SLO target is not at risk. But do we need to rewrite the query every time to check compliance?

Well , no. Instead, we can create dashboards.

Step 3: Build Your SLO Dashboard

Creating a dashboard helps visualize and observe SLO compliance trends over time without repeating manual checks.

Navigate to OpenObserve → Dashboard and create a new dashboard.
Next, add Panels to your dashboard

Panel 1: SLO Status Summary

Select Chart type -> Markdown
Add a Markdown text with your SLO definition:

## Login Latency SLO

- **Target**: P95 < 500ms over 7 day rolling window
- **Error Budget**: 5% of requests can exceed 500ms
- **Business Impact**: Latency >500ms correlates with 12% higher bounce rate
- **Owner**: Backend Team
- **Escalation**: #backend-oncall

Creating markdown panel in OpenObserve Dashboard

This provides a clear, at-a-glance summary of the SLO, helping teams align on objectives, impact, and ownership.

Panel 2: SLI- Latency Trend Over Time

Create a line chart to capture your latency trends over time.

Select the chart type as Line chart
Select stream type and stream name
Add timestamp in x-axis and P95 of response time / duration on y-axis
Add filters based on operation_name and status_code

Creating line chart to check SLI trends in OpenObserve Dashboard

Panel 3: Current SLO Compliance

Create a gauge chart which shows real-time compliance with the latency SLO to quickly identify if the service is meeting performance targets.

Select Chart type -> Gauge Chart
Select stream type and stream name.
Filter based on operation name, and status code.

Creating gauge chart to check SLO compliance in OpenObserve Dashboard

Your final dashboard may look something like this:

Sample SLO Dashboard in OpenObserve

You can add more panels and charts based on your needs.

Use Dashboard settings to update the default duration as 7-days to avoid manual changes every-time you visit the dashboard:

Dashboard settings in OpenObserve

Pro tip: Set dashboard to auto-refresh every 5 minutes so it stays current. So far, the dashboard tells you what's happening, but someone still has to look at it. That doesn't scale.

To truly defend your SLO, you need to:

Detect when latency is violating your target
Get notified before it impacts users or burns through your error budget

Step 4: Create SLO Alerts

Alerts turn SLO breaches into immediate signals , so your team can act before SLAs (Service Level Agreements) or user experience are impacted.

Setting SLO Breach Alert

Navigate to OpenObserve → Alerts → Add Alert
Fill in Alert-Setup details:
- Give the alert a meaningful name
- Choose stream type and Select corresponding stream from the dropdown
- Set alert type to Scheduled Alert, since we want to evaluate data aggregated over a 1 hr window instead of triggering alerts in real-time

Filling in alert details

Configure alert Settings. Select the corresponding destination where you want to receive the alert notification. The Alerts in OpenObserve documentation provides details around Alerts parameters. Or you can click on the i icon for summary details.
Next, we need to set conditions for alert.
- Select SQL-Mode under the Conditions section on the Alerts page
- Click on View Editor to open the SQL query editor for defining alert conditions
- Paste the Query

SELECT
  service_name,
  APPROX_PERCENTILE_CONT(response_time_ms, 0.95) AS p95_latency_ms
FROM <STREAM_NAME>
WHERE 
  service_name = 'auth-service' 
  AND operation_name= 'POST /login'
  AND status_code < 400
GROUP BY service

This query checks if the 95th percentile latency for the /login endpoint in the auth-service is exceeding the SLO threshold of 500 ms. (In case the duration is micro seconds don’t forget the conversion)

HAVING p95_latency_ms > 500: triggers logic only when the SLO is violated ,making it perfect for alerting.

You can set the Message Template as :

🚨 Login Latency SLO BREACH
P95: {p95_latency_ms}ms (target: <500ms)
Dashboard: https://your-openobserve.com/login-slo

Slack alert notifications using OpenObserve Scheduled alerts

Note: While threshold-based alerts (like p95 latency > 500ms) work well, teams aiming for more resilient and user-centric alerting often use burn rate alerts. Burn rate is the speed at which you're consuming your SLO's error budget and alerts can be tuned for fast vs slow incidents.
For a deeper understanding, see Google's Site Reliability Workbook (Chapter 6: Alerting on SLOs)

Step 5: Debug SLO Violations

Once an alert fires, you need to figure out why.

Focus on these four angles:

Impact Scope : Is it a tail latency issue (only p95 is bad) or widespread (p50 is also slow)?
Who’s Affected : Are certain users, regions, or clients consistently slower?
When It Started : Use time charts to find exactly when latency spiked.
Likely Cause : Correlate with logs or upstream errors to narrow down root causes.

Pro tip: Use drilldown to link to dashboards/logs for quick triage when an alert hits.

Drilldowns feature of OpenObserve Drilldowns options in OpenObserve

Takeaways

SLOs aren’t just theory, they’re your best defense against alert fatigue.

By focusing on what users actually care about, and combining that with OpenObserve’s SQL-powered alerting, you can:

Cut down on noisy, one-off threshold alerts
Surface only the issues that threaten real reliability
Empower devs and SREs to share one clear standard of "good enough"
Start small, and scale your SLO coverage as your systems grow

OpenObserve gives you the flexibility to express these goals in code and turn them into actionable alerts.

Get Started with OpenObserve Today!

Sign up for a 14 day cloud trial. Check out our GitHub repository for self-hosting and contribution opportunities.

About the Author

Simran Kumari

Passionate about observability, AI systems, and cloud-native tools. All in on DevOps and improving the developer experience.

Latest From Our Blogs

View all posts

ServiceNow Integration with OpenObserve: Automate Incident Creation from Alerts

How to

OpenObserveAlerts

ServiceNow Integration with OpenObserve: Automate Incident Creation from Alerts

Learn how to integrate ServiceNow with OpenObserve to automatically create incidents from alerts. Step-by-step guide covering webhook integration and openobserve actions with deduplication support.

Md Mosaraf,Manas Sharma

2025-11-14

Full-Stack Observability: How Logs, Metrics, and Traces Work Better Together

Engineering

LoggingMetricsOpenObserve

Full-Stack Observability: How Logs, Metrics, and Traces Work Better Together

Discover how full-stack observability helps teams correlate telemetry across systems to cut MTTR, reduce data costs, and improve performance.

Raven Welch,Simran Kumari

2025-11-13

Cloud Monitoring for AWS, Azure, and GCP with OpenObserve

Engineering

AWSGCPMicrosoft

Cloud Monitoring for AWS, Azure, and GCP with OpenObserve

Discover how to monitor cloud resources effectively. Use OpenObserve to analyze logs, metrics, and traces for better visibility, alerts, and performance.

Simran Kumari

2025-11-12

Major Product Update! OpenObserve v0.16.1

Release

AlertsDashboardsMetrics

Major Product Update! OpenObserve v0.16.1

OpenObserve v0.16.1 delivers meaningful new features including Alert History for debugging monitoring reliability, Pipeline History for execution tracking, and automatic Log Pattern extraction that groups millions of logs into actionable insights. This release brings significant performance improvements with optimized indexing and query execution, alongside UI/UX refinements that enhance readability and usability across the platform. Teams can now better understand their system behavior, reduce alert fatigue through deduplication, and troubleshoot issues faster with comprehensive execution history and diagnostics.

Jake Swiss,Manas Sharma

2025-11-11

Scaling Observability for Peak Traffic: A Practical Guide to Building Resilient Observability Systems

Engineering

OpenObserveObservability

Scaling Observability for Peak Traffic: A Practical Guide to Building Resilient Observability Systems

Learn how to scale observability systems to handle Black Friday-level traffic without losing visibility. Discover best practices for ingestion tuning, query optimization, and resilience using OpenObserve.

Manas Sharma,Simran Kumari

2025-11-10

How to

OpenObserveEnterprise

Sensitive Data Redaction in OpenObserve: How to Redact, Hash, and Drop PII Data Effectively

Explore how OpenObserve’s Sensitive Data Redaction protects PII in observability pipelines. Configure regex-based rules to redact, hash, or drop sensitive data at ingestion or query time for full GDPR and HIPAA compliance.

Manas Sharma

2025-11-07

NVIDIA GPU Monitoring with DCGM Exporter and OpenObserve: Complete Setup Guide

Engineering

AIObservabilityAlerts

NVIDIA GPU Monitoring with DCGM Exporter and OpenObserve: Complete Setup Guide

Monitor NVIDIA H100, H200, and A100 GPUs with DCGM Exporter and OpenObserve. Complete setup guide with dashboards, alerts, and 89% cost savings vs traditional tools.

Chaitanya Sistla

2025-11-06

How to

ApplicationOpenObserve

How to Deploy OpenObserve on DigitalOcean: A Complete Guide

Learn how to deploy OpenObserve on DigitalOcean with Kubernetes, Spaces, and managed PostgreSQL. Complete production-ready setup guide with Helm configuration.

Chaitanya Sistla

2025-11-05

How Evereve Eliminated Monitoring Constraints, Reduced Costs by > 90%, and Unified Observability Across Teams with OpenObserve

An observability customer story showcasing how a brick-and-mortar and e-commerce fashion brand consolidated observability tools while drastically reducing their TCO.

2025-11-05

Announcement

OpenObserve

The OpenObserve Dashboard Contest: Show Us What You’ve Built

Join the OpenObserve Dashboard Contest and show off your best dashboards. Share your build on LinkedIn with #BuiltWithOpenObserve and #DashboardDrop for a chance to win $300, exclusive swag, and a feature on the OpenObserve blog. Entries close November 21, 2025.

Raven Welch

2025-11-04