SLOs: When the Number on Your Dashboard Actually Does Something
Your service is “99.9% available.” The status page says so. So does the SLA your sales team signed. But nobody can tell you what that number actually measures. Nobody knows what happens when it drops below the target. And nobody can say whether the alert that woke your on-call engineer last night had anything to do with it.
A speed limit sign on a road with no speedometer, no police, and no consequences. The number exists. It lives in a dashboard. It occasionally surfaces in an executive slide. And nothing changes, because nothing forces anyone to act on it.
- SLOs are engineering tools, not contractual obligations. An SLO without an error budget policy is just a number on a dashboard. A speed limit with no speedometer. The policy is what changes behavior.
- Averages lie about user experience. A service with 200ms average latency and 4-second P99 has two completely different reliability stories depending on which metric you measure. Average temperature in a room with one end on fire.
- Error budgets make the velocity-versus-reliability tradeoff explicit. A spending account for unreliability. When the budget is healthy, ship fast. When it burns, fix reliability. No debate. No politics.
- Burn rate alerting replaces threshold alerting. Instead of paging when error rate crosses 1%, page when the error budget is being consumed fast enough to run out within hours.
- Product must co-own the target, or the target is fiction. A reliability number set by engineers alone is a wish. One negotiated with product is a commitment.
SLAs, SLOs, SLIs: Three Letters, Three Different Things
Teams use SLA, SLO, and SLI interchangeably, and every conversation goes in circles. Three different concepts, one pile of acronyms.
An SLI (Service Level Indicator) is a measurement. The speedometer. It answers one question: “What does this feel like for the user right now?” What share of requests succeed. What share come back under 300ms. What share of search results are fresh. SLIs are always expressed as a ratio between 0% and 100%.
An SLO (Service Level Objective) is a target set against an SLI. “99.9% of requests will succeed over a rolling 30-day window.” The SLO defines what “good enough” looks like. Not perfect. Good enough.
An SLA (Service Level Agreement) is a contract. It carries financial consequences. If the vendor breaches the SLA, the customer gets credits, refunds, or the right to terminate. SLAs are negotiated by legal and sales, not engineering.
SLOs must always be stricter than SLAs. If your SLA promises 99.5% and your SLO targets 99.9%, you have a buffer. The internal alarm trips long before the contractual breach. Set them equal and you lose that buffer entirely. By the time anyone notices a problem, the SLA violation has already happened and the credits are owed.
What the Industry Gets Wrong About SLOs
“Our SLA is 99.9%, so our SLO is 99.9%.” An SLO equal to the SLA means there is zero buffer. Every dip in reliability immediately risks contractual penalties. SLOs should be meaningfully stricter than SLAs so that internal mechanisms kick in before external consequences do. If your SLA is 99.9%, your SLO should be at least 99.95%.
“We need five nines.” 99.999% availability means 26 seconds of total downtime per month. A single deployment that takes 30 seconds to roll back has already blown the budget. Most services do not need this, and the engineering cost of achieving it is exponential. Every additional nine roughly doubles the infrastructure and operational complexity. The right target for most user-facing services is 99.9% to 99.95%.
“We measure availability by checking if the server is up.” A synthetic health check that returns 200 OK tells you the process is running. It tells you nothing about whether users can actually complete the actions they came to perform. SLIs must measure user-facing behavior, not infrastructure liveness.
Choosing the Right SLIs
Get the SLI wrong and everything built on it - SLO, error budget, alerting - reflects a world that doesn’t match what users see.
Most teams reach for infrastructure metrics: CPU utilization, memory pressure, disk I/O. Useful for capacity planning. Terrible as SLIs. A service can run at 90% CPU and serve every request happily. It can also idle at 20% while returning errors because a downstream dependency is down.
Good SLIs measure the boundary between your system and the user.
| SLI Type | What It Measures | Good For | Example |
|---|---|---|---|
| Availability | Proportion of successful responses | Request-driven services | 99.9% of HTTP requests return non-5xx |
| Latency | Response time at a percentile | User-facing APIs, search | 99% of requests under 300ms |
| Correctness | Proportion of correct results | Data pipelines, ML inference | 99.99% of pricing calculations match source of truth |
| Freshness | Data age relative to source | Dashboards, search indexes | 99.5% of queries see data less than 5 minutes old |
Don’t: Use average latency as an SLI. A service with 50ms average and 4-second p99 looks healthy on the average but is miserable for the unluckiest 1% of users.
Do: Use percentile-based latency SLIs. p99 latency under 300ms captures the experience of all but the most extreme outliers. Track p50 for typical experience and p99 for worst-case.
Error Budgets: The Decision Framework That Changes Behavior
An SLO without a policy is a number on a dashboard. The error budget gives it teeth.
The math is simple. A 99.9% SLO over a 30-day rolling window means 0.1% allowed failure. In a system serving one million requests per day, that budget is 1,000 failed requests per day, or roughly 30,000 over the window. In terms of downtime, 0.1% of 30 days is 43.2 minutes.
That 43.2 minutes (or 30,000 failed requests) is the error budget. It belongs jointly to product and engineering. Product wants to spend it on velocity. Ship features, run experiments, accept some risk. Engineering wants to conserve it. The budget makes the argument specific: “We have 38 minutes left this month. Do we deploy this risky migration, or wait until the window resets?”
Error budgets only work when they have teeth. A policy that says “when the budget runs out, we should probably focus on reliability” changes nothing. A policy that says “when the budget drops below 25%, all feature work stops and the team focuses exclusively on reliability until the budget recovers to 50%” changes everything. The difference is enforcement, and enforcement requires product leadership to co-sign the policy before the first incident, not during one.
| Budget Health | Engineering Behavior | Product Behavior |
|---|---|---|
| Healthy (> 50% remaining) | Ship freely, experiment with new architectures | Push features, run A/B tests, tolerate some churn |
| Caution (25-50%) | Require rollback plans, avoid high-risk changes | Prioritize lower-risk features, defer large migrations |
| Critical (< 25%) | Reliability-only sprint, no feature deployments | Defer all feature requests, support reliability work |
| Exhausted (0%) | Full freeze until budget recovers past 50% threshold | Accept timeline delay, communicate externally if needed |
Burn-Rate Alerting: Stop Paging on the Wrong Things
Traditional threshold alerting pages your on-call when error rate crosses a fixed boundary. Error rate above 1%? Page. Latency above 500ms? Page. But a 1.5% error rate for three seconds is not the same as 1.5% for three hours. The threshold alert fires identically for both.
Burn-rate alerting asks a different question: “At the current rate of errors, how fast is the error budget being consumed?” A burn rate of 1x means you will exactly exhaust the budget at the end of the 30-day window. Nothing to worry about. A burn rate of 14.4x means the budget will be gone in roughly 24 hours. That warrants attention. A burn rate of 720x means the budget burns out in one hour. Page immediately.
The Google SRE Workbook formalizes this into multi-window burn-rate alerts. A fast window (short lookback, say 5 minutes) catches rapid budget consumption. A slow window (longer lookback, say 60 minutes) filters out transient spikes that resolve on their own. Both conditions must be true to fire.
# Multi-window burn-rate alert configuration
# Pages only when BOTH windows confirm the burn
alerts:
- name: "high-burn-rate-page"
# 14.4x burn rate = budget exhausted in ~24 hours
condition:
fast_window: 5m # Recent error rate over 5 minutes
fast_threshold: 14.4
slow_window: 60m # Sustained error rate over 1 hour
slow_threshold: 14.4
severity: page # Wake someone up
- name: "medium-burn-rate-ticket"
# 3x burn rate = budget exhausted in ~10 days
condition:
fast_window: 30m
fast_threshold: 3
slow_window: 6h
slow_threshold: 3
severity: ticket # File a ticket, don't page
Fewer false pages, faster detection of real incidents, and on-call rotations that don’t burn people out chasing phantoms. Burn-rate alerting kills most of the noisy threshold alerts and catches genuine problems sooner. Observability and monitoring built on burn rates gives on-call engineers something they rarely have: quiet nights that actually mean everything is fine.
How to calculate burn rate from SLO parameters
The burn rate formula is: burn_rate = (error_rate_observed / error_rate_allowed).
For a 99.9% SLO, the allowed error rate is 0.1% (0.001). If the observed error rate over the fast window is 1.44% (0.0144), the burn rate is 0.0144 / 0.001 = 14.4x.
To determine the time until budget exhaustion: time_remaining = slo_window / burn_rate. At 14.4x burn rate with a 30-day window: 30 / 14.4 = ~2.08 days. This gives the on-call engineer a concrete timeline for how urgent the response needs to be.
Common burn-rate thresholds:
- 14.4x (budget gone in ~2 days): page immediately
- 6x (budget gone in ~5 days): page during business hours
- 3x (budget gone in ~10 days): ticket for next sprint
- 1x (budget tracks exactly to exhaustion): informational, no action
Multi-Signal SLOs
A service can return 100% successful responses, all of them taking 10 seconds. Availability SLO: met. User experience: terrible.
Real reliability means looking at multiple SLIs together. Not a single number (weighted averages hide problems the same way averaging hides latency outliers) but a set of SLOs where all must be met at the same time.
A request that succeeds but takes 4 seconds counts against the latency budget. Fast and successful but returns stale data? Correctness budget. Each SLI has its own budget, and the most constrained budget drives the team’s priorities.
Most services need two to four SLOs. More than that and the signal becomes noise. The typical starting set:
- Availability: proportion of requests that don’t return server errors
- Latency: response time at p99 (or p95 for less latency-sensitive services)
- Correctness (if applicable): data accuracy, computation correctness, consistency with source of truth
Freshness is the fourth dimension, relevant for search indexes, dashboards, and data pipelines where staleness is a distinct failure mode from incorrectness.
The Organizational Half
SLOs are as much organizational as technical. Perfect instrumentation and elegant alerts mean nothing if nobody changes what they do when the budget runs low.
- Product and engineering leadership have jointly agreed on SLO targets for the top three user-facing services
- Error budget policy document exists with explicit actions for each budget threshold
- SLO dashboards are visible to both product and engineering teams
- On-call rotation has access to error budget status in their incident response tooling
- At least one retrospective has been conducted after an error budget breach
The hardest organizational shift: product managers must accept that error budgets are not engineering’s problem. When the budget is healthy, engineering ships fast and product benefits. When the budget burns, product pays by deferring features. This is the deal. It only works if product co-owns the target from the beginning.
Organizations with mature SRE practices embed SLO reviews into sprint planning. Error budget status goes on the agenda before the feature backlog. Budget below threshold? Reliability sprint. No negotiation.
When SLOs Don’t Apply
Not every system needs a formal SLO. The overhead of defining SLIs, setting targets, building dashboards, configuring burn-rate alerts, and writing error budget policies is real. For systems where it’s not worth the overhead, simpler approaches work fine.
| SLOs add value | Simpler monitoring is fine |
|---|---|
| User-facing services with external customers | Internal batch jobs with retry logic |
| Services in the critical path of revenue transactions | Dev/staging environments |
| Platform services consumed by many internal teams | One-off migration scripts |
| Services with contractual SLA obligations | Services with a single team as the only consumer |
The test: if a reliability problem in this service would change someone’s behavior (halt a deployment, trigger an incident, escalate to leadership), an SLO formalizes that decision trigger. If a failure just means “wait and retry,” a basic health check suffices.
Getting Started: The 30-Day Playbook
SLO adoption does not require new tooling. Most observability stacks already collect the data needed for SLIs. The gap is usually in the framing, not the instrumentation.
Week 1: Pick three services and define SLIs. Instrument availability and latency at the edge. For each SLI, record a baseline over seven days. Do not set targets yet. Measure first.
Week 2: Set SLO targets based on baseline data. If your availability SLI shows 99.95% over the baseline week, a 99.9% SLO gives you breathing room. If your p99 latency is 280ms, a 300ms target is too tight. Set it at 400ms and tighten later as you improve the system.
Week 3: Build error budget dashboards and configure burn-rate alerts. Show remaining budget as a percentage and as absolute time/requests. Configure a 14.4x burn-rate page alert and a 3x ticket alert. Disable or mute the threshold alerts they replace.
Week 4: Write the error budget policy and get product sign-off. Define what happens at each budget threshold. Present it as a joint commitment, not an engineering diktat. Run a tabletop exercise: “The budget just hit 20%. What happens next?”
The SLO will be wrong on the first attempt. Too tight, too loose, missing a failure mode. The policy thresholds will fire at the wrong times. This is expected. SLOs are calibrated through experience, not designed in a vacuum.
Same status page. Same “99.9% available” claim. But now it measures what users actually experience, feeds an error budget the team tracks weekly, and drives alerts that fire on genuine problems while staying quiet during transient blips. When the budget runs low, the conversation shifts from “should we prioritize reliability?” to “the policy says we do.” That number on the dashboard stopped being decorative. It started doing something.