Guides on Grafana Labs

Best practices

Fri, 03 Apr 2026 19:43:06 +0000

Alerting best practices

Designing and configuring an effective alerting system takes time. This guide focuses on building alerting systems that scale with real-world operations.

The practices described here are intentionally high-level and apply regardless of tooling. Whether you use Prometheus, Grafana Alerting, or another stack, the same constraints apply: complex systems, imperfect signals, and humans on call.

Alerting is never finished. It evolves with incidents, organizational changes, and the systems it’s meant to protect.

Prioritize symptoms, but don’t ignore infrastructure signals

Alerts should primarily detect user-facing failures, not internal component behavior. Users don’t care that a pod restarted; they care when the application is slow or failing. Symptom-based alerts tie directly to user impact.

Reliability metrics that impact users—latency, errors, availability—are better paging signals than infrastructure events or internal errors.

That said, infrastructure signals still matter. They can act as early warning indicators and are often useful when alerting maturity is low. A sustained spike in CPU or memory usage might not justify a page, but it can help explain or anticipate symptom-based failures.

Infrastructure alerts tend to be noisy and are often ignored when treated like paging signals. They are usually better suited for lower-severity channels such as dashboards, alert lists, or non-paging destinations like a dedicated Slack channel, where they can be monitored without interrupting on-call.

The key is balance as your alerting matures. Use infrastructure alerts to support diagnosis and prevention, not as a replacement for symptom-based alerts.

Escalate priority based on confidence

Alert priority is often tied to user impact and the urgency to respond, but confidence should determine when escalation is necessary.

In this context, escalation defines how responders are notified as confidence grows. This can include increasing alert priority, widening notification, paging additional responders, or opening an incident once intervention is clearly required.

Early signals are often ambiguous, and confidence in a non-transient failure is usually low. Paging too early creates noise; paging too late means users are impacted for longer before anyone acts. A small or sudden increase in latency may not justify immediate action, but it can indicate a failure in progress.

Confidence increases as signals become stronger or begin to correlate.

Escalation is justified when issues are sustained or reinforced by multiple signals. For example, high latency combined with a rising error rate, or the same event firing over a sustained period. These patterns reduce the chance of transient noise and increase the likelihood of real impact.

Use confidence in user impact to drive escalation and avoid unnecessary pages.

Scope alerts for scalability and actionability

In distributed systems, avoid creating separate alert rules for every host, service, or endpoint. Instead, define alert rules that scale automatically using multi-dimensional alert rules. This reduces rule duplication and allows alerting to scale as the system grows.

Start simple. Default to a single dimension such as service or endpoint to keep alerts manageable. Add dimensions only when they improve actionability. For example, when missing a dimension like region hides failures or doesn’t provide enough information to act quickly.

Additional dimensions like region or instance can help identify the root cause, but more isn’t always better.

Design alerts for first responders and clear actions

Alerts should be designed for the first responder, not the person who created the alert. Anyone on call should be able to understand what’s wrong and what to do next without deep knowledge of the system or alert configuration.

Avoid vague alerts that force responders to spend time figuring out context. Every alert should clearly explain why it exists, what triggered it, and how to investigate. Use annotations to link to relevant dashboards and runbooks, which are essential for faster resolution.

Alerts should indicate a real problem and be actionable, even if the impact is low. Informational alerts add noise without improving reliability.

If no action is possible, it shouldn’t be an alert—consider using a dashboard instead. Over time, alerts behave like technical debt: easy to create, costly to maintain, and hard to remove.

Review alerts often and remove those that don’t lead to action.

Alerts should have an owner and system scope

Alerts without ownership are often ignored. Every alert must have an owner: a team responsible for maintaining the alert and responding when it fires.

Alerts must also define a system scope, such as a service or infrastructure component. Scope provides organizational context and connects alerts with ownership. Defining clear scopes is easier when services are treated as first-class entities, and organizations are built around service ownership.

Service Center in Grafana Cloud can help operate a service-oriented view of your system and align alert scope with ownership.

After scope, ownership, and alert priority are defined, routing determines where alerts go and how they escalate. Notification routing is as important as the alerts.

Alerts should be delivered to the right team and channel based on priority, ownership, and team workflows. Use notification policies to define a routing tree that matches the context of your service or scope:

Define a parent policy for default routing within the scope.
Define nested policies for specific cases or higher-priority issues.

Prevent notification overload with alert grouping

Without alert grouping, responders can receive many notifications for the same underlying problem.

For example, a database failure can trigger several alerts at the same time like increased latency, higher error rates, and internal errors. Paging separately for each symptom quickly turns into notification spam, even though there is a single root cause.

Notification grouping consolidates related alerts into a single notification. Instead of receiving multiple pages for the same issue, responders get one alert that represents the incident and includes all related firing alerts.

Grouping should follow operational boundaries such as service or owner, as defined by notification policies. Downstream or cascading failures should be grouped together so they surface as one issue rather than many.

Mitigate flapping alerts

Short-lived failure spikes often trigger alerts that auto-resolve quickly. Alerting on transient failures creates noise and leads responders to ignore them.

Require issues to persist before alerting. Set a pending period to define how long a condition must remain true before firing. For example, instead of alerting immediately on high error rate, require it to stay above the threshold for some minutes.

Also, stabilize alerts by tuning query ranges and aggregations. Using raw data makes alerts sensitive to noise. Instead, evaluate over a time window and aggregate the data to smooth short spikes.

# Reacts to transient spikes. Avoid this.
cpu_usage > 90

# Smooth fluctuations.
avg_over_time(cpu_usage[5m]) > 90

For latency and error-based alerts, percentiles are often more useful than averages:

quantile_over_time(0.95, http_duration_seconds[5m]) > 3

Finally, avoid rapid resolve-and-fire notifications by using keep_firing_for or recovery thresholds to keep alerts active briefly during recovery. Both options reduce flapping and unnecessary notifications.

Graduate symptom-based alerts into SLOs

When a symptom-based alert fires frequently, it usually indicates a reliability concern that should be measured and managed more deliberately. This is often a sign that the alert could evolve into an SLO.

Traditional alerts create pressure to react immediately, while error budgets introduce a buffer of time to act, changing how urgency is handled. Alerts can then be defined in terms of error budget burn rate rather than reacting to every minor deviation.

SLOs also align distinct teams around common reliability goals by providing a shared definition of what “good” looks like. They help consolidate multiple symptom alerts into a single user-facing objective.

For example, instead of several teams alerting on high latency, a single SLO can be used across teams to capture overall API performance.

Integrate alerting into incident post-mortems

Every incident is an opportunity to improve alerting. After each incident, evaluate whether alerts helped responders act quickly or added unnecessary noise.

Assess which alerts fired, and how they influenced incident response. Review whether alerts triggered too late, too early, or without enough context, and adjust thresholds, priority, or escalation based on what actually happened.

Use silences during active incidents to reduce repeated notifications, but scope them carefully to avoid silencing unrelated alerts.

Post-mortems should evaluate alerts with root causes and lessons learned. If responders lacked key information during the incident, enrich alerts with additional context, dashboards, or better guidance.

Alerts should be continuously improved

Alerting is an iterative process. Alerts that aren’t reviewed and refined lose effectiveness as systems and traffic patterns change.

Schedule regular reviews of existing alerts. Remove alerts that don’t lead to action, and tune alerts or thresholds that fire too often without providing useful signal. Reduce false positives to combat alert fatigue.

Prioritize clarity and simplicity in alert design. Simpler alerts are easier to understand, maintain, and trust under pressure. Favor fewer high-quality, actionable alerts over a large number of low-value ones.

Use dashboards and observability tools for investigation, not alerts.

Handle connectivity errors in alerts

Fri, 03 Apr 2026 19:43:06 +0000

Handle connectivity errors in alerts

Connectivity issues are a common cause of misleading alerts or unnoticed failures.

There could be a number of reasons for these errors. Maybe your target went offline, or Prometheus couldn’t scrape it. Or maybe your alert query failed because its target timed out or the network went down. These situations might look similar, but require different considerations in your alerting setup.

This guide walks through how to detect and handle these types of failures, whether you’re writing alert rules in Prometheus, using Grafana Alerting, or combining both. It covers both availability monitoring and alert query failures, and outlines strategies to improve the reliability of your alerts.

Understand connectivity issues in alerts

Typically, connectivity issues fall into a few common scenarios:

Servers or containers crashed or were shut down.
Service overload or timeout.
Misconfigured authentication or incorrect permissions.
Network issues like DNS problems or ISP outages.

When we talk about connectivity errors in alerting, we’re usually referring to one of two use cases:

Your target is down or unreachable.
The service crashed, the host was down, or a firewall or DNS issue blocked the connection. These are availability problems.
Your alert query failed.
The alert couldn’t evaluate its query—maybe because the data source timed out or an invalid query. These are execution errors.

It helps to separate these cases early, because they behave differently and require different strategies.

Keep in mind that most alert rules don’t hit the target directly. They query metrics from a monitoring system like Prometheus, which scrapes data from your actual infrastructure or application. That gives us two typical alerting setups where connectivity issues can show up:

Alert rule → Target
For example, an alert rule querying an external data source like a database.
Alert rule → Prometheus ← Target
More common in observability stacks. For instance, Prometheus scrapes a node or container, and the alert rule queries the metrics later.

In this second setup, you can run into connectivity issues on either side. If Prometheus fails to scrape the target, your alert rule might not fire, even though something is likely wrong.

Detect target availability with the Prometheus `up` metric

Prometheus scrapes metrics from its targets regularly, following the scrape_interval period. The default scrape interval is 60 seconds, which is generally considered common practice.

Prometheus provides a built-in metric called up for every scrape target, a simple method to indicate whether scraping is successful:

up == 1: Your target is reachable; Prometheus collected the target metrics as expected.
up == 0: Prometheus couldn’t reach your target—indicating possible downtime or network errors.

A typical PromQL expression for an alert rule to detect when a target becomes unreachable is:

up == 0

But this alert rule might result in noisy alerts as one single scrape failure will fire the alert. To reduce noise, you should add a delay:

up == 0 for: 5m

The for option in Prometheus (or pending period in Grafana) delays the alert until the condition has been true for the full duration.

In this example, waiting for 5 minutes means the single scrape error won’t result in a fired alert. Since Prometheus scrapes metrics every minute by default, the alert only fires after five consecutive failures.

However, this kind of up alert has a few potential downfalls:

Failures can slip between scrape intervals: An outage that starts and ends between two evaluations go undetected. You could shorten the for duration, but this might lead to scrape failures that trigger false alarms.
Intermittent recoveries reset the for timer: A single successful scrape resets the alert timer, which masks intermittent outages.

Brief connectivity drops are common in real-world environments, so expect some flakiness in up alerts. For example:

Scrape result (`up`)	Alert rule evaluation
00:00 `up == 0`	Timer starts
01:00 `up == 0`	Timer continues
02:00 `up == 0`	Timer continues
03:00 `up == 1`	Successful scrape resets timer
04:00 `up == 0`	Timer starts again
05:00 `up == 0`	No alert yet; timer hasn’t reached the `for` duration

The longer the period, the more likely this is to happen.

A single recovery resets the alert, that’s why up == 0 for: 5m can sometimes be unreliable. Even if the target is down most of the time, the alert didn’t fire, leaving you unaware of a potential persistent issue.

Use `avg_over_time` to smooth signal

One way to work around these issues is to smooth the signal by averaging the up metric over a similar or longer period:

avg_over_time(up[10m]) < 0.8

This alert rule fires when the target is unreachable for more than 20% of the last 10 minutes, rather than looking for consecutive scrape failures. With a one minute scrape interval, three or more failed scrapes within the last 10 minutes now triggers the alert.

Since this query uses a threshold and time window to control accuracy, you can now lower the for duration (or pending period in Grafana) to something shorter—0m or 1m—so the alert fires faster.

This approach gives you more flexibility in detecting real crashes or network issues. As always, adjust the threshold and period based on your noise tolerance and how critical the target is.

Use synthetic checks to monitor external availability

Prometheus often runs inside the same network as the target it monitors. That means Prometheus might be able to reach the target, but doesn’t ensure it’s reachable to users on the outside.

Firewalls, DNS misconfigurations, or other network issues might block public traffic while Prometheus scrapes up successfully.

This is where synthetic monitoring helps. Tools like the Blackbox Exporter let you continuously verify whether a service is available and reachable from outside your network—not just internally.

The Blackbox Exporter exposes the results of these checks as metrics, which Prometheus can scrape like any other target. For example, the probe_success metric reports whether the probe was able to reach the service. The setup looks like this:

Alert rules → Prometheus ← Blackbox Exporter (external probe) → Target

To detect when a service isn’t reachable externally, you can define an alert using the probe_success metric:

probe_success == 0 for: 5m

This alert fires when the probe has failed continuously for 5 minutes—indicating that the service couldn’t be reached from the outside.

You can then combine internal and external checks to make the detection of connectivity errors more reliable. This alert catches when the internal scrape fails or the service is externally unreachable.

up == 0 or probe_success == 0

As with the up metric, you might want to smooth this out using avg_over_time() for more robust detection. The smooth version might look like:

avg_over_time(up[10m]) < 0.8 or avg_over_time(probe_success[10m]) < 0.8

This alert fires when Prometheus couldn’t scrape the target successfully for more than 20% of the past 10 minutes, or when the external probes have been failing more than 20% of the time. This smoothing technique can be applied to any binary availability signal.

Manage offline hosts

In many setups, Prometheus scrapes multiple hosts under the same target, such as a fleet of servers or containers behind a common job label. It’s common for one host to go offline while the others continue to report metrics normally.

If your alert only checks the general up metric without breaking it down by labels (like instance, host, or pod), you might miss when a host stops reporting. For example, an alert that looks only at the aggregated status of all instances will likely fail to catch when individual instances go missing.

This isn’t a connectivity error in this context — it’s not that the alert or Prometheus can’t reach anything, it’s that one or more specific targets have gone silent. These kinds of problems aren’t caught by up == 0 alerts.

For these cases, see the complementary guide on handling missing data — it covers common scenarios where the alert queries return no data at all, or where only some targets stop reporting. These aren’t full availability failures or execution errors, but they can still lead to blind spots in alert detection.

Handle query errors in Grafana Alerting

Not all connectivity issues come from targets going offline. Sometimes, the alert rule fails when querying its target. These aren’t availability problems—they’re query execution errors: maybe the data source timed out, the network dropped, or the query was invalid.

These errors lead to broken alerts. But they come from a different part of the stack: between the alert rule and the data source, not between the data source (for example, Prometheus) and its target.

This difference matters. Availability issues are typically handled using metrics like up or probe_success but execution errors require a different setup.

Grafana Alerting has built-in handling for execution errors, regardless of the data source. That includes Prometheus, and others like Graphite, InfluxDB, PostgreSQL, etc. By default, Grafana Alerting automatically handles query errors so you don’t miss critical failures. When an alert rule fails to execute, Grafana fires a special DatasourceError alert.

You can configure this behavior depending on how critical the alert is and on whether you already have other alerts detecting the issue. In Configure no data and error handling, click Alert state if execution error or timeout, and choose the desired option for the alert:

Error (default): Triggers a separate DatasourceError alert. This default ensures alert rules always inform about query errors but can create noise.
Alerting: Treats the error as if the alert condition is firing. Grafana transitions all existing instances for that rule to the Alerting state.
Normal: Ignores the query error and transitions all alert instances to the Normal state. This is useful if the error isn’t critical or if you already have other alerts detecting connectivity issues.
Keep Last State: Keeps the previous state until the query succeeds again. Suitable for unstable environments to avoid flapping alerts.

This applies even when alert rules query Prometheus itself—not just external data sources.

Design alerts for connectivity errors

In practice, start by deciding if you want to create explicit alert rules — for example, using up or probe_success — to detect when a target is down or has connectivity issues.

Then, for each alert rule, choose the error-handling behavior based on whether you already have dedicated connectivity alerts, the stability of the target, and how critical the alert is. Prioritize alerts based on symptom severity rather than just infrastructure signals that might not impact users.

Reduce redundant error notifications

A single data source error can lead to multiple alerts firing simultaneously, sometimes bombarding you with many alerts and generating too much noise.

As described previously, you can control the error-handling behavior for Grafana alerts. The Keep Last State or Normal option prevents alerts from firing and helps avoid redundant alerts, especially for services already covered by up or probe_success alerts.

When using the default behavior, a single connectivity error will likely trigger multiple DatasourceError alerts.

These alerts are separate from the original alerts—they’re not just a different state of the original alert. They fire immediately, ignore the pending period, and don’t inherit all the labels. This can catch you off guard if you expect them to behave like the original alerts.

Consider not treating these alerts in the same way as the original alerts, and implement dedicated strategies for their notifications:

Reduce duplicate notifications by grouping DatasourceError alerts. Use the datasource_uid label to group errors from the same data source.
Route DatasourceError alerts separately, sending them to different teams or channels depending on their impact and urgency.

For details on how to configure grouping and routing, refer to handling notifications and No Data and Error alerts documentation.

Conclusion

Connectivity issues are one of the common causes of noisy or misleading alerts. This guide covered two distinct types:

Availability issues, where the target itself is down or unreachable (e.g., due to a crash or network failure).
Query execution errors, where the alert rule can’t reach its data source (e.g., due to timeouts, invalid queries, or data source outages).

These problems come from different parts of your stack, and require its own techniques. Prometheus and Grafana allow you to detect them, and combining distinct techniques can make your alerts more resilient.

With Prometheus, avoid relying solely on up == 0. Smooth queries to account for intermittent failures, and use synthetic monitoring to detect reachability issues from outside your network.

In Grafana Alerting, configure error handling explicitly. Not all alerts are equal or have the same urgency. Tune the error-handling behavior based on the reliability and severity of the alerts and whether you already have alerts dedicated to connectivity problems.

And don’t forget the third case: missing data. If only one host from a fleet silently disappears, you might not get alerted. If you’re dealing with individual instances that stopped reporting data, see the Guide on handling missing data to continue exploring this topic.

Handle missing data in Grafana Alerting

Fri, 03 Apr 2026 19:43:06 +0000

Handle missing data in Grafana Alerting

Missing data from when a target stops reporting metric data can be one of the most common issues when troubleshooting alerts. In cloud-native environments, this happens all the time. Pods or nodes scale down to match demand, or an entire job quietly disappears.

When this happens, alerts won’t fire, and you might not notice the system has stopped reporting.

Sometimes it’s just a lack of data from a few instances. Other times, it’s a connectivity issue where the entire target is unreachable.

This guide covers different scenarios where the underlying data is missing and shows how to design your alerts to act on those cases. If you’re troubleshooting an unreachable host or a network failure, see the Handle connectivity errors documentation as well.

No Data vs. Missing Series

There are a few common causes when an instance stops reporting data, similar to connectivity errors:

Host crash: The system is down, and Prometheus stops scraping the target.
Temporary network failures: Intermittent scrape failures cause data gaps.
Deployment changes: Decommissioning, Kubernetes pod eviction, or scaling down resources.
Ephemeral workloads: Metrics intentionally stop reporting.
And more.

The first thing to understand is the difference between a query failure (or connectivity error), No Data, and a Missing Series.

Alert queries often return multiple time series — one per instance, pod, region, or label combination. This is known as a multi-dimensional alert, meaning a single alert rule can trigger multiple alert instances (alerts).

For example, imagine a recorded metric, http_request_latency_seconds, that reports latency per second in the regions where the application is deployed. The query returns one series per region — for instance, region1 and region2 — and generates only two alert instances. In this scenario, you may experience:

Connectivity Error if the alert rule query fails.
No Data if the query runs successfully but returns no data at all.
Missing Series if one or more specific series, which previously returned data, are missing, but other series still return data.

In both No Data and Missing Series cases, the query still technically “works”, but the alert won’t fire unless you explicitly configure it to handle these situations.

The following tables illustrate both scenarios using the previous example, with an alert that triggers if the latency exceeds 2 seconds in any region: avg_over_time(http_request_latency_seconds[5m]) > 2.

No Data Scenario: The query returns no data for any series:

Time	region1	region2	Alert triggered
00:00	1.5s 🟢	1s 🟢	✅ No Alert
01:00	No Data ⚠️	No Data ⚠️	⚠️ No Alert (Silent Failure)
02:00	1.4s 🟢	1s 🟢	✅ No Alert

MissingSeries Scenario: Only a specific series (region2) disappears:

Time	region1	region2	Alert triggered
00:00	1.5s 🟢	1s 🟢	✅ No Alert
01:00	1.6s 🟢	Missing Series ⚠️	⚠️ No Alert (Silent Failure)
02:00	1.4s 🟢	1s 🟢	✅ No Alert

In both cases, something broke silently.

Detect missing data in Prometheus

Prometheus doesn’t fire alerts when the query returns no data. It simply assumes there was nothing to report, like with query errors. Missing data won’t trigger existing alerts unless you explicitly check for it.

In Prometheus, a common way to catch missing data is by to use the absent_over_time function.

absent_over_time(http_request_latency_seconds[5m]) == 1

This triggers when all series for http_request_latency_seconds are absent for 5 minutes — catching the No Data case when the entire metric disappears.

However, absent_over_time() can’t detect which specific series are missing since it doesn’t preserve labels. The alert won’t tell you which series stopped reporting, only that the query returns no data.

If you want to check for missing data per-region or label, you can specify the label in the alert query as follows:

# Detect missing data in region1
absent_over_time(http_request_latency_seconds{region="region1"}[5m]) == 1

# Detect missing data in region2
absent_over_time(http_request_latency_seconds{region="region2"}[5m]) == 1

But this doesn’t scale well. It is unreliable to have hard-coded queries for each label set, especially in dynamic cloud environments where instances can appear or disappear at any time.

To detect when a specific target has disappeared, see below Evict alert instances for missing series for details on how Grafana handles this case and how to set up detection.

Manage No Data issues in Grafana alerts

While Prometheus provides functions like absent_over_time() to detect missing data, not all data sources — like Graphite, InfluxDB, PostgreSQL, and others — available to Grafana alerts support a similar function.

To handle this, Grafana Alerting implements a built-in No Data state logic, so you don’t need to detect missing data with absent_* queries. Instead, you can configure in the alert rule settings how alerts behave when no data is returned.

Similar to error handling, Grafana triggers a special No data alert by default and lets you control this behavior. In Configure no data and error handling, click Alert state if no data or all values are null, and choose one of the following options:

No Data (default): Triggers a new DatasourceNoData alert, treating No data as a specific problem.
Alerting: Transition each existing alert instance into the Alerting state when data disappears.
Normal: Ignores missing data and transitions all instances to the Normal state. Useful when receiving intermittent data, such as from experimental services, sporadic actions, or periodic reports.
Keep Last State: Leaves the alert in its previous state until the data returns. This is common in environments where brief metric gaps happen regularly, like with flaky exporters or noisy environments.

Manage DatasourceNoData notifications

When Grafana triggers a NoData alert, it creates a distinct alert instance, separate from the original alert instance. These alerts behave differently:

They use a dedicated alertname: DatasourceNoData.
They don’t inherit all the labels from the original alert instances.
They trigger immediately, ignoring the pending period.

Because of this, DatasourceNoData alerts might require a dedicated setup to handle their notifications. For general recommendations, see Reduce redundant DatasourceError alerts — similar practices can apply to NoData alerts.

Evict alert instances for missing series

MissingSeries occurs when only some series disappear but not all. This case is subtle, but important.

Grafana marks missing series as stale after two evaluation intervals and triggers the alert instance eviction process. Here’s what happens under the hood:

Alert instances with missing data keep their last state for two evaluation intervals.
If the data is still missing after that:
- Grafana adds the annotation grafana_state_reason: MissingSeries.
- The alert instance transitions to the Normal state.
- A resolved notification is sent if the alert was previously firing.
- The alert instance is removed from the Grafana UI.

If an alert instance becomes stale, you’ll find it in the alert history as Normal (Missing Series) before it disappears. This table shows the eviction process from the previous example:

Time	region1	region2	Alert triggered
00:00	1.5s 🟢	1s 🟢	🟢🟢 No Alerts
01:00	3s 🔴 `Alerting`	3s 🔴 `Alerting`	🔴🔴 Alert instances triggered for both regions
02:00	1.6s 🟢	`(MissingSeries)`⚠️ `Alerting` ️	🟢🔴 Region2 missing, state maintained.
03:00	1.4s 🟢	`(MissingSeries)` `Normal`	🟢🟢 `region2` was resolved, 📩 notification sent, and instance evicted.
04:00	1.4s 🟢	—	🟢 No Alerts. `region2` was evicted.

Why doesn’t MissingSeries match No Data behavior?

In dynamic environments, such as autoscaling groups, ephemeral pods, spot instances, series naturally come and go. MissingSeries normally signals infrastructure or deployment changes.

By default, No Data triggers an alert to indicate a potential problem.

The eviction process for MissingSeries is designed to prevent alert flapping when a pod or instance disappears, reducing alert noise.

In environments with frequent scale events, prioritize symptom-based alerts over individual infrastructure signals and use aggregate alerts unless you explicitly need to track individual instances.

Handle MissingSeries notifications

A stale alert instance triggers a resolved notification if it transitions from a firing state (such as Alerting, No Data, or Error) to Normal, and the grafana_state_reason annotation is set to MissingSeries to indicate that the alert wasn’t resolved by recovery but evicted because the series data went missing.

Recognizing these notifications helps you handle them appropriately. For example:

Display the grafana_state_reason annotation to clearly identify MissingSeries alerts.
Or use the grafana_state_reason annotation to process these alerts differently.

Also, review these notifications to confirm whether something broke or if the alert was unnecessary. To reduce noise:

Silence or mute alerts during planned maintenance or rollouts.
Adjust alert rules to avoid triggering on series you expect to come and go, and use aggregated alerts instead.

Detect missing series in Prometheus

Previously, an example showed how to detect missing data for a specific label, such as region:

# Detect missing data in region1
absent_over_time(http_request_latency_seconds{region="region1"}[5m]) == 1

# Detect missing data in region2
absent_over_time(http_request_latency_seconds{region="region2"}[5m]) == 1

However, this approach doesn’t scale well because it requires hardcoding all possible region values.

As an alternative, you can create an alert rule that detects missing series dynamically using the present_over_time function:

present_over_time(http_request_latency_seconds{}[24h])
unless
present_over_time(http_request_latency_seconds{}[10m])

Or, if you want to group by a label such as region:

group(present_over_time(http_request_latency_seconds{}[24h])) by (region)
unless
group(present_over_time(http_request_latency_seconds{}[10m])) by (region)

This query finds regions (or other targets) that were present at any time in the past 24 hours but have not been present in the past 10 minutes. The alert rule then triggers an alert instance for each missing region. You can apply the same technique to any label or target dimension.

Conclusion

Missing data isn’t always a failure. It’s a common scenario in dynamic environments when certain targets stop reporting.

Grafana Alerting handles distinct scenarios automatically. Here’s how to think about it:

Understand DatasourceNoData and MissingSeries notifications, since they don’t behave like regular alerts.
Use Grafana’s No Data handling options to define what happens when a query returns nothing.
When NoData is not an issue, consider rewriting the query to always return data — for example, in Prometheus, use your_metric_query OR on() vector(0) to return 0 when your_metric_query returns nothing.
Use absent_over_time() or present_over_time in Prometheus to detect when a metric or target disappears.
If data is frequently missing due to scrape delays, use techniques to account for data delays:
- Adjust the Time Range query option in Grafana to evaluate slightly behind real time (e.g., set To to now-1m) to account for late data points.
- In Prometheus, you can use last_over_time(metric_name[10m]) to pick the most recent sample within a given window.
Don’t alert on every instance by default. In dynamic environments, it’s better to aggregate and alert on symptoms — unless a missing individual instance directly impacts users.
If you’re getting too much noise from disappearing data, consider adjusting alerts, using Keep Last State, or routing those alerts differently.
For connectivity issues involving alert query failures, see the sibling guide: Handling connectivity errors in Grafana Alerting.

Guides on Grafana Labs

Best practices

Alerting best practices

Prioritize symptoms, but don’t ignore infrastructure signals

Escalate priority based on confidence

Scope alerts for scalability and actionability

Design alerts for first responders and clear actions

Alerts should have an owner and system scope

Prevent notification overload with alert grouping

Mitigate flapping alerts

Graduate symptom-based alerts into SLOs

Integrate alerting into incident post-mortems

Alerts should be continuously improved

Handle connectivity errors in alerts

Handle connectivity errors in alerts

Understand connectivity issues in alerts

Detect target availability with the Prometheus up metric

Use avg_over_time to smooth signal

Use synthetic checks to monitor external availability

Manage offline hosts

Handle query errors in Grafana Alerting

Design alerts for connectivity errors

Reduce redundant error notifications

Conclusion

Handle missing data in Grafana Alerting

Handle missing data in Grafana Alerting

No Data vs. Missing Series

Detect missing data in Prometheus

Manage No Data issues in Grafana alerts

Manage DatasourceNoData notifications

Evict alert instances for missing series

Why doesn’t MissingSeries match No Data behavior?

Handle MissingSeries notifications

Detect missing series in Prometheus

Conclusion

Detect target availability with the Prometheus `up` metric

Use `avg_over_time` to smooth signal