<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Guides on Grafana Labs</title><link>https://grafana.com/docs/grafana/v12.4/alerting/guides/</link><description>Recent content in Guides on Grafana Labs</description><generator>Hugo -- gohugo.io</generator><language>en</language><atom:link href="/docs/grafana/v12.4/alerting/guides/index.xml" rel="self" type="application/rss+xml"/><item><title>Best practices</title><link>https://grafana.com/docs/grafana/v12.4/alerting/guides/best-practices/</link><pubDate>Fri, 03 Apr 2026 19:43:06 +0000</pubDate><guid>https://grafana.com/docs/grafana/v12.4/alerting/guides/best-practices/</guid><content><![CDATA[&lt;h1 id=&#34;alerting-best-practices&#34;&gt;Alerting best practices&lt;/h1&gt;
&lt;p&gt;Designing and configuring an effective alerting system takes time. This guide focuses on building alerting systems that scale with real-world operations.&lt;/p&gt;
&lt;p&gt;The practices described here are intentionally high-level and apply regardless of tooling. Whether you use Prometheus, Grafana Alerting, or another stack, the same constraints apply: complex systems, imperfect signals, and humans on call.&lt;/p&gt;
&lt;p&gt;Alerting is never finished. It evolves with incidents, organizational changes, and the systems it’s meant to protect.&lt;/p&gt;
&lt;h2 id=&#34;prioritize-symptoms-but-dont-ignore-infrastructure-signals&#34;&gt;Prioritize symptoms, but don’t ignore infrastructure signals&lt;/h2&gt;
&lt;p&gt;Alerts should primarily detect user-facing failures, not internal component behavior. Users don&amp;rsquo;t care that a pod restarted; they care when the application is slow or failing. Symptom-based alerts tie directly to user impact.&lt;/p&gt;
&lt;p&gt;Reliability metrics that impact users—latency, errors, availability—are better paging signals than infrastructure events or internal errors.&lt;/p&gt;
&lt;p&gt;That said, infrastructure signals still matter. They can act as early warning indicators and are often useful when alerting maturity is low. A sustained spike in CPU or memory usage might not justify a page, but it can help explain or anticipate symptom-based failures.&lt;/p&gt;
&lt;p&gt;Infrastructure alerts tend to be noisy and are often ignored when treated like paging signals. They are usually better suited for lower-severity channels such as dashboards, alert lists, or non-paging destinations like a dedicated Slack channel, where they can be monitored without interrupting on-call.&lt;/p&gt;
&lt;p&gt;The key is balance as your alerting matures. Use infrastructure alerts to support diagnosis and prevention, not as a replacement for symptom-based alerts.&lt;/p&gt;
&lt;h2 id=&#34;escalate-priority-based-on-confidence&#34;&gt;Escalate priority based on confidence&lt;/h2&gt;
&lt;p&gt;Alert priority is often tied to user impact and the urgency to respond, but confidence should determine when escalation is necessary.&lt;/p&gt;
&lt;p&gt;In this context, escalation defines how responders are notified as confidence grows. This can include increasing alert priority, widening notification, paging additional responders, or opening an incident once intervention is clearly required.&lt;/p&gt;
&lt;p&gt;Early signals are often ambiguous, and confidence in a non-transient failure is usually low. Paging too early creates noise; paging too late means users are impacted for longer before anyone acts. A small or sudden increase in latency may not justify immediate action, but it can indicate a failure in progress.&lt;/p&gt;
&lt;p&gt;Confidence increases as signals become stronger or begin to correlate.&lt;/p&gt;
&lt;p&gt;Escalation is justified when issues are sustained or reinforced by multiple signals. For example, high latency combined with a rising error rate, or the same event firing over a sustained period. These patterns reduce the chance of transient noise and increase the likelihood of real impact.&lt;/p&gt;
&lt;p&gt;Use confidence in user impact to drive escalation and avoid unnecessary pages.&lt;/p&gt;
&lt;h2 id=&#34;scope-alerts-for-scalability-and-actionability&#34;&gt;Scope alerts for scalability and actionability&lt;/h2&gt;
&lt;p&gt;In distributed systems, avoid creating separate alert rules for every host, service, or endpoint. Instead, define alert rules that scale automatically using 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/examples/multi-dimensional-alerts/&#34;&gt;multi-dimensional alert rules&lt;/a&gt;. This reduces rule duplication and allows alerting to scale as the system grows.&lt;/p&gt;
&lt;p&gt;Start simple. Default to a single dimension such as &lt;code&gt;service&lt;/code&gt; or &lt;code&gt;endpoint&lt;/code&gt; to keep alerts manageable. Add dimensions only when they improve actionability. For example, when missing a dimension like &lt;code&gt;region&lt;/code&gt; hides failures or doesn&amp;rsquo;t provide enough information to act quickly.&lt;/p&gt;
&lt;p&gt;Additional dimensions like &lt;code&gt;region&lt;/code&gt; or &lt;code&gt;instance&lt;/code&gt; can help identify the root cause, but more isn&amp;rsquo;t always better.&lt;/p&gt;
&lt;h2 id=&#34;design-alerts-for-first-responders-and-clear-actions&#34;&gt;Design alerts for first responders and clear actions&lt;/h2&gt;
&lt;p&gt;Alerts should be designed for the first responder, not the person who created the alert. Anyone on call should be able to understand what&amp;rsquo;s wrong and what to do next without deep knowledge of the system or alert configuration.&lt;/p&gt;
&lt;p&gt;Avoid vague alerts that force responders to spend time figuring out context. Every alert should clearly explain why it exists, what triggered it, and how to investigate. Use 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/alert-rules/annotation-label/#annotations&#34;&gt;annotations&lt;/a&gt; to link to relevant dashboards and runbooks, which are essential for faster resolution.&lt;/p&gt;
&lt;p&gt;Alerts should indicate a real problem and be actionable, even if the impact is low. Informational alerts add noise without improving reliability.&lt;/p&gt;
&lt;p&gt;If no action is possible, it shouldn&amp;rsquo;t be an alert—consider using a dashboard instead. Over time, alerts behave like technical debt: easy to create, costly to maintain, and hard to remove.&lt;/p&gt;
&lt;p&gt;Review alerts often and remove those that don’t lead to action.&lt;/p&gt;
&lt;h2 id=&#34;alerts-should-have-an-owner-and-system-scope&#34;&gt;Alerts should have an owner and system scope&lt;/h2&gt;
&lt;p&gt;Alerts without ownership are often ignored. Every alert must have an owner: a team responsible for maintaining the alert and responding when it fires.&lt;/p&gt;
&lt;p&gt;Alerts must also define a system scope, such as a service or infrastructure component. Scope provides organizational context and connects alerts with ownership. Defining clear scopes is easier when services are treated as first-class entities, and organizations are built around service ownership.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&#34;/docs/grafana-cloud/alerting-and-irm/service-center/&#34;&gt;Service Center in Grafana Cloud&lt;/a&gt; can help operate a service-oriented view of your system and align alert scope with ownership.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;After scope, ownership, and alert priority are defined, routing determines where alerts go and how they escalate. &lt;strong&gt;Notification routing is as important as the alerts&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Alerts should be delivered to the right team and channel based on priority, ownership, and team workflows. Use 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/notifications/notification-policies/&#34;&gt;notification policies&lt;/a&gt; to define a routing tree that matches the context of your service or scope:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Define a parent policy for default routing within the scope.&lt;/li&gt;
&lt;li&gt;Define nested policies for specific cases or higher-priority issues.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;prevent-notification-overload-with-alert-grouping&#34;&gt;Prevent notification overload with alert grouping&lt;/h2&gt;
&lt;p&gt;Without alert grouping, responders can receive many notifications for the same underlying problem.&lt;/p&gt;
&lt;p&gt;For example, a database failure can trigger several alerts at the same time like increased latency, higher error rates, and internal errors. Paging separately for each symptom quickly turns into notification spam, even though there is a single root cause.&lt;/p&gt;
&lt;p&gt;
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/notifications/group-alert-notifications/&#34;&gt;Notification grouping&lt;/a&gt; consolidates related alerts into a single notification. Instead of receiving multiple pages for the same issue, responders get one alert that represents the incident and includes all related firing alerts.&lt;/p&gt;
&lt;p&gt;Grouping should follow operational boundaries such as service or owner, as defined by notification policies. Downstream or cascading failures should be grouped together so they surface as one issue rather than many.&lt;/p&gt;
&lt;h2 id=&#34;mitigate-flapping-alerts&#34;&gt;Mitigate flapping alerts&lt;/h2&gt;
&lt;p&gt;Short-lived failure spikes often trigger alerts that auto-resolve quickly. Alerting on transient failures creates noise and leads responders to ignore them.&lt;/p&gt;
&lt;p&gt;Require issues to persist before alerting. Set a 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/alert-rule-evaluation/#pending-period&#34;&gt;pending period&lt;/a&gt; to define how long a condition must remain true before firing. For example, instead of alerting immediately on high error rate, require it to stay above the threshold for some minutes.&lt;/p&gt;
&lt;p&gt;Also, stabilize alerts by tuning query ranges and aggregations. Using raw data makes alerts sensitive to noise. Instead, evaluate over a time window and aggregate the data to smooth short spikes.&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;promql&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-promql&#34;&gt;# Reacts to transient spikes. Avoid this.
cpu_usage &amp;gt; 90

# Smooth fluctuations.
avg_over_time(cpu_usage[5m]) &amp;gt; 90&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;For latency and error-based alerts, percentiles are often more useful than averages:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;promql&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-promql&#34;&gt;quantile_over_time(0.95, http_duration_seconds[5m]) &amp;gt; 3&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Finally, avoid rapid resolve-and-fire notifications by using 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/alert-rule-evaluation/#keep-firing-for&#34;&gt;&lt;code&gt;keep_firing_for&lt;/code&gt;&lt;/a&gt; or 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/alert-rules/queries-conditions/#recovery-threshold&#34;&gt;recovery thresholds&lt;/a&gt; to keep alerts active briefly during recovery. Both options reduce flapping and unnecessary notifications.&lt;/p&gt;
&lt;h2 id=&#34;graduate-symptom-based-alerts-into-slos&#34;&gt;Graduate symptom-based alerts into SLOs&lt;/h2&gt;
&lt;p&gt;When a symptom-based alert fires frequently, it usually indicates a reliability concern that should be measured and managed more deliberately. This is often a sign that the alert could evolve into an &lt;a href=&#34;/docs/grafana-cloud/alerting-and-irm/slo/&#34;&gt;SLO&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Traditional alerts create pressure to react immediately, while error budgets introduce a buffer of time to act, changing how urgency is handled. Alerts can then be defined in terms of error budget burn rate rather than reacting to every minor deviation.&lt;/p&gt;
&lt;p&gt;SLOs also align distinct teams around common reliability goals by providing a shared definition of what &amp;ldquo;good&amp;rdquo; looks like. They help consolidate multiple symptom alerts into a single user-facing objective.&lt;/p&gt;
&lt;p&gt;For example, instead of several teams alerting on high latency, a single SLO can be used across teams to capture overall API performance.&lt;/p&gt;
&lt;h2 id=&#34;integrate-alerting-into-incident-post-mortems&#34;&gt;Integrate alerting into incident post-mortems&lt;/h2&gt;
&lt;p&gt;Every incident is an opportunity to improve alerting. After each incident, evaluate whether alerts helped responders act quickly or added unnecessary noise.&lt;/p&gt;
&lt;p&gt;Assess which alerts fired, and how they influenced incident response. Review whether alerts triggered too late, too early, or without enough context, and adjust thresholds, priority, or escalation based on what actually happened.&lt;/p&gt;
&lt;p&gt;Use 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/configure-notifications/create-silence/&#34;&gt;silences&lt;/a&gt; during active incidents to reduce repeated notifications, but scope them carefully to avoid silencing unrelated alerts.&lt;/p&gt;
&lt;p&gt;Post-mortems should evaluate alerts with root causes and lessons learned. If responders lacked key information during the incident, enrich alerts with additional context, dashboards, or better guidance.&lt;/p&gt;
&lt;h2 id=&#34;alerts-should-be-continuously-improved&#34;&gt;Alerts should be continuously improved&lt;/h2&gt;
&lt;p&gt;Alerting is an iterative process. Alerts that aren’t reviewed and refined lose effectiveness as systems and traffic patterns change.&lt;/p&gt;
&lt;p&gt;Schedule regular reviews of existing alerts. Remove alerts that don’t lead to action, and tune alerts or thresholds that fire too often without providing useful signal. Reduce false positives to combat alert fatigue.&lt;/p&gt;
&lt;p&gt;Prioritize clarity and simplicity in alert design. Simpler alerts are easier to understand, maintain, and trust under pressure. Favor fewer high-quality, actionable alerts over a large number of low-value ones.&lt;/p&gt;
&lt;p&gt;Use dashboards and observability tools for investigation, not alerts.&lt;/p&gt;


]]></content><description>&lt;h1 id="alerting-best-practices">Alerting best practices&lt;/h1>
&lt;p>Designing and configuring an effective alerting system takes time. This guide focuses on building alerting systems that scale with real-world operations.&lt;/p>
&lt;p>The practices described here are intentionally high-level and apply regardless of tooling. Whether you use Prometheus, Grafana Alerting, or another stack, the same constraints apply: complex systems, imperfect signals, and humans on call.&lt;/p></description></item><item><title>Handle connectivity errors in alerts</title><link>https://grafana.com/docs/grafana/v12.4/alerting/guides/connectivity-errors/</link><pubDate>Fri, 03 Apr 2026 19:43:06 +0000</pubDate><guid>https://grafana.com/docs/grafana/v12.4/alerting/guides/connectivity-errors/</guid><content><![CDATA[&lt;h1 id=&#34;handle-connectivity-errors-in-alerts&#34;&gt;Handle connectivity errors in alerts&lt;/h1&gt;
&lt;p&gt;Connectivity issues are a common cause of misleading alerts or unnoticed failures.&lt;/p&gt;
&lt;p&gt;There could be a number of reasons for these errors. Maybe your target went offline, or Prometheus couldn&amp;rsquo;t scrape it. Or maybe your alert query failed because its target timed out or the network went down. These situations might look similar, but require different considerations in your alerting setup.&lt;/p&gt;
&lt;p&gt;This guide walks through how to detect and handle these types of failures, whether you&amp;rsquo;re writing alert rules in Prometheus, using Grafana Alerting, or combining both. It covers both availability monitoring and alert query failures, and outlines strategies to improve the reliability of your alerts.&lt;/p&gt;
&lt;h2 id=&#34;understand-connectivity-issues-in-alerts&#34;&gt;Understand connectivity issues in alerts&lt;/h2&gt;
&lt;p&gt;Typically, connectivity issues fall into a few common scenarios:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Servers or containers crashed or were shut down.&lt;/li&gt;
&lt;li&gt;Service overload or timeout.&lt;/li&gt;
&lt;li&gt;Misconfigured authentication or incorrect permissions.&lt;/li&gt;
&lt;li&gt;Network issues like DNS problems or ISP outages.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When we talk about connectivity errors in alerting, we’re usually referring to one of two use cases:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Your target is down or unreachable.&lt;/strong&gt;&lt;br /&gt;
The service crashed, the host was down, or a firewall or DNS issue blocked the connection. These are &lt;strong&gt;availability problems&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Your alert query failed.&lt;/strong&gt;&lt;br /&gt;
The alert couldn’t evaluate its query—maybe because the data source timed out or an invalid query. These are &lt;strong&gt;execution errors&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It helps to separate these cases early, because they behave differently and require different strategies.&lt;/p&gt;
&lt;p&gt;Keep in mind that most alert rules don’t hit the target directly. They query metrics from a monitoring system like Prometheus, which scrapes data from your actual infrastructure or application. That gives us two typical alerting setups where connectivity issues can show up:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Alert rule → Target&lt;/strong&gt;&lt;br /&gt;
For example, an alert rule querying an external data source like a database.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Alert rule → Prometheus ← Target&lt;/strong&gt;&lt;br /&gt;
More common in observability stacks. For instance, Prometheus scrapes a node or container, and the alert rule queries the metrics later.&lt;/p&gt;
&lt;p&gt;In this second setup, you can run into connectivity issues on either side. If Prometheus fails to scrape the target, your alert rule might not fire, even though something is likely wrong.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;detect-target-availability-with-the-prometheus-up-metric&#34;&gt;Detect target availability with the Prometheus &lt;code&gt;up&lt;/code&gt; metric&lt;/h2&gt;
&lt;p&gt;Prometheus scrapes metrics from its targets regularly, following the &lt;a href=&#34;https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;&lt;code&gt;scrape_interval&lt;/code&gt;&lt;/a&gt; period. The default scrape interval is 60 seconds, which is generally considered common practice.&lt;/p&gt;
&lt;p&gt;Prometheus provides a built-in metric called &lt;code&gt;up&lt;/code&gt; for every scrape target, a simple method to indicate whether scraping is successful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;up == 1&lt;/code&gt;: Your target is reachable; Prometheus collected the target metrics as expected.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;up == 0&lt;/code&gt;: Prometheus couldn&amp;rsquo;t reach your target—indicating possible downtime or network errors.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A typical PromQL expression for an alert rule to detect when a target becomes unreachable is:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;up == 0&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;But this alert rule might result in noisy alerts as one single scrape failure will fire the alert. To reduce noise, you should add a delay:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;up == 0 for: 5m&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;for&lt;/code&gt; option in Prometheus (or 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/notifications/&#34;&gt;pending period&lt;/a&gt; in Grafana) delays the alert until the condition has been true for the full duration.&lt;/p&gt;
&lt;p&gt;In this example, waiting for 5 minutes means the single scrape error won&amp;rsquo;t result in a fired alert. Since Prometheus scrapes metrics every minute by default, the alert only fires after five consecutive failures.&lt;/p&gt;
&lt;p&gt;However, this kind of &lt;code&gt;up&lt;/code&gt; alert has a few potential downfalls:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Failures can slip between scrape intervals&lt;/strong&gt;: An outage that starts and ends between two evaluations go undetected. You could shorten the &lt;code&gt;for&lt;/code&gt; duration, but this might lead to scrape failures that trigger false alarms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Intermittent recoveries reset the &lt;code&gt;for&lt;/code&gt; timer&lt;/strong&gt;: A single successful scrape resets the alert timer, which masks intermittent outages.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Brief connectivity drops are common in real-world environments, so expect some flakiness in &lt;code&gt;up&lt;/code&gt; alerts. For example:&lt;/p&gt;
&lt;section class=&#34;expand-table-wrapper&#34;&gt;&lt;div class=&#34;button-div&#34;&gt;
      &lt;button class=&#34;expand-table-btn&#34;&gt;Expand table&lt;/button&gt;
    &lt;/div&gt;&lt;div class=&#34;responsive-table-wrapper&#34;&gt;
    &lt;table&gt;
      &lt;thead&gt;
          &lt;tr&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;Scrape result (&lt;code&gt;up&lt;/code&gt;)&lt;/th&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;Alert rule evaluation&lt;/th&gt;
          &lt;/tr&gt;
      &lt;/thead&gt;
      &lt;tbody&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;00:00 &lt;code&gt;up == 0&lt;/code&gt;&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;Timer starts&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;01:00 &lt;code&gt;up == 0&lt;/code&gt;&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;Timer continues&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;02:00 &lt;code&gt;up == 0&lt;/code&gt;&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;Timer continues&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;03:00 &lt;code&gt;up == 1&lt;/code&gt;&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;Successful scrape resets timer&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;04:00 &lt;code&gt;up == 0&lt;/code&gt;&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;Timer starts again&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;05:00 &lt;code&gt;up == 0&lt;/code&gt;&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;No alert yet; timer hasn’t reached the &lt;code&gt;for&lt;/code&gt; duration&lt;/td&gt;
          &lt;/tr&gt;
      &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;/div&gt;
&lt;/section&gt;&lt;p&gt;The longer the period, the more likely this is to happen.&lt;/p&gt;
&lt;p&gt;A single recovery resets the alert, that’s why &lt;code&gt;up == 0 for: 5m&lt;/code&gt; can sometimes be unreliable. Even if the target is down most of the time, the alert didn&amp;rsquo;t fire, leaving you unaware of a potential persistent issue.&lt;/p&gt;
&lt;h3 id=&#34;use-avg_over_time-to-smooth-signal&#34;&gt;Use &lt;code&gt;avg_over_time&lt;/code&gt; to smooth signal&lt;/h3&gt;
&lt;p&gt;One way to work around these issues is to smooth the signal by averaging the &lt;code&gt;up&lt;/code&gt; metric over a similar or longer period:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;avg_over_time(up[10m]) &amp;lt; 0.8&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This alert rule fires when the target is unreachable for more than 20% of the last 10 minutes, rather than looking for consecutive scrape failures. With a one minute scrape interval, three or more failed scrapes within the last 10 minutes now triggers the alert.&lt;/p&gt;
&lt;p&gt;Since this query uses a threshold and time window to control accuracy, you can now lower the &lt;code&gt;for&lt;/code&gt; duration (or 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/notifications/&#34;&gt;pending period&lt;/a&gt; in Grafana) to something shorter—&lt;code&gt;0m&lt;/code&gt; or &lt;code&gt;1m&lt;/code&gt;—so the alert fires faster.&lt;/p&gt;
&lt;p&gt;This approach gives you more flexibility in detecting real crashes or network issues. As always, adjust the threshold and period based on your noise tolerance and how critical the target is.&lt;/p&gt;
&lt;h3 id=&#34;use-synthetic-checks-to-monitor-external-availability&#34;&gt;Use synthetic checks to monitor external availability&lt;/h3&gt;
&lt;p&gt;Prometheus often runs inside the same network as the target it monitors. That means Prometheus might be able to reach the target, but doesn’t ensure it’s reachable to users on the outside.&lt;/p&gt;
&lt;p&gt;Firewalls, DNS misconfigurations, or other network issues might block public traffic while Prometheus scrapes &lt;code&gt;up&lt;/code&gt; successfully.&lt;/p&gt;
&lt;p&gt;This is where synthetic monitoring helps. Tools like the &lt;a href=&#34;https://github.com/prometheus/blackbox_exporter&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Blackbox Exporter&lt;/a&gt; let you continuously verify whether a service is available and reachable from outside your network—not just internally.&lt;/p&gt;
&lt;p&gt;The Blackbox Exporter exposes the results of these checks as metrics, which Prometheus can scrape like any other target. For example, the &lt;code&gt;probe_success&lt;/code&gt; metric reports whether the probe was able to reach the service. The setup looks like this:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Alert rules → Prometheus ← Blackbox Exporter (external probe) → Target&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;To detect when a service isn’t reachable externally, you can define an alert using the &lt;code&gt;probe_success&lt;/code&gt; metric:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;probe_success == 0 for: 5m&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This alert fires when the probe has failed continuously for 5 minutes—indicating that the service couldn’t be reached from the outside.&lt;/p&gt;
&lt;p&gt;You can then combine internal and external checks to make the detection of connectivity errors more reliable. This alert catches when the internal scrape fails or the service is externally unreachable.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;up == 0 or probe_success == 0&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;As with the &lt;code&gt;up&lt;/code&gt; metric, you might want to smooth this out using &lt;code&gt;avg_over_time()&lt;/code&gt; for more robust detection. The smooth version might look like:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;avg_over_time(up[10m]) &amp;lt; 0.8 or avg_over_time(probe_success[10m]) &amp;lt; 0.8&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This alert fires when Prometheus couldn&amp;rsquo;t scrape the target successfully for more than 20% of the past 10 minutes, or when the external probes have been failing more than 20% of the time. This smoothing technique can be applied to any binary availability signal.&lt;/p&gt;
&lt;h2 id=&#34;manage-offline-hosts&#34;&gt;Manage offline hosts&lt;/h2&gt;
&lt;p&gt;In many setups, Prometheus scrapes multiple hosts under the same target, such as a fleet of servers or containers behind a common job label. It’s common for one host to go offline while the others continue to report metrics normally.&lt;/p&gt;
&lt;p&gt;If your alert only checks the general &lt;code&gt;up&lt;/code&gt; metric without breaking it down by labels (like &lt;code&gt;instance&lt;/code&gt;, &lt;code&gt;host&lt;/code&gt;, or &lt;code&gt;pod&lt;/code&gt;), you might miss when a host stops reporting. For example, an alert that looks only at the aggregated status of all instances will likely fail to catch when individual instances go missing.&lt;/p&gt;
&lt;p&gt;This isn&amp;rsquo;t a connectivity error in this context — it’s not that the alert or Prometheus can&amp;rsquo;t reach anything, it’s that one or more specific targets have gone silent. These kinds of problems aren’t caught by &lt;code&gt;up == 0&lt;/code&gt; alerts.&lt;/p&gt;
&lt;p&gt;For these cases, see the complementary 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/best-practices/missing-data/&#34;&gt;guide on handling missing data&lt;/a&gt; — it covers common scenarios where the alert queries return no data at all, or where only some targets stop reporting. These aren&amp;rsquo;t full availability failures or execution errors, but they can still lead to blind spots in alert detection.&lt;/p&gt;
&lt;h2 id=&#34;handle-query-errors-in-grafana-alerting&#34;&gt;Handle query errors in Grafana Alerting&lt;/h2&gt;
&lt;p&gt;Not all connectivity issues come from targets going offline. Sometimes, the alert rule fails when querying its target. These aren’t availability problems—they’re query execution errors: maybe the data source timed out, the network dropped, or the query was invalid.&lt;/p&gt;
&lt;p&gt;These errors lead to broken alerts. But they come from a different part of the stack: between the alert rule and the data source, not between the data source (for example, Prometheus) and its target.&lt;/p&gt;
&lt;p&gt;This difference matters. Availability issues are typically handled using metrics like &lt;code&gt;up&lt;/code&gt; or &lt;code&gt;probe_success&lt;/code&gt; but execution errors require a different setup.&lt;/p&gt;
&lt;p&gt;Grafana Alerting has built-in handling for execution errors, regardless of the data source. That includes Prometheus, and others like Graphite, InfluxDB, PostgreSQL, etc. By default, Grafana Alerting automatically handles query errors so you don’t miss critical failures. When an alert rule fails to execute, Grafana fires a special &lt;code&gt;DatasourceError&lt;/code&gt; alert.&lt;/p&gt;
&lt;p&gt;You can configure this behavior depending on how critical the alert is and on whether you already have other alerts detecting the issue. In 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/alert-rule-evaluation/nodata-and-error-states/#modify-the-no-data-or-error-state&#34;&gt;&lt;strong&gt;Configure no data and error handling&lt;/strong&gt;&lt;/a&gt;, click &lt;strong&gt;Alert state if execution error or timeout&lt;/strong&gt;, and choose the desired option for the alert:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Error (default)&lt;/strong&gt;: Triggers a separate &lt;code&gt;DatasourceError&lt;/code&gt; alert. This default ensures alert rules always inform about query errors but can create noise.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Alerting&lt;/strong&gt;: Treats the error as if the alert condition is firing. Grafana transitions all existing instances for that rule to the &lt;code&gt;Alerting&lt;/code&gt; state.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Normal&lt;/strong&gt;: Ignores the query error and transitions all alert instances to the &lt;code&gt;Normal&lt;/code&gt; state. This is useful if the error isn’t critical or if you already have other alerts detecting connectivity issues.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Keep Last State&lt;/strong&gt;: Keeps the previous state until the query succeeds again. Suitable for unstable environments to avoid flapping alerts.&lt;/p&gt;
&lt;figure
      class=&#34;figure-wrapper figure-wrapper__lightbox w-100p &#34;
      style=&#34;max-width: 500px;&#34;
      itemprop=&#34;associatedMedia&#34;
      itemscope=&#34;&#34;
      itemtype=&#34;http://schema.org/ImageObject&#34;
    &gt;&lt;a
          class=&#34;lightbox-link&#34;
          href=&#34;/media/docs/alerting/alert-rule-configure-no-data-and-error-v2.png&#34;
          itemprop=&#34;contentUrl&#34;
        &gt;&lt;div class=&#34;img-wrapper w-100p h-auto&#34;&gt;&lt;img
            class=&#34;lazyload &#34;
            data-src=&#34;/media/docs/alerting/alert-rule-configure-no-data-and-error-v2.png&#34;data-srcset=&#34;/media/docs/alerting/alert-rule-configure-no-data-and-error-v2.png?w=320 320w, /media/docs/alerting/alert-rule-configure-no-data-and-error-v2.png?w=550 550w, /media/docs/alerting/alert-rule-configure-no-data-and-error-v2.png?w=750 750w, /media/docs/alerting/alert-rule-configure-no-data-and-error-v2.png?w=900 900w, /media/docs/alerting/alert-rule-configure-no-data-and-error-v2.png?w=1040 1040w, /media/docs/alerting/alert-rule-configure-no-data-and-error-v2.png?w=1240 1240w, /media/docs/alerting/alert-rule-configure-no-data-and-error-v2.png?w=1920 1920w&#34;data-sizes=&#34;auto&#34;alt=&#34;A screenshot of the `Configure error handling` option in Grafana Alerting.&#34;width=&#34;477&#34;height=&#34;338&#34;/&gt;
          &lt;noscript&gt;
            &lt;img
              src=&#34;/media/docs/alerting/alert-rule-configure-no-data-and-error-v2.png&#34;
              alt=&#34;A screenshot of the `Configure error handling` option in Grafana Alerting.&#34;width=&#34;477&#34;height=&#34;338&#34;/&gt;
          &lt;/noscript&gt;&lt;/div&gt;&lt;/a&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This applies even when alert rules query Prometheus itself—not just external data sources.&lt;/p&gt;
&lt;h3 id=&#34;design-alerts-for-connectivity-errors&#34;&gt;Design alerts for connectivity errors&lt;/h3&gt;
&lt;p&gt;In practice, start by deciding if you want to create explicit alert rules — for example, using &lt;code&gt;up&lt;/code&gt; or &lt;code&gt;probe_success&lt;/code&gt; — to detect when a target is down or has connectivity issues.&lt;/p&gt;
&lt;p&gt;Then, for each alert rule, choose the error-handling behavior based on whether you already have dedicated connectivity alerts, the stability of the target, and how critical the alert is. Prioritize alerts based on symptom severity rather than just infrastructure signals that might not impact users.&lt;/p&gt;
&lt;h3 id=&#34;reduce-redundant-error-notifications&#34;&gt;Reduce redundant error notifications&lt;/h3&gt;
&lt;p&gt;A single data source error can lead to multiple alerts firing simultaneously, sometimes bombarding you with many alerts and generating too much noise.&lt;/p&gt;
&lt;p&gt;As described previously, you can control the error-handling behavior for Grafana alerts. The &lt;strong&gt;Keep Last State&lt;/strong&gt; or &lt;strong&gt;Normal&lt;/strong&gt; option prevents alerts from firing and helps avoid redundant alerts, especially for services already covered by &lt;code&gt;up&lt;/code&gt; or &lt;code&gt;probe_success&lt;/code&gt; alerts.&lt;/p&gt;
&lt;p&gt;When using the default behavior, a single connectivity error will likely trigger multiple &lt;code&gt;DatasourceError&lt;/code&gt; alerts.&lt;/p&gt;
&lt;p&gt;These alerts are separate from the original alerts—they’re not just a different state of the original alert. They fire immediately, ignore the pending period, and don’t inherit all the labels. This can catch you off guard if you expect them to behave like the original alerts.&lt;/p&gt;
&lt;p&gt;Consider not treating these alerts in the same way as the original alerts, and implement dedicated strategies for their notifications:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Reduce duplicate notifications by grouping &lt;code&gt;DatasourceError&lt;/code&gt; alerts. Use the &lt;code&gt;datasource_uid&lt;/code&gt; label to group errors from the same data source.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Route &lt;code&gt;DatasourceError&lt;/code&gt; alerts separately, sending them to different teams or channels depending on their impact and urgency.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For details on how to configure grouping and routing, refer to 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/notifications/&#34;&gt;handling notifications&lt;/a&gt; and 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/alert-rule-evaluation/nodata-and-error-states/#no-data-and-error-alerts&#34;&gt;&lt;code&gt;No Data&lt;/code&gt; and &lt;code&gt;Error&lt;/code&gt; alerts&lt;/a&gt; documentation.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Connectivity issues are one of the common causes of noisy or misleading alerts. This guide covered two distinct types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Availability issues&lt;/strong&gt;, where the target itself is down or unreachable (e.g., due to a crash or network failure).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query execution errors&lt;/strong&gt;, where the alert rule can&amp;rsquo;t reach its data source (e.g., due to timeouts, invalid queries, or data source outages).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These problems come from different parts of your stack, and require its own techniques. Prometheus and Grafana allow you to detect them, and combining distinct techniques can make your alerts more resilient.&lt;/p&gt;
&lt;p&gt;With Prometheus, avoid relying solely on &lt;code&gt;up == 0&lt;/code&gt;. Smooth queries to account for intermittent failures, and use synthetic monitoring to detect reachability issues from outside your network.&lt;/p&gt;
&lt;p&gt;In Grafana Alerting, configure error handling explicitly. Not all alerts are equal or have the same urgency. Tune the error-handling behavior based on the reliability and severity of the alerts and whether you already have alerts dedicated to connectivity problems.&lt;/p&gt;
&lt;p&gt;And don’t forget the third case: &lt;strong&gt;missing data&lt;/strong&gt;. If only one host from a fleet silently disappears, you might not get alerted. If you&amp;rsquo;re dealing with individual instances that stopped reporting data, see the 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/best-practices/missing-data/&#34;&gt;Guide on handling missing data&lt;/a&gt; to continue exploring this topic.&lt;/p&gt;
]]></content><description>&lt;h1 id="handle-connectivity-errors-in-alerts">Handle connectivity errors in alerts&lt;/h1>
&lt;p>Connectivity issues are a common cause of misleading alerts or unnoticed failures.&lt;/p>
&lt;p>There could be a number of reasons for these errors. Maybe your target went offline, or Prometheus couldn&amp;rsquo;t scrape it. Or maybe your alert query failed because its target timed out or the network went down. These situations might look similar, but require different considerations in your alerting setup.&lt;/p></description></item><item><title>Handle missing data in Grafana Alerting</title><link>https://grafana.com/docs/grafana/v12.4/alerting/guides/missing-data/</link><pubDate>Fri, 03 Apr 2026 19:43:06 +0000</pubDate><guid>https://grafana.com/docs/grafana/v12.4/alerting/guides/missing-data/</guid><content><![CDATA[&lt;h1 id=&#34;handle-missing-data-in-grafana-alerting&#34;&gt;Handle missing data in Grafana Alerting&lt;/h1&gt;
&lt;p&gt;Missing data from when a target stops reporting metric data can be one of the most common issues when troubleshooting alerts. In cloud-native environments, this happens all the time. Pods or nodes scale down to match demand, or an entire job quietly disappears.&lt;/p&gt;
&lt;p&gt;When this happens, alerts won’t fire, and you might not notice the system has stopped reporting.&lt;/p&gt;
&lt;p&gt;Sometimes it&amp;rsquo;s just a lack of data from a few instances. Other times, it&amp;rsquo;s a connectivity issue where the entire target is unreachable.&lt;/p&gt;
&lt;p&gt;This guide covers different scenarios where the underlying data is missing and shows how to design your alerts to act on those cases. If you&amp;rsquo;re troubleshooting an unreachable host or a network failure, see the 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/best-practices/connectivity-errors/&#34;&gt;Handle connectivity errors documentation&lt;/a&gt; as well.&lt;/p&gt;
&lt;h2 id=&#34;no-data-vs-missing-series&#34;&gt;No Data vs. Missing Series&lt;/h2&gt;
&lt;p&gt;There are a few common causes when an instance stops reporting data, similar to 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/best-practices/connectivity-errors/&#34;&gt;connectivity errors&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Host crash: The system is down, and Prometheus stops scraping the target.&lt;/li&gt;
&lt;li&gt;Temporary network failures: Intermittent scrape failures cause data gaps.&lt;/li&gt;
&lt;li&gt;Deployment changes: Decommissioning, Kubernetes pod eviction, or scaling down resources.&lt;/li&gt;
&lt;li&gt;Ephemeral workloads: Metrics intentionally stop reporting.&lt;/li&gt;
&lt;li&gt;And more.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first thing to understand is the difference between a query failure (or connectivity error), &lt;em&gt;No Data&lt;/em&gt;, and a &lt;em&gt;Missing Series&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Alert queries often return multiple time series — one per instance, pod, region, or label combination. This is known as a &lt;strong&gt;multi-dimensional alert&lt;/strong&gt;, meaning a single alert rule can trigger multiple alert instances (alerts).&lt;/p&gt;
&lt;p&gt;For example, imagine a recorded metric, &lt;code&gt;http_request_latency_seconds&lt;/code&gt;, that reports latency per second in the regions where the application is deployed. The query returns one series per region — for instance, &lt;code&gt;region1&lt;/code&gt; and &lt;code&gt;region2&lt;/code&gt; — and generates only two alert instances. In this scenario, you may experience:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Connectivity Error&lt;/strong&gt; if the alert rule query fails.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No Data&lt;/strong&gt; if the query runs successfully but returns no data at all.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Missing Series&lt;/strong&gt; if one or more specific series, which previously returned data, are missing, but other series still return data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In both &lt;em&gt;No Data&lt;/em&gt; and &lt;em&gt;Missing Series&lt;/em&gt; cases, the query still technically &amp;ldquo;works&amp;rdquo;, but the alert won’t fire unless you explicitly configure it to handle these situations.&lt;/p&gt;
&lt;p&gt;The following tables illustrate both scenarios using the previous example, with an alert that triggers if the latency exceeds 2 seconds in any region: &lt;code&gt;avg_over_time(http_request_latency_seconds[5m]) &amp;gt; 2&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No Data Scenario:&lt;/strong&gt; The query returns no data for any series:&lt;/p&gt;
&lt;section class=&#34;expand-table-wrapper&#34;&gt;&lt;div class=&#34;button-div&#34;&gt;
      &lt;button class=&#34;expand-table-btn&#34;&gt;Expand table&lt;/button&gt;
    &lt;/div&gt;&lt;div class=&#34;responsive-table-wrapper&#34;&gt;
    &lt;table&gt;
      &lt;thead&gt;
          &lt;tr&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;Time&lt;/th&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;region1&lt;/th&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;region2&lt;/th&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;Alert triggered&lt;/th&gt;
          &lt;/tr&gt;
      &lt;/thead&gt;
      &lt;tbody&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;00:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1.5s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;✅ No Alert&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;01:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;No Data ⚠️&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;No Data ⚠️&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;⚠️ No Alert (Silent Failure)&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;02:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1.4s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;✅ No Alert&lt;/td&gt;
          &lt;/tr&gt;
      &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;/div&gt;
&lt;/section&gt;&lt;p&gt;&lt;strong&gt;MissingSeries Scenario:&lt;/strong&gt; Only a specific series (&lt;code&gt;region2&lt;/code&gt;) disappears:&lt;/p&gt;
&lt;section class=&#34;expand-table-wrapper&#34;&gt;&lt;div class=&#34;button-div&#34;&gt;
      &lt;button class=&#34;expand-table-btn&#34;&gt;Expand table&lt;/button&gt;
    &lt;/div&gt;&lt;div class=&#34;responsive-table-wrapper&#34;&gt;
    &lt;table&gt;
      &lt;thead&gt;
          &lt;tr&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;Time&lt;/th&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;region1&lt;/th&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;region2&lt;/th&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;Alert triggered&lt;/th&gt;
          &lt;/tr&gt;
      &lt;/thead&gt;
      &lt;tbody&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;00:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1.5s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;✅ No Alert&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;01:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1.6s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;Missing Series ⚠️&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;⚠️ No Alert (Silent Failure)&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;02:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1.4s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;✅ No Alert&lt;/td&gt;
          &lt;/tr&gt;
      &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;/div&gt;
&lt;/section&gt;&lt;p&gt;In both cases, something broke silently.&lt;/p&gt;
&lt;h2 id=&#34;detect-missing-data-in-prometheus&#34;&gt;Detect missing data in Prometheus&lt;/h2&gt;
&lt;p&gt;Prometheus doesn&amp;rsquo;t fire alerts when the query returns no data. It simply assumes there was nothing to report, like with query errors. Missing data won’t trigger existing alerts unless you explicitly check for it.&lt;/p&gt;
&lt;p&gt;In Prometheus, a common way to catch missing data is by to use the &lt;code&gt;absent_over_time&lt;/code&gt; function.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;absent_over_time(http_request_latency_seconds[5m]) == 1&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This triggers when all series for &lt;code&gt;http_request_latency_seconds&lt;/code&gt; are absent for 5 minutes — catching the &lt;em&gt;No Data&lt;/em&gt; case when the entire metric disappears.&lt;/p&gt;
&lt;p&gt;However, &lt;code&gt;absent_over_time()&lt;/code&gt; can’t detect which specific series are missing since it doesn’t preserve labels. The alert won’t tell you which series stopped reporting, only that the query returns no data.&lt;/p&gt;
&lt;p&gt;If you want to check for missing data per-region or label, you can specify the label in the alert query as follows:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;promQL&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-promql&#34;&gt;# Detect missing data in region1
absent_over_time(http_request_latency_seconds{region=&amp;#34;region1&amp;#34;}[5m]) == 1

# Detect missing data in region2
absent_over_time(http_request_latency_seconds{region=&amp;#34;region2&amp;#34;}[5m]) == 1&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;But this doesn&amp;rsquo;t scale well. It is unreliable to have hard-coded queries for each label set, especially in dynamic cloud environments where instances can appear or disappear at any time.&lt;/p&gt;
&lt;p&gt;To detect when a specific target has disappeared, see below &lt;strong&gt;Evict alert instances for missing series&lt;/strong&gt; for details on how Grafana handles this case and how to set up detection.&lt;/p&gt;
&lt;h2 id=&#34;manage-no-data-issues-in-grafana-alerts&#34;&gt;Manage No Data issues in Grafana alerts&lt;/h2&gt;
&lt;p&gt;While Prometheus provides functions like &lt;code&gt;absent_over_time()&lt;/code&gt; to detect missing data, not all data sources — like Graphite, InfluxDB, PostgreSQL, and others — available to Grafana alerts support a similar function.&lt;/p&gt;
&lt;p&gt;To handle this, Grafana Alerting implements a built-in &lt;code&gt;No Data&lt;/code&gt; state logic, so you don’t need to detect missing data with &lt;code&gt;absent_*&lt;/code&gt; queries. Instead, you can configure in the alert rule settings how alerts behave when no data is returned.&lt;/p&gt;
&lt;p&gt;Similar to error handling, Grafana triggers a special &lt;em&gt;No data&lt;/em&gt; alert by default and lets you control this behavior. In 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/alert-rule-evaluation/nodata-and-error-states/#modify-the-no-data-or-error-state&#34;&gt;&lt;strong&gt;Configure no data and error handling&lt;/strong&gt;&lt;/a&gt;, click &lt;strong&gt;Alert state if no data or all values are null&lt;/strong&gt;, and choose one of the following options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;No Data (default):&lt;/strong&gt; Triggers a new &lt;code&gt;DatasourceNoData&lt;/code&gt; alert, treating &lt;em&gt;No data&lt;/em&gt; as a specific problem.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Alerting:&lt;/strong&gt; Transition each existing alert instance into the &lt;code&gt;Alerting&lt;/code&gt; state when data disappears.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Normal:&lt;/strong&gt; Ignores missing data and transitions all instances to the &lt;code&gt;Normal&lt;/code&gt; state. Useful when receiving intermittent data, such as from experimental services, sporadic actions, or periodic reports.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Keep Last State:&lt;/strong&gt; Leaves the alert in its previous state until the data returns. This is common in environments where brief metric gaps happen regularly, like with flaky exporters or noisy environments.&lt;/p&gt;
&lt;figure
      class=&#34;figure-wrapper figure-wrapper__lightbox w-100p &#34;
      style=&#34;max-width: 500px;&#34;
      itemprop=&#34;associatedMedia&#34;
      itemscope=&#34;&#34;
      itemtype=&#34;http://schema.org/ImageObject&#34;
    &gt;&lt;a
          class=&#34;lightbox-link&#34;
          href=&#34;/media/docs/alerting/alert-rule-configure-no-data.png&#34;
          itemprop=&#34;contentUrl&#34;
        &gt;&lt;div class=&#34;img-wrapper w-100p h-auto&#34;&gt;&lt;img
            class=&#34;lazyload &#34;
            data-src=&#34;/media/docs/alerting/alert-rule-configure-no-data.png&#34;data-srcset=&#34;/media/docs/alerting/alert-rule-configure-no-data.png?w=320 320w, /media/docs/alerting/alert-rule-configure-no-data.png?w=550 550w, /media/docs/alerting/alert-rule-configure-no-data.png?w=750 750w, /media/docs/alerting/alert-rule-configure-no-data.png?w=900 900w, /media/docs/alerting/alert-rule-configure-no-data.png?w=1040 1040w, /media/docs/alerting/alert-rule-configure-no-data.png?w=1240 1240w, /media/docs/alerting/alert-rule-configure-no-data.png?w=1920 1920w&#34;data-sizes=&#34;auto&#34;alt=&#34;A screenshot of the `Configure no data handling` option in Grafana Alerting.&#34;width=&#34;520&#34;height=&#34;307&#34;/&gt;
          &lt;noscript&gt;
            &lt;img
              src=&#34;/media/docs/alerting/alert-rule-configure-no-data.png&#34;
              alt=&#34;A screenshot of the `Configure no data handling` option in Grafana Alerting.&#34;width=&#34;520&#34;height=&#34;307&#34;/&gt;
          &lt;/noscript&gt;&lt;/div&gt;&lt;/a&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;manage-datasourcenodata-notifications&#34;&gt;Manage DatasourceNoData notifications&lt;/h3&gt;
&lt;p&gt;When Grafana triggers a 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/alert-rule-evaluation/nodata-and-error-states/#no-data-and-error-alerts&#34;&gt;NoData alert&lt;/a&gt;, it creates a distinct alert instance, separate from the original alert instance. These alerts behave differently:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;They use a dedicated &lt;code&gt;alertname: DatasourceNoData&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;They don’t inherit all the labels from the original alert instances.&lt;/li&gt;
&lt;li&gt;They trigger immediately, ignoring the pending period.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because of this, &lt;code&gt;DatasourceNoData&lt;/code&gt; alerts might require a dedicated setup to handle their notifications. For general recommendations, see 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/best-practices/connectivity-errors/#reducing-notification-fatigue-from-datasourceerror-alerts&#34;&gt;Reduce redundant DatasourceError alerts&lt;/a&gt; — similar practices can apply to &lt;em&gt;NoData&lt;/em&gt; alerts.&lt;/p&gt;
&lt;h2 id=&#34;evict-alert-instances-for-missing-series&#34;&gt;Evict alert instances for missing series&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;MissingSeries&lt;/em&gt; occurs when only some series disappear but not all. This case is subtle, but important.&lt;/p&gt;
&lt;p&gt;Grafana marks missing series as 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/alert-rule-evaluation/stale-alert-instances/&#34;&gt;&lt;strong&gt;stale&lt;/strong&gt;&lt;/a&gt; after two evaluation intervals and triggers the alert instance eviction process. Here’s what happens under the hood:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Alert instances with missing data keep their last state for two evaluation intervals.&lt;/li&gt;
&lt;li&gt;If the data is still missing after that:
&lt;ul&gt;
&lt;li&gt;Grafana adds the annotation &lt;code&gt;grafana_state_reason: MissingSeries&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The alert instance transitions to the &lt;code&gt;Normal&lt;/code&gt; state.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;resolved notification&lt;/strong&gt; is sent if the alert was previously firing.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;alert instance is removed&lt;/strong&gt; from the Grafana UI.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If an alert instance becomes stale, you’ll find it in the 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/monitor-status/view-alert-state-history/&#34;&gt;alert history&lt;/a&gt; as &lt;code&gt;Normal (Missing Series)&lt;/code&gt; before it disappears. This table shows the eviction process from the previous example:&lt;/p&gt;
&lt;section class=&#34;expand-table-wrapper&#34;&gt;&lt;div class=&#34;button-div&#34;&gt;
      &lt;button class=&#34;expand-table-btn&#34;&gt;Expand table&lt;/button&gt;
    &lt;/div&gt;&lt;div class=&#34;responsive-table-wrapper&#34;&gt;
    &lt;table&gt;
      &lt;thead&gt;
          &lt;tr&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;Time&lt;/th&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;region1&lt;/th&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;region2&lt;/th&gt;
              &lt;th style=&#34;text-align: left&#34;&gt;Alert triggered&lt;/th&gt;
          &lt;/tr&gt;
      &lt;/thead&gt;
      &lt;tbody&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;00:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1.5s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;🟢🟢 No Alerts&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;01:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;3s 🔴 &lt;br&gt; &lt;code&gt;Alerting&lt;/code&gt;&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;3s 🔴 &lt;br&gt; &lt;code&gt;Alerting&lt;/code&gt;&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;🔴🔴 Alert instances triggered for both regions&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;02:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1.6s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;&lt;code&gt;(MissingSeries)&lt;/code&gt;⚠️ &lt;br&gt; &lt;code&gt;Alerting&lt;/code&gt; ️&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;🟢🔴 Region2 missing, state maintained.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;03:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1.4s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;&lt;code&gt;(MissingSeries)&lt;/code&gt; &lt;br&gt; &lt;code&gt;Normal&lt;/code&gt;&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;🟢🟢 &lt;code&gt;region2&lt;/code&gt; was resolved, 📩 notification sent, and instance evicted.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;04:00&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;1.4s 🟢&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;—&lt;/td&gt;
              &lt;td style=&#34;text-align: left&#34;&gt;🟢 No Alerts. &lt;code&gt;region2&lt;/code&gt; was evicted.&lt;/td&gt;
          &lt;/tr&gt;
      &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;/div&gt;
&lt;/section&gt;&lt;h3 id=&#34;why-doesnt-missingseries-match-no-data-behavior&#34;&gt;Why doesn’t MissingSeries match No Data behavior?&lt;/h3&gt;
&lt;p&gt;In dynamic environments, such as autoscaling groups, ephemeral pods, spot instances, series naturally come and go. &lt;strong&gt;MissingSeries&lt;/strong&gt; normally signals infrastructure or deployment changes.&lt;/p&gt;
&lt;p&gt;By default, &lt;strong&gt;No Data&lt;/strong&gt; triggers an alert to indicate a potential problem.&lt;/p&gt;
&lt;p&gt;The eviction process for &lt;strong&gt;MissingSeries&lt;/strong&gt; is designed to prevent alert flapping when a pod or instance disappears, reducing alert noise.&lt;/p&gt;
&lt;p&gt;In environments with frequent scale events, prioritize symptom-based alerts over individual infrastructure signals and use aggregate alerts unless you explicitly need to track individual instances.&lt;/p&gt;
&lt;h3 id=&#34;handle-missingseries-notifications&#34;&gt;Handle MissingSeries notifications&lt;/h3&gt;
&lt;p&gt;A stale alert instance triggers a &lt;strong&gt;resolved notification&lt;/strong&gt; if it transitions from a firing state (such as &lt;code&gt;Alerting&lt;/code&gt;, &lt;code&gt;No Data&lt;/code&gt;, or &lt;code&gt;Error&lt;/code&gt;) to &lt;code&gt;Normal&lt;/code&gt;, and the 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/fundamentals/alert-rule-evaluation/nodata-and-error-states/#grafana_state_reason-for-troubleshooting&#34;&gt;&lt;code&gt;grafana_state_reason&lt;/code&gt; annotation&lt;/a&gt; is set to &lt;strong&gt;MissingSeries&lt;/strong&gt; to indicate that the alert wasn’t resolved by recovery but evicted because the series data went missing.&lt;/p&gt;
&lt;p&gt;Recognizing these notifications helps you handle them appropriately. For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Display the &lt;code&gt;grafana_state_reason&lt;/code&gt; annotation to clearly identify &lt;strong&gt;MissingSeries&lt;/strong&gt; alerts.&lt;/li&gt;
&lt;li&gt;Or use the &lt;code&gt;grafana_state_reason&lt;/code&gt; annotation to process these alerts differently.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Also, review these notifications to confirm whether something broke or if the alert was unnecessary. To reduce noise:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Silence or mute alerts during planned maintenance or rollouts.&lt;/li&gt;
&lt;li&gt;Adjust alert rules to avoid triggering on series you expect to come and go, and use aggregated alerts instead.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;detect-missing-series-in-prometheus&#34;&gt;Detect missing series in Prometheus&lt;/h3&gt;
&lt;p&gt;Previously, an example showed how to detect missing data for a specific label, such as &lt;code&gt;region&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;promQL&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-promql&#34;&gt;# Detect missing data in region1
absent_over_time(http_request_latency_seconds{region=&amp;#34;region1&amp;#34;}[5m]) == 1

# Detect missing data in region2
absent_over_time(http_request_latency_seconds{region=&amp;#34;region2&amp;#34;}[5m]) == 1&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;However, this approach doesn’t scale well because it requires hardcoding all possible &lt;code&gt;region&lt;/code&gt; values.&lt;/p&gt;
&lt;p&gt;As an alternative, you can create an alert rule that detects missing series dynamically using the &lt;code&gt;present_over_time&lt;/code&gt; function:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;promQL&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-promql&#34;&gt;present_over_time(http_request_latency_seconds{}[24h])
unless
present_over_time(http_request_latency_seconds{}[10m])&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Or, if you want to group by a label such as region:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;promQL&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-promql&#34;&gt;group(present_over_time(http_request_latency_seconds{}[24h])) by (region)
unless
group(present_over_time(http_request_latency_seconds{}[10m])) by (region)&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This query finds regions (or other targets) that were present at any time in the past 24 hours but have not been present in the past 10 minutes. The alert rule then triggers an alert instance for each missing region. You can apply the same technique to any label or target dimension.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Missing data isn’t always a failure. It’s a common scenario in dynamic environments when certain targets stop reporting.&lt;/p&gt;
&lt;p&gt;Grafana Alerting handles distinct scenarios automatically. Here’s how to think about it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Understand &lt;code&gt;DatasourceNoData&lt;/code&gt; and &lt;code&gt;MissingSeries&lt;/code&gt; notifications, since they don’t behave like regular alerts.&lt;/li&gt;
&lt;li&gt;Use Grafana’s &lt;em&gt;No Data&lt;/em&gt; handling options to define what happens when a query returns nothing.&lt;/li&gt;
&lt;li&gt;When &lt;em&gt;NoData&lt;/em&gt; is not an issue, consider rewriting the query to always return data — for example, in Prometheus, use &lt;code&gt;your_metric_query OR on() vector(0)&lt;/code&gt; to return &lt;code&gt;0&lt;/code&gt; when &lt;code&gt;your_metric_query&lt;/code&gt; returns nothing.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;absent_over_time()&lt;/code&gt; or &lt;code&gt;present_over_time&lt;/code&gt; in Prometheus to detect when a metric or target disappears.&lt;/li&gt;
&lt;li&gt;If data is frequently missing due to scrape delays, use techniques to account for data delays:
&lt;ul&gt;
&lt;li&gt;Adjust the &lt;strong&gt;Time Range&lt;/strong&gt; query option in Grafana to evaluate slightly behind real time (e.g., set &lt;strong&gt;To&lt;/strong&gt; to &lt;code&gt;now-1m&lt;/code&gt;) to account for late data points.&lt;/li&gt;
&lt;li&gt;In Prometheus, you can use &lt;code&gt;last_over_time(metric_name[10m])&lt;/code&gt; to pick the most recent sample within a given window.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Don’t alert on every instance by default. In dynamic environments, it’s better to aggregate and alert on symptoms — unless a missing individual instance directly impacts users.&lt;/li&gt;
&lt;li&gt;If you’re getting too much noise from disappearing data, consider adjusting alerts, using &lt;code&gt;Keep Last State&lt;/code&gt;, or routing those alerts differently.&lt;/li&gt;
&lt;li&gt;For connectivity issues involving alert query failures, see the sibling guide: 
    &lt;a href=&#34;/docs/grafana/v12.4/alerting/best-practices/connectivity-errors/&#34;&gt;Handling connectivity errors in Grafana Alerting&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
]]></content><description>&lt;h1 id="handle-missing-data-in-grafana-alerting">Handle missing data in Grafana Alerting&lt;/h1>
&lt;p>Missing data from when a target stops reporting metric data can be one of the most common issues when troubleshooting alerts. In cloud-native environments, this happens all the time. Pods or nodes scale down to match demand, or an entire job quietly disappears.&lt;/p></description></item></channel></rss>