Examples on Grafana Labs

Example of multi-dimensional alerts on time series data

Fri, 03 Apr 2026 12:35:46 -0500

Example of multi-dimensional alerts on time series data

This example shows how a single alert rule can generate multiple alert instances — one for each label set (or time series). This is called multi-dimensional alerting: one alert rule, many alert instances.

In Prometheus, each unique combination of labels defines a distinct time series. Grafana Alerting uses the same model: each label set is evaluated independently, and a separate alert instance is created for each series.

This pattern is common in dynamic environments when monitoring a group of components like multiple CPUs, containers, or per-host availability. Instead of defining individual alert rules or aggregated alerts, you alert on each dimension — so you can detect particular issues and include that level of detail in notifications.

For example, a query returns one series per CPU:

`cpu` label value	CPU percent usage
cpu-0	95
cpu-1	30
cpu-2	85

With a threshold of > 80, this would trigger two alert instances for cpu-0 and one for cpu-2.

Examples overview

Imagine you want to trigger alerts when CPU usage goes above 80%, and you want to track each CPU core independently.

You can use a Prometheus query like this:

sum by(cpu) (
  rate(node_cpu_seconds_total{mode!="idle"}[1m])
)

This query returns the active CPU usage rate per CPU core, averaged over the past minute.

CPU core	Active usage rate
cpu-0	95
cpu-1	30
cpu-2	85

This produces one series for each existing CPU.

When Grafana Alerting evaluates the query, it creates an individual alert instance for each returned series.

Alert instance	Value
{cpu=“cpu-0”}	95
{cpu=“cpu-1”}	30
{cpu=“cpu-2”}	85

With a threshold condition like $A > 80, Grafana evaluates each instance separately and fires alerts only where the condition is met:

Alert instance	Value	State
{cpu=“cpu-0”}	95	Firing
{cpu=“cpu-1”}	30	Normal
{cpu=“cpu-2”}	85	Firing

Multi-dimensional alerts help you surface issues on individual components—problems that might be missed when alerting on aggregated data (like total CPU usage).

Each alert instance targets a specific component, identified by its unique label set. This makes alerts more specific and actionable. For example, you can set a summary annotation in your alert rule that identifies the affected CPU:

High CPU usage on {{$labels.cpu}}

In the previous example, the two firing alert instances would display summaries indicating the affected CPUs:

High CPU usage on cpu-0
High CPU usage on cpu-2

Try it with TestData

You can quickly experiment with multi-dimensional alerts using the TestData data source, which can generate multiple random time series.

Add the TestData data source through the Connections menu.
Go to Alerting and create an alert rule
Select TestData as the data source.
Configure the TestData scenario
- Scenario: Random Walk
- Labels: cpu=cpu-$seriesIndex
- Series count: 3
- Min: 70, Max: 100
- Spread: 2

Reduce time series data for comparison

The example returns three time series like shown above with values across the selected time range.

To alert on each series, you need to reduce the time series to a single value that the alert condition can evaluate and determine the alert instance state.

Grafana Alerting provides several ways to reduce time series data:

Data source query functions. The earlier example used the Prometheus sum function to sum the rate results by cpu,producing a single value per CPU core.
Reduce expression. In the query and condition section, Grafana provides the Reduce expression to aggregate time series data.
- In Default mode, the When input selects a reducer (like last, mean, or min), and the threshold compares that reduced value.
- In Advanced mode, you can add the Reduce expression (e.g., last(), mean()) before defining the threshold (alert condition).

For demo purposes, this example uses the Advanced mode with a Reduce expression:

Toggle Advanced mode in the top right section of the query panel to enable adding additional expressions.
Add the Reduce expression using a function like mean() to reduce each time series to a single value.
Define the alert condition using a Threshold like $reducer > 80
Click Preview to evaluate the alert rule.

The alert condition evaluates the reduced value for each alert instance and shows whether each instance is Firing or Normal.

Tip
You can explore this alerting example in Grafana Play.

Open the example to view alert evaluation results, generated alert instances, the alert history timeline, and alert rule details.

Learn more

This example shows how Grafana Alerting implements a multi-dimensional alerting model: one rule, many alert instances and why reducing time series data to a single value is required for evaluation.

For additional learning resources, check out:

Get started tutorial – Create multi-dimensional alerts and route them
Example of alerting on tabular data Update the interval of a rule group or modify the rules of the group.

Example of alerting on tabular data

Fri, 03 Apr 2026 12:35:46 -0500

Example of alerting on tabular data

Not all data sources return time series data. SQL databases, CSV files, and some APIs often return results as rows or arrays of columns or fields — commonly referred to as tabular data.

This example shows how to create an alert rule using data in table format. Grafana treats each row as a separate alert instance, as long as the data meets the expected format.

How Grafana Alerting evaluates tabular data

When a query returns data in table format, Grafana transforms each row into a separate alert instance.

To evaluate each row (alert instance), it expects:

Only one numeric column. This is the value used for evaluating the alert condition.
Non-numeric columns. These columns defines the label set. The column name becomes a label name; and the cell value becomes the label value.
Unique label sets per row. Each row must be uniquely identifiable by its labels. This ensures each row represents a distinct alert instance.

Caution
These three conditions must be met—otherwise, Grafana can’t evaluate the table data and the rule will fail.

Example overview

Imagine you store disk usage in a DiskSpace table and you want to trigger alerts when the available space drops below 5%.

Time	Host	Disk	PercentFree
2021-06-07	web1	/etc	3
2021-06-07	web2	/var	4
2021-06-07	web3	/var	8

To calculate the free space per Host and Disk in this case, you can use $__timeFilter to filter by time but without returning the date to Grafana:

SELECT
  Host,
  Disk,
  AVG(PercentFree) AS PercentFree
FROM DiskSpace
WHERE $__timeFilter(Time)
GROUP BY Host, Disk

This query returns the following table response:

Host	Disk	PercentFree
web1	/etc	3
web2	/var	4
web3	/var	8

When Alerting evaluates the query response, the data is transformed into three alert instances as previously detailed:

The numeric column becomes the value for the alert condition.
Additional columns define the label set for each alert instance.

Alert instance	Value
`{Host="web1", Disk="/etc"}`	3
`{Host="web2", Disk="/var"}`	4
`{Host="web3", Disk="/var"}`	8

Finally, an alert condition that checks for less than 5% of free space ($A < 5) would result in two alert instances firing:

Alert instance	Value	State
`{Host="web1", Disk="/etc"}`	3	Firing
`{Host="web2", Disk="/var"}`	4	Firing
`{Host="web3", Disk="/var"}`	8	Normal

Try it with TestData

To test this quickly, you can simulate the table using the TestData data source:

Add the TestData data source through the Connections menu.
Go to Alerting and create an alert rule
Select TestData as the data source.

From Scenario, select CSV Content and paste this CSV:

host, disk, percentFree
web1, /etc, 3
web2, /var, 4
web3, /var, 8

Set a condition like $A < 5 and Preview the alert.

Grafana evaluates the table data and fires the two first alert instances.

Tip
You can explore this alerting example in Grafana Play.

Open the example to view alert evaluation results, generated alert instances, the alert history timeline, and alert rule details.

CSV data with Infinity

Note that when the Infinity plugin fetches CSV data, all the columns are parsed and returned as strings. By default, this causes the query expression to fail in Alerting.

To make it work, you need to format the CSV data as expected by Grafana Alerting.

In the query editor, specify the column names and their types to ensure that only one column is treated as a number.

Differences with time series data

Working with time series is similar—each series is treated as a separate alert instance, based on its label set.

The key difference is the data format:

Time series data contains multiple values over time, each with its own timestamp. To evaluate the alert condition, alert rules must reduce each series to a single number using a function like last(), avg(), or max().
Tabular data doesn’t require reduction, as each row contains only a single numeric value used to evaluate the alert condition.

For comparison, see the multi-dimensional time series data example.

Trace-based alerts

Fri, 03 Apr 2026 12:35:46 -0500

Examples of trace-based alerts

Metrics are the foundation of most alerting systems. They are usually the first signal that something is wrong, but they don’t always indicate where or why a failure occurs.

Traces fill that gap by showing the complete path a request takes through your system. They map the workflows across services, indicating where the request slows down or fails.

Traces report duration and errors directly to specific services and spans, helping to find the affected component and service scope. With this additional context, alerting on tracing data can help you identify root causes faster.

You can create trace-based alerts in Grafana Alerting using two main approaches:

Querying metrics generated from tracing data.
Using TraceQL, a query language for traces available in Grafana Tempo.

This guide provides introductory examples and distinct approaches for setting up trace-based alerts in Grafana. Tracing data is commonly collected using OpenTelemetry (OTel) instrumentation. OTel allows you to integrate trace data from a wide range of applications and environments into Grafana.

Alerting on span metrics

OpenTelemetry provides processors that convert tracing data into Prometheus-style metrics.

The service graph and span metrics processors are the standard options in Alloy and Tempo to generate Prometheus metrics from traces. They can generate the rate, error, and duration (RED) metrics from sampled spans.

You can then create alert rules that query metrics derived from traces.

Service graph metrics focus on inter-service communication and dependency health. They measure the calls between services, helping Grafana to infer the service topology. However, they measure only the interaction between two services—they don’t include the internal processing time of the client service.

You can use service graph metrics to detect infrastructure issues such as network degradation or service mesh problems.

For trace-based alerts, we recommend using span metrics.

Span metrics measure the total processing time of a service request: capturing what happens inside the service, not just the communication between services. They include the time spent on internal processing and waiting on downstream calls, providing an end-to-end picture of service performance.

Depending on how you create span metrics, the following span metrics are generated:

Span metrics generator	Metric name	Prometheus metric type	Description
Alloy and OTEL span metrics connector	`traces_span_metrics_calls_total`	Counter	Total count of the span
	`traces_span_metrics_duration_seconds`	Histogram (native or classic)	Duration of the span
Tempo and Grafana Cloud Application Observability	`traces_spanmetrics_calls_total`	Counter	Total count of the span
	`traces_spanmetrics_latency`	Histogram (native or classic)	Duration of the span
	`traces_spanmetrics_size_total`	Counter	Total size of spans ingested

Each metric includes by default the following labels: service, span_name, span_kind, status_code, status_message, job, and instance.

In the metrics generator, you can customize how traces are converted into metrics by configuring histograms, exemplars, metric dimensions, and other options.

The following examples assume that span metrics have already been generated using one of these options or an alternative.

Detect slow span operations

This example shows how to define an alert rule that detects when operations handled by a service become slow.

Before looking at the query, it’s useful to review a few trace elements that shape how it works:

A trace represents a single request or transaction as it flows through multiple spans and services. A span refers to a specific operation within a service.
Each span includes the operation name (span_name) and its duration (the metric value), as well as additional fields like span status (status_code) and span kind (span_kind).
A server span represents work performed on the receiving side of a request, while a client span represents the outbound call (parent span) waiting for a response (client → server).

To detect slow inbound operations within a specific service, you can define an alert rule that detects when the percentile latency of server spans exceeds a threshold. For example:

Detect when 95% of requests (excluding errors) do not complete faster than 2 seconds.

Using native histograms

The following PromQL query uses the traces_span_metrics_duration_seconds native histogram metric to define the alert rule query.

histogram_quantile(0.95,
 sum by (span_name) (
   rate(traces_span_metrics_duration_seconds{
     service_name="<SERVICE_NAME>",
     span_kind="SPAN_KIND_SERVER",
     status_code!="STATUS_CODE_ERROR"
   }[10m])
 )
) > 2

Here’s the query breakdown:

traces_span_metrics_duration_seconds It’s a native histogram produced from spans using Alloy or the OTEL collector. The metric is filtered by:
- service_name="<SERVICE_NAME>" targets a particular service.
- span_kind="SPAN_KIND_SERVER" selects spans handling inbound requests.
- status_code!="STATUS_CODE_ERROR" excludes spans that ended with errors.
You should query traces_spanmetrics_latency when using other span metric generators.
rate(...[10m]) Converts the histogram into a per-second histogram over the last 10 minutes (the distribution of spans per second during that period). This makes the time window explicit and ensures latencies can be calculated over the last 10 minutes using histogram_* functions.
sum by (span_name)( … ) Merges all series that share the same span_name. This creates a multidimensional alert that generates one alert instance per span name (operation).
histogram_quantile(0.95, ...) Calculates p95 latency from the histogram after applying the rate. The query runs as an instant Prometheus query, returning a single value for the 10-minute window.
> 2 Defines the threshold condition. It returns only series whose p95 latency exceeds 2 seconds. Alternatively, you can set this threshold as a Grafana Alerting expression in the UI, as shown in the following screenshot.

Alert rule querying span metrics and using threshold expression

Using classic histograms

Native histograms are stable in Prometheus since v3.8.0. Your span metric generator may therefore create classic histograms for latency span metrics, either traces_span_metrics_duration_seconds or traces_spanmetrics_latency.

When using classic histograms, the metric is the same but the metric format changes. A classic histogram represents a histogram with fixed buckets and exposes three metrics:

_bucket: cumulative buckets of the observations.
_sum: total sum of all observed values.
_count: count of observed values.

To calculate percentiles accurately, especially exceeding a particular threshold (e.g. `2s`), you have to configure the classic histogram with the explicit bucket, such as:

["100ms", "250ms", "1s", "2s", "5s"]

The otelcol.connector.spanmetrics can configure the buckets using the explicit block. The metric-generator in Tempo can configure the span_metrics.histogram_buckets setting.

Here’s the equivalent PromQL for classic histograms:

histogram_quantile(0.95,
 sum by (span_name, le) (
   rate(traces_span_metrics_duration_seconds_bucket{
     service_name="<SERVICE_NAME>",
     span_kind="SPAN_KIND_SERVER",
     status_code!="STATUS_CODE_ERROR"
   }[10m])
 )
) > 2

Key differences compared with the native histograms example:

You must configure a histogram bucket matching the desired threshold (for example, 2s).
You must query the _bucket metric, not the base metric.
You must include le in the sum by (…) grouping for histogram_quantile calculation.

Everything else remains the same.

Note
The alert rules in these examples create multi-dimensional alerts: one alert instance for each distinct span name.

Dynamic span routes such as /product/1234 can create separate metric dimensions and alerts for each unique span, which can significantly impact metric costs and performance for large volumes.

To prevent high-cardinality data, normalize dynamic routes like /product/{id} using semantic attributes such as http.route and url.template, and limit dimensions to low-cardinality fields such as service_name, status_code, or http_method.

Detect high error rate

This example defines an alert rule that detects when the error rate for any operation exceeds 20%. You can use this error rate alerts to identify increases in request errors, such as 5xx responses or internal failures.

The following query calculates the fraction of failed server spans for each service and operation.

(
  sum by (service, span_name) (
    rate(traces_span_metrics_calls_total{
      span_kind="SPAN_KIND_SERVER",
      status_code="STATUS_CODE_ERROR"
    }[10m])
  )
/
  sum by (service, span_name) (
    rate(traces_span_metrics_calls_total{
      span_kind="SPAN_KIND_SERVER"
    }[10m])
  )
) > 0.2

Here’s the query breakdown

traces_span_metrics_calls_total A counter metric produced from spans that tracks the number of completed span operations.
- span_kind="SPAN_KIND_SERVER" selects spans handling inbound requests.
- status_code="STATUS_CODE_ERROR" selects only spans that ended in error.
- Omitting the status_code filter in the denominator includes all spans, returning the total span count.
Check whether your metric generator instead creates the traces_spanmetrics_calls_total metric, and adjust the metric name.
rate(...[10m]) Converts the cumulative histogram into a per-second histogram over the last 10 minutes (the distribution of spans per second during that period). This makes the time window explicit and ensures counters can be calculated over the last 10 minutes.
sum by (service, span_name)( … ) Aggregates per service and operation, creating one alert instance for each (service, span_name) combination. This is a multidimensional alert that applies to all services, helping identify which service and corresponding operation is failing.
sum by () (...) / sum by () (...) Divides failed spans by total spans to calculate the error rate per operation. The result is a ratio between 0 and 1, where 1 means all operations failed. The query runs as an instant Prometheus query, returning a single value for the 10-minute window.
> 0.2 Defines the threshold condition. It returns only series whose error rate is higher than 20% of spans. Alternatively, you can set this threshold as a Grafana Alerting expression in the UI.

Enable traffic guardrails

When the traffic is very low, even a single slow or failing request can trigger the alerts.

To avoid these types of false positives during low-traffic periods, you can include a minimum traffic condition in your alert rule queries. For example:

sum by (service, span_name)(
  increase(traces_span_metrics_calls_total{
    span_kind="SPAN_KIND_SERVER"
  }[10m])
) > 300

This query returns only spans that handled more than 300 requests in the 10-minute period.

This minimum level of traffic helps prevent false positives, ensuring the alert evaluates a significant number of spans before triggering.

You can combine this traffic condition with the error-rate query to ensure alerts fire only when both conditions are met:

((
  sum by (service, span_name) (
    rate(traces_span_metrics_calls_total{
      span_kind="SPAN_KIND_SERVER",
      status_code="STATUS_CODE_ERROR"
    }[10m])
  )
/
  sum by (service, span_name) (
    rate(traces_span_metrics_calls_total{
      span_kind="SPAN_KIND_SERVER"
    }[10m])
  )
) > 0.2)
and
(
    sum by (service, span_name)(
    increase(traces_span_metrics_calls_total{
      span_kind="SPAN_KIND_SERVER"
    }[10m])
) > 300 )

For a given span, the alert fires when:

The error rate exceeds 20% over the last 10 minutes.
The span handled at least 300 requests over the last 10 minutes.

Alternatively, you can split the alert into separate queries and combine them using a math expression as the threshold. In the example below, $ErrorRateCondition is the Grafana reference for the error-rate query, and $TrafficCondition is the reference for the traffic query.

In this case, you must ensure both queries group by the same labels.

The advantage of this approach is that you can observe the results of both independent queries. You can then access the query results through the $values variable and display them in notifications or use them in custom labels.

A potential drawback of splitting queries is that each query runs separately. This increases backend load and can affect query performance, especially in environments with a large number of active alerts.

You can apply this traffic guardrail pattern to any alert rule.

Consider sampling

Sampling is a technique used to reduce the amount of collected spans for cost-saving purposes. There are two main strategies which can be combined:

Head sampling: The decision to record or drop a span is made when the trace begins. The condition can be configured probabilistically (a percentage of traces) or by filtering out certain operations.
Tail sampling: The decision is made after the trace completes. This allows sampling more interesting operations, such as slow or failing requests.

With head sampling, alerting on span metrics should be done with caution, since span metrics will represent only a subset of all traces.

With tail sampling, it’s important to generate span metrics before a sampling decision is made. Grafana Cloud Adaptive Traces handle this automatically. With Alloy or the OpenTelemetry Collector, make sure the SpanMetrics connector runs before the filtering or tail sampling processor.

Using TraceQL

TraceQL is a query language for searching and filtering traces in Grafana Tempo, which uses a syntax similar to PromQL and LogQL.

With TraceQL, you can skip converting tracing data into span metrics and query raw trace data directly. It provides a more flexible filtering based on the trace structure, attributes, or resource metadata, and can detect issues faster as it does not wait for metric generation.

TraceQL isn’t suitable for all scenarios. For example:

Inadequate for long-term analysis Trace data has a significantly shorter retention period than metrics. For historical monitoring, it’s recommended to convert key tracing data into metrics to ensure the persistence of important data.
Inadequate for alerting after sampling TraceQL can only query traces that are actually stored in Tempo. If sampling drops a large portion of traces, TraceQL-based alerts may miss real issues. Refer to consider sampling for guidance on how to generate span metrics before sampling.

Caution
TraceQL alerting is available in Grafana v12.1 or higher, supported as an experimental feature. Engineering and on-call support isn’t available. Documentation is either limited or not provided outside of code comments. No SLA is provided.

While TraceQL can be powerful for exploring and detecting issues directly from trace data, alerting with TraceQL shouldn’t be used in production environments yet. Use it for testing and experimentation at this moment.

The following example demonstrates how to recreate the previous alert rule that detected slow span operations using TraceQL.

Follow these steps to create the alert:

Enable TraceQL alerting To use TraceQL in alerts, you must enable the tempoAlerting feature flag in your Grafana configuration.

If you use Grafana Cloud, contact Support to enable TraceQL alerting.
Configure the alert query

In your alert rule, select the Tempo data source, then convert the original PromQL query into the equivalent TraceQL query:
traceql
```
{status != error && kind = server && .service.name = "<SERVICE_NAME>"}
| quantile_over_time(duration, .95) by (name)
```
For a given service, this query calculates the p95 latency for all server spans, excluding errors, and groups them by span name.
Configure the time range

Currently, TraceQL alerting supports only range queries. To define the time window, set the query time range to the last 10 minutes.
- From: now-10m
- To: now
Add a reducer expression.

Range queries return time series data, not a single value. The alert rule must then reduce time series data to a single numeric value before comparing it against a threshold.

Add a Reduce expression to convert the query results into a single value.
Set the threshold condition.

Create a Threshold expression to fire when the p95 latency exceeds 2 seconds: $B > 2.

This final alert detects when 95% of the server spans for a particular service (excluding errors) take longer than 2 seconds to complete, using raw trace data instead of span metrics.

Additional resources

To explore related topics and expand the examples in this guide, see the following resources:

Trace structure: Learn how traces and spans are structured.
Grafana Tempo documentation: Full reference for Grafana’s open source tracing backend.
Span metrics using the metrics generator in Tempo: Generate span metrics directly from traces with Tempo’s built-in metrics generator.
Span metrics using Grafana Alloy: Configure Alloy to export span metrics from OpenTelemetry (OTel) traces.
Multi-dimensional alerts: Learn how to trigger multiple alert instances per alert rule like in these examples.
Grafana SLO documentation: Use span metrics to define Service Level Objectives (SLOs) in Grafana.
Trace sampling: explore strategies and configuration in Grafana Tempo.

Note
OpenTelemetry instrumentations can record metrics independently of spans.

These OTEL metrics are not derived from traces and are not affected by sampling. They can serve as an alternative to span-derived metrics.

Example of dynamic labels in alert instances

Fri, 03 Apr 2026 12:35:46 -0500

Example of dynamic labels in alert instances

Labels are essential for scaling your alerting setup. They define metadata like severity, team, category, or environment, which you can use for alert routing.

A label like severity="critical" can be set statically in the alert rule configuration, or dynamically based on a query value such as the current free disk space. Dynamic labels adjust label values at runtime, allowing you to reuse the same alert rule across different scenarios.

This example shows how to define dynamic labels based on query values, along with key behavior to keep in mind when using them.

First, it’s important to understand how Grafana Alerting treats labels.

Alert instances are defined by labels

Each alert rule creates a separate alert instance for every unique combination of labels.

This is called multi-dimensional alerts: one rule, many instances—one per unique label set.

For example, a rule that queries CPU usage per host might return multiple series (or dimensions):

{alertname="ServerHighCPU", instance="prod-server-1" }
{alertname="ServerHighCPU", instance="prod-server-2" }
{alertname="ServerHighCPU", instance="prod-server-3" }

Each unique label combination defines a distinct alert instance, with its own evaluation state and potential notifications.

The full label set of an alert instance can include:

Labels from the query result (e.g., instance)
Auto-generated labels (e.g., alertname)
User-defined labels from the rule configuration

User-defined labels

As shown earlier, alert instances automatically include labels from the query result, such as instance or job. To add more context or control alert routing, you can define user-defined labels in the alert rule configuration:

User-defined labels can be either:

Fixed labels: These have the same value for every alert instance. They are often used to include common metadata, such as team ownership.
Templated labels: These calculate their values based on the query result at evaluation time.

Templated labels

Templated labels evaluate their values dynamically, based on the query result. This allows the label value to vary per alert instance.

Use templated labels to inject additional context into alerts. To learn about syntax and use cases, refer to Template annotations and labels.

You can define templated labels that produce either:

A fixed value per alert instance.
A dynamic value per alert instance that changes based on the last query result.

Fixed values per alert instance

You can use a known label value to enrich the alert with additional metadata not present in existing labels. For example, you can map the instance label to an env label that represents the deployment environment:

{{- if eq $labels.instance "prod-server-1" -}}production
{{- else if eq $labels.instance "stag-server-1" -}}staging
{{- else -}}development
{{- end -}}

This produces alert instances like:

{alertname="ServerHighCPU", instance="prod-server-1", env="production"}
{alertname="ServerHighCPU", instance="stag-server-1", env="staging"}

In this example, the env label is fixed for each alert instance and does not change during its lifecycle.

Dynamic values per alert instance

You can define a label whose value depends on the numeric result of a query—mapping it to a predefined set of options. This is useful for representing severity levels within a single alert rule.

Instead of defining three separate rules like:

CPU ≥ 90 → severity=critical
CPU ≥ 80 → severity=warning
CPU ≥ 70 → severity=minor

You can define a single rule and assign severity dynamically using a template:

{{/* $values.B.Value refers to the numeric result from query B */}}
{{- if gt $values.B.Value 90.0 -}}critical
{{- else if gt $values.B.Value 80.0 -}}warning
{{- else if gt $values.B.Value 70.0 -}}minor
{{- else -}}none
{{- end -}}

This pattern lets you express multiple alerting scenarios in a single rule, while still routing based on the severity label value.

Example overview

In the previous severity template, you can set the alert condition to $B > 70 to prevent firing when severity=none, and then use the severity label to route distinct alert instances to different contact points.

For example, configure a notification policy that matches alertname="ServerHighCPU" with the following children policies:

severity=critical → escalate to an incident response and management solution (IRM).
severity=warning → send to the team’s Slack channel.
severity=minor → send to a non-urgent queue or log-only dashboard.

The resulting alerting flow might look like this:

Time	$B query	Alert instance	Routed to
t1	65	`{alertname="ServerHighCPU", severity="none"}`	`Not firing`
t2	75	`{alertname="ServerHighCPU", severity="minor"}`	Non-urgent queue
t3	85	`{alertname="ServerHighCPU", severity="warning"}`	Team Slack channel
t4	95	`{alertname="ServerHighCPU", severity="critical"}`	IRM escalation chain

This alerting setup allows you to:

Use a single rule for multiple severity levels.
Route alerts dynamically using the label value.
Simplify alert rule maintenance and avoid duplication.

However, dynamic labels can introduce unexpected behavior when label values change. The next section explains this.

Caveat: a label change affects a distinct alert instance

Remember: alert instances are defined by their labels.

If a dynamic label changes between evaluations, this new value affects a separate alert instance.

Here’s what happens if severity changes from minor to warning:

The instance with severity="minor" disappears → it becomes a missing series.
A new instance with severity="warning" appears → it starts from scratch.
After two evaluations without data, the minor instance is resolved and evicted.

Here’s a sequence example:

Time	Query value	Instance `severity="none"`	Instance `severity="minor"`	Instance `severity="warning"`
t0
t1	75		🔴 📩
t2	85		⚠️ MissingSeries	🔴 📩
t3	85		⚠️ MissingSeries	🔴
t4	50	🟢	📩 Resolved and evicted	⚠️ MissingSeries
t5	50	🟢		⚠️ MissingSeries
t6	50	🟢		📩 Resolved and evicted

Learn more about this behavior in Stale alert instances.

In this example, the minor and warning alerts likely represent the same underlying issue, but Grafana treats them as distinct alert instances. As a result, this scenario generates two firing notifications and two resolved notifications, one for each instance.

This behavior is important to keep in mind when dynamic label values change frequently.

It can lead to multiple notifications firing and resolving in short intervals, resulting in noisy and confusing notifications.

Try it with TestData

You can replicate this scenario using the TestData data source to simulate an unstable signal—like monitoring a noisy sensor.

This setup reproduces label flapping and shows how dynamic label values affect alert instance behavior.

Add the TestData data source through the Connections menu.
Create an alert rule.

Navigate to Alerting → Alert rules and click New alert rule.
Simulate a query ($A) that returns a noisy signal.

Select TestData as the data source and configure the scenario.
- Scenario: Random Walk
- Series count: 1
- Start value: 51
- Min: 50, Max: 100
- Spread: 100 (ensures large changes between consecutive data points)
Add an expression.
- Type: Reduce
- Input: A
- Function: Last (to get the most recent value)
- Name: B
Define the alert condition.

Use a threshold like $B >= 50 (it always fires).

Click Edit Labels to add a dynamic label.

Create a new label severity and set its value to the following:

{{/* $values.B.Value refers to the numeric result from query B */}}
{{- if gt $values.B.Value 90.0 -}}P1
{{- else if gt $values.B.Value 80.0 -}}P2
{{- else if gt $values.B.Value 70.0 -}}P3
{{- else if gt $values.B.Value 60.0 -}}P4
{{- else if gt $values.B.Value 50.0 -}}P5
{{- else -}}none
{{- end -}}

Set evaluation behavior.

Set a short evaluation interval (e.g., 10s) to observe quickly label flapping and alert instance transitions in the history.
Preview alert routing to verify the label template.

In Configure notifications, toggle Advanced options.
Click Preview routing and check the value of the severity label:

Preview routing multiple times to verify how label values change over time.
Observe alert state changes.

Click Save rule and exit, and open the alert history view to see how changes in severity affect the state of distinct alert instances.

You can find multiple transitions over time as the label value fluctuates.

Tip
You can explore this alerting example in Grafana Play.

Open the example to view alert evaluation results, generated alert instances, the alert history timeline, and alert rule details.

Considerations

Dynamic labels lets you reuse a single alert rule across multiple escalation scenarios—but it also introduces complexity. When the label value depends on a noisy metric and changes frequently, it can lead to flapping alert instances and excessive notifications.

These alerts often require tuning to stay reliable and benefit from continuous review. To get the most out of this pattern, consider the following:

Tune evaluation settings and queries for stability

Increase the evaluation interval and pending period to reduce the frequency of state changes. Additionally, consider smoothing metrics with functions like avg_over_time to reduce flapping.
Use wider threshold bands

Define broader ranges in your label template logic to prevent label switching caused by small value changes.
Disable resolved notifications

When labels change frequently and alerts resolve quickly, you can reduce the number of notifications by disabling resolved notifications at the contact point.
Disable the Missing series evaluations setting

The Missing series evaluations setting (default: 2) defines how many intervals without data are allowed before resolving an instance. Consider disabling it if it’s unnecessary for your use case, as it can complicate alert troubleshooting.
Preserve context across related alerts

Ensure alert metadata includes enough information to help correlate related alerts during investigation.
Use separate alert rules and static labels when simpler

In some cases, defining separate rules with static labels may be easier to manage than one complex dynamic rule. This also allows you to customize alert queries for each specific case.

Learn more

Here’s a list of additional resources related to this example:

Multi-dimensional alerting example – Explore how Grafana creates separate alert instances for each unique set of labels.
Labels – Learn about the different types of labels and how they define alert instances.
Template labels in alert rules – Use templating to set label values dynamically based on query results.
Stale alert instances – Understand how Grafana resolves and removes stale alert instances.
Handle missing data – Learn how Grafana distinguishes between missing series and NoData.
Notification policies and routing – Create multiple notification policies to route alerts based on label values like severity or team.
Dynamic label example in Grafana Play - View this example in Grafana Play to explore alert instances and state transitions with dynamic labels.

Example of dynamic thresholds per dimension

Fri, 03 Apr 2026 12:35:46 -0500

Example of dynamic thresholds per dimension

In Grafana Alerting, each alert rule supports only one condition expression.

That’s enough in many cases—most alerts use a fixed numeric threshold like latency > 3s or error_rate > 5% to determine their state.

As your alerting setup grows, you may find that different targets require different threshold values.

Instead of duplicating alert rules, you can assign a different threshold value to each target—while keeping the same condition. This simplifies alert maintenance.

This example shows how to do that using multi-dimensional alerts and a Math expression.

Example overview

You’re monitoring latency across multiple API services. Initially, you want to get alerted if the 95th percentile latency (p95_api_latency) exceeds 3 seconds, so your alert rule uses a single static threshold:

p95_api_latency > 3

But the team quickly finds that some services require stricter thresholds. For example, latency for payment APIs should stay under 1.5s, while background jobs can tolerate up to 5s. The team establishes different thresholds per service:

p95_api_latency{service="checkout-api"}: must stay under 1.5s.
p95_api_latency{service="auth-api"}: also strict, 1.5s.
p95_api_latency{service="catalog-api"}: less critical, 3s.
p95_api_latency{service="async-tasks"}: background jobs can tolerate up to 5s.

You want to avoid creating one alert rule per service—this is harder to maintain.

In Grafana Alerting, you can define one alert rule that monitors multiple similar components like this scenario. This is called multi-dimensional alerts: one alert rule, many alert instances—one per unique label set.

But there’s an issue: Grafana supports only one alert condition per rule.

One alert rule
├─ One condition ( e.g., $A > 3)
│  └─ Applies to all returned series in $A
│     ├─ {service="checkout-api"}
│     ├─ {service="auth-api"}
│     ├─ {service="catalog-api"}
│     └─ {service="async-tasks"}

To evaluate per-service thresholds, you need a distinct threshold value for each returned series.

Dynamic thresholds using a Math expression

You can create a dynamic alert condition by operating on two queries with a Math expression.

$A for query results (e.g., p95_api_latency).
$B for per-service thresholds (from CSV data or another query).
$A > $B is the Math expression that defines the alert condition.

Grafana evaluates the Math expression per series, by joining series from $A and $B based on their shared labels before applying the expression.

Here’s an example of an arithmetic operation:

$A returns series {host="web01"} 30 and {host="web02"} 20.
$B returns series {host="web01"} 10 and {host="web02"} 0.
$A + $B returns {host="web01"} 40 and {host="web02"} 20.

In practice, you must align your threshold input with the label sets returned by your alert query.

The following table illustrates how a per-service threshold is evaluated in the previous example:

$A: p95 latency query	$B: threshold value	$C: $A>$B	State
`{service="checkout-api"} 3`	`{service="checkout-api"} 1.5`	`{service="checkout-api"} 1`	Firing
`{service="auth-api"} 1`	`{service="auth-api"} 1.5`	`{service="auth-api"} 0`	Normal
`{service="catalog-api"} 2`	`{service="catalog-api"} 3`	`{service="catalog-api"} 0`	Normal
`{service="sync-work"} 3`	`{service="sync-work"} 5`	`{service="sync-work"} 0`	Normal

In this example:

$A comes from the p95_api_latency query.
$B is manually defined with a threshold value for each series in $A.
The alert condition compares $A>$B using a Math relational operator (e.g., >, <, >=, <=, ==, !=) that joins series by matching labels.
Grafana evaluates the alert condition and sets the firing state where the condition is true.

The Math expression works as long as each series in $A can be matched with exactly one series in $B. They must align in a way that produces a one-to-one match between series in $A and $B.

Caution
If a series in one query doesn’t match any series in the other, it’s excluded from the result and a warning message is displayed:

1 items dropped from union(s): ["$A > $B": ($B: {service=payment-api})]

Labels in both series don’t need to be identical. If labels are a subset of the other, they can join. For example:

$A returns series {host="web01", job="event"} 30 and {host="web02", job="event"} 20.
$B returns series {host="web01"} 10 and {host="web02"} 0.
$A + $B returns {host="web01", job="event"} 40 and {host="web02", job="event"} 20.

Try it with TestData

You can use the TestData data source to replicate this example:

Add the TestData data source through the Connections menu.
Create an alert rule.

Navigate to Alerting → Alert rules and click New alert rule.
Simulate a query ($A) that returns latencies for each service.

Select TestData as the data source and configure the scenario.
- Scenario: Random Walk
- Alias: latency
- Labels: service=api-$seriesIndex
- Series count: 4
- Start value: 1
- Min: 1, Max: 4
  
  This uses $seriesIndex to assign unique service labels: api-0, api-1, etc.
Define per-service thresholds with static data.

Add a new query ($B) and select TestData as the data source.

From Scenario, select CSV Content and paste this CSV:
```
 service,value
 api-0,1.5
 api-1,1.5
 api-2,3
 api-3,5
```
The service column must match the labels from $A.

The value column is a numeric value used for the alert comparison.

For details on CSV format requirements, see table data examples.
Add a new Reduce expression ($C).
- Type: Reduce
- Input: A
- Function: Mean
- Name: C
This calculates the average latency for each service: api-0, api-1, etc.
Add a new Math expression.
- Type: Math
- Expression: $C > $B
- Set this expression as the alert condition.
This fires if the average latency ($C) exceeds the threshold from $B for any service.
Preview the alert.

Alert preview evaluating multiple series with distinct threshold values

Tip
You can explore this alerting example in Grafana Play.

Open the example to view alert evaluation results, generated alert instances, the alert history timeline, and alert rule details.

Other use cases

This example showed how to build a single alert rule with different thresholds per series using multi-dimensional alerts and Math expressions.

This approach scales well when monitoring similar components with distinct reliability goals.

By aligning series from two queries, you can apply a dynamic threshold—one value per label set—without duplicating rules.

While this example uses static CSV content to define thresholds, the same technique works in other scenarios:

Dynamic thresholds from queries or recording rules: Fetch threshold values from a real-time query, or from custom recording rules.
Combine multiple conditions: Build more advanced threshold logic by combining multiple conditions—such as latency, error rate, or traffic volume.

For example, you can define a PromQL expression that sets a latency threshold which adjusts based on traffic—allowing higher response times during periods of high-load.

(
  // Fires when p95 latency > 2s during usual traffic (≤ 1000 req/s)
  service:latency:p95 > 2 and service:request_rate:rate1m <= 1000
)
or
(
  // Fires when p95 latency > 4s during high traffic (> 1000 req/s)
  service:latency:p95 > 4 and service:request_rate:rate1m > 1000
)

Examples of high-cardinality alerts

Fri, 03 Apr 2026 12:35:46 -0500

Examples of high-cardinality alerts

In Prometheus and Mimir, metrics are stored as time series, where each unique set of labels defines a distinct series.

A large number of unique series (high cardinality) can overload your metrics backend, slow down dashboard and alert queries, and quickly increase your observability costs or exceed the limits of your Grafana Cloud plan.

These examples show how to detect and alert on early signs of high cardinality:

Total active series near limits: detect when your Prometheus, Mimir, or Grafana Cloud Metrics instance approaches soft or hard limits.
Series increase per metric or scope: fine-tune detection to identify growth in a particular metric or domain.
Sudden series growth: detect runaway cardinality increases caused by misconfigured exporters or new deployments.
High ingestion rate: detect when too many samples per second are being ingested, even if the total series count is stable.

Use these alert patterns to act on high-cardinality growth, and consider implementing Adaptive Metrics recommendations to keep your observability costs under control.

Choose metrics to monitor active series

First, identify which metric reports the number of active time series.

Prometheus, Mimir, and Grafana Cloud expose this information differently:

Environment	Metric	Description
Prometheus	`prometheus_tsdb_head_series`	Reports the number of active series currently stored in memory (the head block) of a single Prometheus instance. It includes series that have stopped receiving samples for up to one hour.
Grafana Cloud	`grafanacloud_instance_active_series`	Tracks the number of active series in your Grafana Cloud Metrics backend (Mimir).
Prometheus or Mimir	`count({__name__!=""})`	Counts the number of series with recent samples by scanning the TSDB index. This query is expensive and should be exposed through a recording rule.

Detect total active series near limits

A high number of active series increases memory usage and can impact performance. Grafana Cloud enforces usage limits to prevent your instance from running into these performance issues.

In Prometheus, you can alert when the total number of active series exceeds a threshold:

prometheus_tsdb_head_series > 1.5e6

This fires when the instance exceeds 1.5 million active series.
Adjust the threshold based on the capacity of your Prometheus host and observed load.

In Grafana Cloud, use the grafanacloud_instance_active_series metric to monitor active series across your managed Mimir backend:

grafanacloud_instance_active_series > 1.5e6

Grafana Cloud also provides account-level limits through the grafanacloud_instance_metrics_limits metric.

For more robust alerting, you can compare your current usage to the max_global_series_per_user limit:

(
  grafanacloud_instance_active_series
  / on (id)
    grafanacloud_instance_metrics_limits{limit_name="max_global_series_per_user"}
)
* on (id) group_left(name) grafanacloud_instance_info
> 0.9

grafanacloud_instance_active_series
Returns the current number of active series per your Mimir (Prometheus) data source instance (id).
/ on (id) grafanacloud_instance_metrics_limits{limit_name="max_global_series_per_user"}
Divides current usage by the account limit to calculate a utilization ratio between 0 and 1 (where 1 means the limit is reached).
* on (id) group_left(name) grafanacloud_instance_info
Joins instance metadata to display the instance name.
> 0.9
Defines the threshold condition to fire when usage exceeds 90% of the limit.
Adjust this value according to your alert goal. Alternatively, you can set the threshold as a Grafana Alerting expression in the UI.

Detect high-cardinality per metric

Instead of monitoring the total number of active series, you can fine-tune alerts to detect high cardinality within a specific scope — for example, by filtering on certain namespaces, services, or metrics known to generate many label combinations.

Multi-dimensional alerts let you evaluate each metric independently, so you can identify which metric is responsible for the label explosion instead of only tracking the overall total.

You can apply label filters, or use {__name__=~"regex"} to select specific metrics. Then, use count by (__name__) to group results per metric name.

Because the __name__ selector queries the entire TSDB index, it’s recommended to query this using a recording rule:

# Only HTTP/RPC-style metrics
active_series_per_metric:http_rpc =
label_replace(
  count by (__name__) ({__name__=~"http_.*|rpc_.*"}),
  "metric", "$1", "__name__", "(.*)"
)

This recording rule stores the number of active series per metric name.

count by (__name__) ({__name__=~"http_.*|rpc_.*"})
Counts the number of active series per metric matching the http_.* or rpc_.* regex.
label_replace(..., "metric", "$1", "__name__", "(.*)")
Copies the metric name into a new label called metric.
This enables generating one alert instance per metric because __name__ is not treated as a regular label.

Adjust the threshold and recording rule scope based on the label usage and normal behavior of your observed metrics.

After the recording rule is available, you can define this multi-dimensional alert rule as follows:

active_series_per_metric:http_rpc > 100

Grafana Alerting evaluates each row (or time series) returned by the active_series_per_metric:http_rpc recording rule as a separate alert instance, producing independent alert instance states:

Alert instance	Value	State
`{metric="http_requests_total"}`	320	Firing
`{metric="rpc_client_calls_total"}`	45	Normal
`{metric="rpc_server_errors_total"}`	110	Firing

Each metric name (__name__) becomes a separate alert instance, so you immediately see which metric exceeds the expected limit.

Detect sudden cardinality growth

Even if the number of active series stays within proper limits, a sudden increase can signal a misbehaving exporter, a new deployment, or an unexpected label explosion. These peaks can help you prevent potential issues, or just inform you of deployment changes that might need adjustments.

You can use any of the metrics from the previous examples to track short-term changes in the total active series:

delta(active_series_metric[10m]) > 1000

This alert fires when the number of active series increases by more than 1000 within the last 10 minutes.
Adjust the time window (for example, [5m] or [30m]) and threshold to match your environment’s normal variability.

Detect high ingestion rate

Even if label cardinality remains under control, a high ingestion rate can affect Prometheus performance or increase observability costs.

In Prometheus, this usually happens when scrapes occur too frequently or when exporters generate large numbers of samples in short intervals.

To find an appropriate threshold to be alerted, start by monitoring normal ingestion peaks and set the threshold to a value that stays below the point where scrapes or WAL operations begin to slow down.

In Prometheus, use the prometheus_tsdb_head_samples_appended_total metric to measure the number of samples appended per scrape:

rate(prometheus_tsdb_head_samples_appended_total[10m]) > 1e5

The alert rule query returns the average ingestion rate per second over the last 10 minutes and fires when the value exceeds 100 000 samples per second.

In Grafana Cloud, use the grafanacloud_instance_samples_per_second metric to monitor total ingestion rate of your Mimir instances:

grafanacloud_instance_samples_per_second
  * on (id) group_left(name) grafanacloud_instance_info
> 1e5

Alternatively, Grafana Cloud metrics limits are based on data points per minute (DPM): the number of samples sent per minute across all your active series.

To monitor when your actual data-point rate approaches your DPM limit, you can compare total ingestion to your plan’s DPM limit:

(
  (grafanacloud_instance_samples_per_second * 60)
  / grafanacloud_org_metrics_included_dpm_per_series
)
> 0.9

(grafanacloud_instance_samples_per_second * 60)
Converts ingestion rate from data points per second to minutes.
/ grafanacloud_org_metrics_included_dpm_per_series
Divides current DPM usage by the DPM limit to calculate a utilization ratio between 0 and 1 (where 1 means the limit is reached).
> 0.9
Defines the threshold condition to fire when the usage exceeds 90% of the DPM limit.

This alert helps you detect when your organization is ingesting data faster and approaching its limits. Use ingestion rate alerts to detect workload spikes, exporter misconfigurations, or rapid increases in ingestion volume and cost.

Learn more

Here’s list of additional resources related to this example:

Multi-dimensional alerting example – Learn how Grafana creates separate alert instances for each unique label set.
Understand Grafana Cloud active series and DPM– See how active series and data points per minute (DPM) are used to calculate metrics usage in Grafana Cloud.
Create Grafana Cloud usage alerts – Set up alerts when your usage or costs approach your predefined limits.
Plan capacity for Mimir– Learn how to plan ingestion rate and memory capacity for Mimir or Prometheus environments.
Adaptive Metrics recommendations – Use Adaptive Metrics to automatically reduce high-cardinality metrics and control observability costs.

Grafana Alerting tutorials

Fri, 03 Apr 2026 12:35:46 -0500

Grafana Alerting tutorials

This section provides step-by-step tutorials to help you learn Grafana Alerting and explore key features through practical, easy-to-follow examples.

Examples on Grafana Labs

Example of multi-dimensional alerts on time series data

Example of multi-dimensional alerts on time series data

Examples overview

Try it with TestData

Reduce time series data for comparison

Learn more

Example of alerting on tabular data

Example of alerting on tabular data

How Grafana Alerting evaluates tabular data

Example overview

Try it with TestData

CSV data with Infinity

Differences with time series data

Trace-based alerts

Examples of trace-based alerts

Alerting on span metrics

Detect slow span operations

Using native histograms

Using classic histograms

Detect high error rate

Enable traffic guardrails

Consider sampling

Using TraceQL

Additional resources

Example of dynamic labels in alert instances

Example of dynamic labels in alert instances

Alert instances are defined by labels

User-defined labels

Templated labels

Fixed values per alert instance

Dynamic values per alert instance

Example overview

Caveat: a label change affects a distinct alert instance

Try it with TestData

Considerations

Learn more

Example of dynamic thresholds per dimension

Example of dynamic thresholds per dimension

Example overview

Dynamic thresholds using a Math expression

Try it with TestData

Other use cases

Examples of high-cardinality alerts

Examples of high-cardinality alerts

Choose metrics to monitor active series

Detect total active series near limits

Detect high-cardinality per metric

Detect sudden cardinality growth

Detect high ingestion rate

Learn more

Grafana Alerting tutorials

Grafana Alerting tutorials

Get started with Grafana Alerting

Additional tutorials