Service graphs on Grafana Labs

Enable service graphs

Fri, 03 Apr 2026 19:43:06 +0000

Enable service graphs

Service graphs are generated in Tempo and pushed to a metrics storage. Then, they can be represented in Grafana as a graph. You will need those components to fully use service graphs.

Note
Cardinality can pose a problem when you have lots of services. To learn more about cardinality and how to perform a dry run of the metrics generator, see the Cardinality documentation.

Enable service graphs in Tempo/GET

To enable service graphs in Tempo/GET, enable the metrics generator and add an overrides section which enables the service-graphs generator. For more information, refer to the configuration details.

To enable service graphs when using Grafana Agent, refer to the Grafana Agent and service graphs documentation.

Enable service graphs in Grafana

Note
Since Grafana 9.0.4, service graphs have been enabled by default. Prior to Grafana 9.0.4, service graphs were hidden under the feature toggle tempoServiceGraph.

Configure a Tempo data source’s service graphs by linking to the Prometheus backend where metrics are being sent:

apiVersion: 1
datasources:
  # Prometheus backend where metrics are sent
  - name: Prometheus
    type: prometheus
    uid: prometheus
    url: <prometheus-url>
    jsonData:
        httpMethod: GET
    version: 1
  - name: Tempo
    type: tempo
    uid: tempo
    url: <tempo-url>
    jsonData:
      httpMethod: GET
      serviceMap:
        datasourceUid: 'prometheus'
    version: 1

Estimate cardinality from traces

Fri, 03 Apr 2026 19:43:06 +0000

Estimate cardinality from traces

Cardinality can pose a problem when you have lots of services. There isn’t a direct formula or solution to this issue. The following guide should help estimate the cardinality that the feature will generate.

For more information on cardinality, refer to the Cardinality documentation.

How to estimate the cardinality

The amount of edges depends on the number of nodes in the system and the direction of the requests between them. Let’s call this amount hops. Every hop will be a unique combination of client + server labels.

For example:

A system with 3 nodes (A, B, C) of which A only calls B and B only calls C will have 2 hops (A → B, B → C)
A system with 3 nodes (A, B, C) that call each other (i.e., all bidirectional link) will have 6 hops (A → B, B → A, B → C, C → B, A → C, C → A)

We can’t calculate the amount of hops automatically based upon the nodes, but it should be a value between #services - 1 and #services!.

If we know the amount of hops in a system, we can calculate the cardinality of the generated service graphs:

  traces_service_graph_request_total: #hops
  traces_service_graph_request_failed_total: #hops
  traces_service_graph_request_server_seconds: 3 buckets * #hops
  traces_service_graph_request_client_seconds: 3 buckets * #hops
  traces_service_graph_unpaired_spans_total: #services (absolute worst case)
  traces_service_graph_dropped_spans_total: #services (absolute worst case)

Finally, we get the following cardinality estimation:

  Sum: 8 * #hops + 2 * #services

Note
To estimate the number of metrics, refer to the Dry run metrics generator documentation.

Service graph metrics queries

Fri, 03 Apr 2026 19:43:06 +0000

Service graph metrics queries

A collection of useful PromQL queries for service graphs.

In most cases, users want to see a visual representation of their service graph. Grafana uses the service graph metrics created by Tempo and builds that visual for the user. However, in some cases, users may want to interact with the metrics that define that service graph directly. They may want to, for example, programmatically analyze how their services are interconnected and build downstream applications that use this information.

To help with this, we’ve provided a collection of useful PromQL queries that can be used to explore service graph metrics.

Instant Queries

An instant query will give a single value at the end of the selected time range. Instant queries are quicker to execute and it often easier to understand their results. We will prefer them in some scenarios:

Connectivity between services

Show me the total calls in the last 7 days for every client/server pair:

sum(increase(traces_service_graph_request_server_seconds_count{}[7d])) by (server, client) > 0

If you’d like to only see when a single service is the server:

sum(increase(traces_service_graph_request_server_seconds_count{server="foo"}[7d])) by (client) > 0

If you’d like to only see when a single service is the client:

sum(increase(traces_service_graph_request_server_seconds_count{client="foo"}[7d])) by (server) > 0

In all of the above queries, you can adjust the interval to change the amount of time this is calculated for. So if you wanted the same analysis done over one day:

sum(increase(traces_service_graph_request_server_seconds_count{}[1d])) by (server, client) > 0

Range queries

Range queries are nice for calculating service graph info over a time range instead of a single point in time.

Rates over time between services

Taking two of the queries above, we can request the rate over time that any given service acted as the client or server:

sum(rate(traces_service_graph_request_server_seconds_count{server="foo"}[5m])) by (client) > 0

sum(rate(traces_service_graph_request_server_seconds_count{client="foo"}[5m])) by (server) > 0

Notice that our interval dropped to 5m. This is so we only calculate the rate over the past 5 minutes which creates a more responsive graph.

Latency percentiles over time between services

These queries will give us latency quantiles for the above rate. If we were interested in how the latency changed over time between any two services we could use these. In the following query the .9 means we’re calculating the 90th percentile. Adjust this value if you want to calculate a different percentile for latency (e.g. p50, p95, p99, etc).

histogram_quantile(.9, sum(rate(traces_service_graph_request_server_seconds_bucket{client="foo"}[5m])) by (server, le))