Grafana Mimir advanced architecture on Grafana Labs

Grafana Mimir deployment modes

Wed, 03 Jun 2026 09:01:40 +0200

Grafana Mimir deployment modes

Grafana Mimir offers two deployment modes to accommodate different operational requirements and scale needs. Choose the deployment mode that best fits your use case:

Monolithic mode: Run all components in a single process for simple deployments.
Microservices mode: Deploy components separately for maximum scalability and flexibility.

Configure the deployment mode using the -target parameter, which you can set via CLI flag or YAML configuration.

About monolithic mode

Monolithic mode runs all required components in a single process. This mode is ideal for getting started or running Grafana Mimir in a development environment.

To enable monolithic mode, set -target=all.

To see the complete list of components that run in monolithic mode, use the -modules flag:

./mimir -modules

This diagram shows how Mimir works in monolithic mode:

Scale monolithic mode

You can horizontally scale monolithic mode by deploying multiple Mimir binaries with -target=all. This approach, shown in the following diagram, provides high availability and increased scale without the configuration complexity of microservices mode.

Note
Because monolithic mode requires scaling all Grafana Mimir components together, this deployment mode isn’t recommended for large-scale deployments.

About microservices mode

Microservices mode deploys each component in separate processes, enabling independent scaling and creating granular failure domains. Microservices mode is recommended for production environments.

Note
Even though the read path (query-frontend, querier, and store-gateway) runs separately from the write path (distributor and ingester), a healthy ring is typically required for successful queries. If the write components (distributor or ingester) are unavailable or unhealthy, the ring health check may fail, causing read queries to fail. Complete isolation of read versus write availability requires careful configuration of ring settings and failure tolerance.

Specifically, the querier consults the hash ring to discover ingesters before reading recent data from them. If ingesters are unhealthy or unavailable, the ring reflects that state and the querier may be unable to satisfy queries for recent data. Achieving true read/write isolation requires zone-aware replication and careful ring configuration so that the loss of write-path components does not reduce the ring below the quorum needed for reads. For more information, refer to Configuring zone-aware replication.

The following diagrams show how Mimir works in microservices mode using ingest storage and classic architectures. For more information about the two supported architectures in Grafana Mimir, refer to Grafana Mimir architecture.

Ingest storage architecture:

Classic architecture:

In microservices mode, each Grafana Mimir process is invoked with its -target parameter set to a specific Grafana Mimir component (for example, -target=ingester or -target=distributor). To get a working Grafana Mimir instance, you must deploy every required component. For more information about each of the Grafana Mimir components, refer to Grafana Mimir advanced architecture.

To deploy Grafana Mimir in microservices mode, use Kubernetes and the mimir-distributed Helm chart.

About Grafana Mimir network ports

Wed, 03 Jun 2026 09:01:40 +0200

About Grafana Mimir network ports

Grafana Mimir uses various network ports to facilitate communication between its internal components, external services like Prometheus and Grafana, and for overall cluster operation. Proper port configuration is crucial for setting up your Mimir cluster, configuring firewalls, and ensuring secure communication between Mimir components and integrated tools.

The ports required to run Grafana Mimir can vary slightly depending on your deployment mode and whether you’re using additional components like Grafana or a load balancer.

The following table shows the default ports that are fundamental to operating Mimir, whether in a monolithic or distributed setup. You can update these values in your Mimir configuration.

Port	Function	Related components	Description
8080	HTTP API / remote write	All Mimir components	This is the main entry point for Prometheus to remote-write metrics to Mimir through the Distributor and for Grafana and Prometheus to query data through the Querier or Query-frontend. This port is not typically exposed, as Grafana Mimir generally runs behind an Nginx proxy, the GEM gateway, or Kubernetes ingress.
9095	Internal gRPC communication	All Mimir components	Used for high-performance communication between different Mimir components, such as Distributor to Ingester, or Querier to Ingester. This communication is essential for distributed deployments.
7946	Memberlist / Gossip	All Mimir components	Used for service discovery and maintaining the consistent hash ring that allows Mimir components to find and communicate with each other. This process is critical for high availability and scaling.

Grafana Mimir components

Wed, 03 Jun 2026 09:01:40 +0200

Grafana Mimir components

Grafana Mimir includes a set of components that interact to form a cluster.

Grafana Mimir binary index-header

Wed, 03 Jun 2026 09:01:40 +0200

Grafana Mimir binary index-header

To query series inside blocks from object storage, the store-gateway must obtain information about each block index. To obtain the required information, the store-gateway builds an index-header for each block and stores it on local disk.

The store-gateway uses GET byte range request to build the index-header, which contains specific sections of the block’s index. The store-gateway uses the index-header at query time.

Because downloading specific sections of the original block’s index is a computationally easy operation, the index-header is not uploaded to the object storage. If the index-header is not available on local disk, store-gateway instances (or the same instance after a rolling update completes without a persistent disk) re-build the index-header from the original block’s index.

Format (version 1)

The index-header is a subset of the block index and contains:

Symbol Table: Used to unintern string values
Posting Offset Table: Used to look up postings

The following example shows the format of the index-header file that is located in each block’s store-gateway local directory. It is terminated by a table of contents that serves as an entry point into the index.

┌─────────────────────────────┬───────────────────────────────┐
│    magic(0xBAAAD792) <4b>   │      version(1) <1 byte>      │
├─────────────────────────────┬───────────────────────────────┤
│  index version(2) <1 byte>  │ index PostingOffsetTable <8b> │
├─────────────────────────────┴───────────────────────────────┤
│ ┌─────────────────────────────────────────────────────────┐ │
│ │      Symbol Table (exact copy from original index)      │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │      Posting Offset Table (exact copy from index)       │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │                          TOC                            │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Grafana Mimir bucket index

Wed, 03 Jun 2026 09:01:40 +0200

Grafana Mimir bucket index

The bucket index is a per-tenant file that contains the list of blocks and block deletion marks in the storage. The bucket index is stored in the backend object storage, is periodically updated by the compactor, and used by queriers, store-gateways, and rulers (in internal operational mode) to discover blocks in the storage.

Benefits

The querier, store-gateway and ruler must have an almost¹ up-to-date view of the storage bucket, in order to find the right blocks to look up at query time (querier) and to load a block’s index-header (store-gateway). Because of this, they need to periodically scan the bucket to look for new blocks uploaded by ingesters or compactors, and blocks deleted (or marked for deletion) by compactors.

When the bucket index is enabled, the querier, store-gateway, and ruler periodically look up the per-tenant bucket index instead of scanning the bucket via list objects operations.

This provides the following benefits:

Reduced number of API calls to the object storage by querier and store-gateway
No “list objects” storage API calls performed by querier and store-gateway
The querier is up and running immediately after the startup, so there is no need to run an initial bucket scan

Structure of the index

The bucket-index.json.gz contains:

blocks
List of complete blocks of a tenant, including blocks marked for deletion. Partial blocks are excluded from the index.
block_deletion_marks
List of block deletion marks.
updated_at
A Unix timestamp, with precision measured in seconds, displays the last time index was updated and written to the storage.

How it gets updated

The compactor periodically scans the bucket and uploads an updated bucket index to the storage. You can configure the frequency with which the bucket index is updated via -compactor.cleanup-interval.

The use of the bucket index is optional, but the index is built and updated by the compactor even if -blocks-storage.bucket-store.bucket-index.enabled=false. This behavior ensures that the bucket index for any tenant exists and that query result consistency is guaranteed if a Grafana Mimir cluster operator enables the bucket index in a live cluster. The overhead introduced by keeping the bucket index updated is not significant.

How it’s used by the querier

At query time the querier and ruler determine whether the bucket index for the tenant has already been loaded to memory. If not, the querier and ruler download it from the storage and cache it.

Because the bucket index is a small file, lazy downloading it doesn’t have a significant impact on first query performance, but it does allow a querier to get up and running without pre-downloading every tenant’s bucket index. In addition, if the metadata cache is enabled, the bucket index is cached for a short time in a shared cache, which reduces the latency and number of API calls to the object storage in case multiple queriers and rulers fetch the same tenant’s bucket index within a short time.

While in-memory, a background process keeps the bucket index updated periodically so that subsequent queries from the same tenant to the same querier instance use the cached (and periodically updated) bucket index.

The following configuration options determine bucket index update intervals:

-blocks-storage.bucket-store.sync-interval
This option configures how frequently a cached bucket index is refreshed.
-blocks-storage.bucket-store.bucket-index.update-on-error-interval
If downloading a bucket index fails, the failure is cached for a short time so that the backend storage doesn’t experience a large volume of storage requests. This option configures the frequency with which the bucket store attempts to load a failed bucket index.

If a bucket index is unused for the amount of time configured via -blocks-storage.bucket-store.bucket-index.idle-timeout (for example, if a querier instance is not receiving any query from the tenant), the querier removes it from memory and stops updating it at regular intervals. This is useful for tenants that are resharded to different queriers when shuffle sharding is enabled.

At query time the querier and ruler determine how old a bucket index is based on its updated_at field. The query fails if the bucket index is older than the period configured via -blocks-storage.bucket-store.bucket-index.max-stale-period. This circuit breaker ensures queriers and rulers do not return any partial query results due to a stale view over the long-term storage.

How it’s used by the store-gateway

The store-gateway, at startup and periodically, fetches the bucket index for each tenant that belongs to its shard, and uses it as the source of truth for the blocks and deletion marks in the storage. This removes the need to periodically scan the bucket to discover blocks belonging to its shard.

Ingesters regularly add new blocks to the bucket as they offload data to long-term storage, and compactors subsequently compact these blocks and mark the original blocks for deletion. Actual deletion happens after the delay value that is associated with the parameter -compactor.deletion-delay. An attempt to fetch a deleted block will lead to failure of the query. Therefore, in this context, an almost up-to-date view is a view that’s outdated by less than the value of -compactor.deletion-delay. ↩︎

Grafana Mimir hash rings

Wed, 03 Jun 2026 09:01:40 +0200

Grafana Mimir hash rings

Hash rings are a distributed consistent hashing scheme that Grafana Mimir uses for sharding, replication, and service discovery.

The following Mimir features are built on top of hash rings:

Service discovery: Instances can discover each other by looking up which peers are registered in the ring.
Health check: Instances periodically send a heartbeat to the ring to signal that they are healthy. An instance is considered unhealthy if it misses heartbeats for a configured period.
Zone-aware replication: Optionally replicate data across failure domains for high availability. For more information, refer to Configure zone-aware replication.
Shuffle sharding: Optionally limit the blast radius of failures in a multi-tenant cluster by isolating tenants. For more information, refer to Configure shuffle sharding.

How the hash ring is used for sharding

The primary use of hash rings in Mimir is to consistently shard data, such as time series, and workloads, such as compaction jobs, without a central coordinator or single point of failure.

Each of the following Mimir components joins its own dedicated hash ring for sharding:

Ingesters: Shard and replicate series.
Compactors: Shard compaction jobs.
Store-gateways: Shard blocks to query from long-term storage.
(Optional) Rulers: Shard rule groups to evaluate.
(Optional) Alertmanagers: Shard tenants.

A hash ring is a data structure that represents the data space as 32-bit unsigned integers. Each instance of a Mimir component owns a set of token ranges that define which portion of the data space it is responsible for.

The data or workload to be sharded is hashed using a function that returns a 32-bit unsigned integer, called a token. The instance that owns that token handles the data.

When an instance starts, it generates a fixed number of tokens and registers them in the ring. A token is owned by the instance that registered the smallest value greater than the lookup token being looked up and wraps around to zero after (2^32)-1.

Hash rings provide consistent hashing. When an instance joins or leaves the ring, only a small, bounded portion of data moves. On average, only n/m tokens move, where n is the total number of tokens (32-bit unsigned integer) and m is the number of instances that are registered in the ring.

How series sharding works

The most important hash ring in Grafana Mimir is the one used to shard series. The implementation details depend on the configured architecture.

Series sharding in ingest storage architecture

Note
This guidance applies to ingest storage architecture. For more information about the supported architectures in Grafana Mimir, refer to Grafana Mimir architecture.

In ingest storage architecture, distributors shard incoming series across Kafka partitions. Each series is assigned to a single Kafka partition. Replication is handled by Kafka.

Ingesters own Kafka partitions, consuming the series written to the partitions they own and making those series available for querying. Each ingester owns one partition, but multiple ingesters can own the same partition for high availability.

Series sharding in ingest storage architecture relies on two hash rings that work together:

Partitions ring
Ingesters ring

Write path

The partitions ring is the source of truth for the Kafka partitions that Grafana Mimir currently uses. Each partition owns a range of tokens used to shard series among partitions and includes the unique identifiers of the ingesters that own that partition.

When a distributor receives a write request containing series data:

It hashes each series using the fnv32a hashing function.
It looks up the resulting token in the partitions ring to determine the Kafka partition for that series.
It writes the series to the matching Kafka partition.

A write request is considered successful when all series in the request are successfully committed to Kafka.

Read path

The ingesters ring is the source of truth for all ingesters currently running in the Grafana Mimir cluster and is used for service discovery. Each ingester registers itself in the ring and periodically updates its heartbeat.

Queriers watch the ingesters ring to identify healthy ingesters and their IP addresses. When a querier receives a query:

It looks up the partitions ring to find which partitions contain the relevant data.
It looks up the ingesters ring to find which ingesters own those partitions.
It fetches the matching series by contacting the ingesters that own the partitions.

In ingest storage architecture, consistency is guaranteed with a quorum of 1. Each partition needs to be queried only once. If multiple ingesters own the same partition, the querier fetches data from only one of the healthy ingesters for that partition.

Partitions ring lifecycle

A partition in the ring can be in one of the following states:

Pending: No writes or reads are allowed.
Active: The partition is in read-write mode.
Inactive: The partition is in read-only mode.

Partitions are not live components and cannot register themselves in the ring. Their lifecycle is managed by ingesters. Each ingester manages the lifecycle of the partition it owns.

When ingesters are scaled out, new partitions are added to the ring. When ingesters are scaled in, their partitions are removed from the ring through a decommissioning procedure.

Partition creation and activation

When an ingester starts up, it checks whether the partition it owns already exists in the ring. If the partition does not exist, the ingester creates it in the Pending state and adds itself as the partition owner.

This is the initial state for a new partition, allowing time for additional ingesters to join as owners and for ring changes to propagate across instances. While a partition is in the Pending state, distributors cannot write to it, and queriers cannot read from it.

After the partition has at least one owner and remains in Pending for longer than a configured grace period, the ingester transitions it to the Active state. When a partition is Active, distributors can write to it, and queriers must read from it. This is the normal operational state of a partition.

Partition decommissioning and downscaling

Grafana’s Kubernetes Rollout Operator manages partition and ingester downscaling.

When an ingester is marked for termination due to a downscaling event, the rollout operator invokes the “prepare delayed downscale endpoint” API exposed by the ingester. This API switches the partition from Active to Inactive.

When a partition is Inactive, distributors can no longer write to it, but queriers must still read from it. The partition remains in this state until it is safe to stop querying the ingester, specifically, when the data has become available for querying from long-term object storage.

Once the grace period passes, the rollout operator invokes a second API exposed by the ingester, the “prepare shutdown endpoint”. This API removes the ingester as a partition owner from the ring. If the partition has no remaining owners, it is then removed from the ring entirely.

Finally, the rollout operator terminates the ingester pod, completing the safe downscaling procedure.

Series sharding in classic architecture

Note
This guidance applies to classic architecture. For more information about the supported architectures in Grafana Mimir, refer to Grafana Mimir architecture.

In classic architecture, distributors shard and replicate the incoming series among ingesters.

Each ingester joins the ingesters hash ring and owns a subset of token ranges. When a distributor receives a write request containing series data, it hashes each series using the fnv32a hashing function. It then looks up the resulting token in the ingesters hash ring to find the authoritative owner and replicates the series to the next RF - 1 ingesters in the ring (where RF is the replication factor, 3 by default).

Then the distributor writes the series to the RF ingesters owning the series itself. A write request is considered successful when each series is written to a quorum of ingesters. With a replication factor of 3, a quorum is reached when at least 2 ingesters successfully receive each series.

To illustrate, consider four ingesters and a token space from 0 to 9:

Ingester #1 is registered in the ring with the token 2.
Ingester #2 is registered in the ring with the token 4.
Ingester #3 is registered in the ring with the token 6.
Ingester #4 is registered in the ring with the token 9.

A distributor receives an incoming sample for the series {__name__="cpu_seconds_total",instance="1.1.1.1"}. It hashes the series’ labels, and the result of the hashing function is the token 3.

To find which ingester owns token 3, the distributor looks up the token 3 in the ingesters ring and finds the ingester that is registered with the smallest token larger than 3. The ingester #2, which is registered with token 4, is the authoritative owner of the series {__name__="cpu_seconds_total",instance="1.1.1.1"}.

By default, Grafana Mimir replicates each series to three ingesters. After finding the authoritative owner of the series, the distributor continues to walk the ring clockwise to find the remaining two instances where the series should be replicated. In the example that follows, the series are replicated to the instances of Ingester #3 and Ingester #4.

How the hash ring is used for service discovery

Grafana Mimir also uses the ring for built-in service discovery. Since instances register themselves in their ring and periodically send heartbeats, it’s convenient to use the hash ring for internal service discovery as well.

When the hash ring is used exclusively for service discovery, rather than sharding, instances don’t register tokens in the ring. Instead, they only register their presence and periodically update a heartbeat timestamp. When other instances need to find the healthy instances of a given component, they look up the ring to find the instances that have successfully updated the heartbeat timestamp in the ring.

The Grafana Mimir components using the ring for service discovery or coordination are:

Distributors: Enforce global rate limits as local limits by dividing the global limit by the number of healthy distributor instances. For more information, refer to Rate limiting.
Query-schedulers: Allow query-frontends and queriers to discover available schedulers.
(Optional) Overrides-exporters: Self-elect a leader among replicas to export high-cardinality metrics. No strict leader election is required.

Hash ring data structures need to be shared between Grafana Mimir instances. To propagate changes to a given hash ring, Grafana Mimir uses a key-value store. You can configure the key-value store independently for the hash rings of different components.

For more information, refer to Grafana Mimir key-value store.

Grafana Mimir key-value store

Wed, 03 Jun 2026 09:01:40 +0200

Grafana Mimir key-value store

A key-value (KV) store is a database that stores data indexed by key. Grafana Mimir requires a key-value store for the following features:

Supported key-value store backends

Grafana Mimir supports the following key-value (KV) store backends:

Gossip-based memberlist protocol (default)
Consul
Etcd

Gossip-based memberlist protocol (default)

By default, Grafana Mimir instances use a Gossip-based protocol to join a memberlist cluster. The data is shared between the instances using peer-to-peer communication and no external dependency is required.

We recommend that you use memberlist to run Grafana Mimir.

To configure memberlist, refer to configuring hash rings.

Note
The Gossip-based memberlist protocol isn’t supported for the optional distributor high-availability tracker.

Consul

Grafana Mimir supports Consul as a backend KV store. If you want to use Consul, you must install it. The Grafana Mimir installation does not include Consul.

To configure Consul, refer to configuring hash rings.

Etcd

Grafana Mimir supports etcd as a backend KV store. If you want to use etcd, you must install it. The Grafana Mimir installation does not include etcd.

To configure etcd, refer to configuring hash rings.

Grafana Mimir memberlist and gossip protocol

Wed, 03 Jun 2026 09:01:40 +0200

Grafana Mimir memberlist and gossip protocol

Memberlist is a Go library that manages cluster membership, node failure detection, and message passing using a gossip-based protocol. Memberlist is eventually consistent and network partitions are partially tolerated by attempting to communicate to potentially dead nodes through multiple routes.

By default, Grafana Mimir uses memberlist to implement a key-value (KV) store to share the hash ring data structures between instances.

When using a memberlist-based KV store, each instance maintains a copy of the hash rings. Each Mimir instance updates a hash ring locally and uses memberlist to propagate the changes to other instances. Updates generated locally and updates received from other instances are merged together to form the current state of the ring on the instance.

To configure memberlist, refer to configuring hash rings.

How memberlist propagates hash ring changes

When using a memberlist-based KV store, every Grafana Mimir instance propagates the hash ring data structures to other instances using the following techniques:

Propagating only the differences introduced in recent changes.
Propagating the full hash ring data structure.

Every -memberlist.gossip-interval an instance randomly selects a subset of all Grafana Mimir cluster instances configured by -memberlist.gossip-nodes and sends the latest changes to the selected instances. This operation is performed frequently and it’s the primary technique used to propagate changes.

In addition, every -memberlist.pullpush-interval an instance randomly selects another instance in the Grafana Mimir cluster and transfers the full content of the KV store, including all hash rings (unless -memberlist.pullpush-interval is zero, which disables this behavior). After this operation is complete, the two instances have the same content as the KV store. This operation is computationally more expensive, and as a result, it’s performed less frequently. The operation ensures that the hash rings periodically reconcile to a common state.

Grafana Mimir query sharding

Wed, 03 Jun 2026 09:01:40 +0200

Grafana Mimir query sharding

Mimir includes the ability to run a single query across multiple machines. This is achieved by breaking the dataset into smaller pieces. These smaller pieces are called shards. Each shard then gets queried in a partial query, and those partial queries are distributed by the query-frontend to run on different queriers in parallel. The results of those partial queries are aggregated by the query-frontend to return the full query result.

Query sharding is applied on the query and query_range APIs only.

Query sharding at glance

Not all queries are shardable. While the full query is not shardable, the inner parts of a query could still be shardable.

In particular associative aggregations (like sum, min, max, count, avg) are shardable, while some query functions (like absent, absent_over_time, histogram_quantile, sort_desc, sort) are not.

In the following examples we look at a concrete example with a shard count of 3. All the partial queries that include a label selector __query_shard__ are executed in parallel. The concat() annotation is used to show when partial query results are concatenated/merged by the query-frontend.

Example 1: Full query is shardable

sum(rate(metric[1m]))

Is executed as (assuming a shard count of 3):

sum(
  concat(
    sum(rate(metric{__query_shard__="1_of_3"}[1m]))
    sum(rate(metric{__query_shard__="2_of_3"}[1m]))
    sum(rate(metric{__query_shard__="3_of_3"}[1m]))
  )
)

Example 2: Inner part is shardable

histogram_quantile(0.99, sum by(le) (rate(metric[1m])))

Is executed as (assuming a shard count of 3):

histogram_quantile(0.99, sum by(le) (
  concat(
    sum by(le) (rate(metric{__query_shard__="1_of_3"}[1m]))
    sum by(le) (rate(metric{__query_shard__="2_of_3"}[1m]))
    sum by(le) (rate(metric{__query_shard__="3_of_3"}[1m]))
  )
))

Example 3: Query with two shardable portions

sum(rate(failed[1m])) / sum(rate(total[1m]))

Is executed as (assuming a shard count of 3):

sum(
  concat(
    sum (rate(failed{__query_shard__="1_of_3"}[1m]))
    sum (rate(failed{__query_shard__="2_of_3"}[1m]))
    sum (rate(failed{__query_shard__="3_of_3"}[1m]))
  )
)
/
sum(
  concat(
    sum (rate(total{__query_shard__="1_of_3"}[1m]))
    sum (rate(total{__query_shard__="2_of_3"}[1m]))
    sum (rate(total{__query_shard__="3_of_3"}[1m]))
  )
)

How to enable query sharding

In order to enable query sharding you need to opt-in by setting -query-frontend.parallelize-shardable-queries to true.

Each shardable portion of a query is split into -query-frontend.query-sharding-total-shards partial queries. If a query has multiple inner portions that can be sharded, each portion is sharded -query-frontend.query-sharding-total-shards times. In some cases, this could lead to an explosion of queries. For this reason, there is a parameter that allows to modify the default hard limit of 128 queries on the total number of partial queries a single input query can generate: -query-frontend.query-sharding-max-sharded-queries.

When running a query over a large time range and -query-frontend.split-queries-by-interval is enabled, the -query-frontend.query-sharding-max-sharded-queries limit applies on the total number of queries which have been split by time (first) and by shards (second).

As an example, if -query-frontend.query-sharding-max-sharded-queries=128 and -query-frontend.split-queries-by-interval=24h, and you run a query over 8 days, each daily query will have a max of 128 / 8 days = 16 partial queries per day.

After enabling query sharding in a microservices deployment, the query frontends will start processing the aggregation of the partial queries. Hence it is important to configure some PromQL engine specific parameters on the query-frontend too:

-querier.max-concurrent
-querier.timeout
-querier.max-samples
-querier.default-evaluation-interval
-querier.lookback-delta

Operational considerations

Splitting a single query into sharded queries increases the quantity of queries that must be processed. Parallelization decreases the query processing time, but increases the load on querier components and their underlying data stores (ingesters for recent data and store-gateway for historic data). The caching layer for chunks and indexes will also experience an increased load.

We also recommend to increase the maximum number of queries scheduled in parallel by the query-frontend, multiplying the previously set value of -querier.max-query-parallelism by -query-frontend.query-sharding-total-shards.

Cardinality estimation for query sharding (experimental)

When the number of parallel sharded queries increases, so does the load on the queriers and their dependencies. Therefore, to balance the tradeoff, only use shard queries as much as necessary. Queries that return more series, such as those that are of high cardinality, need to fetch more data and should therefore be split into a larger number of shards. Queries that return few or no series should be executed with fewer or no shards at all. When determining the number of shards to use for a given query, the sharding logic can optionally take into account the cardinality (number of series) observed during previous executions of the same query for similar time ranges.

To enable this feature, set -query-frontend.query-sharding-target-series-per-shard to a value representing roughly how many series each shard should fetch, and configure the results cache via the query-frontend.results-cache.* flags. This is necessary even when results caching is disabled, as the estimates are stored in the same cache that’s used for query result caching. The value that you set for this flag is one of several parameters that the sharding logic uses to determine the appropriate number of shards for a query. Therefore, it will not strictly be complied with in all cases, and the actual number of series fetched per shard might exceed the limit. This is likely to happen in cases where the cardinality of a query changes rapidly within a short period of time.

Estimates for query cardinality are only ever used to reduce the number of shards compared to the case when cardinality estimation is disabled. Other parameters that limit the total number of shards, such as -query-frontend.query-sharding-total-shards, will still provide an upper bound for the number of shards even when cardinality estimation is enabled and would suggest the use of a higher number of shards.

The histogram metric cortex_query_frontend_cardinality_estimation_difference tracks the difference between the estimated and actual number of series fetched.

Verification

Query statistics

The query statistics logged by the query-frontend allow to check if query sharding was used for an individual query. The field sharded_queries contains the amount of parallelly executed partial queries.

When sharded_queries is 0, either the query is not shardable or query sharding is disabled for cluster or tenant. This is a log line of an unshardable query:

sharded_queries=0  param_query="absent(up{job=\"my-service\"})"

When sharded_queries matches the configured shard count, query sharding is operational and the query has only a single leg (assuming time splitting is disabled or the query doesn’t span across multiple days). The following log line represents that case with a shard count of 16:

sharded_queries=16 query="sum(rate(prometheus_engine_queries[5m]))"

When sharded_queries is a multiple of the configured shard count, query sharding is operational and the query has multiple legs (assuming time splitting is disabled or the query doesn’t span across multiple days). The following log line shows a query with two legs and with a configured shard count of 16:

sharded_queries=32 query="sum(rate(prometheus_engine_queries{engine=\"ruler\"}[5m]))/sum(rate(prometheus_engine_queries[5m]))"

The query-frontend also exposes metrics, which can be useful to understand the query workload’s parallelism as a whole.

You can run the following query to get the ratio of queries which have been successfully sharded:

sum(rate(cortex_frontend_query_sharding_rewrites_succeeded_total[$__rate_interval])) /
sum(rate(cortex_frontend_query_sharding_rewrites_attempted_total[$__rate_interval]))

The histogram cortex_frontend_sharded_queries_per_query allows to understand how many sharded sub queries are generated per query.

Grafana Mimir query engine

Wed, 03 Jun 2026 09:01:40 +0200

Grafana Mimir query engine

The Mimir Query Engine (MQE) is an alternative to Prometheus’ query engine. You can use it in queriers to evaluate PromQL queries.

MQE produces equivalent results to Prometheus’ engine, generally uses less memory and CPU than Prometheus’ engine, and evaluates queries at least as fast, if not faster. It supports all stable PromQL features and transparently falls back to Prometheus' engine for queries that use unsupported features.

How to enable MQE

MQE is enabled by default. To disable it, either set the -querier.query-engine=prometheus CLI flag on queriers or set the equivalent YAML configuration file option.

Fallback to Prometheus’ engine

By default, MQE falls back to Prometheus’ engine for any queries that use unsupported features.

To disable this behaviour, either set the -querier.enable-query-engine-fallback=false CLI flag on queriers, or set the equivalent YAML configuration file option. If fallback is disabled and MQE receives a query it does not support, then the query fails.

To force a query supported by MQE to use Prometheus’ engine, add the X-Mimir-Force-Prometheus-Engine: true HTTP header to the query request. This header only has an effect if fallback is enabled.

Query memory consumption limit

MQE supports enforcing a per-query memory consumption limit. This allows you to ensure that a single memory-hungry query cannot monopolize a large proportion of available memory in a querier, or cause it to exhaust all available memory and crash.

While evaluating a query, MQE estimates the memory consumed by the query, such as memory used for the final result and any intermediate calculations, and stops the query with an err-mimir-max-estimated-memory-consumption-per-query error if the estimate exceeds the configured limit.

The estimate is based on the memory consumed by samples currently held in memory for query evaluation. This includes both raw samples decoded from chunks, and samples held in memory as intermediate results of calculations or as the final result. It also includes some other large sources of memory consumption for intermediate results.

This estimate has the following limitations:

It doesn’t consider the memory consumed by series labels.
It doesn’t consider the memory consumed by chunks that are currently in memory. However, the maximum chunks and maximum chunks bytes limits continue to be enforced.
It makes an assumption about the memory consumed by each native histogram, rather than accurately calculating the memory consumed by each histogram.

By default, no limit is enforced. To configure the default limit for all tenants, set either the -querier.max-estimated-memory-consumption-per-query CLI flag, or set the equivalent YAML configuration file option. You can override this default limit on a per-tenant basis by setting max_estimated_memory_consumption_per_query for that tenant. Setting the limit to 0 disables it.

The limit is not enforced for queries that run through Prometheus’ engine, and setting the limit has no impact if MQE is disabled or if the query falls back to Prometheus’ engine.

Known differences compared to Prometheus’ engine

The following are known differences between MQE and Prometheus’ engine:

Binary operations that produce no series

When MQE evaluates a binary operation (such as +, -, /, and, or, etc.), it checks if the binary operation will produce no series, or if some series from one side of the operation can be skipped, based on the series labels on both sides.

If MQE can determine that a binary operation produce no series based on the series labels on both sides, it skips evaluating both sides. For example, if the query is foo / bar, and foo selects a single series foo{env="test"}, and bar selects a single series bar{env="prod"}, then the query cannot produce any series and so the data for each side is not evaluated.
If MQE can determine that some series on one side will not match anything on the other side, it will skip evaluating the series that do not match the other side. For example, if the query is foo / on (env) bar, and foo has series foo{env="1", region="a"} and foo{env="2", region="a"} and bar has bar{env="1", cluster="x"}, bar{env="3", cluster="x"} and bar{env="3", cluster="y"}, MQE will ignore the env="3" series from bar.

This has some noticeable side effects, including:

aborted stream because query was cancelled: context canceled: query execution finished might be logged by the querier, as streaming data from ingesters and store-gateways is aborted without reading all the data, as it isn’t needed.
Some annotations aren’t emitted. For example, if the query above was rate(foo[1m]) / sum(rate(bar[1m])), Prometheus’ engine emits annotations such as metric might not be a counter, name does not end in _total/_sum/_count/_bucket: "foo" and metric might not be a counter, name does not end in _total/_sum/_count/_bucket: "bar". In contrast, MQE doesn’t emit these annotations, as they are only emitted during the evaluation of the series data, and not during the evaluation of the series labels.
found duplicate series for the match group errors aren’t returned by MQE if a match group has no series on one side but multiple series on the other side and those series have samples that conflict with each other.

`topk` and `bottomk`

MQE and Prometheus’ engine produce different results for queries that use topk and bottomk if different series have samples with the same values. Prometheus’ engine does not have deterministic behavior in this case and selects different series on each evaluation of the query. MQE’s implementation differs from Prometheus’ engine, which can also lead to different results.

Grafana Mimir advanced architecture on Grafana Labs

Grafana Mimir deployment modes

Grafana Mimir deployment modes

About monolithic mode

Scale monolithic mode

About microservices mode

About Grafana Mimir network ports

About Grafana Mimir network ports

Grafana Mimir components

Grafana Mimir components

Grafana Mimir binary index-header

Grafana Mimir binary index-header

Format (version 1)

Grafana Mimir bucket index

Grafana Mimir bucket index

Benefits

Structure of the index

How it gets updated

How it’s used by the querier

How it’s used by the store-gateway

Grafana Mimir hash rings

Grafana Mimir hash rings

How the hash ring is used for sharding

How series sharding works

Series sharding in ingest storage architecture

Write path

Read path

Partitions ring lifecycle

Partition creation and activation

Partition decommissioning and downscaling

Series sharding in classic architecture

How the hash ring is used for service discovery

Share a hash ring between Grafana Mimir instances

Grafana Mimir key-value store

Grafana Mimir key-value store

Supported key-value store backends

Gossip-based memberlist protocol (default)

Consul

Etcd

Grafana Mimir memberlist and gossip protocol

Grafana Mimir memberlist and gossip protocol

How memberlist propagates hash ring changes

Grafana Mimir query sharding

Grafana Mimir query sharding

Query sharding at glance

Example 1: Full query is shardable

Example 2: Inner part is shardable

Example 3: Query with two shardable portions

How to enable query sharding

Operational considerations

Cardinality estimation for query sharding (experimental)

Verification

Query statistics

Grafana Mimir query engine

Grafana Mimir query engine

How to enable MQE

Fallback to Prometheus’ engine

Query memory consumption limit

Known differences compared to Prometheus’ engine

Binary operations that produce no series

topk and bottomk

`topk` and `bottomk`