Grafana Labs blog on Grafana Labs

AI-assisted testing, extensions updates, and more: k6 2.0 is here

Théo Crevon — Tue, 12 May 2026 16:46:59

For years, teams have relied on k6 to take a more proactive approach to performance testing, ensuring they can catch issues early and deliver more reliable user experiences. That approach has helped make k6 one of the most widely used performance testing tools in the open source community today, with more than 30k stars on GitHub.

Last year, we introduced k6 1.0, a major release that brought TypeScript support, native extensions, revamped test insights, and production-grade stability guarantees.

Now, we’ve reached another milestone for the OSS project: k6 2.0 is generally available.

This latest release builds on k6 1.0 to better support faster, more automated software delivery lifecycles. We’ve introduced AI-assisted testing workflows, broader Playwright compatibility in the browser module, a new Assertions API, and more. Overall, the release makes it easier to author, validate, automate, and scale performance tests, especially as AI becomes a more integral part of your development workflows.

Even with these advancements, existing k6 users should still feel right at home: scripts, checks, thresholds, scenarios, and CI/CD workflows remain core to the testing experience.

Read on to learn more about what's new, and be sure to check out the k6 2.0 talk from GrafanaCON 2026 in the video below.

AI-assisted workflows for faster, scalable testing

AI is changing how software gets written. Developers can generate, refactor, and review code faster than ever, but faster output also raises the bar for validation.

As more teams bring AI assistants into their development workflows, testing needs to become easier to author and automate, and easier for both humans and agents to interpret. k6 2.0 is built around that shift: it helps teams create tests faster, express expectations more clearly, and scale validation from local development to production-like environments.

The release includes four new commands that enable deeper integration with AI workflows and help teams use k6 programmatically:

k6 x agent helps developers bootstrap agentic testing workflows in AI coding assistants like Claude Code, Codex, Cursor, and more. It sets up the configuration, skills, and references an agent needs to use k6 to write correct, idiomatic, and modern tests; turn requirements and expectations into a testing strategy; and build out a test suite.

k6 x mcp exposes k6 through a built-in Model Context Protocol server, giving compatible agents the tools and resources they need to work effectively with k6. Agents can validate and run scripts, inspect results, iterate quickly on the tests they write, and tap into k6 resources and best practices along the way.
k6 x docs gives agents and developers CLI access to k6 documentation, API references, and examples without leaving the session, or having to perform web searches.
k6 x explore lets agents and developers browse the k6 extension registry from the CLI, filtering by type or tier and surfacing the imports, subcommands, and outputs each extension provides. Combined with automatic extension resolution, agents can discover the right extension for a testing scenario and pull it into a script without leaving the session.

These commands also reflect how k6 2.0 extends beyond test scripts. They are built on the same subcommand extension model now available to extension authors, which we’ll cover in the next section.

Extensions updates to expand the reach of k6

Extensions help you extend core k6 functionality with new features to support your specific reliability testing needs. The 2.0 release expands on extensions in multiple ways: it provides a consolidated catalog of official and community extensions, makes it easier to test more systems and protocols, and introduces a way to extend the k6 CLI itself.

A curated extensions catalog

In k6, official extensions are those owned and maintained by Grafana Labs, with defined compatibility expectations and support across a range of k6 versions. Community extensions are built and maintained by k6 contributors and members of our OSS community.

With k6 2.0, these extensions are consolidated into a single catalog that makes it easier to discover and use them, and more clearly defines the boundaries between them. Community extensions, for example, are clearly identified as community-maintained and must follow registry requirements before being included.

This distinction matters. Extensions can add new protocols, clients, outputs, and CLI workflows to k6, so teams need to understand what is maintained by Grafana Labs, what is maintained by the community, and what guarantees apply before adding an extension in their testing workflows.

The catalog also gives extension authors a clearer path to contribute. Public community extensions can be submitted for inclusion if they meet the registry requirements, including documentation, build instructions, usage guidance, and k6 version compatibility.

Test more systems and protocols

Modern systems consist of so much more than HTTP services and browser frontends. Teams also need to test databases, message queues, streaming APIs, DNS, event-driven systems, and other infrastructure components that sit on the critical path.

Official extensions maintained by Grafana Labs, including k6/x/faker, k6/x/mqtt, k6/x/sql, and k6/x/dns, sit alongside community extensions like k6/x/sse and k6/x/kafka to help with these needs.

For cataloged extensions that support automatic resolution, you can reference the extension in your script and let k6 handle the rest. For custom extensions or extensions outside automatic resolution, xk6 is still available.

xk6 as an extension development toolbox

Extensions are only as healthy as the tooling around them. In k6 2.0, xk6 grows from a custom k6 build tool into a full extension development toolbox.

Extension authors can scaffold a new project from official templates with xk6 new, build and run k6 with an in-development extension in one step, check a project against the registry's compliance requirements with xk6 lint, and run a suite of k6 scripts against the extension with xk6 test, reporting results in TAP or CTRF JSON for CI/CD pipelines.

The result is a shorter path from idea to a published, catalog-ready extension, and a consistent baseline of quality across official and community extensions alike.

Subcommand extensions

Not every extension needs to be something you import in a test script. k6 2.0 introduces subcommand extensions, a new way to add custom commands under the k6 x namespace.

This means teams can build workflows around test authoring, environment setup, documentation, result processing, mocks, internal tooling, or anything else they need close to the k6 runtime.

We’re already using this model internally at Grafana Labs: k6 x agent, k6 x mcp, k6 x docs, and k6 x explore are all built as subcommand extensions. The same mechanism that powers these AI-assisted workflows is now available to extension authors.

Writing familiar browser and assertion tests

k6 2.0 significantly expands compatibility between the k6 browser module and the Playwright API, making it easier for teams to apply existing browser testing knowledge and adapt existing Playwright tests to k6.

This is important because browser testing is often where functional correctness, user experience, and performance meet. With a more familiar API surface, teams can progress more easily from “does this user flow work?” to “how does this user flow behave under load?”

k6 2.0 also introduces a new Assertions API. The expect() API brings a Playwright-inspired assertion style to k6 scripts, with expressive matchers for both protocol and browser testing.

Assertions come in two forms:

Non-retrying assertions, which evaluate whether a condition is true immediately. They’re useful for static values such as HTTP status codes, response headers, JSON payloads, and configuration.

import http from 'k6/http'; import { expect } from 'https://jslib.k6.io/k6-testing/0.6.1/index.js';
export default function () {
const response = http.get('https://quickpizza.grafana.com/');
expect(response.status).toBe(200);
expect(response.body).toBeDefined();
}

Auto-retrying assertions, which hold the execution of the test until a condition becomes true or a timeout is reached. They’re especially useful for browser tests where elements may take time to appear, update, or become interactive.

import { browser } from 'k6/browser';
import { expect } from 'https://jslib.k6.io/k6-testing/0.6.1/index.js';
export const options = {
scenarios: {
ui: {
executor: 'shared-iterations',
options: {
browser: {
type: 'chromium'
}
},
},
},
};
export default async function () {
const page = await browser.newPage();
await page.goto('https://quickpizza.grafana.com/');
await expect(page.locator("h1")).toContainText("Welcome to QuickPizza!");
}

Assertions complement existing k6 checks. Checks are still a great fit for load testing because they continue execution and emit metrics for threshold evaluation. Assertions are designed for use cases where a failed expectation should stop the test because the scenario is no longer valid.

From AI-authored tests to production-scale validation

A locally run test is a useful starting point for evaluating performance. But as teams bring testing into AI-assisted workflows and CI/CD pipelines, results need to be machine-readable and test execution needs to scale beyond a single machine.

k6 2.0 adds a new JSON summary output, making end-of-test results easier for CI/CD systems and AI agents to consume. Instead of scraping terminal output, tools can read structured results and make decisions based on them.

For real-time observability, native OpenTelemetry output makes it easier to analyze k6 results alongside the application telemetry teams already use.

And for teams that need production-scale load, k6 Operator 1.0 is now stable. The operator lets teams run distributed k6 tests on Kubernetes, closer to the environments where their applications already run.

Getting started with k6 2.0

Here are a few ways to try k6 2.0 today:

Initialize: Set up AI-assisted test authoring with k6 x agent.
Adapt: Convert or migrate existing Playwright tests with expanded browser module compatibility.
Extend: Explore official and community extensions for protocols and systems beyond HTTP.
Scale: Run distributed tests on Kubernetes with k6 Operator 1.0

Thank you to the k6 community!

To everyone in the community who contributed features, filed issues, fixed bugs, wrote extensions, tested early builds, or pushed for more reliable software: thank you. k6 2.0 would not be possible without you.

You can learn more in our k6 documentation, and we’d love to hear what you think on GitHub.

Happy testing!

Grafana Cloud is the easiest way to get started with Grafana k6 and performance testing. We have a generous forever-free tier and plans for every use case. Sign up for free now!

Eliminate noisy log lines with Adaptive Logs drop rules

Steven Dungan — Thu, 07 May 2026 17:09:18

Most platform and observability teams have logs they know are noise. These could be throwaway health check logs, forgotten DEBUG logs, or verbose INFO logs from little used services that only serve to inflate your bill.

Regardless of what they contain and why they're there in the first place, the hard part is getting rid of them. Centralized teams want to easily and quickly prevent these logs from being ingested, without having to work with toilsome infrastructure change management to do so. There hasn't been a simple way to use Grafana Cloud to drop them—until now.

With the new drop rules feature in Adaptive Logs, now in public preview, you can define your own rules to drop low-value logs before they are written to Grafana Cloud Logs, reducing noise and saving money right away.

With this latest update, you get the same capability—to supplement our intelligent optimization recommendations with your own custom inputs to drop data—already available in Adaptive Metrics and Adaptive Traces.

How drop rules help you reduce waste with Adaptive Logs

With each drop rule, you can create logic using any combination of log labels, detected log levels, or line content to drop logs before they are written to Cloud logs. Check out our documentation to learn how it works.

Here are a few examples of things you can do with drop rules:

Drop logs by level. Drop noisy DEBUG logs which eat the logging budget.
Sample chatty, repetitive logs. Adaptive Logs drop rules allow you to specify a drop percentage, effectively allowing you to sample noisy logs you don’t want to entirely discard.
Target a specific noisy producer of logs. A particular service may have just started emitting lots of high-volume, low-value logs. Target these logs by specifying a label selector in combination with other criteria such as a log level or text string.

How drop rules fit into Adaptive Logs

Drop rules are one of three mechanisms Adaptive Logs uses to manage log volume. When a log line arrives in Grafana Cloud, it's evaluated in this order:

Exemptions: Protected logs pass through untouched. If a log line matches an exemption, no sampling is applied.
Drop rules: Evaluated in priority order. The first matching rule applies its drop rate.
Patterns: Optimization recommendations can be applied to remaining log lines that weren't exempted or filtered using drop rules.

Drop rules, recommendations, and exemptions: a complete system

Drop rules are one piece of a complete log cost management system in Adaptive Logs. Each mechanism serves a different purpose:

Drop rules eliminate known noise. Your platform team knows health check logs don't need to be stored in Grafana Cloud. A single rule with a 100% drop rate enforces that standard across every service, without requiring individual teams to change their logging configuration.

Drop rules apply sampling to specific workloads. A batch processing job generates repetitive log output. Target it with a stream selector and a 90% drop rate to keep a representative sample without the full volume.

Recommendations surface waste you haven't identified. Adaptive Logs analyzes 15 days of query behavior and surfaces patterns you're not using. This is where optimization recommendations excel—finding the waste that no one on your team has noticed yet.

Exemptions retain known business-critical logs. Your compliance team requires full retention of audit logs. Your SRE team needs every error log from payment processing. Exemptions protect specific log streams from any drop rules or recommendations—a targeted way to retain the logs you can't afford to lose.

Pause Adaptive Logs for full visibility. When more visibility is needed, such as an incident or a deployment, you can pause Adaptive Logs to selectively override drops for one hour. Combined with persistent exemptions, your team gets full visibility exactly when it matters.

Drop rules handle the noise you already know about. Pattern-based recommendations surface the waste you haven't spotted yet. Exemptions protect the logs you can't afford to lose. Together, they give your platform team complete control over log volume and cost.

How to get started

Drop rules are available now in public preview in Grafana Cloud. This is an admin experience designed for platform and observability teams who manage log ingestion centrally.

To create your first rule, navigate to Adaptive Telemetry → Adaptive Logs → Drop rules tab → Create a new drop rule.

Next, you need to configure your criteria. We recommend starting simple. A good first rule: drop 90% of debug-level logs from a known-noisy service. The rule takes effect immediately. Monitor impact from the drop rules page, where you can see how much volume each rule is dropping.

Since our public preview release is designed for admins, drop rules require the Adaptive Logs admin role (plugins:grafana-adaptivelogs-app:admin), which is granted by default to Grafana Admins and Org Admins. This keeps rule creation and management in the hands of the team responsible for your organization's observability costs and policies.

And if you're already using Adaptive Logs recommendations, drop rules are available in the same UI. Open the drop rules tab to start combining both approaches today.

Additionally, you can check out gcx, our new agent friendly CLI, to create and manage drop rules.

Adaptive Logs is included at no additional cost on all Grafana Cloud tiers. For more information on drop rules or any of the other capabilities discussed here, check out our documentation for Adaptive Logs and Adaptive Telemetry.

Troubleshoot performance issues faster with the new Grafana Assistant integration for Database Observability

Jeremy Heller — Wed, 06 May 2026 15:08:55

So your database is slow. Now what?

Grafana Cloud Database Observability already gives you visibility into your SQL queries with RED metrics, individual execution samples, wait event breakdowns, table schemas, and visual explain plans. But visibility is just the starting point.

You can see that a query's P99 latency spiked, but what should you do about it? You can see wait events like wait/synch/mutex/innodb firing, but what does that actually mean?

Thankfully, you can now use the new Grafana Assistant integration for Database Observability to find those answers easier and faster than ever. You get the power of AI, coupled with the depth of Grafana Cloud’s observability capability, available every time you investigate a query.

The best part: you don't have to worry about assembling context, explaining schema, or describing time ranges. The assistant isn't working from a copy of your SQL pasted into a separate AI tool. Instead, it runs queries against your actual Prometheus and Loki data sources, in the time window you're looking at, with your real table schemas, indexes, and execution plans already loaded.

Each tab has purpose-built analysis actions designed by database engineers rather than generic prompts. Every analysis is based on real data from your database and provides specific advice. Your query text and schema metadata are used only for the current analysis, and are not stored or used for model training.

Prompts for tackling common database issues

To illustrate how effective this integration is, let's walk through some examples of how the assistant helps you quickly solve some common problems.

Yes, you can still freely prompt directly in the assistant chat box the same way you normally would, but we've built out-of-the-box AI buttons to provide a guided experience for tackling slow or degraded queries, or for getting recommendations on changes.

Why is this query slow?

You've found the offending query. It's in the overview and the duration is spiking and the error rate is climbing. You click into it and see specific time-series performance data.

The data is all there, but the diagnosis isn't obvious. Is it a bad join or lock contention? A table scan that wasn't a problem until the data grew?

Open the assistant with the pre-defined prompt with the click of a button.

It goes to work, using both Loki and Prometheus to query the selected time window and synthesizes them into a single health assessment. Duration is spiking because the number of rows examined is 50 times the number rows returned, which means most of the work is wasted on filtering. The P99 is 12x the median, which means the problem is intermittent, not constant. CPU time is healthy, but wait events are eating 40% of execution time.

The last point is crucial. Wait events have names like wait/synch/mutex/innodb or io/table/sql/handler. These names aren’t self-explanatory, but the assistant is still able to understand them and lets you know:

"During this wait, the database is physically reading data from disk because the requested rows aren't in the buffer pool. This is happening because the query performs a sequential scan on the orders table, which has 1.2 million rows and no index on order_date."

It connects the metric (40% wait time) to the cause (sequential scan, missing index) to the table (orders) to the column (order_date) all in one response, using your actual schema and execution plan. You’ve saved yourself a lot of time and avoided going down a rabbit hole.

In the video below, you'll see another example of this in practice.

What should I actually change?

Sometimes you aren't sure what to ask. To help in those situations, Assistant produces specific, testable SQL like a CREATE INDEX statement with:

The columns in the right order
An explanation of why that column order matters for your query's WHERE clause and JOIN conditions
A note about the write-performance trade-off

These recommendations are dialect-specific. For a PostgreSQL query, the assistant might suggest a partial index on a filtered subset; for the same pattern in MySQL, a prefix index with appropriate key length. Plus, documentation links point to the right vendor docs, not generic SQL references.

Note: We recommend reviewing the suggested changes in a staging environment before applying them to production.

When the fix is a query rewrite rather than an index modification, the assistant analyzes the step-by-step breakdown of how the database processes your query (the EXPLAIN plan) and identifies the operations consuming the most cost.

For example, it might spot a nested loop join that should be a hash join, a sort operation that could be eliminated with a composite index, or a subquery that would be faster as a CTE. Each bottleneck comes with a fix, and each fix comes with a verification step: an EXPLAIN command you can run afterward to confirm the plan actually changed.

The recommendations are based on your schema. The assistant knows which indexes already exist on your tables, which foreign key relationships are defined (and which are missing), and how the database's execution plan currently uses that infrastructure. If your indexes are already well-designed, it says so and doesn't invent problems.

Is this getting worse?

Some queries aren't broken; they're degrading. The P50 duration is fine today, but it's 20% slower than it was last week. No one noticed because there was no single incident, just a slow creep.

The assistant analyzes individual execution samples and surfaces the patterns. It compares extremes directly: in one investigation, the fastest execution took 12 ms and examined 200 rows, while the slowest took 3.4 seconds and examined 180,000 rows. Same query, same schema, taking 280x longer.

The assistant highlights the difference between fast and slow executions: rows examined, wait event breakdown, and timing. From there, it can identify likely causes.

For example, it can identify a parameter value that hits a larger data partition, lock contention that only appears under load, or a plan change triggered by stale statistics. The result is a full diagnosis based on factual data rather than a guess based on the query text alone.

All your context, for your entire team, in one place

The assistant is also positioned to help entire teams. Every conversation can be shared with team members, so a developer can send the assistant's analysis to a DBA for review before applying an index change, or attach it to a pull request or incident ticket as supporting evidence.

And it's right there when you need it. Click the button on any tab. Get a diagnosis grounded in your live metrics, a fix tailored to your schema, and the ability to share the conversation with team members.

Get started today

The AI Assistant is now generally available in Grafana Cloud Database Observability. For those of you who used the previous AI Helper during the preview phase, we think you'll find this new Assistant integration more comprehensive. Beyond query assistance, the new Assistant integration also helps you understand explain plans, table schemas, and supports follow-on conversations and sharing conversations with your team.

To get started, navigate to a queries detail view and look for the Assistant button on the query performance, query samples, wait events, table schema, or explain plan tabs.

For setup instructions and supported databases, check out the Database Observability documentation.

Faster fixes, less context sharing: how Grafana Assistant learns your infrastructure before you even ask

William Dumont — Thu, 30 Apr 2026 17:59:49

When an unexpected alert fires these days, most engineers' first move is to ask their AI assistant for help.You ask why your checkout service is slow and the assistant gets to work, but it can't get any meaningful insights—at least not quickly—without the proper guidance. So, the next thing you know you're sharing deals about your existing data sources, the services you have running, how they connect, which labels and metrics matter, and on and on.

Every conversation starts from scratch, and that discovery process eats into the time you actually need for troubleshooting.

But if you're using Grafana Assistant, our agentic observability assistant, you can skip right over all that context and dive straight into troubleshooting. Assistant doesn't learn about your environment on demand. Instead, it studies your infrastructure ahead of time and builds a persistent knowledge base. That way, by the time you ask your first question, it already knows what's running, how it's connected, and where to look.

Fostering a knowledge base to jumpstart incident response

Because Assistant automatically builds and maintains a knowledge base about your environment, it already knows what services you run, how they connect, which metrics and labels matter, where the logs live, and how things are deployed.

Think of it as giving the assistant a map of your world before it starts answering questions.

As a result, conversations become faster and more accurate. When you ask about a service, the assistant doesn't need to fumble through data source discovery. It already knows that your payment system talks to three downstream services, that its latency metrics live in a specific Prometheus data source, and that its logs are structured JSON in Loki.

When an incident hits, speed matters. Having that context preloaded can shave valuable minutes off of your response time even if you're experienced with the system. But this functionality is especially powerful for teams where not everyone has the full picture of the infrastructure. A developer investigating an issue in their service can ask about upstream dependencies and get accurate answers, even if they've never looked at those systems before.

How does it work?

Assistant runs this infrastructure memory in the background with zero configuration. A swarm of AI agents does the heavy lifting:

Data source discovery: The system identifies all connected Prometheus, Loki, and Tempo data sources in your Grafana Cloud stack.
Metrics scans: Agents query your Prometheus data sources in parallel to find services, deployments, and infrastructure components.
Enrichments via logs and traces: Loki and Tempo data sources get correlated with their corresponding metrics, adding context about log formats, trace structures, and service dependencies.
Structured knowledge generation: For each discovered service group, agents produce documentation covering five areas: what the service is, its key metrics and labels, how it's deployed, what it depends on, and how its logs are structured.

This knowledge is stored as searchable chunks in a vector database, so when you or the assistant need information about a specific service, it can be retrieved in milliseconds through semantic search.

The whole process refreshes automatically on a weekly cadence, so your assistant's understanding of your infrastructure stays current as your environment evolves.

What does the assistant actually learn?

For every service group it discovers, Assistant captures five categories of knowledge:

Identity and purpose: What the service is, what it does, which namespace and cluster it belongs to, and what technology stack it uses
Key metrics: The metric names and labels relevant to the service, including golden signals such as latency, error rate, traffic, and saturation. Not generic guesses, but the actual metric names from your Prometheus data sources
Deployment topology: Kubernetes resources, replica counts, scaling configurations, and container details
Dependencies: Upstream and downstream service connections, database and cache relationships, message queue interactions, and external integrations
Log structure: Available log labels and their values, detected log formats (JSON, logfmt, or unstructured), common patterns, and extracted field names

This is the kind of context that makes the difference between an assistant that gives you a generic answer and one that gives you the right answer for your environment.

You don't have to do anything

This isn't a feature you configure, enable, or maintain. It runs automatically for all Grafana Cloud customers who use Assistant. There are no setup steps, no configuration files, no scheduled jobs to manage.

Your existing telemetry data is the input. The assistant reads what's already in your Prometheus, Loki, and Tempo data sources and builds its understanding from there. If you have metrics, you get this infrastructure memory capability.

You can review what the assistant has learned by navigating to the Assistant settings and browsing the discovered service groups. You can also trigger a manual scan if you want to refresh the knowledge base ahead of the next automatic cycle.

Assistant also respects your organization's access controls. Each memory is linked to the data sources used to generate it, so users only see knowledge derived from data sources they have permission to access.

A foundation for smarter conversations

This is one of those features that works best when you don't notice it. You ask a question, you get a precise answer that references the right metrics, the right labels, the right data sources. You don't have to wonder whether the assistant actually understands your environment. It does, because it already mapped it.

For us, this is a step toward an assistant that genuinely understands the infrastructure it's helping you observe and knows your system well enough to ask the right questions on its own.

Get observability in the terminal, for you and your agents, with the gcx CLI tool

Ward Bekker — Tue, 28 Apr 2026 19:54:52

The way you write code is changing, which means the way you observe your systems and respond to issues needs to change, too.

Engineers today spend much of their day working via command line, as agentic tools like Cursor and Claude Code have become highly effective at handling many day-to-day engineering tasks. This greatly accelerates code generation, but it doesn't solve for the context switching that comes when you have to jump into another tool that's not part of this new, faster workflow.

Moreover, agents introduce a new visibility gap: they can see your code on your machine, but they're blind to what's going on in your production environment. They don't see the latency spike on checkout. They don't know whether you're actually hitting your SLOs. They write code based on what could happen rather than what is actually happening.

To address these issues, we've launched the public preview of gcx, the new Grafana Cloud CLI. gcx brings Grafana Cloud and Grafana Assistant directly into your terminal—and to the agentic coding environment running inside it—so you can spot and resolve incidents in minutes instead of hours.

From greenfield to full observability in minutes

gcx is built to do the heavy lifting. Most services start out with no instrumentation, no alerts, and no SLOs. That's the normal state of things, and gcx treats it as a starting line rather than a blocker.

All you have to do is point your agent at the service and ask it to bring it up to standard. gcx exposes the primitives it needs across the full observability lifecycle:

Instrumentation: Wire OpenTelemetry into the codebase; validate that metrics, logs, and traces are flowing; confirm the data is landing in the right metric, logs, and traces backends—all from the terminal.
Alerting, SLOs, and synthetics check: Generate alert rules from the signals your service actually emits. Define an SLO against a real latency or availability indicator and push it live. Stand up synthetic probes so users aren't the ones reporting the outage.
Frontend Observability, Application Observability, and Kubernetes Monitoring: Onboard a Faro-instrumented frontend, create the app, and manage sourcemaps so stack traces are readable. Onboard backend services and Kubernetes infrastructure with Instrumentation Hub.
Everything as code: Pull dashboards, alerts, SLOs, and checks as files. Edit them locally with your agent. Push them back. Open a deep link into Grafana Cloud the moment a human needs to look.

With all this in one place, what used to be a multi-day ticket becomes a one-agent session.

Why this matters for agents

Being able to access Grafana via the command line and manage your observability as code is great, but the real power of gcx shines through when you give your agents access to it.

Without production context, an agent is pattern-matching on source files and hoping to find the right answer. With gcx, the same agent can read the state of the running system and make more informed decisions based on actual production observations rather than assumptions,

As a result, the shape of the conversation changes. Here are just a few examples of the questions you might ask as a user, along with in how it interprets and responds to those requests by using gcx:

"Why did this endpoint get slower this week?" The agent pulls traces and latency histograms, not just the diff.
"Is my new query efficient?" The agent runs the PromQL query against the actual metrics backend instance and iterates on the result.
"Are we meeting the SLO for checkout?" The agent reads the SLO definition and the current burn rate before writing a line.
"This alert is noisy, fix it." The agent inspects the rule, the firing history, and the related dashboards, then proposes a tuned threshold.

How GCX behaves when an agent drives it

It's become clear that agentic coding tools belong in the terminal. CLIs match how models actually reason—text in, text out, stable exit codes—and they compose with every credential and config the developer already has. CLI-driven agents also tend to be cheaper per task and more reliable than equivalent GUI-driven setups.

gcx is built for this world because it's built with agents in mind, helping them respond faster to CLI outputs while also helping to reduce the token burn.

Agents already know how to run git, kubectl, and go test. gcx fits in the same slot, and its defaults are tuned for the case where an LLM is the caller.

Two things matter most when an LLM (instead of a human at a keyboard) is the caller:

Every command emits JSON or YAML via --output, with field names that stay stable across versions. That way, an agent parsing today's response will still work next month
Exit codes and error shapes are documented and consistent, so an agent can branch on failure and recover on its own instead of guessing from a stderr string.

Beyond that, gcx is tuned for the agent case in the ways you'd expect: it auto-detects when it's being driven by Claude Code, Cursor, or similar, and it drops spinners, truncation, and other human-friendly noise (or force it with GCX_AGENT_MODE=true).

It ships a machine-readable catalog of its own commands and flags, so agents can discover capabilities at runtime instead of guessing from stale training data. For example, it will find commands that result in destructive operations, which require explicit confirmation to reduce agent mistakes. And kubectl-style named contexts let an agent juggle several stacks in one session without mutating global state.

The agent calls gcx the way it already calls git or kubectl: run the command, read the output, move on. No wrapper, no shim, no bespoke integration layer.

Try gcx today, and get a jumpstart with custom observability skills

To get started, install gcx from https://github.com/grafana/gcx. From there, point it at your Grafana Cloud stack, hand it to your coding agent, and start fixing struggling services in minutes.

And while capable agents can work out most gcx workflows on their own, we also include a bundle of portable agent skills to accelerate tasks that come up often.

Skills are specialized instructions designed to guide AI agents, and the gcx agent skills cover observability setup, alert investigation, SLO management and investigations, synthetic check investigations and more. They work in any harness that follows the .agents skill convention, including Claude Code, and they can be installed with one command.

$ gcx skills install --all

Next, run gcx skills list to see the complete list of available skills.

And from there, you can start putting gcx to use, helping to reduce alert noise, keep resource cost under control, and catch production issues earlier. The important shift that comes as a result: the agent writing the code now has the same production view as the on-call engineer.

Secure performance testing at scale: Introducing secrets management for Grafana Cloud k6

Facundo Batista — Tue, 28 Apr 2026 14:47:13

To simulate real user behavior, performance tests often rely on API keys, tokens, or credentials to interact with real systems. But as your testing suite grows, this sensitive data can start to sprawl across scripts, configs, and environments, increasing the risk of exposure and making tests harder to manage and maintain.

To address this challenge, we’re rolling out secrets management for Grafana Cloud k6, the fully managed performance testing platform powered by k6 OSS. Secrets management allows you to securely store and use sensitive values in your load tests. This means if your tests rely on API tokens, credentials, or any other confidential data, you no longer need to hardcode them into your scripts or pass them around manually.

With secrets management, secrets are stored centrally in Grafana Cloud and injected into your tests at runtime. This keeps your scripts clean, avoids accidental leaks in version control, and makes it easier to reuse the same test across environments.

Here’s a look at how to get started.

Getting started: How to manage secrets from the Grafana Cloud UI

Secrets can be created and managed directly from the Grafana Cloud web UI. To access them, navigate to Testing & synthetics > Performance > Settings, and open the Secrets tab from the menu.

From this interface, you can perform all the basic lifecycle operations:

Create secrets by providing a name, description, and value. The name is how the secret will be referenced in your tests, and the value is the sensitive data itself. Once saved, the secret becomes immediately available to your tests. For each secret, you can also write a description (e.g., to explain the secret’s purpose) and use labels to help with organization.
Edit secrets to modify their values, descriptions, or labels. Note that editing a secret does not reveal its current value; instead, you provide a new value that replaces the previous one. This ensures that secrets are never exposed through the UI after they are initially set. If you need to rotate credentials, you can simply overwrite the existing secret with a new value.
Delete secrets that are no longer needed.

A key design principle here is that secret values are write-only in the UI. After creation, they cannot be read back or displayed. This prevents accidental exposure through screenshots, screen sharing, or casual inspection, and aligns with common security practices.

Using secrets in your Grafana Cloud k6 tests

Once your secrets are defined, using them in your tests is simple. Grafana Cloud k6 provides a dedicated module, k6/secrets, which allows you to retrieve secret values at runtime.

You can import the module and access a secret by its name:

import check from "k6";
import http from 'k6/http';
import secrets from 'k6/secrets';
export default async function main () {
const apiToken = await secrets.get('api-token');
const headers = {
Authorization: `Bearer ${apiToken}`,
};
console.log("Headers: " + JSON.stringify(headers))
let res = http.get('https://example.com/api', {headers: headers});
check(res, { "get executions status is 200": (res) => res.status === 200 });
}

In this example, the secret "api-token" is fetched when the test runs and used as part of an HTTP request. From the script's perspective, the returned value behaves like a regular string, so you can use it anywhere you would normally use a variable.

This makes it easy to integrate secrets into existing scripts without major refactoring. You can gradually replace hardcoded values or environment variables with secrets managed in Grafana Cloud.

Secrets are also protected during test execution. If a secret is accidentally logged, its value will not be exposed in the logs. Instead, it will be redacted automatically. This reduces the risk of leaking sensitive data through debugging output or test results. Combined with write-only storage in the UI, this ensures that secrets remain protected throughout their lifecycle: from creation, to usage in tests, to observability outputs.

Learn more

Secrets management in Grafana Cloud k6 is available now in public preview. The feature is also generally available in Grafana Cloud Synthetic Monitoring, a black box monitoring solution powered by k6 that lets you proactively assess system reliability and performance.

To learn more, please visit our documentation.

Grafana Cloud is the easiest way to get started with k6 and performance testing. We have a generous forever-free tier and plans for every use case. Sign up for free now!

Customize preconfigured views for AWS, Azure, and Google Cloud with Cloud Provider Observability in Grafana Cloud

Ana Ivanov — Mon, 27 Apr 2026 13:56:31

Part of what makes Cloud Provider Observability in Grafana Cloud really useful is that it gives you prebuilt dashboards and drill-downs for AWS, Azure, and Google Cloud. Out of the box you get service overviews, instance-level views, and quick links to explore your data.

However, you might already have dashboards you trust, want a view tailored to your team’s workflow, or need to change which panels show up when you drill into a single instance. The good news: you can now customize all of that without leaving the app.

This post walks through three ways to make service views your own: connecting an existing dashboard, creating one with AI and wiring it in, and editing the cloud provider instance drill-down views that appear in Cloud Provider Observability, Database Observability, the entity graph, and elsewhere.

The benefits of customization

With this new feature, you get three key capabilities:

Quick links and default dashboard: Whatever you set on the "Configure" page (preconfigured or a custom dashboard as default) is what users see when they open that service from the services tab, entity graph, or other entry points. Custom dashboards you add become extra quick links.
Instance drill-down: The panels and queries you configure under “Customize the panels…” are exactly what render in the instance-level view everywhere that view is used (Cloud Provider Observability, Database Observability, entity graph, etc.).
AI-generated dashboards: Created with the right variables and methodology, then added like any other custom dashboard and optionally set as default, so they fit into the same workflows and debugging paths.

Together, these options let you keep using the premade out-of-the-box views where they fit, plug in your own or AI-generated dashboards where you want a different “front door,” and tailor the per-instance drill-down so the same view is used consistently across observability surfaces.

One place to customize: the configure page

Customization for a given cloud service (e.g., Amazon RDS, GCP Cloud SQL, Azure Virtual Machines) lives on that service’s configure page.

On the "Services" tab, click Configure for the service you want to edit. There you’ll see:

Preconfigured dashboard: The built-in, out-of-the-box view for that service
Custom dashboards: Dashboards you’ve added as quick links, with one marked as default
Explore-style links for metrics and Grafana Metrics Drilldown.

Everything you add or change here is saved per service and reused wherever that service is shown in Grafana (services, entity graph, Database Observability, etc.).

1. Connect an existing dashboard

If you already have a dashboard that fits a service (e.g., your internal RDS or Lambda view), you can attach it as a quick link and optionally make it the default view for that service.

On the configure page for that service, find the section titled “Customize your quick links and add new ones to your custom dashboards.”
Under “Select a dashboard”, choose a dashboard from your stack and click Add.
The new dashboard appears in the table. Use “Set as default” to make it the one used when opening that service from the services tab, entity graph, or other entry points.
Click Save.

Your custom dashboard then appears in the quick links at the top of the service page, and if it’s the default, it becomes the main view for that service across the app. The preconfigured out-of-the-box dashboard stays available; you’re adding options and choosing which one is primary.

2. Create a dashboard with AI and use it in the app

If you don’t have a ready-to-use dashboard and haven’t built one yourself, but you also don't want to start from scratch, the "Generate with AI" flow can create a new dashboard for the service (with the right variables and RED/USE-style panels). Use it from the same configure page.

On the configure page, in the same “Customize your quick links…” section, click “Generate with AI”
You get a ready-to-use dashboard for that service (with the right variables and RED/USE-style panels). Save it.
Use the link Grafana Assistant provides in the chat to navigate back to the configure page.
Add your newly created dashboard to the app: select it in “Select a dashboard”, click Add, and optionally set it as default, then Save.

You then have an AI-generated dashboard that’s part of your service’s quick links and can serve as the default view for that service everywhere in the app—and you can update the dashboard anytime.

3. Edit the instance drill-down view

When you click through to a single instance (e.g., one RDS instance, one Lambda function, one VM), the app shows an instance drill-down view: a set of panels and queries that are either the built-in set or a custom layout you define. You can change which panels appear, add custom queries, reorder panels, and control units and legend formats—all from the configure page.

On the configure page for the service, scroll to “Customize the panels that will be displayed in the drilldown instance view.”
Turn on the toggle to enable custom panels.
Use the selection list to choose which metrics/queries are available; you can search, select/deselect all, and add custom queries. Each selected query can be assigned to panels.
In the panel grid, you can reorder panels via drag-and-drop, edit panel titles, change units, and adjust legend format and aggregation per query. You can also move queries between panels or into new panels.
Click Save when done.

Once saved, this layout is used whenever that service’s instance drill-down is shown—whether you opened it from the Cloud Provider Observability services tab, Database Observability, the entity graph, or from an alert. So you get one place to define what you see when you drill into one resource and it stays consistent across the app.

Introducing Pyroscope 2.0: faster, more cost-effective continuous profiling at scale

Christian Simon — Tue, 21 Apr 2026 07:20:39

Continuous profiling is becoming a standard part of the observability stack, and for good reason. It's the only signal that tells you why your code is slow or expensive, not just that it is. Metrics tell you CPU usage is high. Logs tell you a request was slow. Traces tell you which service is the bottleneck. But only a profile tells you which function, on which line, is burning the cycles.

As systems grow more complex, that level of visibility becomes essential. OpenTelemetry recently declared its Profiles signal as alpha, marking a clear step toward profiling becoming a first-class observability signal.

Now, we’re also taking a next step with the release of Pyroscope 2.0, a ground-up rearchitecture of our open source continuous profiling database. It’s designed to make continuous profiling more cost-effective at scale, and with native support for OpenTelemetry Protocol (OTLP) profiling, you can start ingesting profiles using the emerging standard today.

The case for always-on profiling

Before we get into what's new in Pyroscope 2.0, it's worth talking about why continuous profiling matters, especially because the payoff is larger than most teams realize.

Cut infrastructure costs with data, not guesswork

Cloud spend is one of the biggest line items in engineering budgets, and a significant part of it is CPU and memory. Teams routinely overprovision because they don't have fine-grained visibility into what's actually consuming resources.

Continuous profiling changes that equation. When you can see exactly which functions are responsible for CPU and memory consumption—across every service, in production, over time—you can make targeted optimizations instead of throwing hardware at the problem.

Faster root cause analysis

When an incident hits, the first question is always why. Metrics and traces narrow the blast radius; you know which service, which endpoint, and maybe which deployment introduced the regression. But the last mile of root cause analysis is where teams lose hours.

With continuous profiling, that last mile shrinks to minutes. You can compare a profile from before and after the regression, diff them, and see exactly which code paths changed. No reproducing in staging, no adding ad-hoc logging, and no guessing.

Understand latency at the code level

While distributed tracing tells you where wall clock time is spent, profiling tells you where the CPU spends that time. Together, they close the observability gap. A trace might show that your auth service added 200ms to a request, while a profile shows you that 150ms of that was in a regex compilation that could be cached.

This is especially powerful for tail latency, where the p99 spikes are hard to reproduce and harder to diagnose. Continuous profiling captures these moments as they happen, so you don't have to rely on luck with a debugger.

Pyroscope 2.0: a closer look at what’s new

The original Pyroscope architecture was based on Cortex, which is the same foundation that the Mimir and Loki projects started with. It worked, but it carried overhead that made large-scale continuous profiling expensive to run and operationally heavy.

All three projects have since outgrown that foundation. Mimir recently redesigned its architecture to eliminate write-path replication, decouple reads from writes, and make object storage the single source of truth. Pyroscope 2.0 applies similar architectural principles, adapted for the unique characteristics of profiling data: large payloads, heavy symbolic information, and bursty query patterns. The result is a system that's dramatically cheaper, faster, and simpler to operate.

Profiling at scale without the cost penalty

The v1 architecture for Pyroscope replicated every profile three times on the write path. For a signal where a single profile can be tens of megabytes, that 3x amplification added up fast. Pyroscope 2.0 eliminates write-path replication entirely, so each profile is written exactly once to object storage.

But the bigger win is data co-location. Profiles from the same service now are stored close together, which means symbolic information like function names, source locations, and stack traces—which are often 60% or more of a profile's size—is deduplicated and kept within as few objects as required by the service’s data volume. In our production environment, this reduced the symbol storage footprint by up to 95%.

For teams that avoided continuous profiling because of storage and compute costs, these architectural changes make it practical to run profiling at scale.

Query performance that matches the workflow

Profiling queries are inherently expensive due to the sheer volume of data involved. Each pod continuously emits stack trace samples, so querying 100 pods over 12 hours means scanning and merging hundreds of millions of samples, which can require hundreds of CPU-seconds of processing.

In v1, this work happened inside stateful components that couldn't scale elastically; you had to reserve capacity for peak query load, even if that capacity sat idle 99% of the time.

Pyroscope 2.0 makes the entire read path stateless. Any querier can process any query, and queriers scale up and down based on demand. You pay for query compute when you're actually querying instead of all the time.

This matters because profiling has a bursty access pattern. There is essentially no base load; nobody is polling profiles on a dashboard every 30 seconds. But when an incident happens, multiple engineers start running heavy queries simultaneously. And increasingly, LLM-powered agents are querying profiling data autonomously as part of automated investigations, adding significant traffic. With stateless queriers, the system can handle these spikes gracefully without paying for idle capacity the rest of the time.

Operational simplicity

Fewer stateful components means fewer things that can break and faster recovery when they do. Rollouts that took 8-12 hours in v1 now complete in minutes. The segment writer is diskless. The store-gateway is gone. The operational surface area is significantly smaller.

For teams running Pyroscope themselves, this is the difference between "we need a dedicated person to operate this" and "it just runs."

Pressure-tested in Grafana Cloud

Grafana Cloud Profiles, our hosted continuous profiling tool powered by Pyroscope, has been running Pyroscope 2.0 in production since April 2025. We rolled it out to every region by September, and have since processed 19.5PB of profiling data. The challenges we set out to fix, including wasteful replication, coupled read/write paths, and slow rollouts, are measurably gone.

If you're a Grafana Cloud Profiles user, the migration has already happened. This release brings the same production-proven architecture to the open source community.

A foundation for new capabilities

Beyond the operational improvements, the cleaner architecture in Pyroscope 2.0 enables features that simply weren't feasible in v1, including:

Metrics from profiles: aggregate profiling data into fleet-wide metrics to compare resource consumption across services, versions, or deployments without querying individual profiles.
Individual profile inspection: drill into a single profile instance rather than only viewing aggregates.
Heatmap queries (shown below): visualize profile distributions over time to spot patterns and outliers.
Richer query types: the stateless read path and cleaner data model make it possible to build new analysis capabilities without touching every component in the system.

Getting started

Pyroscope 2.0 is available now. If you're upgrading from v1, the key change is that object storage is required for distributed deployments, as it's the single source of truth for all profile data.

For step-by-step migration instructions, please reference our Pyroscope 2.0 migration guide. You can also learn more in our release notes.

To learn more about all the announcements coming out of GrafanaCON 2026, read our GrafanaCON announcements blog post.

Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Yasir Ekinci — Tue, 21 Apr 2026 06:43:10

Evaluating agents is hard. Verifying observability tasks is harder.

Yes, AI agents have gotten dramatically and quantifiably better at coding and tool use, but observability presents a different kind of challenge. In a real incident, the hard part is rarely just writing a query. It's deciding which signal matters, figuring out whether a spike is noise or symptom, correlating metrics with logs and traces, and sometimes making a change in Grafana without breaking the dashboard another engineer depends on.

To help the Grafana community navigate this new world of AI-assisted observability, we’re open sourcing grafana/o11y-bench, a benchmark for evaluating AI agents on observability workflows. It runs agents against a real Grafana stack with access to Grafana MCP server and grades them on a set of observability tasks within that environment.

o11y-bench is built on Harbor, an open source framework released by the creators of Terminal Bench that standardizes environments for benchmarking agents against sets of focused tasks. The benchmark we developed focuses on the workflows that actually matter in practice: querying metrics, logs, and traces; investigating incidents; and making targeted dashboard changes.

Why observability needs its own benchmark

Observability isn't just another straightforward agent tool-calling problem. Observability tasks such as root-cause investigations or dashboard creation often depend on the interaction between large amounts of metrics, logs, traces, time ranges, and saved application state. And that collection of variables makes it harder to tell whether an agent actually got the work right. For example, a query can be syntactically valid and still select the wrong series; a dashboard can render and still be saved incorrectly.

To properly evaluate AI systems today, benchmark tasks and simulated environments must reflect reality. o11y-bench runs agents against a real Grafana stack and evaluates them on a set of focused criteria simulating the complexity of a modern observability stack.

This type of standardized measuring can provide critical insights for Grafana users because the outcome can help you discern the difference between an agent that looks helpful in a demo and one you can trust in a real workflow. In observability, the dangerous mistakes are often the subtle ones.

And by open sourcing the tasks, environment, grading logic, and results, we want this to be inspectable, reproducible, and open to challenge. We are also hoping that these tasks can help the next generation of models improve their observability related skills.

Open source, open testing

Built on Harbor, o11y-bench allows you to run your model, agent harness, or any combination of the two in a sandboxed environment alongside a Grafana Docker container with synthetic metrics, logs, and traces present. It’s as simple as running the following task to get started:

mise run bench:job -- --model openai/gpt-5.4-nano --task-name query-cpu-metrics --agent opencode

This command will kick off the benchmark over just one task, (query-cpu-metrics ) and output the results to the /jobs folder where you can inspect agent trajectory, see the LLM-as-a-judge and heuristic scoring, and understand how your agent or model performed.

Our goal with o11y-bench is to engage the community to see what's possible. We have kicked off the leaderboard with a set of base frontier models, but we welcome new combinations of agent harnesses, model configurations, and experimentation to push agent capabilities in observability forward.

What tasks o11y-bench tests

The first public release of o11y-bench includes 63 tasks across various observability workflows:

Prometheus and PromQL tasks
Loki and LogQL tasks
Tempo and TraceQL tasks
Multi-step incident investigations
Dashboard editing and repair tasks

The tasks we have curated aim to be deterministic enough to grade reliably, but rich enough to produce real failure modes. For instance, take a problem from the Prometheus query category, promql-retry-backlog-triage:

“We think the payment incident may have built up retries behind the scenes. Over roughly the last six hours, which service showed the highest retry/backlog depth, about how high did it get, and does the next-worst service look like a smaller spillover or a comparable primary problem?”

To a human familiar with the system, this problem seems relatively straightforward. However, we noticed high-thinking or token-heavy agents would spin their wheels by gathering too much information about the system, wasting tokens and timing out. On the other hand, more-focused agents were able to zero-in on the proper queries and diagnose the system quickly and accurately.

While a high-thinking agent may get there in the end, the metrics included with o11y-bench also allow us to examine cost, token usage, and overall performance rather than just a "0" or "1" answer, providing actionable insights on agents and models we may want to use for these types of scenarios.

Why verifying observability work is hard

Coming up with observability tasks that sufficiently test an agent is only part of the assessment process. You also need to be able to verify that those tasks are accurately completed.

If a user asks an agent to investigate latency, compare error rates, or update a dashboard, simply getting a final answer that looks good isn't good enough. For many query tasks, we run a reference Prometheus or Loki query against the same stack the agent saw, then compare that value to what the model actually cited. For dashboard tasks, we inspect the saved Grafana state and, when needed, execute the saved panel query and compare it against a reference query for the same case.

We start with outcomes. For the explanation itself, we still score the response, but we pair that with verifiable facts from the environment rather than treating fluent prose as evidence.

Two simple examples:

If a model says “the p95 latency was about 2.3 seconds,” the verifier can run the reference query against the same Prometheus or Loki data and check whether that number is actually supported.
If a model says it fixed a dashboard, the verifier can inspect the saved panel JSON, bind the expected variable values, execute the saved query, and compare the result to a reference query for the same case.

Our general grading philosophy is to always check against the ground truth of what the agent actually did, not just what it said. In practice, that is the difference between an agent that looks convincing in a transcript and one you can trust in a real investigation.

Measuring reliability vs. best-of-three success

Two headline scores are used in this benchmark to evaluate model performance:

Pass^3: A measure of consistency, computed as the average benchmark score across three runs
Pass@3: A measure of best-of-three success, indicating whether the model solved the task at least once across three attempts

Note: Each metric has its value, but their individual usefulness will depend on the use case. For the purposes of this exercise, we care more about consistency, so Pass^3 takes priority in the rankings. Further reading on agent eval methodology and metrics can be found on the Anthropic blog.

It's interesting to note how each model family performs in the relative metrics, as different leaders emerge when looking at the different metrics for success.

The results

The initial launch suite covered 29 model variants on 63 tasks (at three attempts each) for a total 5,481 total trials.

Using Pass^3 as the headline metric:

Opus 4.7 with reasoning turned off led the launch run
Opus 4.7 with reasoning set to high came in second, interestingly lower consistency at Pass^3 but with a higher Pass@3 score

Qwen 3.6 Plus performed the best of the frontier open source models we tested, even beating some of the smaller Sonnet and GPT models.

The main takeaway is that reliability is what truly separates the top models. Many models could get a task right at least once across three attempts. Far fewer could do it consistently. That gap is exactly why we treat reliability as the main benchmark signal.

“Got it right once” and “gets it right consistently” are not the same thing, especially in observability, where a subtle mistake can send an engineer down the wrong path. Mean score is still useful for debugging tasks and graders, but it is not a good headline metric for agent trust.

A per-category view sharpened that picture further. Grafana API tasks were close to saturated, and Prometheus was relatively strong. Tempo and Loki sat in the middle. Dashboarding remains the hardest area, not because it is the only thing that matters, but because it combines state, query correctness, variable wiring, and saved behavior in ways that are easy to get almost right.

Here, Pass@3 means “got an individual task right at least once across three tries,” while a perfect Pass^3 means “got it right all three times.” The gap between those two is one of the main things the benchmark is trying to expose.

Try it yourself

The quickest way to get started is to head to the grafana/o11y-bench repo, clone it locally, and follow the README.

From there, you can run individual tasks, full suites, and comparison reports against any model or agent harness available through Harbor and LiteLLM.

If you try o11y-bench, we’d be interested to see how it holds up across more models, agent setups, and independent reproductions, as well as what it suggests for future benchmark revisions. Submit contributions to the HuggingFace leaderboard per the contributing guide or open an issue in the benchmark repo for feedback or discussions.

For more information on this and all the other exciting updates from GrafanaCON 2026, check out our announcement blog for all the news. And for more information on Grafana Cloud AI, including FAQs about Assistant and our other AI capabilities, check out our AI observability page.

AI Observability in Grafana Cloud: A complete solution for monitoring your agentic workloads

Maurice Rochau — Tue, 21 Apr 2026 06:35:04

The observability industry has developed great tools for using metrics, logs, traces, and profiles to monitor the cloud native applications that have dominated the last decade of software development.

But when it comes to understanding what an AI system is actually doing, we’re often left reading raw conversations, guessing at quality, and reacting too late. And that’s a problem.

Agents make decisions, call tools, generate content, and interact with users, services, and applications in ways that traditional observability isn't designed to handle. As organizations shift from cloud native to AI native, it's increasingly clear that agent chats and sessions need to be treated as first-class signals alongside the rest of your more traditional telemetry.

To address this emerging gap, we're launching AI Observability in Grafana Cloud. Available now in public preview, AI Observability actually started as an internal hackathon project designed to address some of our own agentic challenges. Since then, we've heard from lots of customers dealing with similar problems, so we decided to take what we learned and turn it into a complete solution for teams running agents in production, helping them understand what their AI is doing, how well it’s doing it, and where issues are emerging.

How AI Observability can help you today

Traditional observability gives you signals like CPU usage, request latency, and error rates. Those are important. But they don’t tell you if your agent is being helpful, hallucinating, or quietly degrading over time.

That's why we've built AI Observability in Grafana Cloud to help teams:

Observe AI agent behavior in real time, including inputs, outputs, and execution flows
Continuously evaluate outputs, with alerts for issues such as low-quality responses, policy violations, or anomalous behavior
Surface risk earlier, including potential data exposure or misuse (for example, leaked credentials or abnormal usage patterns)
Elevate agent sessions and conversations to first-class telemetry signals and correlate them in the same environment where applications are observed

AI Observability connects agents directly to traces, tool calls, token usage, costs, and (live) evaluations. And it does it all in the same Grafana Cloud environment where you observe the rest of your systems.

This gives you true end-to-end signals—agentic or otherwise. So the next time something looks off, you won't just see a spike in latency. You'll also be able to open the exact conversation, inspect what happened, and understand why.

Instrument once, understand everything with open standards

AI Observability is OpenTelemetry-compatible, so it fits naturally into existing observability setups. You instrument your app once using a thin SDK, and AI Observability automatically captures:

Generations and conversations
Model and provider metadata
Tool usage
Latency and token metrics
Cost signals

From there, everything becomes queryable and explorable in one place. You can also filter by model or provider, time range, labels, or environments. This is particularly helpful if you have different providers, where the same model can function differently in different environments.

AI Observability automatically classifies and catalogs agent versions for you. If you change an agent’s system prompt or its tool set, a new agent version is created that you can inspect separately. This helps you find the best performing agent you have and spot specific problems agents might have.

Agents can break subtly; AI Observability gives you the context to know why

One of the hardest parts of running AI in production is that issues are often subtle. Nothing crashes. No alerts fire. But something is off and your users complain. Responses are getting longer and less useful, costs are creeping up, quality is slowly degrading, users are losing trust.

AI Observability is designed for this exact problem. You can drill into any conversation and see the full thread: tool calls and execution traces, token usage and cost breakdown, scores, ratings, and annotations.

This is critical for debugging your agents. It helps you understand if specific agents struggle with specific models, or the impacts of your latest release. You can also know where your tokens are going—and, as a result, where your money is going. You can see if certain operations are expensive, or if certain tools are slow or struggle with certain tasks, which in turn adds to your costs.

And if you want additional debugging support or advice for how to improve your agents, you can ask for help from Grafana Assistant using natural language. Because it can correlate your AI data with all your other telemetry signals, it can help all sorts of use cases.

For example, it can see how much time you're spending on compute, or what's causing spikes in latency, and you can find out how that ties back to your AI. Essentially, it takes the power of our existing full-stack observability platform and extends it to the next-generation of applications your business relies on.

Get alerts when they matter

AI Observability can also help assess your AI's accuracy, which can become a major challenge when you have multiple agents running at scale.

You can use LLM-as-a-judge, heuristics, or regex to detect undesirable outputs. And because AI Observability natively integrates with Grafana Alerting, you can get notified when your agents behave out of control. Taken collectively, this allows you to treat agents like the rest of your services and infrastructure.

You can even combine this with Assistant skills and agent-specific runbooks. That way, when you get paged about increased toxicity from your agent, you can ask Assistant to inspect the conversation, read your runbook, and offer remediation strategies.

AI Observability, from the team that built Grafana Assistant

We know about the challenge of observing agents because we've lived through it. When we started building Assistant, we looked at lots of frameworks, but they just weren’t detailed enough for how we wanted to monitor our agents.

We quickly realized we needed to build our own in-house solution. Assistant has been very well received by our users—so popular, in fact, that we're making it available in new and exciting ways—and a big part of that success has been our ability to closely monitor the feedback loop we built so we can keep track of how performance and customer behavior evolves.

That success led us to think we should make this available to our customers. So, during a recent hackathon, the Assistant team took what it learned from monitoring agents, prompt engineering, keeping track of agent versions, handling tools, and more, and baked it into a solution that enables you to monitor agents at a large scale, too.

As with Assistant, AI Observability is at the center of continuous innovation. We've already shipped new features, including user annotations and streamlined alerting, after working with customers during the private preview. And we're excited to continue to innovate as we open the solution to a wider audience.

Starting today, you can find AI Observability in Grafana Cloud and start using it right away. Start with the demo mode to get a feel for how it works with some example data. And when you’re ready, you can hook up your own agents and start analyzing them.