Menu
Choose a product
Scroll for more
Grafana Cloud
GenAI evaluations configuration
Configure OpenLIT evaluations to monitor AI model quality, safety, and performance with customizable thresholds, providers, and evaluation parameters.
Basic configuration
Provider selection
Choose between OpenAI and Anthropic for evaluation services:
Python
import openlit
# OpenAI-based evaluations (default)
evals = openlit.evals.All(provider="openai")
# Anthropic-based evaluations
evals = openlit.evals.All(provider="anthropic")API key configuration
Set your evaluation provider API key:
Python
import os
# Option 1: Environment variable (recommended)
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"
# Option 2: Direct parameter
evals = openlit.evals.All(provider="openai", api_key="your-api-key")Model selection
Specify which model to use for evaluations:
Python
# OpenAI models
evals = openlit.evals.All(provider="openai", model="gpt-4o")
evals = openlit.evals.All(provider="openai", model="gpt-4o-mini")
# Anthropic models
evals = openlit.evals.All(provider="anthropic", model="claude-3-5-sonnet-20241022")
evals = openlit.evals.All(provider="anthropic", model="claude-3-5-haiku-20241022")
evals = openlit.evals.All(provider="anthropic", model="claude-3-opus-20240229")Advanced configuration
Threshold scoring
Configure the score threshold for determining evaluation verdicts:
Python
import openlit
# Default threshold is 0.5
evals = openlit.evals.All(provider="openai", threshold_score=0.7)
# Different thresholds for different evaluation metrics
hallucination_eval = openlit.evals.Hallucination(threshold_score=0.6)
bias_eval = openlit.evals.Bias(threshold_score=0.8)
toxicity_eval = openlit.evals.Toxicity(threshold_score=0.9)Custom base URL
For custom API endpoints or proxies:
Python
import openlit
# Custom OpenAI-compatible endpoint
evals = openlit.evals.All(
provider="openai",
base_url="https://your-custom-endpoint.com/v1"
)Custom categories
Add custom evaluation categories beyond the defaults:
Python
import openlit
# Add custom categories for specialized detection
custom_categories = {
"spam_detection": "Identify promotional or spam content",
"factual_verification": "Verify claims against known facts",
"technical_accuracy": "Check technical information correctness"
}
evals = openlit.evals.All(
provider="openai",
custom_categories=custom_categories
)Metrics collection
Enable OpenTelemetry metrics collection for evaluations:
Python
import openlit
# Initialize OpenLIT for metrics collection first
openlit.init()
# Enable metrics collection for evaluations
evals = openlit.evals.All(
provider="openai",
collect_metrics=True
)
# Now evaluation metrics are sent metrics to Grafana Cloud
result = evals.measure(prompt=prompt, contexts=contexts, text=text)Evaluation-specific configuration
Hallucination detection
Python
import openlit
hallucination_detector = openlit.evals.Hallucination(
provider="openai",
model="gpt-4o",
threshold_score=0.6,
custom_categories={
"medical_misinformation": "Incorrect medical or health information",
"historical_inaccuracy": "Incorrect historical facts or dates"
},
collect_metrics=True
)
# Usage with detailed context
result = hallucination_detector.measure(
prompt="Explain the discovery of penicillin",
contexts=[
"Alexander Fleming discovered penicillin in 1928",
"Penicillin was discovered accidentally when Fleming noticed mold killing bacteria"
],
text="Fleming invented penicillin in 1925 as a deliberate research project"
)Bias detection
Python
import openlit
bias_detector = openlit.evals.Bias(
provider="anthropic",
model="claude-3-5-sonnet-20241022",
threshold_score=0.7,
custom_categories={
"professional_bias": "Stereotypes about professional roles",
"geographic_bias": "Assumptions based on geographic location"
},
collect_metrics=True
)
# Usage for bias detection
result = bias_detector.measure(
prompt="Describe a typical nurse",
text="Nurses are usually women who are very caring and emotional"
)Toxicity detection
Python
import openlit
toxicity_detector = openlit.evals.Toxicity(
provider="openai",
model="gpt-4o-mini",
threshold_score=0.8,
custom_categories={
"cyberbullying": "Online harassment or bullying behavior",
"discriminatory_language": "Language that discriminates against groups"
},
collect_metrics=True
)
# Usage for toxicity detection
result = toxicity_detector.measure(
prompt="Provide feedback on this comment",
text="Your opinion is worthless and you should be ashamed"
)Was this page helpful?
Related resources from Grafana Labs
Additional helpful documentation, links, and articles:
Video

Getting started with managing your metrics, logs, and traces using Grafana
In this webinar, we’ll demo how to get started using the LGTM Stack: Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics.
Video

Intro to Kubernetes monitoring in Grafana Cloud
In this webinar you’ll learn how Grafana offers developers and SREs a simple and quick-to-value solution for monitoring their Kubernetes infrastructure.
Video

Building advanced Grafana dashboards
In this webinar, we’ll demo how to build and format Grafana dashboards.