AI Config Online Evaluations
Automatically score AI Config responses using LLM-as-a-judge methodology.
Prerequisites
- •LaunchDarkly SDK initialized (see
aiconfig-sdk) - •AI Config in completion mode (judges don't work with agent mode)
- •Judges enabled in LaunchDarkly UI (AI Configs → your config → Variations → Attach judges)
Core Concepts
Three built-in judges are available:
- •Accuracy - Scores 0.0-1.0 for correctness
- •Relevance - Scores 0.0-1.0 for addressing the request
- •Toxicity - Scores 0.0-1.0 where lower is safer
Judges evaluate asynchronously (1-2 minute delay). Results appear in the Monitoring tab.
SDK: Check Judge Configuration
python
from ldclient import Context
from ldclient.config import Config
from ldai.client import LDAIClient, AICompletionConfigDefault
import ldclient
# Initialize (see aiconfig-sdk)
ldclient.set_config(Config("your-sdk-key"))
ld_client = ldclient.get()
ai_client = LDAIClient(ld_client)
def check_judges(ai_client, config_key: str, user_id: str):
"""Check which judges are attached to a config."""
context = Context.builder(user_id).build()
config = ai_client.completion_config(
config_key,
context,
AICompletionConfigDefault(enabled=False),
{}
)
if config.judge_configuration and config.judge_configuration.judges:
print("[OK] Judges attached:")
for judge in config.judge_configuration.judges:
print(f" - {judge.key}: {int(judge.sampling_rate * 100)}% sampling")
else:
print("[INFO] No judges configured")
return config.judge_configuration
SDK: Automatic Evaluation with create_chat
For automatic judge evaluation, use the create_chat() method. This handles the full conversation flow and triggers judges automatically.
Important:
create_chat()passes model parameters directly to the provider. LaunchDarkly uses camelCase (maxTokens), but OpenAI expects snake_case (max_tokens). If your variation hasmaxTokensset,create_chat()will fail with OpenAI. Either:
- •Omit
maxTokensfrom the variation's model parameters, OR- •Use
completion_config()+track_openai_metrics()instead (but judges won't auto-evaluate)
python
from ldai.client import AICompletionConfigDefault, ModelConfig, ProviderConfig, LDMessage
async def generate_with_automatic_evaluation(ai_client, config_key: str, user_id: str, prompt: str):
"""Generate AI response with automatic judge evaluation using create_chat."""
context = Context.builder(user_id).build()
chat = await ai_client.create_chat(
config_key,
context,
AICompletionConfigDefault(
enabled=True,
model=ModelConfig("gpt-4"),
provider=ProviderConfig("openai"),
messages=[LDMessage(role='system', content='You are a helpful assistant.')]
)
)
if not chat:
return None
# Invoke chat - judges evaluate automatically (1-2 min delay)
response = await chat.invoke(prompt)
# Results appear in Monitoring tab as:
# $ld:ai:judge:accuracy, $ld:ai:judge:relevance, $ld:ai:judge:toxicity
return response.message.content
Sampling Rate Guidelines
Configure sampling rates in the LaunchDarkly UI:
| Environment | Rate | Use Case |
|---|---|---|
| Development | 100% | Full evaluation for testing |
| Staging | 50% | Validation coverage |
| Production (initial) | 10% | Start conservatively |
| Production (stable) | 20% | Ongoing monitoring |
| Critical features | 30% | Important flows |
Viewing Results
- •Go to AI Configs in LaunchDarkly
- •Select your config
- •Click Monitoring tab
- •View judge scores by variation and time range
Best Practices
- •Completion Mode Only - Judges don't work with agent mode configs
- •Async Results - Evaluation takes 1-2 minutes; don't wait for results
- •Monitor Costs - Judge evaluations use LLM tokens
- •Start Low - Begin with 10% sampling, increase as needed
- •Flush Events - Call
ld_client.flush()in serverless environments
Related Skills
- •
aiconfig-sdk- SDK setup and config retrieval - •
aiconfig-ai-metrics- Automatic AI metrics tracking - •
aiconfig-variations- Manage variations