AgentSkillsCN

production-monitoring

为GenAI代理进行持续的生产监控——注册评分器并进行抽样、按需评估()、评分器生命周期、追踪归档、指标回填。在设置生产监控、创建注册评分器、实施抽样策略、归档追踪,或回填历史指标时使用此功能。可通过“监控”“生产”“评分器”“抽样”“评估”“追踪归档”“回填”等触发器来启动该操作。

SKILL.md
--- frontmatter
name: production-monitoring
description: >
  Continuous production monitoring for GenAI agents - registered scorers with sampling,
  on-demand assess(), scorer lifecycle, trace archival, metric backfill. Use when setting
  up production monitoring, creating registered scorers, implementing sampling strategies,
  archiving traces, or backfilling historical metrics. Triggers on "monitoring", "production",
  "scorer", "sampling", "assess", "trace archival", "backfill".
license: Apache-2.0
metadata:
  author: prashanth subrahmanyam
  version: "1.0.0"
  domain: genai-agents
  role: worker
  pipeline_stage: 9
  pipeline_stage_name: genai-agents
  called_by:
    - genai-agents-setup
  standalone: true
  last_verified: "2026-02-07"
  volatility: high
  upstream_sources:
    - name: "ai-dev-kit"
      repo: "databricks-solutions/ai-dev-kit"
      paths:
        - "databricks-skills/agent-bricks/SKILL.md"
      relationship: "reference"
      last_synced: "2026-02-09"
      sync_commit: "97a3637"

Production Monitoring Patterns

Production-grade patterns for continuous monitoring of GenAI agents in production using MLflow registered scorers, on-demand assessment, trace archival, and metric backfill.

When to Use

  • Setting up continuous production monitoring for GenAI agents
  • Creating registered scorers with sampling strategies
  • Implementing on-demand evaluation workflows
  • Archiving MLflow traces for analysis
  • Backfilling historical metrics
  • Troubleshooting production scorer failures

Two Approaches: Registered Scorers vs On-Demand assess()

ApproachUse CaseSamplingCost
Registered ScorersContinuous monitoringConfigurable (1-100%)Higher (continuous)
On-Demand assess()Periodic evaluation100% (when run)Lower (on-demand)

Registered Scorers (Continuous)

python
from mlflow.models import scorer
from mlflow.metrics import Score

@scorer
def safety_scorer(inputs, outputs, expectations=None):
    # Scorer logic
    return Score(value=0.95, rationale="...")

# Register scorer
mlflow.models.register_scorer(
    scorer=safety_scorer,
    name="safety_scorer",
    sampling_rate=0.1  # 10% sampling
)

# Start scorer (begins continuous monitoring)
scorer_instance = mlflow.models.start_scorer("safety_scorer")

For complete registered scorer patterns, see: references/registered-scorers.md

On-Demand assess()

python
# Evaluate specific traces on-demand
results = mlflow.genai.assess(
    traces=traces_df,
    evaluators=[safety_scorer, relevance_scorer]
)

Use when:

  • Periodic evaluation (weekly, monthly)
  • Ad-hoc analysis
  • Cost optimization (evaluate only when needed)

⚠️ CRITICAL: Immutable Scorer Pattern

MANDATORY: Scorers must be immutable - scorer = scorer.start() pattern.

❌ WRONG: Mutable Scorer

python
scorer = mlflow.models.register_scorer(...)
scorer.start()  # ❌ Modifies scorer object
scorer.stop()   # ❌ May not work correctly

✅ CORRECT: Immutable Pattern

python
scorer = mlflow.models.register_scorer(...)
scorer_instance = scorer.start()  # ✅ Returns new instance
scorer_instance.stop()  # ✅ Works correctly

Why this matters:

  • Scorer lifecycle operations require immutable pattern
  • Prevents state corruption
  • Enables proper start/stop/delete operations

For complete immutable patterns, see: references/registered-scorers.md


Registered Scorer Quick Pattern with Sampling

python
from mlflow.models import scorer
from mlflow.metrics import Score

@scorer
def safety_scorer(inputs, outputs, expectations=None):
    """Safety scorer with 10% sampling."""
    response_text = _extract_response_text(outputs)
    # ... scoring logic
    return Score(value=0.95, rationale="Safe response")

# Register with sampling
safety_scorer_registered = mlflow.models.register_scorer(
    scorer=safety_scorer,
    name="safety_scorer",
    sampling_rate=0.1  # 10% of traces evaluated
)

# Start continuous monitoring
safety_scorer_instance = safety_scorer_registered.start()

Sampling rates by scorer type:

  • Safety: 100% (critical)
  • Relevance: 10-20% (moderate volume)
  • Guidelines: 5-10% (lower priority)
  • Custom: 1-5% (cost optimization)

Custom Heuristic Scorer Pattern (Fast, 100% Sampling)

python
@scorer
def fast_heuristic_scorer(inputs, outputs, expectations=None):
    """
    Fast heuristic scorer - runs on 100% of traces.
    
    Use for lightweight checks (length, format, keywords).
    """
    response_text = _extract_response_text(outputs)
    
    # Fast checks
    score = 1.0
    issues = []
    
    if len(response_text) < 10:
        score -= 0.5
        issues.append("Response too short")
    
    if "ERROR" in response_text.upper():
        score -= 0.3
        issues.append("Contains error keyword")
    
    return Score(
        value=max(0.0, score),
        rationale="; ".join(issues) if issues else "Passed heuristic checks"
    )

On-Demand assess() Pattern

python
def evaluate_production_traces(
    start_time: datetime,
    end_time: datetime,
    evaluators: list
):
    """
    Evaluate production traces on-demand.
    
    Useful for periodic evaluation or ad-hoc analysis.
    """
    # Query traces from time range
    traces_df = query_traces(start_time, end_time)
    
    # Run evaluation
    results = mlflow.genai.assess(
        traces=traces_df,
        evaluators=evaluators,
        experiment_name="/Shared/health_monitor_agent/production_eval"
    )
    
    return results

Trace Archival Quick Setup

python
import mlflow

# Enable trace archival to Unity Catalog
mlflow.enable_databricks_trace_archival(
    catalog="catalog",
    schema="schema",
    table="agent_traces"
)

# Traces automatically archived to Delta table
# Query: SELECT * FROM catalog.schema.agent_traces

For complete trace archival patterns, see: references/trace-archival.md


Production vs Dev Evaluation Comparison

AspectDevelopmentProduction
Methodmlflow.genai.evaluate()Registered scorers + assess()
Sampling100% (full dataset)Configurable (1-100%)
FrequencyOn-demandContinuous + periodic
CostLow (one-time)Higher (continuous)
DatasetStatic eval datasetLive production traces
ThresholdsPre-deployment gatesContinuous monitoring

Unified Scorer Definitions Concept

Define scorers once, use in both development and production:

python
# Define scorer (reusable)
@scorer
def safety_scorer(inputs, outputs, expectations=None):
    # ... scorer logic
    return Score(value=0.95, rationale="...")

# Use in development evaluation
results = mlflow.genai.evaluate(
    model=model_uri,
    data=eval_dataset,
    evaluators=[safety_scorer]
)

# Use in production (register + start)
safety_registered = mlflow.models.register_scorer(
    scorer=safety_scorer,
    name="safety_scorer",
    sampling_rate=1.0
)
safety_registered.start()

❌/✅ Patterns

Immutable Pattern

python
# ✅ CORRECT: Immutable pattern
scorer = mlflow.models.register_scorer(...)
scorer_instance = scorer.start()  # Returns new instance
scorer_instance.stop()

# ❌ WRONG: Mutable pattern
scorer = mlflow.models.register_scorer(...)
scorer.start()  # Modifies scorer object
scorer.stop()   # May not work

External Imports in Scorers

python
# ✅ CORRECT: Import helpers inside scorer
@scorer
def safety_scorer(inputs, outputs, expectations=None):
    from evaluation_helpers import _extract_response_text
    response_text = _extract_response_text(outputs)
    # ...

# ❌ WRONG: External imports at module level
from evaluation_helpers import _extract_response_text  # ❌ May fail in production

@scorer
def safety_scorer(inputs, outputs, expectations=None):
    response_text = _extract_response_text(outputs)
    # ...

Validation Checklist

Before deploying production monitoring:

Registered Scorers

  • Immutable pattern used (scorer = scorer.start())
  • Sampling rates configured appropriately
  • Scorer lifecycle managed (register/start/stop/delete)
  • External imports inside scorer functions

On-Demand Assessment

  • assess() function implemented for periodic evaluation
  • Trace querying patterns implemented
  • Results logged to experiments

Trace Archival

  • Trace archival enabled
  • Delta table schema configured
  • Query patterns for archived traces

Metric Backfill

  • Backfill workflow implemented (if needed)
  • Historical analysis patterns defined

Reference Files

  • references/registered-scorers.md - Complete registered scorer patterns with lifecycle management
  • references/trace-archival.md - Trace archival setup and querying patterns
  • references/metric-backfill.md - Backfill patterns for historical metrics
  • references/monitoring-dashboard-queries.md - SQL queries for monitoring dashboards
  • scripts/register_production_scorers.py - Complete script to register all production scorers

References

Official Documentation

Related Skills

  • mlflow-genai-evaluation - Development evaluation patterns
  • deployment-automation - Deployment workflows
  • responses-agent-patterns - ResponsesAgent implementation

Version History

DateChanges
Feb 6, 2026Initial version: Production monitoring with immutable scorer pattern