AgentSkillsCN

drift-detection

针对 LLM 应用,开展统计与质量漂移检测。在监控模型质量下降、输入分布偏移或输出模式随时间变化时使用。

SKILL.md
--- frontmatter
name: drift-detection
description: Statistical and quality drift detection for LLM applications. Use when monitoring model quality degradation, input distribution shifts, or output pattern changes over time.
context: fork
agent: metrics-architect
version: 1.0.0
author: OrchestKit
tags: [drift, monitoring, quality, statistical, psi, langfuse, evidently, 2026]
user-invocable: false

Drift Detection

Monitor LLM quality degradation and input/output distribution shifts in production.

Overview

  • Detecting input distribution drift (data drift)
  • Monitoring output quality degradation (concept drift)
  • Implementing statistical methods (PSI, KS, KL divergence)
  • Setting up dynamic thresholds with moving averages
  • Integrating Langfuse scores with drift analysis

Quick Reference

Population Stability Index (PSI)

python
import numpy as np

def calculate_psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float:
    """
    Calculate Population Stability Index.

    Thresholds:
    - PSI < 0.1: No significant drift
    - 0.1 <= PSI < 0.25: Moderate drift, investigate
    - PSI >= 0.25: Significant drift, action needed
    """
    expected_pct = np.histogram(expected, bins=bins)[0] / len(expected)
    actual_pct = np.histogram(actual, bins=bins)[0] / len(actual)

    # Avoid division by zero
    expected_pct = np.clip(expected_pct, 0.0001, None)
    actual_pct = np.clip(actual_pct, 0.0001, None)

    psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
    return psi

# Usage
psi_score = calculate_psi(baseline_scores, current_scores)
if psi_score >= 0.25:
    alert("Significant quality drift detected!")

EWMA Dynamic Threshold

python
class EWMADriftDetector:
    """Exponential Weighted Moving Average for drift detection."""

    def __init__(self, lambda_param: float = 0.2, L: float = 3.0):
        self.lambda_param = lambda_param  # Smoothing factor
        self.L = L  # Control limit multiplier
        self.ewma = None

    def update(self, value: float, baseline_mean: float, baseline_std: float) -> dict:
        if self.ewma is None:
            self.ewma = value
        else:
            self.ewma = self.lambda_param * value + (1 - self.lambda_param) * self.ewma

        # Calculate control limits
        factor = np.sqrt(self.lambda_param / (2 - self.lambda_param))
        ucl = baseline_mean + self.L * baseline_std * factor
        lcl = baseline_mean - self.L * baseline_std * factor

        return {
            "ewma": self.ewma,
            "ucl": ucl,
            "lcl": lcl,
            "drift_detected": self.ewma > ucl or self.ewma < lcl
        }

Langfuse Score Trend Monitoring

python
from langfuse import Langfuse

langfuse = Langfuse()

def check_quality_drift(days: int = 7, threshold_drop: float = 0.1):
    """Compare recent quality scores against baseline."""

    # Fetch recent scores
    current_scores = langfuse.fetch_scores(
        name="quality_overall",
        from_timestamp=datetime.now() - timedelta(days=1)
    )

    # Fetch baseline scores
    baseline_scores = langfuse.fetch_scores(
        name="quality_overall",
        from_timestamp=datetime.now() - timedelta(days=days),
        to_timestamp=datetime.now() - timedelta(days=1)
    )

    current_mean = np.mean([s.value for s in current_scores])
    baseline_mean = np.mean([s.value for s in baseline_scores])

    drift_pct = (baseline_mean - current_mean) / baseline_mean

    if drift_pct > threshold_drop:
        return {"drift": True, "drop_pct": drift_pct}
    return {"drift": False, "drop_pct": drift_pct}

Key Decisions

DecisionRecommendation
Statistical methodPSI for production (stable), KS for small samples
Threshold strategyDynamic (95th percentile of historical) over static
Baseline window7-30 days rolling window
Alert priorityPerformance metrics > distribution metrics
Tool stackLangfuse (traces) + Evidently/Phoenix (drift analysis)

PSI Threshold Guidelines

PSI ValueInterpretationAction
< 0.1No significant driftMonitor
0.1 - 0.25Moderate driftInvestigate
>= 0.25Significant driftAlert + Action

Anti-Patterns

python
# ❌ NEVER use static thresholds without context
if psi > 0.2:  # May cause alert fatigue
    alert()

# ❌ NEVER retrain on time schedule alone
schedule.every(7).days.do(retrain)  # Wasteful if no drift

# ✅ ALWAYS use dynamic thresholds
threshold = np.percentile(historical_psi, 95)
if psi > threshold:
    alert()

# ✅ ALWAYS correlate with performance metrics
if psi > threshold AND quality_score < baseline:
    trigger_evaluation()

Detailed Documentation

ResourceDescription
references/statistical-methods.mdPSI, KS, KL divergence, Wasserstein comparison
references/embedding-drift.mdArize Phoenix, cluster monitoring, semantic drift
references/ewma-baselines.mdMoving averages, dynamic thresholds, control charts
references/langfuse-evidently-integration.mdCombined pipeline pattern
checklists/drift-detection-setup-checklist.mdImplementation checklist

Related Skills

  • langfuse-observability - Score tracking for drift analysis
  • llm-evaluation - Quality metrics that feed drift detection
  • quality-gates - Threshold enforcement
  • observability-monitoring - General monitoring patterns

Capability Details

psi-drift

Keywords: psi, population stability index, distribution drift, histogram comparison Solves:

  • Detect distribution shifts in LLM inputs/outputs
  • Production-grade drift monitoring
  • Stable drift metric for large datasets

embedding-drift

Keywords: embedding drift, semantic drift, cluster, centroid, arize phoenix Solves:

  • Detect semantic changes in text data
  • Monitor RAG retrieval quality
  • Track embedding space shifts

quality-regression

Keywords: quality drift, score degradation, trend, moving average Solves:

  • Detect LLM quality degradation over time
  • Compare against historical baselines
  • Early warning for model issues

dynamic-thresholds

Keywords: ewma, dynamic threshold, adaptive, control chart Solves:

  • Reduce alert fatigue with adaptive thresholds
  • Statistical process control for LLMs
  • Context-aware drift alerting

canary-monitoring

Keywords: canary prompt, fixed test, regression test, behavioral drift Solves:

  • Track consistency with fixed test inputs
  • Detect behavioral changes in LLMs
  • Regression testing for model updates