AgentSkillsCN

langsmith-evaluator

适用于任何关于创建评估器的问题。涵盖自定义指标的创建、以 LLM 作为 Judge 评估器、基于代码的评估器,以及将评估逻辑上传至 LangSmith。同时包含评估器的基本使用方法,用于开展评估工作。

SKILL.md
--- frontmatter
name: langsmith-evaluator
description: Use this skill for ANY question about CREATING evaluators. Covers creating custom metrics, LLM as Judge evaluators, code-based evaluators, and uploading evaluation logic to LangSmith. Includes basic usage of evaluators to run evaluations.

LangSmith Evaluator

Create evaluators to measure agent performance on your datasets. LangSmith supports two types: LLM as Judge (uses LLM to grade outputs) and Custom Code (deterministic logic).

Setup

Environment Variables

bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge

Dependencies

bash
pip install langsmith langchain-openai python-dotenv

Evaluator Format

Evaluators support two function signatures:

Method 1: Dict Parameters (For running evaluations locally):

python
def evaluator_name(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """Evaluate a single prediction."""
    user_query = inputs.get("query", "")
    agent_response = outputs.get("expected_response", "")
    expected = reference_outputs.get("expected_response", "") if reference_outputs else None

    return {
        "key": "metric_name",    # Metric identifier
        "score": 0.85,           # Number or boolean
        "comment": "Reason..."   # Optional explanation
    }

Method 2: Run/Example Parameters (For uploading to LangSmith):

python
def evaluator_name(run, example):
    """Evaluate using run/example dicts.

    Args:
        run: Dict with run["outputs"] containing agent outputs
        example: Dict with example["outputs"] containing expected outputs
    """
    agent_response = run["outputs"].get("expected_response", "")
    expected = example["outputs"].get("expected_response", "")

    return {
        "metric_name": 0.85,      # Metric name as key directly
        "comment": "Reason..."    # Optional explanation
    }

LLM as Judge Evaluators

Use structured output for reliable grading:

python
from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI

class AccuracyGrade(TypedDict):
    """Structured evaluation output."""
    reasoning: Annotated[str, ..., "Explain your reasoning"]
    is_accurate: Annotated[bool, ..., "True if response is accurate"]
    confidence: Annotated[float, ..., "Confidence 0.0-1.0"]

# Configure model with structured output
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(
    AccuracyGrade, method="json_schema", strict=True
)

async def accuracy_evaluator(run, example):
    """Evaluate factual accuracy for LangSmith upload."""
    expected = example["outputs"].get('expected_response', '')
    agent_output = run["outputs"].get('expected_response', '')

    prompt = f"""Expected: {expected}
Agent Output: {agent_output}
Evaluate accuracy:"""

    grade = await judge.ainvoke([{"role": "user", "content": prompt}])

    return {
        "accuracy": 1 if grade["is_accurate"] else 0,
        "comment": f"{grade['reasoning']} (confidence: {grade['confidence']})"
    }

Common Metrics: Completeness, correctness, helpfulness, professionalism

Custom Code Evaluators

Exact Match

python
def exact_match_evaluator(run, example):
    """Check if output exactly matches expected."""
    output = run["outputs"].get("expected_response", "").strip().lower()
    expected = example["outputs"].get("expected_response", "").strip().lower()

    match = output == expected
    return {
        "exact_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }

Trajectory Validation

python
def trajectory_evaluator(run, example):
    """Evaluate tool call sequence."""
    trajectory = run["outputs"].get("expected_trajectory", [])
    expected = example["outputs"].get("expected_trajectory", [])

    # Exact sequence match
    exact = trajectory == expected

    # All required tools used (order-agnostic)
    all_tools = set(expected).issubset(set(trajectory))

    # Efficiency: count extra steps
    extra_steps = len(trajectory) - len(expected)

    return {
        "trajectory_match": 1 if exact else 0,
        "comment": f"Exact: {exact}, All tools: {all_tools}, Extra: {extra_steps}"
    }

Single Step Validation

python
def single_step_evaluator(run, example):
    """Evaluate single node output."""
    output = run["outputs"].get("output", {})
    expected = example["outputs"].get("expected_output", {})
    node_name = run["outputs"].get("node_name", "")

    # For classification nodes
    if "classification" in node_name:
        classification = output.get("classification", "")
        expected_class = expected.get("classification", "")
        match = classification.lower() == expected_class.lower()

        return {
            "classification_correct": 1 if match else 0,
            "comment": f"Output: {classification}, Expected: {expected_class}"
        }

    # For other nodes
    match = output == expected
    return {
        "output_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }

Running Evaluations

python
from langsmith import Client

client = Client()

# Define your agent function
def run_agent(inputs: dict) -> dict:
    """Your agent invocation logic."""
    result = your_agent.invoke(inputs)
    return {"expected_response": result}

# Run evaluation
results = await client.aevaluate(
    run_agent,
    data="Skills: Final Response",              # Dataset name
    evaluators=[
        exact_match_evaluator,
        accuracy_evaluator,
        trajectory_evaluator
    ],
    experiment_prefix="skills-eval-v1",
    max_concurrency=4
)

Upload Evaluators to LangSmith

The upload script is a utility tool to deploy your custom evaluators to LangSmith. Write evaluators specific to your use case, then upload them.

Navigate to skills/langsmith-evaluator/scripts/ to upload evaluators.

Important: LangSmith API requires evaluators to use (run, example) signature where:

  • run: dict with run["outputs"] containing agent outputs
  • example: dict with example["outputs"] containing expected outputs

Create Evaluator File

python
# my_project/evaluators/custom_evals.py

def my_custom_evaluator(run, example):
    """Your custom evaluation logic.

    Args:
        run: Dict with run["outputs"] - agent outputs
        example: Dict with example["outputs"] - expected outputs

    Returns:
        Dict with metric_name as key, score as value, optional comment
    """
    # Extract relevant data
    agent_output = run["outputs"].get("expected_trajectory", [])
    expected = example["outputs"].get("expected_trajectory", [])

    # Your custom logic here
    match = agent_output == expected

    return {
        "my_metric": 1 if match else 0,
        "comment": "Custom reasoning here"
    }

Upload

bash
# List existing evaluators
python upload_evaluators.py list

# Upload evaluator
python upload_evaluators.py upload my_evaluators.py \
  --name "Trajectory Match" \
  --function trajectory_match \
  --dataset "Skills: Trajectory" \
  --replace

# Delete evaluator (will prompt for confirmation)
python upload_evaluators.py delete "Trajectory Match"

# Skip confirmation prompts (use with caution)
python upload_evaluators.py delete "Trajectory Match" --yes
python upload_evaluators.py upload my_evaluators.py \
  --name "Trajectory Match" \
  --function trajectory_match \
  --replace --yes

Options:

  • --name - Display name in LangSmith
  • --function - Function name to extract
  • --dataset - Target dataset name
  • --project - Target project name
  • --sample-rate - Sampling rate (0.0-1.0)
  • --replace - Replace if exists (will prompt for confirmation)
  • --yes - Skip confirmation prompts for replace/delete operations

IMPORTANT - Safety Prompts:

  • The script prompts for confirmation before any destructive operations (delete, replace)
  • ALWAYS respect these prompts - wait for user input before proceeding
  • NEVER use --yes flag unless the user explicitly requests it
  • The --yes flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user

Best Practices

  1. Use structured output for LLM judges - More reliable than parsing free-text
  2. Match evaluator to dataset type
    • Final Response → LLM as Judge for quality, Custom Code for format
    • Single Step → Custom Code for exact match
    • Trajectory → Custom Code for sequence/efficiency
  3. Combine multiple evaluators - Run both subjective (LLM) and objective (code)
  4. Use async for LLM judges - Enables parallel evaluation, much faster
  5. Test evaluators independently - Validate on known good/bad examples first
  6. Upload to LangSmith - Automatic evaluation on new runs

Example Workflow

bash
# 1. Create evaluators file
cat > evaluators.py <<'EOF'
def exact_match(run, example):
    """Check if output exactly matches expected."""
    output = run["outputs"].get("expected_response", "").strip().lower()
    expected = example["outputs"].get("expected_response", "").strip().lower()
    match = output == expected
    return {
        "exact_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }
EOF

# 2. Upload to LangSmith
python upload_evaluators.py upload evaluators.py \
  --name "Exact Match" \
  --function exact_match \
  --dataset "Skills: Final Response" \
  --replace

# 3. Evaluator runs automatically on new dataset runs

Resources