AgentSkillsCN

promptfoo

Promptfoo评估框架,用于测试与比较LLM输出。 在编写评估配置、创建测试用例、调试评估运行,或进行断言操作时使用此功能。

SKILL.md
--- frontmatter
name: promptfoo
description: |
  Promptfoo evaluation framework for testing and comparing LLM outputs.
  Use when writing eval configs, creating test cases, debugging eval runs, or working with assertions.
allowed-tools:
  - Bash(npx promptfoo:*)
  - Bash(npm run evals:*)
  - WebFetch(domain:www.promptfoo.dev)

Promptfoo

Promptfoo is a CLI tool for testing and comparing LLM outputs.

Config File

The CLI auto-discovers promptfooconfig.yaml in the current directory. Use -c path for other locations.

Supported extensions: .yaml, .json, .js

Configuration

yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "What this eval tests"

prompts:
  - file://prompt.txt
  - |
    Inline prompt with {{variable}} substitution

providers:
  - anthropic:messages:claude-sonnet-4-5-20250929

defaultTest:
  options:
    provider:
      config:
        temperature: 0.0
        max_tokens: 4096

tests:
  - description: "What this case tests"
    vars:
      variable: "value"
      from_file: file://data/input.txt
    assert:
      - type: contains
        value: "expected substring"

# Or load tests from files
tests: file://cases/all.yaml

outputPath: ./results.json

evaluateOptions:
  maxConcurrency: 4

Provider IDs

ModelID
Opus 4.5anthropic:messages:claude-opus-4-5-20251101
Sonnet 4.5anthropic:messages:claude-sonnet-4-5-20250929
Haiku 4.5anthropic:messages:claude-haiku-4-5-20251001

Provider config: temperature, max_tokens, top_p, top_k, tools, tool_choice

Prompts

  • file://path.txt — load from file (path relative to config)
  • Inline string with {{variable}} Nunjucks substitution
  • Chat format via JSON: [{"role": "system", "content": "..."}, {"role": "user", "content": "{{input}}"}]

Assertion Types

TypeUseValue
containsSubstring match"expected text"
icontainsCase-insensitive substring"expected text"
equalsExact match"exact value"
regexPattern match"\\d{4}-\\d{2}-\\d{2}"
is-jsonValid JSON output
contains-jsonOutput contains JSON
starts-withPrefix match"prefix"
costMax costthreshold: 0.01
latencyMax response time (ms)threshold: 5000
javascriptCustom JS expressionoutput.includes('x')
pythonCustom Pythonfile://check.py:fn_name
llm-rubricLLM-as-judgerubric text
similarSemantic similarityvalue: "text", threshold: 0.8
model-graded-factualityFact checking

Prefix any assertion with not- to negate (e.g., not-contains).

llm-rubric

Uses an LLM to grade output against a rubric:

yaml
assert:
  - type: llm-rubric
    value: |
      The response should:
      - Mention at least 3 factors
      - Include specific examples
    threshold: 0.7
    provider: anthropic:messages:claude-sonnet-4-5-20250929

javascript

Inline expressions or functions. Access output (string) and context (with vars, prompt):

yaml
assert:
  - type: javascript
    value: output.length > 100 && output.includes('route')
  - type: javascript
    value: |
      const data = JSON.parse(output);
      return data.calories >= 200 && data.calories <= 300;

Test Organization

Split cases into separate files and reference them:

yaml
tests:
  - file://cases/basic.yaml
  - file://cases/edge-cases.yaml

Each case file contains a YAML array of test objects.

CLI

bash
npx promptfoo eval                         # Run with auto-discovered config
npx promptfoo eval -c path/to/config.yaml  # Specific config
npx promptfoo eval --filter-metadata key=v # Filter tests
npx promptfoo view                         # Web UI for results
npx promptfoo cache clear                  # Clear result cache

References

Consult the configuration reference and Anthropic provider docs for full details.