AgentSkillsCN

run-test

执行 Freeplay 测试运行,以针对数据集评估提示模板与智能体的表现。在运行测试前务必征得用户同意——测试运行会向 LLM 提供商发起 API 调用,可能产生相应费用。当用户希望进行测试、针对特定数据集执行测试、启动评估流程,或触发测试运行时,此技能尤为适用。切勿用于分析已有测试结果(请改用“测试运行分析”)或管理数据集(请改用“数据集管理”)。

SKILL.md
--- frontmatter
name: run-test
description: Execute Freeplay test runs to evaluate prompt templates and agents against datasets. Always confirm with the user before running — test runs make API calls to LLM providers and may incur costs. Use when the user wants to run a test, execute tests against a dataset, kick off evaluations, or trigger a test run. Do NOT use for analyzing existing test results (use test-run-analysis) or managing datasets (use dataset-management).

Run Freeplay Test

IMPORTANT: Confirmation Required

Always ask for user confirmation before executing any test run. Test runs make API calls to Freeplay and LLM providers, which may incur costs. Never execute a test without explicit user consent.


This skill helps users execute Freeplay test runs via the SDK/API to evaluate their AI features against datasets.

Note: This skill is for SDK/API-based test runs only, not UI-based testing from the Freeplay dashboard.

When to use this skill

  • "Run a test for my prompt"
  • "Execute the test run for customer support"
  • "Test my agent against the golden dataset"
  • "Run evaluation tests"
  • "Kick off a test run"
  • "I want to test my prompt changes"

Workflow

Step 1: Identify the Test Run Code Path

Search the codebase for existing test run implementation. Look for:

bash
# Search patterns (run these to find test run implementations)
grep -r "test_runs.create" --include="*.py" .
grep -r "fp_client.test_runs" --include="*.py" .
grep -r "freeplay.*test" --include="*.py" .
grep -r "TestRun" --include="*.py" .

The key API call to look for is:

Python SDK:

python
test_run = fp_client.test_runs.create(
    project_id=project_id,
    testlist="Dataset Name",
    name="Test Run Name"
)

TypeScript SDK:

typescript
const testRun = await fpClient.testRuns.create({
    projectId: projectId,
    testlist: "Dataset Name",
    name: "Test Run Name"
});

Step 2: If No Test Run Code Exists

If the search returns no results, inform the user:

I couldn't find an existing test run script in your codebase. To run Freeplay test runs, you'll need to set up a test runner script first.

Documentation to get started:

Would you like me to help you understand what's needed to set up a test runner?

Do NOT attempt to write the test runner for them. This requires understanding their specific:

  • Pipeline architecture
  • Dataset structure
  • Evaluation criteria
  • Environment configuration

Step 3: If Test Run Code Exists

Once you've found the test run implementation:

  1. Identify the entry point - Find the script/function that initiates test runs
  2. Check for required environment variables:
    • FREEPLAY_API_KEY
    • FREEPLAY_BASE_URL (default: https://app.freeplay.ai)
    • OPENAI_API_KEY (or other LLM provider keys)

NOTE Project ID can be provided by user or discovered via list_projects()

  1. Determine how to run it:

    bash
    # Common patterns
    python run_tests.py
    python -m tests.run_freeplay_tests
    npm run test:freeplay
    pytest tests/test_freeplay.py
    
  2. Ask the user for any required parameters:

    • Dataset/testlist name
    • Test run name
    • Prompt template to test
    • Environment (if applicable)

Step 4: Execute with Consent

Before running, confirm with the user:

I found the test runner at [path]. This will:

  • Run tests against the [dataset] dataset
  • Test the [prompt/agent]
  • Make API calls to Freeplay and your LLM provider

Ready to execute?

Only proceed with explicit consent.

Step 5: Handle Results

On Success:

  • Report the test run completed
  • Provide the test run ID for reviewing results
  • Share any immediate metrics if available in output
  • Always suggest using the test-run-analysis skill to analyze results and suggest improvements to the prompt or agent based on evaluation metrics

On Failure:

  • Capture the error output
  • Identify common issues:
    • Missing environment variables
    • Invalid API keys
    • Dataset not found
    • Network/connectivity issues
    • Rate limiting
  • Help debug with the user

Example Test Run Script Structure

For reference, a typical component-level test run script looks like:

python
from freeplay import Freeplay, RecordPayload
from openai import OpenAI
import os
from scripts.secrets import SecretString

# Initialize clients (use SecretString to prevent accidental logging)
api_key = SecretString(os.environ.get("FREEPLAY_API_KEY"))
fp_client = Freeplay(
    api_key=api_key.get(),
    api_base=os.environ.get("FREEPLAY_BASE_URL", "https://app.freeplay.ai")
)
openai_client = OpenAI()

project_id = "<project-id>"  # Provided by user or discovered via list_projects()

# Create test run
test_run = fp_client.test_runs.create(
    project_id=project_id,
    testlist="Golden Set",
    name="My Test Run"
)

# Get prompt template
template_prompt = fp_client.prompts.get(
    project_id=project_id,
    template_name="my-prompt",
    environment="latest"
)

# Process each test case
for test_case in test_run.test_cases:
    formatted_prompt = template_prompt.bind(test_case.variables).format()

    # Call LLM
    response = openai_client.chat.completions.create(
        model=formatted_prompt.prompt_info.model,
        messages=formatted_prompt.llm_prompt,
        **formatted_prompt.prompt_info.model_parameters
    )

    # Record results back to Freeplay
    fp_client.recordings.create(RecordPayload(
        # ... recording configuration
    ))

print(f"Test run completed: {test_run.id}")

Environment Variables

Required variables that must be set:

  • FREEPLAY_API_KEY - Freeplay API key
  • FREEPLAY_BASE_URL - Freeplay API URL (default: https://app.freeplay.ai)
  • LLM provider keys (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY)

Project ID can come from:

  • User specification
  • MCP list_projects() tool to discover available projects

Tips

  • Always search the codebase first before assuming no test runner exists
  • Check common locations: scripts/, tests/, tools/, root directory
  • Look for files named: run_test*.py, test_runner.py, freeplay_test*.py
  • Check package.json scripts or pyproject.toml for test commands
  • Environment variables may be in .env, .env.local, or CI configuration