AgentSkillsCN

dataset-management

创建、管理和维护 Freeplay 数据集(包括提示数据集与智能体数据集)。在执行任何写入操作前,务必事先征得用户同意。当用户希望新建数据集、添加测试用例、更新数据集内容、管理数据集中的各项条目、从 CSV 或 JSONL 文件中导入测试数据、构建黄金标准数据集,或打造评估用数据集时,可使用此技能。切勿用于运行测试(请使用 run-test)或分析测试结果(请使用 test-run-analysis)。

SKILL.md
--- frontmatter
name: dataset-management
description: Create, manage, and curate Freeplay datasets (prompt datasets and agent datasets). Always confirm with the user before any write operations. Use when the user wants to create a new dataset, add test cases, update datasets, manage dataset content, import test data from CSV or JSONL, create golden sets, or build evaluation datasets. Do NOT use for running tests (use run-test) or analyzing test results (use test-run-analysis).

Freeplay Dataset Management

IMPORTANT: Safety Guidelines

Confirmation Required: Always ask for user confirmation before performing any write operations (creating datasets, adding test cases, updating datasets, updating test cases).

No Deletion Operations: Do NOT perform any deletion operations (deleting datasets or test cases). If the user requests deletion, inform them that deletion must be done manually through the Freeplay UI or API directly. This skill does not support deletion actions to prevent accidental data loss.

Manage test datasets for evaluating prompts and agents in Freeplay.

Critical: Two Dataset Types

Freeplay has two distinct dataset types with different API endpoints:

  1. Prompt Datasets - Test individual prompt templates → See prompt-datasets.md
  2. Agent Datasets - Test complete agent workflows → See agent-datasets.md

ALWAYS confirm which type the user needs. The APIs are not interchangeable.

Quick decision guide:

  • User mentions "prompt", "template", or "component" → Prompt Dataset
  • User mentions "agent", "workflow", or "end-to-end" → Agent Dataset

Quick Start

Setup (required for all operations)

python
import requests
import os
from scripts.secrets import SecretString

api_key = SecretString(os.environ.get("FREEPLAY_API_KEY"))
headers = {
    "Authorization": f"Bearer {api_key.get()}",
    "Content-Type": "application/json"
}
project_id = "<project-id>"  # See note below on how to resolve this
base = f"{os.environ.get('FREEPLAY_BASE_URL', 'https://app.freeplay.ai')}/api/v2/projects/{project_id}"

All examples assume this setup. Required environment variables:

Project ID can come from:

  • User specification
  • MCP list_projects() tool to discover available projects

Create a Prompt Dataset

python
response = requests.post(
    f"{base}/prompt-datasets",
    headers=headers,
    json={
        "name": "my-golden-set",
        "description": "Golden test cases",
        "input_names": ["question", "context"]
    }
)
dataset_id = response.json()["id"]  # Status: 201 Created

Create an Agent Dataset

python
response = requests.post(
    f"{base}/agent-datasets",
    headers=headers,
    json={
        "name": "my-agent-tests",
        "description": "Agent workflow test cases"
    }
)
dataset_id = response.json()["id"]  # Status: 201 Created

Add Test Cases (both types)

python
# For prompt datasets: /prompt-datasets/{id}/test-cases/bulk
# For agent datasets: /agent-datasets/{id}/test-cases/bulk

test_cases = [
    {
        "inputs": {"question": "What is your refund policy?"},
        "output": "Expected response...",
        "metadata": {"category": "refunds"}
    }
]

response = requests.post(
    f"{base}/prompt-datasets/{dataset_id}/test-cases/bulk",
    headers=headers,
    json={"data": test_cases}  # Key is "data", not "test_cases"
)
# Status: 201 Created, max 100 items per request

Common Operations

Import from CSV/JSONL

Use the utility script for bulk imports:

bash
python scripts/import_testcases.py \
  --file test_cases.csv \
  --dataset-id ds_abc123 \
  --type prompt  # or "agent"

Handles batching automatically (100 per request). See script help for full options:

bash
python scripts/import_testcases.py --help

CSV format:

  • Columns starting with inputs. become input fields
  • output column becomes expected output
  • Other columns become metadata

Example:

csv
inputs.question,inputs.context,output,category,priority
"What is...","User context","Expected...","refunds","high"

List Datasets

python
# Prompt datasets
response = requests.get(f"{base}/prompt-datasets", headers=headers)
datasets = response.json().get('data', [])

# Agent datasets
response = requests.get(f"{base}/agent-datasets", headers=headers)
datasets = response.json().get('data', [])

Get Test Cases

python
# Replace prompt-datasets with agent-datasets for agent datasets
response = requests.get(
    f"{base}/prompt-datasets/{dataset_id}/test-cases",
    headers=headers
)
test_cases = response.json().get('data', [])

Update Test Case

python
response = requests.patch(
    f"{base}/prompt-datasets/{dataset_id}/test-cases/{test_case_id}",
    headers=headers,
    json={
        "inputs": {"question": "Updated question"},
        "output": "Updated output",
        "metadata": {"updated": True}
    }
)
# Status: 200 OK

Workflow Patterns

For complex operations, use the checklist pattern:

code
Dataset Creation with Verification:
- [ ] Step 1: Create dataset (verify 201 status)
- [ ] Step 2: Add test cases (verify 201 status)
- [ ] Step 3: Retrieve test cases to confirm count
- [ ] Step 4: Verify test case structure

See examples.md for complete workflows with verification steps.

Reference Documentation

Complete API references:

Quick links by task:

Utility Scripts

scripts/import_testcases.py - Import test cases from CSV or JSONL files

  • Handles batching automatically (100 per request)
  • Supports both prompt and agent datasets
  • Provides progress reporting and error handling

scripts/batch_operations.py - Reusable batch operation functions

  • Use in custom scripts for programmatic batch operations
  • Functions: batch_create_test_cases()

Common Errors

ErrorCauseSolution
404 Path Not FoundWrong endpoint path or dataset typeVerify /api/v2/ base and correct type (prompt-datasets or agent-datasets)
Bulk limit exceeded>100 items in single requestUse scripts/import_testcases.py or batch manually
UnauthorizedInvalid or missing API keyCheck FREEPLAY_API_KEY environment variable
Incompatible test caseInput keys don't match prompt variablesVerify inputs keys match prompt template variables

Best Practices

  1. Verify operations: Always check status codes and counts after operations
  2. Use descriptive names: Dataset names should indicate purpose (e.g., "refunds-edge-cases")
  3. Include metadata: Tag test cases for filtering (category, priority, source)
  4. Provide expected outputs: Required for meaningful evaluations
  5. Batch large operations: Use scripts for >10 items
  6. Organize by scenario: Create focused datasets rather than one large dataset
  7. Match input names: For prompt datasets, ensure inputs match template variables

Related Skills

After creating datasets:

  • Run tests using the run-test skill
  • Analyze results using the test-run-analysis skill
  • Check deployment status using the get_deployed_prompt_versions MCP tool

Freeplay Documentation