AgentSkillsCN

data-designer

利用统计采样器与 Claude 原生大模型能力,生成高质量的合成数据集。适用于用户提出创建合成数据、生成数据集、制作虚假/模拟数据、生成测试数据、训练数据,或任何数据生成任务时使用。支持 CSV、JSON、JSONL、Parquet 等多种输出格式。源自 NVIDIA NeMo DataDesigner(Apache 2.0)。

SKILL.md
--- frontmatter
name: data-designer
description: Generate high-quality synthetic datasets using statistical samplers and Claude's native LLM capabilities. Use when users ask to create synthetic data, generate datasets, create fake/mock data, generate test data, training data, or any data generation task. Supports CSV, JSON, JSONL, Parquet output. Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).

Data Designer

Generate synthetic datasets combining statistical samplers with Claude's LLM capabilities. No external API keys required.

Workflow

  1. Clarify requirements - Ask about purpose, columns, size, format
  2. Create schema - Write dataset_schema.json defining columns
  3. Generate preview - Run batch_generator.py for 3-5 rows
  4. Iterate - Refine based on feedback
  5. Generate full dataset - Batch generate, then merge
  6. Deliver - Export to requested format

Column Types

Statistical Samplers (No LLM)

TypeDescriptionKey Params
categoryWeighted random choicevalues, weights
subcategoryHierarchical (parent-based)mapping, category
uniformUniform distributionlow, high, dtype
gaussianNormal distributionmean, std, min_val, max_val
bernoulliBinary probabilityp, true_value, false_value
poissonPoisson distributionmean
datetimeRandom datesstart, end, format
personSynthetic personasfields, age_range, locale
uuidUnique IDsprefix, format

LLM Columns (Claude generates)

TypeDescription
llm_textFree-form text
llm_codeCode with syntax validation
llm_structuredJSON matching schema
llm_judgeQuality scoring

Schema Format

Create dataset_schema.json:

json
{
  "name": "dataset_name",
  "seed": 42,
  "columns": [
    {"name": "category", "type": "category", "params": {"values": ["A","B"], "weights": [0.6,0.4]}},
    {"name": "text", "type": "llm_text", "prompt": "Write about {{ category }}.", "depends_on": ["category"]}
  ],
  "output": {"format": "csv", "filename": "output"}
}

For full schema reference: references/schema.md

Jinja2 Templating

Reference columns in prompts:

code
Write a {{ rating }}-star review for {{ product_name }} by {{ customer.first_name }}.

Supports: {{ var }}, {{ obj.field }}, {% if %}, filters

Scripts

Generate Data

bash
# Preview
python scripts/batch_generator.py --schema schema.json --rows 5 --output preview.json --preview

# Full generation
python scripts/batch_generator.py --schema schema.json --rows 100 --batch-size 20 --output batches/

Merge & Export

bash
python scripts/merger.py --input batches/ --output dataset.csv --flatten

Formats: csv, json, jsonl, parquet

Generation Strategy

  1. Sampler columns first - Python scripts, fast
  2. LLM columns in dependency order - Topological sort by depends_on
  3. Batch processing - Generate in batches of 20-50 for large datasets

For LLM columns, Claude generates directly:

  • Render Jinja2 prompt with row data
  • Generate content
  • Validate if configured
  • Retry on failure (max 3)

Examples

Simple:

"Generate 50 product reviews with ratings 1-5"

Complex:

"Create 200 support tickets with: ticket_id (UUID), customer (name, email), category (billing/technical/general), priority (1-5 gaussian), description (LLM)"

Code:

"Generate 100 Python functions with description, code (validated), tests"

Tips

  • Use seed for reproducibility
  • Preview first, then scale
  • Keep LLM prompts specific
  • Use subcategory for correlated data

Attribution

Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).