Data Designer
Generate synthetic datasets combining statistical samplers with Claude's LLM capabilities. No external API keys required.
Workflow
- •Clarify requirements - Ask about purpose, columns, size, format
- •Create schema - Write
dataset_schema.jsondefining columns - •Generate preview - Run
batch_generator.pyfor 3-5 rows - •Iterate - Refine based on feedback
- •Generate full dataset - Batch generate, then merge
- •Deliver - Export to requested format
Column Types
Statistical Samplers (No LLM)
| Type | Description | Key Params |
|---|---|---|
category | Weighted random choice | values, weights |
subcategory | Hierarchical (parent-based) | mapping, category |
uniform | Uniform distribution | low, high, dtype |
gaussian | Normal distribution | mean, std, min_val, max_val |
bernoulli | Binary probability | p, true_value, false_value |
poisson | Poisson distribution | mean |
datetime | Random dates | start, end, format |
person | Synthetic personas | fields, age_range, locale |
uuid | Unique IDs | prefix, format |
LLM Columns (Claude generates)
| Type | Description |
|---|---|
llm_text | Free-form text |
llm_code | Code with syntax validation |
llm_structured | JSON matching schema |
llm_judge | Quality scoring |
Schema Format
Create dataset_schema.json:
{
"name": "dataset_name",
"seed": 42,
"columns": [
{"name": "category", "type": "category", "params": {"values": ["A","B"], "weights": [0.6,0.4]}},
{"name": "text", "type": "llm_text", "prompt": "Write about {{ category }}.", "depends_on": ["category"]}
],
"output": {"format": "csv", "filename": "output"}
}
For full schema reference: references/schema.md
Jinja2 Templating
Reference columns in prompts:
Write a {{ rating }}-star review for {{ product_name }} by {{ customer.first_name }}.
Supports: {{ var }}, {{ obj.field }}, {% if %}, filters
Scripts
Generate Data
# Preview python scripts/batch_generator.py --schema schema.json --rows 5 --output preview.json --preview # Full generation python scripts/batch_generator.py --schema schema.json --rows 100 --batch-size 20 --output batches/
Merge & Export
python scripts/merger.py --input batches/ --output dataset.csv --flatten
Formats: csv, json, jsonl, parquet
Generation Strategy
- •Sampler columns first - Python scripts, fast
- •LLM columns in dependency order - Topological sort by
depends_on - •Batch processing - Generate in batches of 20-50 for large datasets
For LLM columns, Claude generates directly:
- •Render Jinja2 prompt with row data
- •Generate content
- •Validate if configured
- •Retry on failure (max 3)
Examples
Simple:
"Generate 50 product reviews with ratings 1-5"
Complex:
"Create 200 support tickets with: ticket_id (UUID), customer (name, email), category (billing/technical/general), priority (1-5 gaussian), description (LLM)"
Code:
"Generate 100 Python functions with description, code (validated), tests"
Tips
- •Use
seedfor reproducibility - •Preview first, then scale
- •Keep LLM prompts specific
- •Use
subcategoryfor correlated data
Attribution
Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).