AgentSkillsCN

ruvector-agentic-synth

高性能合成数据生成器,可用于 AI/ML 训练、RAG 评估以及代理式工作流测试。适用于生成训练数据集、为 RAG 管道创建测试数据、构建评估基准,或合成逼真的智能体对话数据。

SKILL.md
--- frontmatter
name: ruvector-agentic-synth
description: "High-performance synthetic data generator for AI/ML training, RAG evaluation, and agentic workflow testing. Use when generating training datasets, creating test data for RAG pipelines, building evaluation benchmarks, or synthesizing realistic agent conversation data."

@ruvector/agentic-synth

High-performance synthetic data generator designed for AI/ML training, RAG system evaluation, and agentic workflow testing. Generates realistic text, embeddings, Q&A pairs, conversations, and structured datasets.

Quick Reference

TaskCode
Installnpx @ruvector/agentic-synth@latest
Create generatornew SynthGenerator(config)
Generate QA pairsgen.generateQA(docs, count)
Generate embeddingsgen.generateEmbeddings(count, dims)
Generate conversationsgen.generateConversations(config)
Generate datasetgen.generateDataset(schema)

Installation

bash
npx @ruvector/agentic-synth@latest

Quick Start

typescript
import {
  SynthGenerator,
  QAGenerator,
  ConversationGenerator,
  DatasetGenerator,
} from '@ruvector/agentic-synth';

const gen = new SynthGenerator({ seed: 42 });

// Generate Q&A pairs from documents (for RAG evaluation)
const qaPairs = await gen.generateQA(documents, {
  count: 100,
  difficulty: 'mixed',
  includeNegatives: true,
});

console.log(qaPairs[0]);
// { question: "What is the max retry count?",
//   answer: "The max retry count is 3",
//   context: "...",
//   difficulty: "easy",
//   isNegative: false }

// Generate synthetic embeddings (for testing vector search)
const embeddings = gen.generateEmbeddings({
  count: 10_000,
  dimensions: 1536,
  clusters: 50,
  noise: 0.1,
});

// Generate agent conversations (for training/eval)
const conversations = await gen.generateConversations({
  count: 50,
  turns: 5,
  agents: ['user', 'assistant'],
  topics: ['coding', 'debugging', 'architecture'],
});

// Generate structured dataset
const dataset = gen.generateDataset({
  count: 1000,
  schema: {
    name: { type: 'name' },
    email: { type: 'email' },
    score: { type: 'float', min: 0, max: 1 },
    category: { type: 'enum', values: ['A', 'B', 'C'] },
    embedding: { type: 'vector', dimensions: 384 },
  },
});

Core API

SynthGenerator

Main generator with all synthesis capabilities.

typescript
const gen = new SynthGenerator(config?: SynthConfig);

SynthConfig:

ParameterTypeDefaultDescription
seednumberrandomReproducibility seed
localestring'en'Data locale
modelstring-LLM for text generation
batchSizenumber100Generation batch size

gen.generateQA(documents, options)

Generate question-answer pairs from source documents for RAG evaluation.

typescript
await gen.generateQA(
  documents: string[],
  options: QAOptions
): Promise<QAPair[]>

QAOptions:

ParameterTypeDefaultDescription
countnumber100Number of pairs
difficulty'easy' | 'medium' | 'hard' | 'mixed''mixed'Question difficulty
includeNegativesbooleanfalseInclude unanswerable questions
negativeRationumber0.2Ratio of negatives
chunkSizenumber512Context chunk size

QAPair:

FieldTypeDescription
questionstringGenerated question
answerstringExpected answer
contextstringSource context chunk
difficultystringDifficulty level
isNegativebooleanWhether unanswerable

gen.generateEmbeddings(options)

Generate synthetic embedding vectors with cluster structure.

typescript
gen.generateEmbeddings(options: EmbeddingOptions): EmbeddingDataset

EmbeddingOptions:

ParameterTypeDefaultDescription
countnumber1000Number of vectors
dimensionsnumber384Vector dimensions
clustersnumber10Number of clusters
noisenumber0.1Gaussian noise level
normalizebooleantrueL2 normalize

EmbeddingDataset:

FieldTypeDescription
vectorsFloat32Array[]Generated embeddings
labelsnumber[]Cluster assignments
centroidsFloat32Array[]Cluster centers

gen.generateConversations(options)

Generate multi-turn agent conversations.

typescript
await gen.generateConversations(options: ConversationOptions): Promise<Conversation[]>

ConversationOptions:

ParameterTypeDefaultDescription
countnumber10Number of conversations
turnsnumber5Turns per conversation
agentsstring[]['user', 'assistant']Participant roles
topicsstring[]['general']Conversation topics
style'formal' | 'casual' | 'technical''technical'Conversation style

Conversation:

FieldTypeDescription
idstringConversation ID
turnsTurn[][{ role, content, timestamp }]
topicstringTopic label
metadataRecord<string, unknown>Extra metadata

gen.generateDataset(schema)

Generate structured tabular data.

typescript
gen.generateDataset(options: DatasetOptions): Record<string, unknown>[]

DatasetOptions:

ParameterTypeDefaultDescription
countnumber1000Row count
schemaSchemaSpecrequiredColumn definitions

Schema field types:

TypeParametersDescription
'name'-Random person name
'email'-Random email
'text'{ minLength?, maxLength? }Random text
'int'{ min?, max? }Random integer
'float'{ min?, max? }Random float
'enum'{ values: string[] }Random from set
'bool'{ probability? }Random boolean
'date'{ from?, to? }Random date
'vector'{ dimensions }Random vector
'uuid'-Random UUID

gen.generateText(options)

Generate synthetic text paragraphs.

typescript
await gen.generateText(options: TextOptions): Promise<string[]>

TextOptions:

ParameterTypeDefaultDescription
countnumber10Paragraphs
topicstring'general'Topic
minLengthnumber50Min words
maxLengthnumber200Max words

CLI Usage

bash
# Generate QA pairs
npx @ruvector/agentic-synth qa --input docs/ --count 100 --output qa.json

# Generate embeddings
npx @ruvector/agentic-synth embeddings --count 10000 --dims 384 --output embeds.npy

# Generate conversations
npx @ruvector/agentic-synth conversations --count 50 --turns 5 --output convos.json

# Generate dataset
npx @ruvector/agentic-synth dataset --count 1000 --schema schema.json --output data.csv

References