@ruvector/agentic-synth

High-performance synthetic data generator designed for AI/ML training, RAG system evaluation, and agentic workflow testing. Generates realistic text, embeddings, Q&A pairs, conversations, and structured datasets.

Quick Reference

Task	Code
Install	`npx @ruvector/agentic-synth@latest`
Create generator	`new SynthGenerator(config)`
Generate QA pairs	`gen.generateQA(docs, count)`
Generate embeddings	`gen.generateEmbeddings(count, dims)`
Generate conversations	`gen.generateConversations(config)`
Generate dataset	`gen.generateDataset(schema)`

Installation

bash

npx @ruvector/agentic-synth@latest

Quick Start

typescript

import {
  SynthGenerator,
  QAGenerator,
  ConversationGenerator,
  DatasetGenerator,
} from '@ruvector/agentic-synth';

const gen = new SynthGenerator({ seed: 42 });

// Generate Q&A pairs from documents (for RAG evaluation)
const qaPairs = await gen.generateQA(documents, {
  count: 100,
  difficulty: 'mixed',
  includeNegatives: true,
});

console.log(qaPairs[0]);
// { question: "What is the max retry count?",
//   answer: "The max retry count is 3",
//   context: "...",
//   difficulty: "easy",
//   isNegative: false }

// Generate synthetic embeddings (for testing vector search)
const embeddings = gen.generateEmbeddings({
  count: 10_000,
  dimensions: 1536,
  clusters: 50,
  noise: 0.1,
});

// Generate agent conversations (for training/eval)
const conversations = await gen.generateConversations({
  count: 50,
  turns: 5,
  agents: ['user', 'assistant'],
  topics: ['coding', 'debugging', 'architecture'],
});

// Generate structured dataset
const dataset = gen.generateDataset({
  count: 1000,
  schema: {
    name: { type: 'name' },
    email: { type: 'email' },
    score: { type: 'float', min: 0, max: 1 },
    category: { type: 'enum', values: ['A', 'B', 'C'] },
    embedding: { type: 'vector', dimensions: 384 },
  },
});

Core API

SynthGenerator

Main generator with all synthesis capabilities.

typescript

const gen = new SynthGenerator(config?: SynthConfig);

SynthConfig:

Parameter	Type	Default	Description
`seed`	`number`	random	Reproducibility seed
`locale`	`string`	`'en'`	Data locale
`model`	`string`	-	LLM for text generation
`batchSize`	`number`	`100`	Generation batch size

gen.generateQA(documents, options)

Generate question-answer pairs from source documents for RAG evaluation.

typescript

await gen.generateQA(
  documents: string[],
  options: QAOptions
): Promise<QAPair[]>

QAOptions:

Parameter	Type	Default	Description
`count`	`number`	`100`	Number of pairs
`difficulty`	`'easy' \| 'medium' \| 'hard' \| 'mixed'`	`'mixed'`	Question difficulty
`includeNegatives`	`boolean`	`false`	Include unanswerable questions
`negativeRatio`	`number`	`0.2`	Ratio of negatives
`chunkSize`	`number`	`512`	Context chunk size

QAPair:

Field	Type	Description
`question`	`string`	Generated question
`answer`	`string`	Expected answer
`context`	`string`	Source context chunk
`difficulty`	`string`	Difficulty level
`isNegative`	`boolean`	Whether unanswerable

gen.generateEmbeddings(options)

Generate synthetic embedding vectors with cluster structure.

typescript

gen.generateEmbeddings(options: EmbeddingOptions): EmbeddingDataset

EmbeddingOptions:

Parameter	Type	Default	Description
`count`	`number`	`1000`	Number of vectors
`dimensions`	`number`	`384`	Vector dimensions
`clusters`	`number`	`10`	Number of clusters
`noise`	`number`	`0.1`	Gaussian noise level
`normalize`	`boolean`	`true`	L2 normalize

EmbeddingDataset:

Field	Type	Description
`vectors`	`Float32Array[]`	Generated embeddings
`labels`	`number[]`	Cluster assignments
`centroids`	`Float32Array[]`	Cluster centers

gen.generateConversations(options)

Generate multi-turn agent conversations.

typescript

await gen.generateConversations(options: ConversationOptions): Promise<Conversation[]>

ConversationOptions:

Parameter	Type	Default	Description
`count`	`number`	`10`	Number of conversations
`turns`	`number`	`5`	Turns per conversation
`agents`	`string[]`	`['user', 'assistant']`	Participant roles
`topics`	`string[]`	`['general']`	Conversation topics
`style`	`'formal' \| 'casual' \| 'technical'`	`'technical'`	Conversation style

Conversation:

Field	Type	Description
`id`	`string`	Conversation ID
`turns`	`Turn[]`	`[{ role, content, timestamp }]`
`topic`	`string`	Topic label
`metadata`	`Record<string, unknown>`	Extra metadata

gen.generateDataset(schema)

Generate structured tabular data.

typescript

gen.generateDataset(options: DatasetOptions): Record<string, unknown>[]

DatasetOptions:

Parameter	Type	Default	Description
`count`	`number`	`1000`	Row count
`schema`	`SchemaSpec`	required	Column definitions

Schema field types:

Type	Parameters	Description
`'name'`	-	Random person name
`'email'`	-	Random email
`'text'`	`{ minLength?, maxLength? }`	Random text
`'int'`	`{ min?, max? }`	Random integer
`'float'`	`{ min?, max? }`	Random float
`'enum'`	`{ values: string[] }`	Random from set
`'bool'`	`{ probability? }`	Random boolean
`'date'`	`{ from?, to? }`	Random date
`'vector'`	`{ dimensions }`	Random vector
`'uuid'`	-	Random UUID

gen.generateText(options)

Generate synthetic text paragraphs.

typescript

await gen.generateText(options: TextOptions): Promise<string[]>

TextOptions:

Parameter	Type	Default	Description
`count`	`number`	`10`	Paragraphs
`topic`	`string`	`'general'`	Topic
`minLength`	`number`	`50`	Min words
`maxLength`	`number`	`200`	Max words

CLI Usage

bash

# Generate QA pairs
npx @ruvector/agentic-synth qa --input docs/ --count 100 --output qa.json

# Generate embeddings
npx @ruvector/agentic-synth embeddings --count 10000 --dims 384 --output embeds.npy

# Generate conversations
npx @ruvector/agentic-synth conversations --count 50 --turns 5 --output convos.json

# Generate dataset
npx @ruvector/agentic-synth dataset --count 1000 --schema schema.json --output data.csv

References

•API Reference
•npm