AgentSkillsCN

ruvector-ruvllm-wasm

为浏览器端 LLM 推理提供 WASM 绑定,支持 WebGPU 加速、量化模型加载以及流式生成。适用于在浏览器中运行 LLM 推理、将语言模型部署至边缘设备、构建具备离线能力的 AI 聊天系统,或为 Web 应用程序添加客户端侧文本生成功能。

SKILL.md
--- frontmatter
name: "ruvector-ruvllm-wasm"
description: "WASM bindings for browser-based LLM inference with WebGPU acceleration, quantized model loading, and streaming generation. Use when running LLM inference in browsers, deploying language models to edge devices, building offline-capable AI chat, or adding client-side text generation to web applications."

@ruvector/ruvllm-wasm

WebAssembly bindings for browser-native LLM inference, enabling text generation, embeddings, and streaming completions directly in the browser with WebGPU acceleration and quantized model support.

Quick Reference

TaskCode
Installnpx @ruvector/ruvllm-wasm@latest
Import (Node)import { WasmLLM } from '@ruvector/ruvllm-wasm';
Import (Browser)import init, { WasmLLM } from '@ruvector/ruvllm-wasm'; await init();
Createconst llm = new WasmLLM({ model: 'tinyllama-1.1b-q4' });
Generateconst text = await llm.generate('Hello');
Streamfor await (const tok of llm.stream(prompt)) { ... }

Installation

Install: npx @ruvector/ruvllm-wasm@latest See Installation Guide for the full ecosystem.

Key API

WasmLLM

Main WASM-accelerated LLM inference engine.

Node.js:

typescript
import { WasmLLM } from '@ruvector/ruvllm-wasm';

const llm = new WasmLLM({
  model: 'tinyllama-1.1b-q4',
  maxTokens: 256,
});

Browser:

typescript
import init, { WasmLLM } from '@ruvector/ruvllm-wasm';

await init(); // Initialize WASM module

const llm = new WasmLLM({
  model: 'tinyllama-1.1b-q4',
  maxTokens: 256,
  webgpu: true,
});

Constructor Options:

OptionTypeDefaultDescription
modelstringrequiredModel identifier or URL
maxTokensnumber256Maximum generation tokens
temperaturenumber0.7Sampling temperature
topPnumber0.9Nucleus sampling threshold
topKnumber40Top-K sampling
repetitionPenaltynumber1.1Repetition penalty
contextLengthnumber2048Maximum context window
webgpubooleanfalseEnable WebGPU acceleration
quantizationstring'q4'Quantization: 'q4', 'q8', 'f16', 'f32'
threadsnumbernavigator.hardwareConcurrencyWASM threads
simdbooleantrueUse WASM SIMD instructions

Methods:

MethodReturnsDescription
generate(prompt, opts?)Promise<string>Generate text completion
stream(prompt, opts?)AsyncIterableIterator<string>Stream tokens
embed(text)Promise<Float32Array>Generate text embedding
tokenize(text)Uint32ArrayTokenize text
detokenize(tokens)stringConvert tokens to text
loadModel(source)Promise<void>Load or switch model
getMemoryUsage()MemoryInfoWASM memory stats
dispose()voidFree all WASM memory

WasmTokenizer

Standalone WASM tokenizer.

typescript
import init, { WasmTokenizer } from '@ruvector/ruvllm-wasm';
await init();

const tokenizer = new WasmTokenizer({ model: 'tinyllama-1.1b-q4' });
const tokens = tokenizer.encode('Hello, world!');
const text = tokenizer.decode(tokens);

Methods:

MethodReturnsDescription
encode(text)Uint32ArrayEncode text to tokens
decode(tokens)stringDecode tokens to text
vocabSize()numberVocabulary size

WasmEmbedder

Dedicated embedding generator.

typescript
import init, { WasmEmbedder } from '@ruvector/ruvllm-wasm';
await init();

const embedder = new WasmEmbedder({ model: 'all-minilm-l6-q4' });
const embedding = await embedder.embed('search query');
const similarity = embedder.cosineSimilarity(emb1, emb2);

Constructor Options:

OptionTypeDefaultDescription
modelstringrequiredEmbedding model identifier
dimnumbermodel defaultOutput embedding dimension
normalizebooleantrueL2-normalize embeddings
poolingstring'mean'Pooling: 'mean', 'cls', 'max'

Methods:

MethodReturnsDescription
embed(text)Promise<Float32Array>Single text embedding
embedBatch(texts)Promise<Float32Array[]>Batch embeddings
cosineSimilarity(a, b)numberCosine similarity
dispose()voidFree WASM memory

Common Patterns

Browser Chat Interface

typescript
import init, { WasmLLM } from '@ruvector/ruvllm-wasm';

await init();
const llm = new WasmLLM({ model: 'tinyllama-1.1b-q4', webgpu: true });

const outputEl = document.getElementById('output');
for await (const token of llm.stream(userInput)) {
  outputEl.textContent += token;
}

Offline-Capable PWA

typescript
import init, { WasmLLM } from '@ruvector/ruvllm-wasm';

await init();

// Cache model in IndexedDB on first load
const cache = await caches.open('llm-models');
let modelData = await cache.match('/models/tinyllama-q4.bin');
if (!modelData) {
  modelData = await fetch('/models/tinyllama-q4.bin');
  await cache.put('/models/tinyllama-q4.bin', modelData.clone());
}

const llm = new WasmLLM({ model: 'tinyllama-1.1b-q4' });
await llm.loadModel(await modelData.arrayBuffer());

Semantic Search in Browser

typescript
import init, { WasmEmbedder } from '@ruvector/ruvllm-wasm';

await init();
const embedder = new WasmEmbedder({ model: 'all-minilm-l6-q4' });

const docEmbeddings = await embedder.embedBatch(documents);
const queryEmb = await embedder.embed(searchQuery);

const scores = docEmbeddings.map((emb, i) => ({
  index: i,
  score: embedder.cosineSimilarity(queryEmb, emb),
}));
scores.sort((a, b) => b.score - a.score);

RAN DDD Context

Bounded Context: Learning

References