@ruvector/ruvllm-wasm
WebAssembly bindings for browser-native LLM inference, enabling text generation, embeddings, and streaming completions directly in the browser with WebGPU acceleration and quantized model support.
Quick Reference
| Task | Code |
|---|---|
| Install | npx @ruvector/ruvllm-wasm@latest |
| Import (Node) | import { WasmLLM } from '@ruvector/ruvllm-wasm'; |
| Import (Browser) | import init, { WasmLLM } from '@ruvector/ruvllm-wasm'; await init(); |
| Create | const llm = new WasmLLM({ model: 'tinyllama-1.1b-q4' }); |
| Generate | const text = await llm.generate('Hello'); |
| Stream | for await (const tok of llm.stream(prompt)) { ... } |
Installation
Install: npx @ruvector/ruvllm-wasm@latest
See Installation Guide for the full ecosystem.
Key API
WasmLLM
Main WASM-accelerated LLM inference engine.
Node.js:
typescript
import { WasmLLM } from '@ruvector/ruvllm-wasm';
const llm = new WasmLLM({
model: 'tinyllama-1.1b-q4',
maxTokens: 256,
});
Browser:
typescript
import init, { WasmLLM } from '@ruvector/ruvllm-wasm';
await init(); // Initialize WASM module
const llm = new WasmLLM({
model: 'tinyllama-1.1b-q4',
maxTokens: 256,
webgpu: true,
});
Constructor Options:
| Option | Type | Default | Description |
|---|---|---|---|
model | string | required | Model identifier or URL |
maxTokens | number | 256 | Maximum generation tokens |
temperature | number | 0.7 | Sampling temperature |
topP | number | 0.9 | Nucleus sampling threshold |
topK | number | 40 | Top-K sampling |
repetitionPenalty | number | 1.1 | Repetition penalty |
contextLength | number | 2048 | Maximum context window |
webgpu | boolean | false | Enable WebGPU acceleration |
quantization | string | 'q4' | Quantization: 'q4', 'q8', 'f16', 'f32' |
threads | number | navigator.hardwareConcurrency | WASM threads |
simd | boolean | true | Use WASM SIMD instructions |
Methods:
| Method | Returns | Description |
|---|---|---|
generate(prompt, opts?) | Promise<string> | Generate text completion |
stream(prompt, opts?) | AsyncIterableIterator<string> | Stream tokens |
embed(text) | Promise<Float32Array> | Generate text embedding |
tokenize(text) | Uint32Array | Tokenize text |
detokenize(tokens) | string | Convert tokens to text |
loadModel(source) | Promise<void> | Load or switch model |
getMemoryUsage() | MemoryInfo | WASM memory stats |
dispose() | void | Free all WASM memory |
WasmTokenizer
Standalone WASM tokenizer.
typescript
import init, { WasmTokenizer } from '@ruvector/ruvllm-wasm';
await init();
const tokenizer = new WasmTokenizer({ model: 'tinyllama-1.1b-q4' });
const tokens = tokenizer.encode('Hello, world!');
const text = tokenizer.decode(tokens);
Methods:
| Method | Returns | Description |
|---|---|---|
encode(text) | Uint32Array | Encode text to tokens |
decode(tokens) | string | Decode tokens to text |
vocabSize() | number | Vocabulary size |
WasmEmbedder
Dedicated embedding generator.
typescript
import init, { WasmEmbedder } from '@ruvector/ruvllm-wasm';
await init();
const embedder = new WasmEmbedder({ model: 'all-minilm-l6-q4' });
const embedding = await embedder.embed('search query');
const similarity = embedder.cosineSimilarity(emb1, emb2);
Constructor Options:
| Option | Type | Default | Description |
|---|---|---|---|
model | string | required | Embedding model identifier |
dim | number | model default | Output embedding dimension |
normalize | boolean | true | L2-normalize embeddings |
pooling | string | 'mean' | Pooling: 'mean', 'cls', 'max' |
Methods:
| Method | Returns | Description |
|---|---|---|
embed(text) | Promise<Float32Array> | Single text embedding |
embedBatch(texts) | Promise<Float32Array[]> | Batch embeddings |
cosineSimilarity(a, b) | number | Cosine similarity |
dispose() | void | Free WASM memory |
Common Patterns
Browser Chat Interface
typescript
import init, { WasmLLM } from '@ruvector/ruvllm-wasm';
await init();
const llm = new WasmLLM({ model: 'tinyllama-1.1b-q4', webgpu: true });
const outputEl = document.getElementById('output');
for await (const token of llm.stream(userInput)) {
outputEl.textContent += token;
}
Offline-Capable PWA
typescript
import init, { WasmLLM } from '@ruvector/ruvllm-wasm';
await init();
// Cache model in IndexedDB on first load
const cache = await caches.open('llm-models');
let modelData = await cache.match('/models/tinyllama-q4.bin');
if (!modelData) {
modelData = await fetch('/models/tinyllama-q4.bin');
await cache.put('/models/tinyllama-q4.bin', modelData.clone());
}
const llm = new WasmLLM({ model: 'tinyllama-1.1b-q4' });
await llm.loadModel(await modelData.arrayBuffer());
Semantic Search in Browser
typescript
import init, { WasmEmbedder } from '@ruvector/ruvllm-wasm';
await init();
const embedder = new WasmEmbedder({ model: 'all-minilm-l6-q4' });
const docEmbeddings = await embedder.embedBatch(documents);
const queryEmb = await embedder.embed(searchQuery);
const scores = docEmbeddings.map((emb, i) => ({
index: i,
score: embedder.cosineSimilarity(queryEmb, emb),
}));
scores.sort((a, b) => b.score - a.score);
RAN DDD Context
Bounded Context: Learning
References
- •API reference: See references/commands.md
- •Full README
- •npm