@ruvector/ruvllm-cli
CLI for local LLM inference, benchmarking, and model management. Run quantized models locally with Metal (macOS) and CUDA (Linux/Windows) GPU acceleration. Supports GGUF format models, HTTP serving, and performance benchmarking.
Quick Command Reference
| Task | Command |
|---|---|
| Show help | npx @ruvector/ruvllm-cli@latest --help |
| Run inference | npx @ruvector/ruvllm-cli@latest run --model <path> |
| Chat mode | npx @ruvector/ruvllm-cli@latest chat --model <path> |
| Download model | npx @ruvector/ruvllm-cli@latest download <model-id> |
| List models | npx @ruvector/ruvllm-cli@latest models list |
| Serve model | npx @ruvector/ruvllm-cli@latest serve --model <path> |
| Benchmark | npx @ruvector/ruvllm-cli@latest bench --model <path> |
| Model info | npx @ruvector/ruvllm-cli@latest info <model-path> |
Installation
Install: npx @ruvector/ruvllm-cli@latest
See Installation Guide for the full ecosystem.
Core Commands
run
Run inference on a prompt.
npx @ruvector/ruvllm-cli@latest run --model <path> --prompt "Your prompt"
Options: --model <path>, --prompt <text>, --max-tokens <n>, --temperature <f>, --gpu, --threads <n>
chat
Interactive chat mode.
npx @ruvector/ruvllm-cli@latest chat --model <path>
Options: --model <path>, --system <prompt>, --temperature <f>, --gpu
download
Download models from Hugging Face Hub.
npx @ruvector/ruvllm-cli@latest download <model-id> [--quantization q4_k_m]
models
Model management.
npx @ruvector/ruvllm-cli@latest models list npx @ruvector/ruvllm-cli@latest models info <name> npx @ruvector/ruvllm-cli@latest models delete <name>
serve
Serve model via HTTP API (OpenAI-compatible).
npx @ruvector/ruvllm-cli@latest serve --model <path> --port 8080
bench
Benchmark inference performance.
npx @ruvector/ruvllm-cli@latest bench --model <path> [--iterations <n>]
Common Patterns
Quick Chat
npx @ruvector/ruvllm-cli@latest download TheBloke/Llama-2-7B-GGUF --quantization q4_k_m npx @ruvector/ruvllm-cli@latest chat --model llama-2-7b-q4_k_m.gguf --gpu
API Server
npx @ruvector/ruvllm-cli@latest serve --model ./model.gguf --port 8080 --gpu
# Then: curl http://localhost:8080/v1/chat/completions -d '{"messages": [...]}'
RAN DDD Context
Bounded Context: Learning
References
- •Command reference: See references/commands.md
- •Full README
- •npm