AgentSkillsCN

@ruvector/ruvllm-cli

本地 LLM 推理 CLI,支持 Metal/CUDA 加速、模型管理以及基准测试。适用于在本地运行 LLM 推理、下载并管理 GGUF 模型、对推理性能进行基准测试、通过 HTTP API 提供模型服务,或在不同量化级别之间对比模型输出。

SKILL.md
--- frontmatter
name: "@ruvector/ruvllm-cli"
description: "CLI for local LLM inference with Metal/CUDA acceleration, model management, and benchmarking. Use when running local LLM inference, downloading and managing GGUF models, benchmarking inference performance, serving models via HTTP API, or comparing model outputs across different quantization levels."

@ruvector/ruvllm-cli

CLI for local LLM inference, benchmarking, and model management. Run quantized models locally with Metal (macOS) and CUDA (Linux/Windows) GPU acceleration. Supports GGUF format models, HTTP serving, and performance benchmarking.

Quick Command Reference

TaskCommand
Show helpnpx @ruvector/ruvllm-cli@latest --help
Run inferencenpx @ruvector/ruvllm-cli@latest run --model <path>
Chat modenpx @ruvector/ruvllm-cli@latest chat --model <path>
Download modelnpx @ruvector/ruvllm-cli@latest download <model-id>
List modelsnpx @ruvector/ruvllm-cli@latest models list
Serve modelnpx @ruvector/ruvllm-cli@latest serve --model <path>
Benchmarknpx @ruvector/ruvllm-cli@latest bench --model <path>
Model infonpx @ruvector/ruvllm-cli@latest info <model-path>

Installation

Install: npx @ruvector/ruvllm-cli@latest See Installation Guide for the full ecosystem.

Core Commands

run

Run inference on a prompt.

bash
npx @ruvector/ruvllm-cli@latest run --model <path> --prompt "Your prompt"

Options: --model <path>, --prompt <text>, --max-tokens <n>, --temperature <f>, --gpu, --threads <n>

chat

Interactive chat mode.

bash
npx @ruvector/ruvllm-cli@latest chat --model <path>

Options: --model <path>, --system <prompt>, --temperature <f>, --gpu

download

Download models from Hugging Face Hub.

bash
npx @ruvector/ruvllm-cli@latest download <model-id> [--quantization q4_k_m]

models

Model management.

bash
npx @ruvector/ruvllm-cli@latest models list
npx @ruvector/ruvllm-cli@latest models info <name>
npx @ruvector/ruvllm-cli@latest models delete <name>

serve

Serve model via HTTP API (OpenAI-compatible).

bash
npx @ruvector/ruvllm-cli@latest serve --model <path> --port 8080

bench

Benchmark inference performance.

bash
npx @ruvector/ruvllm-cli@latest bench --model <path> [--iterations <n>]

Common Patterns

Quick Chat

bash
npx @ruvector/ruvllm-cli@latest download TheBloke/Llama-2-7B-GGUF --quantization q4_k_m
npx @ruvector/ruvllm-cli@latest chat --model llama-2-7b-q4_k_m.gguf --gpu

API Server

bash
npx @ruvector/ruvllm-cli@latest serve --model ./model.gguf --port 8080 --gpu
# Then: curl http://localhost:8080/v1/chat/completions -d '{"messages": [...]}'

RAN DDD Context

Bounded Context: Learning

References