AgentSkillsCN

local-llm

管理本地 Ollama LLM 模型,用于开发与测试。当您需要运行本地模型、配置 Ollama、在性能优先与质量优先的模型之间灵活切换、优化显存占用、排查模型性能问题、创建 Modelfile,或通过兼容 OpenAI 的 API 将本地 LLM 集成到各类应用中时,可选用此技能。触发词:Ollama、本地模型、本地 LLM、显存、GPU 内存、模型速度、推理、Modelfile、llama.cpp、量化。

SKILL.md
--- frontmatter
name: local-llm
description: "Manage local Ollama LLM models for development and testing. Use when: running local models, configuring Ollama, switching between fast/quality models, optimizing VRAM usage, troubleshooting model performance, creating Modelfiles, or integrating local LLMs with applications via OpenAI-compatible API. Triggers: ollama, local model, local LLM, VRAM, GPU memory, model speed, inference, Modelfile, llama.cpp, quantization."

Local LLM Management

This skill provides comprehensive guidance for managing local LLM models via Ollama for development, testing, and production use.

Overview

Ollama provides a simple way to run large language models locally with:

  • Easy model management (pull, run, stop)
  • OpenAI-compatible API at http://localhost:11434/v1
  • GPU acceleration with automatic VRAM management
  • Custom model creation via Modelfiles

Two-Tier Model Strategy

For efficient development workflows, use a two-tier approach:

TierPurposeModel SizeSpeedUse Case
FastIteration3-4B params30+ tok/sSanity tests, CI/CD, rapid prototyping
QualityValidation7-14B params5-15 tok/sE2E testing, complex reasoning, final QA

Recommended Model Combinations

GPU VRAMFast ModelQuality Model
4 GBQwen2.5 3B Q4Phi-3 4B Q4
6 GBQwen2.5 4B Q4Llama 3.2 8B Q4
8 GBLlama 3.2 3B Q8DeepSeek-R1 8B Q4
12+ GBQwen2.5 7B Q4Llama 3.1 14B Q4

Create custom models using the templates in templates/ directory.

Quick Reference Commands

Running Models

bash
# Interactive chat
ollama run <model-name>

# Single prompt (non-interactive)
echo "Your prompt" | ollama run <model-name>

# With verbose output (shows speed metrics)
echo "Your prompt" | ollama run <model-name> --verbose

Model Management

bash
# List all models
ollama list

# Check currently loaded models (VRAM usage)
ollama ps

# Unload model to free VRAM
ollama stop <model-name>

# View model details
ollama show <model-name>

# View Modelfile configuration
ollama show <model-name> --modelfile

# Delete a model
ollama rm <model-name>

# Pull/update a model
ollama pull <model-name>

Service Management

bash
# Check Ollama service status
systemctl status ollama

# Restart Ollama service
sudo systemctl restart ollama

# View Ollama logs
journalctl -u ollama -f

# Manual start (if service not running)
ollama serve

OpenAI-Compatible API

Ollama provides an OpenAI-compatible API at http://localhost:11434/v1.

Python Integration

python
import openai

# Configure client for local Ollama
client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not validated locally
)

# Chat completion
response = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "Your prompt"}],
    temperature=0.6
)
print(response.choices[0].message.content)

Environment Variables

bash
# Standard configuration
export OLLAMA_HOST=http://localhost:11434
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
export OPENAI_MODEL=your-model-name

cURL Examples

bash
# Test API availability
curl http://localhost:11434/v1/models

# Chat completion
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Creating Custom Models

Custom models are created using Modelfiles. See templates/ for ready-to-use templates.

Modelfile Structure

dockerfile
# Base model
FROM model-name:tag

# Parameters
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1

# System prompt
SYSTEM """Your system prompt here"""

# Stop sequences
PARAMETER stop "<|end|>"

Create Model from Modelfile

bash
ollama create my-model -f /path/to/Modelfile

Key Parameters

ParameterDescriptionDefaultRecommended Range
temperatureRandomness (0=deterministic)0.80.3-0.7 for code, 0.6-0.8 for chat
top_pNucleus sampling0.90.9-0.95
top_kToken selection pool4020-100
num_ctxContext window20484096-8192 (VRAM dependent)
repeat_penaltyReduce repetition1.11.0-1.2
num_gpuGPU layers (-1=all)-1-1 for full GPU

Performance Optimization

VRAM Guidelines

GPU VRAMMax Model (Q4)Recommended Context
4 GB3-4B2048-4096
6 GB7-8B4096-8192
8 GB13B8192
12 GB20-30B8192-16384
24 GB70B16384-32768

Speed vs Quality Trade-offs

AdjustmentSpeed ImpactQuality Impact
Smaller model (4B vs 8B)+3-5x fasterLower reasoning
Lower quantization (Q4 vs Q8)+20% fasterSlight quality loss
Smaller context (4K vs 8K)+15% fasterLess context awareness
Disable thinking mode+40% fasterNo chain-of-thought

Troubleshooting

Common Issues

Model won't load / CUDA out of memory

bash
ollama stop --all
sudo systemctl restart ollama

Slow generation speed

  • Check nvidia-smi for GPU utilization
  • Ensure model fits in VRAM (ollama ps)
  • Use smaller model or higher quantization

API connection refused

bash
curl http://localhost:11434/api/tags
ollama serve  # Start if needed

Model not found

bash
ollama list
ollama pull model-name

Performance Diagnostics

bash
# Test with verbose output
echo "test" | ollama run model-name --verbose 2>&1 | grep -E "(eval rate|total duration)"

# Monitor GPU during inference
watch -n 1 nvidia-smi

Recommended Workflow

  1. Development: Use fast model (4B) for rapid iteration
  2. Pre-commit: Run sanity tests with fast model
  3. CI/CD: Use fast model for pipeline tests
  4. Validation: Run full tests with quality model (overnight if needed)
  5. Release: Final validation with reasoning model

File Locations

ItemPath
Ollama binary/usr/local/bin/ollama
Model storage~/.ollama/models/
Service config/etc/systemd/system/ollama.service
Custom Modelfiles$HOME/ or project directory

Available Templates

See the templates/ directory for ready-to-use Modelfiles:

  • fast-model.Modelfile - Quick iteration (4B base)
  • reasoning-model.Modelfile - Quality validation (8B reasoning)
  • code-generation.Modelfile - Code-focused tasks
  • json-output.Modelfile - Structured data generation
  • analysis.Modelfile - Scientific analysis and reasoning