AgentSkillsCN

ollama-optimizer

针对当前设备,优化 Ollama 配置,以充分发挥其性能潜力。当用户要求“优化 Ollama”、“配置 Ollama”、“加速 Ollama 运行”、“调优大语言模型性能”、“搭建本地大语言模型”、“修复 Ollama 性能问题”、“Ollama 运行缓慢”,或当用户希望最大化推理速度、降低内存占用,或根据硬件条件选择最适合的模型时,这一工具将为你提供专业指导。它会分析系统硬件配置(GPU、内存、CPU),并据此给出量身定制的优化建议。

SKILL.md
--- frontmatter
name: ollama-optimizer
description: Optimize Ollama configuration for maximum performance on the current machine. Use when asked to "optimize Ollama", "configure Ollama", "speed up Ollama", "tune LLM performance", "setup local LLM", "fix Ollama performance", "Ollama running slow", or when users want to maximize inference speed, reduce memory usage, or select appropriate models for their hardware. Analyzes system hardware (GPU, RAM, CPU) and provides tailored recommendations.

Ollama Optimizer

Optimize Ollama configuration based on system hardware analysis.

Workflow

Phase 1: System Detection

Run the detection script to gather hardware information:

bash
python3 scripts/detect_system.py

Parse the JSON output to identify:

  • OS and version
  • CPU model and core count
  • Total RAM / unified memory
  • GPU type, VRAM, and driver version
  • Current Ollama installation and environment variables

Phase 2: Analyze and Recommend

Based on detected hardware, determine the optimization profile:

Hardware Tier Classification:

TierCriteriaMax ModelKey Optimizations
CPU-onlyNo GPU detected3Bnum_thread tuning, Q4_K_M quant
Low VRAM<6GB VRAM3BFlash attention, KV cache q4_0
Entry6-8GB VRAM8BFlash attention, KV cache q8_0
Prosumer10-12GB VRAM14BFlash attention, full offload
Workstation16-24GB VRAM32BStandard config, Q5_K_M option
High-end48GB+ VRAM70B+Multiple models, Q5/Q6 quants

Apple Silicon Special Case:

  • Unified memory = shared CPU/GPU RAM
  • 8GB Mac → treat as 6GB VRAM tier
  • 16GB Mac → treat as 12GB VRAM tier
  • 32GB+ Mac → treat as workstation tier

Phase 3: Generate Optimization Plan

Create a structured optimization guide with these sections:

1. System Overview

Present detected hardware specs and highlight constraints (e.g., "8GB unified memory limits to 7B models").

2. Dependency Assessment

List what's needed based on the platform:

  • macOS: Ollama only (Metal automatic)
  • Linux NVIDIA: Ollama + NVIDIA driver 450+
  • Linux AMD: Ollama + ROCm 5.0+
  • Windows: Ollama + NVIDIA driver 452+

3. Configuration Recommendations

Essential environment variables:

bash
# Always recommended
export OLLAMA_FLASH_ATTENTION=1

# Memory-constrained systems (<12GB)
export OLLAMA_KV_CACHE_TYPE=q8_0  # or q4_0 for severe constraints

Model selection guidance:

  • Recommend specific models from ollama list output
  • Suggest appropriate quantization (Q4_K_M default, Q5_K_M if headroom exists)
  • Warn if current models exceed hardware capacity

Modelfile tuning (when needed):

code
PARAMETER num_gpu <layers>    # Partial offload for limited VRAM
PARAMETER num_thread <cores>  # CPU threads (physical cores, not hyperthreads)
PARAMETER num_ctx <size>      # Reduce context for memory savings

4. Execution Checklist

Provide copy-paste commands in order:

  1. Set environment variables
  2. Restart Ollama service
  3. Pull recommended models
  4. Test with ollama run <model> --verbose

5. Verification Commands

bash
# Benchmark current performance
python3 scripts/benchmark_ollama.py --model <model>

# Check GPU memory usage (NVIDIA)
nvidia-smi

# Verify config is applied
ollama run <model> "test" --verbose 2>&1 | head -20

Reference Files

Output Format

Generate an ollama-optimization-guide.md file in the current directory with:

markdown
# Ollama Optimization Guide

**Generated:** <timestamp>
**System:** <OS> | <CPU> | <RAM>GB RAM | <GPU>

## System Overview
<hardware summary and constraints>

## Current Configuration
<existing Ollama setup and env vars>

## Recommendations

### Environment Variables
<shell commands to set vars>

### Model Selection
<recommended models with rationale>

### Performance Tuning
<Modelfile adjustments if needed>

## Execution Checklist
- [ ] <step 1>
- [ ] <step 2>
...

## Verification
<benchmark commands and expected results>

## Rollback
<commands to revert changes if needed>

Quick Optimization Commands

For users who want immediate results without full analysis:

macOS (Apple Silicon):

bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.2:3b  # Safe for 8GB, fast

Linux/Windows with 8GB NVIDIA GPU:

bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.1:8b-instruct-q4_K_M

CPU-only systems:

bash
export CUDA_VISIBLE_DEVICES=-1
ollama pull llama3.2:3b
# Create Modelfile with: PARAMETER num_thread 4