AgentSkillsCN

Local Llm

本地大模型

SKILL.md

Local LLM Integration Skill

Deploy and integrate local LLMs with Ollama, LocalAI, and Home Assistant for privacy-focused voice assistants and automation.

Activation Triggers

Activate this skill when:

  • Setting up Ollama or LocalAI
  • Configuring local voice assistants
  • Integrating LLMs with Home Assistant
  • Optimizing local model performance
  • Building LLM-powered automations

Ollama Installation

Ubuntu/Debian

bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start as service
sudo systemctl enable ollama
sudo systemctl start ollama

# Pull models
ollama pull llama3.2:3b
ollama pull fixt/home-3b-v3  # HA-optimized

Docker

yaml
# docker-compose.yaml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ./ollama:/root/.ollama
    # GPU support (NVIDIA)
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Ollama API

Generate Completion

python
import httpx

async def generate(prompt: str, model: str = "llama3.2:3b") -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:11434/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.7,
                    "num_ctx": 2048,
                    "top_p": 0.9
                }
            },
            timeout=60.0
        )
        return response.json()["response"]

Chat Completion

python
async def chat(messages: list, model: str = "llama3.2:3b") -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:11434/api/chat",
            json={
                "model": model,
                "messages": messages,
                "stream": False
            },
            timeout=60.0
        )
        return response.json()["message"]["content"]

# Usage
response = await chat([
    {"role": "system", "content": "You are a helpful home assistant."},
    {"role": "user", "content": "Turn on the living room lights."}
])

Streaming

python
async def stream_generate(prompt: str, model: str = "llama3.2:3b"):
    async with httpx.AsyncClient() as client:
        async with client.stream(
            "POST",
            "http://localhost:11434/api/generate",
            json={"model": model, "prompt": prompt},
            timeout=60.0
        ) as response:
            async for line in response.aiter_lines():
                if line:
                    chunk = json.loads(line)
                    yield chunk.get("response", "")

Home Assistant Integration

Ollama Conversation Agent

yaml
# configuration.yaml
ollama:
  url: http://localhost:11434
  model: llama3.2:3b
  context_window: 4096
  keep_alive: 5m
  prompt_template: |
    You are a helpful home assistant AI. You can control smart home devices.

    When asked to control devices, respond with the action you're taking.
    Be concise and helpful.

conversation:
  - platform: ollama
    name: Local Assistant

Home-LLM Integration

yaml
# For the home-llm custom component
# Install via HACS

# configuration.yaml
home_llm:
  backend: ollama
  model: fixt/home-3b-v3
  url: http://localhost:11434
  max_tokens: 256
  temperature: 0.3

Custom HA Agent with Function Calling

python
import json
import re
from homeassistant.core import HomeAssistant

SYSTEM_PROMPT = """You are a home automation AI assistant.

When the user asks to control a device, respond with a JSON action block:
```json
{"action": "service_call", "domain": "light", "service": "turn_on", "target": {"entity_id": "light.living_room"}, "data": {"brightness_pct": 100}}

For information queries, respond naturally. For device control, always include the JSON block.

Available entities: {entities} """

async def process_command( hass: HomeAssistant, user_input: str, model: str = "llama3.2:3b" ) -> str: # Get available entities entities = [] for state in hass.states.async_all(): if state.domain in ["light", "switch", "climate", "cover", "lock"]: entities.append(f"- {state.entity_id}: {state.name}")

code
system_prompt = SYSTEM_PROMPT.format(entities="\n".join(entities[:50]))

# Call Ollama
response = await chat([
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_input}
], model=model)

# Extract and execute JSON actions
json_match = re.search(r'```json\s*({.*?})\s*```', response, re.DOTALL)
if json_match:
    try:
        action = json.loads(json_match.group(1))
        if action.get("action") == "service_call":
            await hass.services.async_call(
                action["domain"],
                action["service"],
                action.get("data", {}),
                target=action.get("target")
            )
    except Exception as e:
        return f"Error executing action: {e}"

return response
code

## Model Recommendations

| Use Case | Model | RAM | VRAM | Speed |
|----------|-------|-----|------|-------|
| Fast responses | llama3.2:1b | 2GB | 2GB | Very Fast |
| Voice assistant | llama3.2:3b | 4GB | 4GB | Fast |
| HA control | fixt/home-3b-v3 | 4GB | 4GB | Fast |
| General chat | llama3.2:8b | 8GB | 8GB | Medium |
| Complex tasks | mistral:7b | 8GB | 8GB | Medium |
| Reasoning | deepseek-r1:7b | 8GB | 8GB | Slow |

## Custom Modelfile

```dockerfile
# ha-assistant.modelfile
FROM llama3.2:3b

# System prompt for HA
SYSTEM """You are a helpful home automation assistant.

When asked to control devices, provide clear confirmation of actions.
When asked about device states, check current status and report accurately.
Be concise and helpful. Avoid unnecessary explanations.

Format device control responses as:
"Done! [What was changed]"

Format status queries as:
"The [device] is currently [state]."
"""

# Optimize for fast responses
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 2048
PARAMETER stop "<|eot_id|>"

# Template
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
bash
# Create the model
ollama create ha-assistant -f ha-assistant.modelfile

# Test it
ollama run ha-assistant "Turn on the kitchen lights"

Performance Optimization

GPU Configuration

bash
# Check GPU availability
nvidia-smi

# Set GPU layers in Ollama
export OLLAMA_NUM_GPU=35

# For AMD GPUs
export HSA_OVERRIDE_GFX_VERSION=10.3.0

Memory Management

bash
# Limit VRAM usage
export OLLAMA_GPU_MEMORY_FRACTION=0.8

# Keep model in memory
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2:3b", "keep_alive": "10m"}'

Quantization

FormatSizeSpeedQuality
Q4_0SmallestFastestLower
Q4_K_MSmallFastGood
Q5_K_MMediumMediumBetter
Q8_0LargeSlowerBest
F16LargestSlowestOriginal

LocalAI Alternative

yaml
# docker-compose.yaml
services:
  localai:
    image: localai/localai:latest-aio-cpu
    container_name: localai
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - ./models:/models
    environment:
      - MODELS_PATH=/models

LocalAI provides OpenAI-compatible API:

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[
        {"role": "user", "content": "Turn on the lights"}
    ]
)

Troubleshooting

IssueSolution
Model not loadingCheck VRAM, use smaller quantization
Slow responsesEnable GPU, reduce context length
Out of memoryUse Q4 quantization, reduce batch
Connection refusedCheck ollama service status
Timeout errorsIncrease client timeout, use streaming