Local LLM Integration Skill
Deploy and integrate local LLMs with Ollama, LocalAI, and Home Assistant for privacy-focused voice assistants and automation.
Activation Triggers
Activate this skill when:
- •Setting up Ollama or LocalAI
- •Configuring local voice assistants
- •Integrating LLMs with Home Assistant
- •Optimizing local model performance
- •Building LLM-powered automations
Ollama Installation
Ubuntu/Debian
bash
# Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Start as service sudo systemctl enable ollama sudo systemctl start ollama # Pull models ollama pull llama3.2:3b ollama pull fixt/home-3b-v3 # HA-optimized
Docker
yaml
# docker-compose.yaml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ./ollama:/root/.ollama
# GPU support (NVIDIA)
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Ollama API
Generate Completion
python
import httpx
async def generate(prompt: str, model: str = "llama3.2:3b") -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.7,
"num_ctx": 2048,
"top_p": 0.9
}
},
timeout=60.0
)
return response.json()["response"]
Chat Completion
python
async def chat(messages: list, model: str = "llama3.2:3b") -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11434/api/chat",
json={
"model": model,
"messages": messages,
"stream": False
},
timeout=60.0
)
return response.json()["message"]["content"]
# Usage
response = await chat([
{"role": "system", "content": "You are a helpful home assistant."},
{"role": "user", "content": "Turn on the living room lights."}
])
Streaming
python
async def stream_generate(prompt: str, model: str = "llama3.2:3b"):
async with httpx.AsyncClient() as client:
async with client.stream(
"POST",
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt},
timeout=60.0
) as response:
async for line in response.aiter_lines():
if line:
chunk = json.loads(line)
yield chunk.get("response", "")
Home Assistant Integration
Ollama Conversation Agent
yaml
# configuration.yaml
ollama:
url: http://localhost:11434
model: llama3.2:3b
context_window: 4096
keep_alive: 5m
prompt_template: |
You are a helpful home assistant AI. You can control smart home devices.
When asked to control devices, respond with the action you're taking.
Be concise and helpful.
conversation:
- platform: ollama
name: Local Assistant
Home-LLM Integration
yaml
# For the home-llm custom component # Install via HACS # configuration.yaml home_llm: backend: ollama model: fixt/home-3b-v3 url: http://localhost:11434 max_tokens: 256 temperature: 0.3
Custom HA Agent with Function Calling
python
import json
import re
from homeassistant.core import HomeAssistant
SYSTEM_PROMPT = """You are a home automation AI assistant.
When the user asks to control a device, respond with a JSON action block:
```json
{"action": "service_call", "domain": "light", "service": "turn_on", "target": {"entity_id": "light.living_room"}, "data": {"brightness_pct": 100}}
For information queries, respond naturally. For device control, always include the JSON block.
Available entities: {entities} """
async def process_command( hass: HomeAssistant, user_input: str, model: str = "llama3.2:3b" ) -> str: # Get available entities entities = [] for state in hass.states.async_all(): if state.domain in ["light", "switch", "climate", "cover", "lock"]: entities.append(f"- {state.entity_id}: {state.name}")
code
system_prompt = SYSTEM_PROMPT.format(entities="\n".join(entities[:50]))
# Call Ollama
response = await chat([
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_input}
], model=model)
# Extract and execute JSON actions
json_match = re.search(r'```json\s*({.*?})\s*```', response, re.DOTALL)
if json_match:
try:
action = json.loads(json_match.group(1))
if action.get("action") == "service_call":
await hass.services.async_call(
action["domain"],
action["service"],
action.get("data", {}),
target=action.get("target")
)
except Exception as e:
return f"Error executing action: {e}"
return response
code
## Model Recommendations
| Use Case | Model | RAM | VRAM | Speed |
|----------|-------|-----|------|-------|
| Fast responses | llama3.2:1b | 2GB | 2GB | Very Fast |
| Voice assistant | llama3.2:3b | 4GB | 4GB | Fast |
| HA control | fixt/home-3b-v3 | 4GB | 4GB | Fast |
| General chat | llama3.2:8b | 8GB | 8GB | Medium |
| Complex tasks | mistral:7b | 8GB | 8GB | Medium |
| Reasoning | deepseek-r1:7b | 8GB | 8GB | Slow |
## Custom Modelfile
```dockerfile
# ha-assistant.modelfile
FROM llama3.2:3b
# System prompt for HA
SYSTEM """You are a helpful home automation assistant.
When asked to control devices, provide clear confirmation of actions.
When asked about device states, check current status and report accurately.
Be concise and helpful. Avoid unnecessary explanations.
Format device control responses as:
"Done! [What was changed]"
Format status queries as:
"The [device] is currently [state]."
"""
# Optimize for fast responses
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 2048
PARAMETER stop "<|eot_id|>"
# Template
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
bash
# Create the model ollama create ha-assistant -f ha-assistant.modelfile # Test it ollama run ha-assistant "Turn on the kitchen lights"
Performance Optimization
GPU Configuration
bash
# Check GPU availability nvidia-smi # Set GPU layers in Ollama export OLLAMA_NUM_GPU=35 # For AMD GPUs export HSA_OVERRIDE_GFX_VERSION=10.3.0
Memory Management
bash
# Limit VRAM usage
export OLLAMA_GPU_MEMORY_FRACTION=0.8
# Keep model in memory
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.2:3b", "keep_alive": "10m"}'
Quantization
| Format | Size | Speed | Quality |
|---|---|---|---|
| Q4_0 | Smallest | Fastest | Lower |
| Q4_K_M | Small | Fast | Good |
| Q5_K_M | Medium | Medium | Better |
| Q8_0 | Large | Slower | Best |
| F16 | Largest | Slowest | Original |
LocalAI Alternative
yaml
# docker-compose.yaml
services:
localai:
image: localai/localai:latest-aio-cpu
container_name: localai
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- ./models:/models
environment:
- MODELS_PATH=/models
LocalAI provides OpenAI-compatible API:
python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[
{"role": "user", "content": "Turn on the lights"}
]
)
Troubleshooting
| Issue | Solution |
|---|---|
| Model not loading | Check VRAM, use smaller quantization |
| Slow responses | Enable GPU, reduce context length |
| Out of memory | Use Q4 quantization, reduce batch |
| Connection refused | Check ollama service status |
| Timeout errors | Increase client timeout, use streaming |