Ollama LLM & Prompt Engineering

Name: skill-ollama-llm-prompts
Rating: 76
Author: Rubenvalrom

Architecture Decision

Why Ollama?

•Local inference: No API dependencies, full data privacy, no token costs
•Model flexibility: Pull/swap models (Llama 2, Mistral, Neural Chat) without code changes
•Integration: Simple HTTP API, easy to containerize, Kubernetes-ready
•Development: Instant feedback loop, no rate limits

Supported Models (balance accuracy vs. speed):

•llama2 (7B): General purpose, good UX recommendations. ~6GB VRAM
•mistral (7B): Best quality-to-speed ratio. ~7GB VRAM
•neural-chat (7B): Conversational, good for assistant. ~6GB VRAM
•nomic-embed-text: For embeddings (pgvector), ~500MB

Model Selection Criteria:

Use Case	Model	Reason
Extract expediente metadata	Mistral 7B	Accurate JSON output, fast
Virtual assistant chat	Llama 2 Chat	Conversational, 60 token/sec
Budget forecasting	Llama 2	Mathematical reasoning OK
Embebddings (semantic search)	nomic-embed-text	Fast, 1536-dim vectors

Quick Start

bash

# Pull model (one-time, auto on first request)
ollama pull mistral

# Verify Ollama running
curl http://localhost:11434/api/tags

python

import requests
import json

def call_ollama(prompt: str, model: str = "mistral") -> str:
    url = "http://ollama:11434/api/generate"
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": False,
    }
    
    response = requests.post(url, json=payload, timeout=30)
    return response.json()["response"]

# Example: Extract metadata
ocr_text = "Solicitud de licencia de construcción..."
prompt = f"Extract: solicitante, tipo trámite, monto\n\nDocumento:\n{ocr_text}"
result = call_ollama(prompt, "mistral")
print(result)

Prompt Templates

See references/prompt-templates.md:

•Auto-rellenado de expedientes (OCR → JSON metadata)
•Asistente virtual (context-aware recommendations)
•Predicción de tendencias (time-series forecasting)
•Extracción de cláusulas legales

RAG Strategy

See references/rag-patterns.md:

•Retrieval: pgvector for semantic search + simple text BM25
•Augmentation: Precedent documents + regulatory references
•Generation: Context-aware prompts with retrieved examples
•Fallback: When no relevant docs found, use generic prompt

Reliability Patterns

See references/reliability-patterns.md:

•Retry logic with exponential backoff
•Request timeouts (30s default)
•Graceful degradation (return cached/default on failure)
•Structured output validation (JSON schema check)

Scripts

Client library:

bash

python scripts/ollama_client.py --model mistral --prompt "What is machine learning?"

Prompt evaluation framework:

bash

pytest scripts/test_prompts.py -v
# Runs test cases with different models, measures accuracy

Model benchmarking:

bash

python scripts/benchmark_models.py --models mistral llama2 neural-chat
# Outputs: latency, accuracy, VRAM usage

Decisions

•MVP models: Mistral (text) + nomic-embed (vectors), both 7B to fit on single GPU
•Streaming vs. Complete: Use complete (stream=False) for MVP, simpler handling
•Structured output: Prompt for JSON, validate with Pydantic, fallback to named entities
•Caching: Cache embeddings in pgvector, NOT raw LLM outputs (too large)
•Cost estimation: Ollama local = no API costs, but infra/compute cost (single GPU ~$500/year)