Hailo AI Services Architecture

This skill provides pragmatic patterns for AI service APIs that leverage Hailo-10H on Raspberry Pi 5.

Core Principle: Stand on Proven Shoulders

Adopt existing standards: Don't invent new APIs. Use proven protocols and conventions:

•LLM inference: Ollama API (adopted by thousands of projects)
•Object detection: Standard bbox JSON formats (COCO, YOLO conventions)
•Pose estimation: Standard keypoint formats (COCO pose, OpenPose conventions)
•Speech/TTS: Whisper API patterns, standard audio formats (WAV, MP3)
•OCR: Standard text detection/recognition formats
•Depth estimation: Standard depth map formats (PNG, NumPy arrays)
•General REST: Standard HTTP methods, JSON, conventional status codes
•Health checks: Simple GET /health or /api/health

Benefit: Works with existing client tools, familiar to users, reduces code to write and maintain.

These are personal projects and art installations—reuse over reinvention.

Python Runtime Strategy

For Python-based services: Use isolated virtual environments in /opt/hailo-service-name/venv (see Raspberry Pi skill for detailed rationale and alternatives).

Why it matters for AI services:

•Heavy dependencies: HailoRT Python bindings, OpenCV, NumPy, ML frameworks
•Version pinning: AI libraries evolve rapidly; lock versions for stability
•Multiple services: Different services may need incompatible package versions
•Clean uninstall: Remove entire /opt/hailo-service-name/ directory

hailo-ollama exception: Wraps Ollama binary (not Python); no venv needed.

Service Types from hailo-apps

The hailo-apps submodule provides 20+ AI applications across vision and GenAI domains. Each can be wrapped as a systemd service.

Architecture Note: Hailo-10H supports concurrent services — multiple models can run simultaneously. Design services to:

•Load models at startup and keep them resident (startup latency ~10-30s is costly)
•Support graceful unload via API endpoint (for memory management)
•Budget memory across all active services (~5-6GB total available)

LLM Inference (hailo-ollama) — Starting Point

Purpose: Expose LLM inference via Ollama API as a systemd service

Pattern:

•Ollama-compatible REST endpoints (/api/tags, /api/chat, /api/generate)
•Model lifecycle management, port 11434
•Why first? Ollama is the most widely adopted local LLM standard

Computer Vision Services (from hailo-apps)

Available applications ready for service deployment:

Object Detection:

•YOLO (v5, v6, v8, v11), SSD, CenterNet variants
•Standard COCO bbox format output
•Potential API: /api/detect endpoint with image upload

Pose Estimation:

•YOLOv8 pose models (13+ keypoints per person)
•Standard COCO pose keypoint format
•Potential API: /api/pose endpoint returning joint coordinates

Instance Segmentation:

•YOLOv5/v8 segmentation models
•Per-instance pixel masks
•Potential API: /api/segment with mask output

Depth Estimation:

•Monocular (scdepthv3) and stereo (stereonet)
•Depth map output (PNG or NumPy)
•Potential API: /api/depth returning depth maps

OCR:

•PaddleOCR text detection + recognition
•Standard text/bbox JSON output
•Potential API: /api/ocr following Tesseract conventions

Face Recognition:

•Face detection + embedding comparison
•Standard face detection bbox format
•Potential API: /api/faces with recognition results

Other Vision:

•Object Detection services:**

bash

curl -X POST http://localhost:8080/api/detect \
  -F "image=@photo.jpg" \
  -H "Content-Type: multipart/form-data"
# Returns COCO-format bboxes: [{"label": "person", "confidence": 0.95, "bbox": [x, y, w, h]}]

Pose Estimation services:

bash

curl -X POST http://localhost:8081/api/pose \
  -F "image=@person.jpg"
# Returns keypoints: [{"person_id": 0, "keypoints": [{"name": "nose", "x": 120, "y": 80}...]}]

Speech-to-Text services (Whisper API format):

bash

curl -X POST http://localhost:8082/api/transcribe \
  -F "audio=@speech.wav" \
  -F "language=en"
# Returns: {"text": "Hello world", "language": "en", "duration": 2.5}

Text-to-Speech services (Piper TTS):

bash

curl -X POST http://localhost:8083/api/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "en_US-amy-low"}' \
  --output speech.wav
# Returns audio file (WAV format)

General principles for all services:

•Adopt the existing API standard for that AI domain (Ollama, Whisper, COCO, etc.)
•Use standard HTTP conventions (200 OK, 400 Bad Request, 404 Not Found, 503 Service Unavailable)
•JSON request/response bodies; multipart for file uploads
•Health check endpoint (e.g., /api/health, /health
•Potential API: /api/transcribe with audio upload

Voice Assistant:

•End-to-end speech → LLM → TTS pipeline
•WebSocket or streaming API
•Potential API: /api/voice streaming endpoint

Vision-Language Models:

•Image + text reasoning (VLM chat)
•Standard multimodal input format
•Potential API: /api/vlm with image + prompt

Service Deployment Pattern

For each AI capability:

•Identify existing API standard for that domain (if one exists)
•Wrap hailo-apps application in REST API server
•Deploy as systemd service with health checks
•Use Hailo-10 acceleration where available
•Standard error handling and logging to journald

Batch Processing Service (Future)

Purpose: Queue-based inference for non-interactive workloads

Characteristics:

•Consume tasks from queue (file, message broker, API)
•Process in batches for efficiency
•Store results to database or file system
•Lower latency requirements than interactive API

API Design Patterns

Follow the Standard for Your Domain

Ollama-based services (LLM):

bash

curl http://localhost:11434/api/tags  # Standard Ollama endpoint

Object Detection services:

bash

curl -X POST http://localhost:8080/api/detect \
  -F "image=@photo.jpg" \
  -H "Content-Type: multipart/form-data"
# Returns COCO-format bboxes: [{"label": "person", "confidence": 0.95, "bbox": [x, y, w, h]}]

Pose Estimation services:

bash

curl -X POST http://localhost:8081/api/pose \
  -F "image=@person.jpg"
# Returns keypoints: [{"person_id": 0, "keypoints": [{"name": "nose", "x": 120, "y": 80}...]}]

Speech-to-Text services (Whisper API format):

bash

curl -X POST http://localhost:8082/api/transcribe \
  -F "audio=@speech.wav" \
  -F "language=en"
# Returns: {"text": "Hello world", "language": "en", "duration": 2.5}

Text-to-Speech services (Piper TTS):

bash

curl -X POST http://localhost:8083/api/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "en_US-amy-low"}' \
  --output speech.wav
# Returns audio file (WAV format)

General principles for all services:

•Adopt the existing API standard for that AI domain (Ollama, Whisper, COCO, etc.)
•Use standard HTTP conventions (200 OK, 400 Bad Request, 404 Not Found, 503 Service Unavailable)
•JSON request/response bodies; multipart for file uploads
•Health check endpoint (e.g., /api/health, /health)

Pragmatic Error Handling

Keep it simple:

json

{
  "error": "Model not found",
  "detail": "Available models: neural-chat, mistral"
}

No need for custom error codes if standard HTTP status codes work.

Async Request/Response (optional)

For long-running inference:

code

POST /api/infer
Body: {"model": "llama2", "prompt": "..."}
Response: {"job_id": "abc123", "status": "queued"}

GET /api/infer/abc123
Response: {"status": "complete", "result": "..."}

Model Management

Model Lifecycle Strategy

Persistent Loading (Preferred):

•Load models at service startup; keep resident in memory
•Avoids 10-30s startup latency on each inference request
•Suitable for services with regular usage patterns

Graceful Unloading (When Needed):

•Expose /api/unload endpoint to release model memory
•Useful for memory-constrained scenarios or infrequent usage
•Service should support reload on next inference request

Concurrent Services:

•Multiple services (e.g., LLM + STT + vision) can run simultaneously
•Total memory budget: ~5-6GB across all services
•Example allocation: Ollama 2GB + Whisper 1GB + YOLO 1GB + OS overhead

Ollama Model Paths

bash

# Default model storage
~hailo/.ollama/models/                 # User home
/var/lib/ollama/models/                # System service

# Model structure
/var/lib/ollama/models/manifests/<namespace>/<model>:<tag>
/var/lib/ollama/models/blobs/           # Actual model files

Model Lifecycle API Patterns

Recommended pattern for persistent services:

bash

# Service startup: Load model into memory (happens once)
# - Model remains resident until service stop or explicit unload
# - First inference after startup: ~10-30s (model loading)
# - Subsequent inferences: <1s (model already loaded)

# List loaded models
curl http://localhost:11434/api/tags
# Returns: {"models": [{"name": "llama2", "size": "4.1GB", "loaded": true}]}

# Inference (uses already-loaded model)
curl -X POST http://localhost:11434/api/chat -d '{"model": "llama2", "messages": [...]}'
# Fast response (model already in memory)

# Graceful unload (optional, for memory management)
curl -X POST http://localhost:11434/api/unload -d '{"model": "llama2"}'
# Releases memory; next inference will reload (slow)

# Health check includes model status
curl http://localhost:11434/api/health
# Returns: {"status": "ok", "models_loaded": ["llama2"], "memory_used_mb": 4200}

Design principle: Services should keep models loaded by default. Only unload when:

•User explicitly requests via /api/unload endpoint
•Memory pressure detected (optional: auto-unload least-recently-used)
•Service shutdown (graceful cleanup)

Model Selection for Raspberry Pi

Recommended models for single-service deployment:

•neural-chat (7B, ~4GB) - Good for dedicated LLM service
•mistral (7B, ~4GB) - Faster inference, good general use
•orca-mini (3B, ~2GB) - Lowest memory, good for multi-service setups

For concurrent multi-service scenarios:

•Use smaller models to fit memory budget
•Example: orca-mini (2GB) + Whisper-tiny (512MB) + YOLO (512MB) = ~3GB total
•Monitor memory with: free -h and /api/health endpoints

Avoid:

•13B+ models when running multiple services (exceed total RAM budget)
•Frequent model reloading (adds 10-30s latency each time)

Resource Management

Memory Budgeting

code

Total Pi 5 RAM:              8 GB
Reserved for OS:             1 GB
Available for services:      ~7 GB

Ollama process overhead:     ~300 MB
Model cache (llama2 7B):     ~5 GB
Inference workspace:         ~1 GB buffer

Action: Set MemoryLimit=6G in systemd unit to prevent swap thrashing.

CPU Performance Considerations

code

Hailo-10 offloads:     Tensor operations (inference)
CPU still handles:     Tokenization, sampling, post-processing

Typical CPU load:      30-50% during active inference
Thermal idle:          35-45°C (with passive cooling)
Thermal active use:    50-65°C (with active fan)

Action: Monitor CPU usage in journalctl logs; if sustained >80%, reduce batch size or model size.

Thermal Management

bash

# Monitor service + system temperature
while true; do
  echo "$(date): $(vcgencmd measure_temp)"
  sleep 5
done

# Log thermal events to journalctl
# Modify service unit:
# ExecStartPost=/usr/local/bin/log-thermal-baseline.sh

Throttle recovery:

•If temperature >80°C: CPU governor reduces frequency
•Service may emit 503 errors during throttle
•Normal operation resumes as temperature drops
•Implement retry logic in clients

Deployment Checklist

Common Patterns

Graceful Shutdown

python

# Python service code
import signal

def signal_handler(sig, frame):
    print("Shutting down gracefully...")
    # Save model state, close Hailo device connection
    sys.exit(0)

signal.signal(signal.SIGTERM, signal_handler)

bash

# systemd unit
TimeoutStopSec=30
KillSignal=SIGTERM

Dependency on System Setup

bash

# In installer, verify prerequisites:
if ! command -v hailortcli &> /dev/null; then
    echo "ERROR: Hailo driver not installed. Run system setup first."
    exit 1
fi

if ! [ -e /dev/hailo0 ]; then
    echo "ERROR: /dev/hailo0 not found. Reboot after hailo-h10-all install."
    exit 1
fi

Model Pre-caching

bash

# Installer: Pre-download and load model to avoid cold-start delays
su - hailo -s /bin/bash -c "/usr/local/bin/ollama pull neural-chat"

# Service startup script loads model into memory:
# - systemd ExecStartPost can trigger initial model load
# - Or service startup code loads default model
# - Subsequent requests use already-loaded model (fast)

Managing Multiple Concurrent Services

bash

# Example multi-service deployment on Pi 5 (8GB RAM):
# - hailo-ollama: 2GB (LLM inference)
# - hailo-whisper: 1GB (STT)
# - hailo-yolo: 512MB (object detection)
# - OS + overhead: 1.5GB
# Total: ~5GB (safe with headroom)

# Monitor memory across services:
free -h
systemctl status hailo-ollama hailo-whisper hailo-yolo

# Each service tracks its own memory via /api/health:
curl http://localhost:11434/api/health  # Ollama
curl http://localhost:8082/api/health   # Whisper
curl http://localhost:8080/api/health   # YOLO

Service Integration Patterns & Debugging

Understanding Systemd vs Manual Execution

Key principle: Services running under systemd have constrained environments that differ from manual execution. What works when you run application may fail as a service.

Common differences:

•Environment variables: Limited to what's explicitly set in unit file
•Working directory: Set by WorkingDirectory=, not your shell's $PWD
•User context: Runs as service user (e.g., hailo-ollama), not root or your user
•PATH: May not include /usr/local/bin or other custom paths
•Home directory: Service user's home (/var/lib/service-name), not /root or /home/you
•Resource discovery: Applications may search different paths under different users

Debugging strategy:

•

Test as the service user first:

bash

# Become the service user and test manually
sudo -u hailo-ollama bash
cd /var/lib/hailo-ollama
/usr/bin/hailo-ollama  # Run exactly as service will

•

Match the service environment:

bash

# Export the exact environment from unit file
sudo -u hailo-ollama \
  XDG_DATA_HOME=/var/lib \
  XDG_CONFIG_HOME=/etc/xdg \
  bash -c '/usr/bin/hailo-ollama'

•

Inspect the running service environment:

bash

# Check environment variables
sudo cat /proc/$(pgrep service-name)/environ | tr '\0' '\n'

# Check mount points (for bind mounts)
sudo cat /proc/$(pgrep service-name)/mountinfo | grep service-name

# Check working directory
ls -la /proc/$(pgrep service-name)/cwd

Resource Discovery & Package Integration

Problem pattern: Application works manually but service can't find resources (models, configs, manifests).

Discovery mechanisms to investigate:

•

Hardcoded paths:

bash

# Search binary for path strings
strings /usr/bin/application | grep -E "(usr|var|etc|share)"

•

Environment-based paths (XDG, HOME, etc.):

bash

# Trace file access during startup
strace -e openat,access /usr/bin/application 2>&1 | grep -E "(model|config|manifest)"

•

Config file references:

bash

# Check installed config files
dpkg -L package-name | grep -E "\.(json|yaml|conf|ini)$"

Common solutions:

Option 1: Configuration (preferred when supported):

ini

# systemd unit
Environment=MODEL_PATH=/usr/share/hailo-models

Option 2: Bind mount (when app searches fixed user paths):

ini

# Mount package resources into service's writable state
BindReadOnlyPaths=/usr/share/package-resources:/var/lib/service-name/resources

Why bind mounts over symlinks:

•Symlinks fail when scanners check d_type: Directory scanners using readdir() see DT_LNK for symlinks, DT_DIR for real directories. Many apps skip non-directory entries.
•Bind mounts are transparent: Appear as real directories to all applications
•Per-service scope: Mount exists only in service's namespace, no system-wide changes
•Package update safe: Contents automatically reflect package upgrades

Option 3: Copy resources (last resort):

bash

# In installer: copy package resources to service directory
cp -r /usr/share/package-resources/* /var/lib/service-name/resources/
chown -R service-user:service-user /var/lib/service-name/resources/

Downsides: Duplicates data, stale after package upgrades, requires re-copy logic.

Debugging with strace

When to use: Application fails in service but works manually; need to see what it's actually accessing.

bash

# Trace file operations during startup
sudo systemctl stop service-name
strace -f -e openat,access,stat /usr/bin/application 2>&1 | tee /tmp/strace.log

# Common patterns to look for:
grep -E "ENOENT|EACCES" /tmp/strace.log  # File not found or permission denied
grep "manifest" /tmp/strace.log           # Resource discovery paths
grep "\.so" /tmp/strace.log               # Library loading issues

Attach to running service:

bash

# Trace a running process
pid=$(pgrep service-name)
sudo strace -p $pid -e openat -f 2>&1 | grep -i resource

Testing Service Integration

Progressive verification checklist:

•

Package resources installed:

bash

dpkg -L package-name | grep -E "models|manifests|configs"
ls -la /usr/share/package-name/

•

Service user can read resources:

bash

sudo -u service-user ls -la /usr/share/package-resources/
sudo -u service-user cat /usr/share/package-resources/model.hef

•

Manual execution as service user:

bash

sudo systemctl stop service-name
sudo -u service-user bash -c 'cd /var/lib/service-name && /usr/bin/application'
# Verify application finds resources

•

Service execution:

bash

sudo systemctl start service-name
sudo systemctl status service-name
sudo journalctl -u service-name -n 50

•

API validation:

bash

curl http://localhost:PORT/api/health
curl http://localhost:PORT/api/list-resources

Common Integration Issues

Issue: "Resources not found" in service but work manually

Cause: Application searches paths relative to $HOME, $XDG_DATA_HOME, or current directory.

Debug:

bash

# Compare environments
env | grep -E "HOME|XDG|PATH" > /tmp/manual.env
sudo cat /proc/$(pgrep service-name)/environ | tr '\0' '\n' | grep -E "HOME|XDG|PATH" > /tmp/service.env
diff /tmp/manual.env /tmp/service.env

Solutions:

•Set missing environment variables in unit file
•Use BindReadOnlyPaths to mount resources into expected location
•Provide config file with explicit paths

Issue: "Permission denied" accessing /dev/hailo0

Cause: Service user not in Hailo device group.

Fix:

bash

stat -c '%G' /dev/hailo0  # Check device group (usually 'hailo' or 'root')
sudo usermod -aG hailo service-user
sudo systemctl restart service-name

Issue: Models load on first run but not after package upgrade

Cause: Resources copied during install rather than referenced from package location.

Fix: Use bind mounts or config paths pointing to /usr/share instead of copying.

Case Study: XDG Base Directory Spec

Scenario: Application uses XDG Base Directory specification but only searches XDG_DATA_HOME, not XDG_DATA_DIRS.

Symptoms:

•Manual execution finds resources in /usr/share/application/
•Service execution fails to find same resources
•XDG_DATA_DIRS includes /usr/share but doesn't help

Root cause: Application follows XDG spec incompletely—only checks user data location, not system data directories.

Solution: Use BindReadOnlyPaths:

ini

# systemd unit
Environment=XDG_DATA_HOME=/var/lib
BindReadOnlyPaths=/usr/share/application/resources:/var/lib/application/resources

Lesson: Don't assume applications follow specs completely. Verify actual behavior with strace.

See hailo-ollama MANIFEST_DISCOVERY.md for detailed case study.

Monitoring & Debugging

systemd Service Logs

bash

# Real-time logs
sudo journalctl -u hailo-ollama.service -f

# Debug level logs (if configured)
sudo journalctl -u hailo-ollama.service -p debug

# Continuous output with timestamps
sudo journalctl -u hailo-ollama.service -o short-iso -f

API Testing

bash

# Health
curl http://localhost:11434/api/health | jq

# List models
curl http://localhost:11434/api/tags | jq

# Chat inference
curl -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "neural-chat",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": false
  }' | jq

Thermal & Resource Monitoring

bash

# During active inference, monitor in separate terminal
watch -n 1 'echo "Temp: $(vcgencmd measure_temp)" && \
  free -h && \
  ps aux | grep hailo-ollama'

Reference:

•Ollama API: https://github.com/ollama/ollama/blob/main/docs/api.md
•Hailo documentation: https://www.raspberrypi.com/documentation/computers/ai.html