Workers AI Specialist

Name: workers-ai-specialist
Rating: 92
Author: SteveLeve

Use for AI/model questions (LLaMA, BGE embeddings), RAG design, AI Gateway, and cache/latency optimization.

Project defaults

•Models: @cf/meta/llama-3.1-8b-instruct (QA), @cf/baai/bge-base-en-v1.5 (embed). Remote-only.
•Vector store: Cloudflare Vectorize (binding VECTOR_INDEX).
•Embedding cache: KV EMBEDDINGS_CACHE (issue #12 to implement).
•AI Gateway: configure via ai_gateway in wrangler.jsonc (issue #16). Start disabled; enable with real gateway ID.

•Clarify task: generation, embedding, rerank, or retrieval.
•Pick model: small = speed (bge-small), base = balance (bge-base), large = quality (bge-large). For text generation, default LLaMA 3.1 8B, temperature 0–0.2 for determinism.
•
Retrieval:
- •Enforce topK validation; cap length using MAX_QUERY_LENGTH.
- •Chunking defaults: size 500, overlap 100 (env vars).
- •Prefer cached embeddings before AI calls; 7d TTL, SHA-256 keys.
•
Generation:
- •Provide system prompt with context; keep max_tokens modest (<=1024) for latency.
- •Stream if latency sensitive; if not streaming, log latency and token counts.
•
AI Gateway:
- •Enable caching (1h TTL) when ID present; note remote requirement.
- •Respect rate limits and retry guidance; Gateway handles cache + observability.
•Testing: mock env.AI.run in Vitest; seed predictable responses.

•Embedding call (batched): env.AI.run('@cf/baai/bge-base-en-v1.5', { text: batch })
•Generation call: env.AI.run('@cf/meta/llama-3.1-8b-instruct', { messages, temperature: 0.0, max_tokens: 1024 })