llama.cpp C API Guide
Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.
Overview
llama.cpp is a C/C++ implementation for LLM inference with minimal dependencies and state-of-the-art performance. This skill provides:
- •Complete API Reference: All non-deprecated functions organized by category
- •Common Workflows: Working examples for typical use cases
- •Best Practices: Patterns for efficient and correct API usage
Quick Start
See references/workflows.md for complete working examples. Basic workflow:
- •
llama_backend_init()- Initialize backend - •
llama_model_load_from_file()- Load model - •
llama_init_from_model()- Create context - •
llama_tokenize()- Convert text to tokens - •
llama_decode()- Process tokens - •
llama_sampler_sample()- Sample next token - •Cleanup in reverse order
When to Use This Skill
Use this skill when:
- •API Lookup: You need to find a specific function (e.g., "How do I load a model?", "What function creates a context?")
- •Code Generation: You're writing C code that uses llama.cpp
- •Workflow Guidance: You need to understand the steps for a task (e.g., text generation, embeddings, chat)
- •Advanced Features: You're working with batches, sequences, LoRA adapters, state management, or custom sampling
- •Migration: You're updating code from deprecated functions to current API
Core Concepts
Key Objects
- •
llama_model: Loaded model weights and architecture - •
llama_context: Inference state (KV cache, compute buffers) - •
llama_batch: Input tokens and positions for processing - •
llama_sampler: Token sampling configuration - •
llama_vocab: Vocabulary and tokenizer - •
llama_memory_t: KV cache memory handle
Typical Flow
- •Initialize:
llama_backend_init() - •Load Model:
llama_model_load_from_file() - •Create Context:
llama_init_from_model() - •Tokenize:
llama_tokenize() - •Process:
llama_encode()orllama_decode() - •Sample:
llama_sampler_sample() - •Generate: Repeat steps 5-6
- •Cleanup: Free in reverse order
API Reference
For detailed API documentation, the complete API is split across 6 files for efficient targeted loading. Start with references/api-core.md which links to all other sections.
API Files:
- •api-core.md (220 lines) - Initialization, parameters, model loading
- •api-model-info.md (193 lines) - Model properties, architecture detection NEW
- •api-context.md (412 lines) - Context, memory (KV cache), state management
- •api-inference.md (417 lines) - Batch operations, inference, tokenization, chat
- •api-sampling.md (490 lines) - All 26+ sampling strategies (incl. adaptive-p) + backend sampling API
- •api-advanced.md (359 lines) - LoRA adapters, performance, training
Total: 197 active functions (b7942) across 6 organized files
Quick Function Lookup
Most common: llama_backend_init(), llama_model_load_from_file(), llama_init_from_model(), llama_tokenize(), llama_decode(), llama_sampler_sample(), llama_vocab_is_eog(), llama_memory_clear()
See references/api.md for all 197 function signatures and detailed usage.
Common Workflows
See references/workflows.md for 13 complete working examples: basic text generation, chat, embeddings, batch processing, multi-sequence, LoRA, state save/load, custom sampling (XTC/DRY), encoder-decoder models, model detection, and memory management patterns.
Best Practices
See references/workflows.md for detailed best practices. Key points:
- •Always use default parameter functions (
llama_model_default_params(), etc.) - •Check return values for errors
- •Free resources in reverse order of creation
- •Handle dynamic buffer sizes for tokenization
- •Query actual context size after creation (
llama_n_ctx()) - •Check for end-of-generation with
llama_vocab_is_eog()
Common Patterns
End-of-generation check (llama_vocab_is_eog()), logits retrieval (llama_get_logits_ith()), batch creation (llama_batch_get_one()), tokenization buffer handling. See references/workflows.md for complete code examples.
Troubleshooting
Common Issues
Model loading fails:
- •Verify file path and GGUF format validity
- •Check available RAM/VRAM for model size
- •Reduce
n_gpu_layersif GPU memory insufficient
Tokenization returns negative value:
- •Buffer too small; reallocate with
-nsize and retry - •See tokenization pattern in Common Patterns
Decode/encode returns non-zero:
- •Verify batch initialization (
llama_batch_get_one()orllama_batch_init()) - •Check context capacity (
llama_n_ctx()) - •Ensure positions within context window
Silent failures / no output:
- •Check if
llama_vocab_is_eog()immediately returns true - •Verify sampler initialization
- •Enable logging:
llama_log_set()
Performance issues:
- •Increase
n_threadsfor CPU - •Set
n_gpu_layersfor GPU offloading - •Use larger
n_batchfor prompts - •See Performance & Utilities
Sliding Window Attention (SWA) issues:
- •If using Mistral-style models with SWA, set
ctx_params.swa_full = trueto access beyond attention window - •Check:
llama_model_n_swa(model)to detect SWA size and configuration needs - •Symptoms: Token positions beyond window size causing decode errors
Per-sequence state errors:
- •Ensure sequence ID matches when loading:
llama_state_seq_load_file(ctx, "file", dest_seq_id, ...) - •Verify token buffer is large enough for loaded tokens
- •Check sequence wasn't cleared or removed before loading state
Model type detection:
- •Use
llama_model_has_encoder()before assuming decoder-only architecture - •For recurrent models (Mamba/RWKV), KV cache behavior differs from standard transformers
- •Encoder-decoder models require
llama_encode()thenllama_decode()workflow
For advanced issues: https://github.com/ggerganov/llama.cpp/discussions
Resources
- •API Reference (6 files, 2,086 lines total) - Complete API reference split by category for targeted loading:
- •api-core.md - Initialization, parameters, model loading
- •api-model-info.md - Model properties, architecture detection
- •api-context.md - Context, memory, state management
- •api-inference.md - Batch, inference, tokenization, chat
- •api-sampling.md - All 25+ sampling strategies + backend sampling API
- •api-advanced.md - LoRA, performance, training
- •references/workflows.md (1,616 lines) - 15 complete working examples: basic workflows (text generation, chat, embeddings, batching, sequences), intermediate (LoRA, state, sampling, encoder-decoder, memory), advanced features (XTC/DRY, per-sequence state, model detection), and production applications (interactive chat, streaming).
Key Differences from Deprecated API
If you're updating old code:
- •Use
llama_model_load_from_file()instead ofllama_load_model_from_file() - •Use
llama_model_free()instead ofllama_free_model() - •Use
llama_init_from_model()instead ofllama_new_context_with_model() - •Use
llama_vocab_*()functions instead ofllama_token_*() - •Use
llama_state_*()functions instead of deprecated state functions
See the API reference for complete mappings.