llama.cpp C API Guide

Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.

Overview

llama.cpp is a C/C++ implementation for LLM inference with minimal dependencies and state-of-the-art performance. This skill provides:

•Complete API Reference: All non-deprecated functions organized by category
•Common Workflows: Working examples for typical use cases
•Best Practices: Patterns for efficient and correct API usage

Quick Start

See references/workflows.md for complete working examples. Basic workflow:

•llama_backend_init() - Initialize backend
•llama_model_load_from_file() - Load model
•llama_init_from_model() - Create context
•llama_tokenize() - Convert text to tokens
•llama_decode() - Process tokens
•llama_sampler_sample() - Sample next token
•Cleanup in reverse order

When to Use This Skill

Use this skill when:

•API Lookup: You need to find a specific function (e.g., "How do I load a model?", "What function creates a context?")
•Code Generation: You're writing C code that uses llama.cpp
•Workflow Guidance: You need to understand the steps for a task (e.g., text generation, embeddings, chat)
•Advanced Features: You're working with batches, sequences, LoRA adapters, state management, or custom sampling
•Migration: You're updating code from deprecated functions to current API

Core Concepts

Key Objects

•llama_model: Loaded model weights and architecture
•llama_context: Inference state (KV cache, compute buffers)
•llama_batch: Input tokens and positions for processing
•llama_sampler: Token sampling configuration
•llama_vocab: Vocabulary and tokenizer
•llama_memory_t: KV cache memory handle

Typical Flow

•Initialize: llama_backend_init()
•Load Model: llama_model_load_from_file()
•Create Context: llama_init_from_model()
•Tokenize: llama_tokenize()
•Process: llama_encode() or llama_decode()
•Sample: llama_sampler_sample()
•Generate: Repeat steps 5-6
•Cleanup: Free in reverse order

API Reference

For detailed API documentation, the complete API is split across 6 files for efficient targeted loading. Start with references/api-core.md which links to all other sections.

API Files:

•api-core.md (220 lines) - Initialization, parameters, model loading
•api-model-info.md (193 lines) - Model properties, architecture detection NEW
•api-context.md (412 lines) - Context, memory (KV cache), state management
•api-inference.md (417 lines) - Batch operations, inference, tokenization, chat
•api-sampling.md (490 lines) - All 26+ sampling strategies (incl. adaptive-p) + backend sampling API
•api-advanced.md (359 lines) - LoRA adapters, performance, training

Total: 197 active functions (b7942) across 6 organized files

Quick Function Lookup

Most common: llama_backend_init(), llama_model_load_from_file(), llama_init_from_model(), llama_tokenize(), llama_decode(), llama_sampler_sample(), llama_vocab_is_eog(), llama_memory_clear()

See references/api.md for all 197 function signatures and detailed usage.

Common Workflows

See references/workflows.md for 13 complete working examples: basic text generation, chat, embeddings, batch processing, multi-sequence, LoRA, state save/load, custom sampling (XTC/DRY), encoder-decoder models, model detection, and memory management patterns.

Best Practices

See references/workflows.md for detailed best practices. Key points:

•Always use default parameter functions (llama_model_default_params(), etc.)
•Check return values for errors
•Free resources in reverse order of creation
•Handle dynamic buffer sizes for tokenization
•Query actual context size after creation (llama_n_ctx())
•Check for end-of-generation with llama_vocab_is_eog()

Common Patterns

End-of-generation check (llama_vocab_is_eog()), logits retrieval (llama_get_logits_ith()), batch creation (llama_batch_get_one()), tokenization buffer handling. See references/workflows.md for complete code examples.

Troubleshooting

Common Issues

Model loading fails:

•Verify file path and GGUF format validity
•Check available RAM/VRAM for model size
•Reduce n_gpu_layers if GPU memory insufficient

Tokenization returns negative value:

•Buffer too small; reallocate with -n size and retry
•See tokenization pattern in Common Patterns

Decode/encode returns non-zero:

•Verify batch initialization (llama_batch_get_one() or llama_batch_init())
•Check context capacity (llama_n_ctx())
•Ensure positions within context window

Silent failures / no output:

•Check if llama_vocab_is_eog() immediately returns true
•Verify sampler initialization
•Enable logging: llama_log_set()

Performance issues:

•Increase n_threads for CPU
•Set n_gpu_layers for GPU offloading
•Use larger n_batch for prompts
•See Performance & Utilities

Sliding Window Attention (SWA) issues:

•If using Mistral-style models with SWA, set ctx_params.swa_full = true to access beyond attention window
•Check: llama_model_n_swa(model) to detect SWA size and configuration needs
•Symptoms: Token positions beyond window size causing decode errors

Per-sequence state errors:

•Ensure sequence ID matches when loading: llama_state_seq_load_file(ctx, "file", dest_seq_id, ...)
•Verify token buffer is large enough for loaded tokens
•Check sequence wasn't cleared or removed before loading state

Model type detection:

•Use llama_model_has_encoder() before assuming decoder-only architecture
•For recurrent models (Mamba/RWKV), KV cache behavior differs from standard transformers
•Encoder-decoder models require llama_encode() then llama_decode() workflow

For advanced issues: https://github.com/ggerganov/llama.cpp/discussions

Resources

•
API Reference (6 files, 2,086 lines total) - Complete API reference split by category for targeted loading:
- •api-core.md - Initialization, parameters, model loading
- •api-model-info.md - Model properties, architecture detection
- •api-context.md - Context, memory, state management
- •api-inference.md - Batch, inference, tokenization, chat
- •api-sampling.md - All 25+ sampling strategies + backend sampling API
- •api-advanced.md - LoRA, performance, training
•references/workflows.md (1,616 lines) - 15 complete working examples: basic workflows (text generation, chat, embeddings, batching, sequences), intermediate (LoRA, state, sampling, encoder-decoder, memory), advanced features (XTC/DRY, per-sequence state, model detection), and production applications (interactive chat, streaming).

Key Differences from Deprecated API

If you're updating old code:

•Use llama_model_load_from_file() instead of llama_load_model_from_file()
•Use llama_model_free() instead of llama_free_model()
•Use llama_init_from_model() instead of llama_new_context_with_model()
•Use llama_vocab_*() functions instead of llama_token_*()
•Use llama_state_*() functions instead of deprecated state functions

See the API reference for complete mappings.