Feature Flag Strategy
Managing 20+ feature combinations, platform-specific builds, and conditional compilation in Kreuzberg's polyglot architecture
Feature Matrix Overview
Location: Cargo.toml (workspace), crates/kreuzberg/Cargo.toml, FEATURE_MATRIX.md
Kreuzberg manages 20+ interdependent features across 9 language bindings:
Rust Core Crate (crates/kreuzberg)
├── OCR Backends
│ ├── tesseract (default)
│ ├── tesseract-static (bundled binary)
│ └── ocr-minimal (text extraction only)
├── Document Formats
│ ├── pdf (default, with PDFium)
│ ├── pdf-minimal (text only)
│ ├── office (DOCX, XLSX, PPTX)
│ └── office-minimal (text only)
├── AI/ML Features
│ ├── embeddings (ONNX Runtime + fastembed)
│ ├── keywords-yake (YAKE keyword extraction)
│ ├── keywords-rake (RAKE keyword extraction)
│ └── language-detection (fast-langdetect)
├── API & Server
│ ├── api (Axum server, REST endpoints)
│ ├── mcp (Model Context Protocol server)
│ └── tokio-runtime (full async runtime)
├── Platform Support
│ ├── python-bindings (PyO3)
│ ├── ruby-bindings (Rutie)
│ ├── php-bindings (Extism)
│ ├── node-bindings (NAPI-RS)
│ └── wasm (wasm32-unknown-unknown target)
└── Development
├── otel (OpenTelemetry tracing)
├── bench (Criterion benchmarks)
└── dev-tools (debugging, profiling)
Cargo.toml Feature Definition
Location: Cargo.toml (workspace root), crates/kreuzberg/Cargo.toml
[package] name = "kreuzberg" version = "4.0.0-rc.22" [features] # Default: Minimal + essential features default = ["api", "tesseract", "pdf", "office", "embeddings", "keywords-yake"] # OCR Backends (mutually exclusive recommendation) tesseract = ["dep:kreuzberg-tesseract"] tesseract-static = ["tesseract", "kreuzberg-tesseract/bundled"] ocr-minimal = [] # No OCR backend; image extraction only # Document Format Support pdf = ["dep:pdfium-render"] pdf-minimal = [] # Text extraction from PDFs only office = [] # DOCX, XLSX, PPTX support office-minimal = [] # Metadata extraction only # AI/ML Features embeddings = ["dep:fastembed", "dep:ort"] # Requires ONNX Runtime keywords-yake = ["dep:yake"] keywords-rake = ["dep:rake"] language-detection = ["dep:fast-langdetect"] # Server & API api = ["dep:axum", "dep:tower", "dep:tower-http", "tokio-runtime"] mcp = [] # Model Context Protocol server mode # Runtime tokio-runtime = ["dep:tokio"] # Full async runtime lite-runtime = [] # Minimal embedded runtime # Platform Bindings python-bindings = ["dep:pyo3"] ruby-bindings = ["dep:rutie"] php-bindings = ["dep:extism"] node-bindings = ["dep:napi-rs"] wasm = [] # wasm32-unknown-unknown target # Observability otel = ["dep:opentelemetry"] # Build & Testing bench = ["criterion"]
Platform-Specific Feature Combinations
1. Core Rust Library (crates/kreuzberg)
Typical Build:
cargo build --release --features "api,tesseract,pdf,office,embeddings,keywords-yake,language-detection"
Minimal Build (for embedded):
cargo build --release --no-default-features --features "pdf-minimal,ocr-minimal"
Static Tesseract (Docker/reproducible):
cargo build --release --features "tesseract-static,api"
2. Python Binding (crates/kreuzberg-py)
Location: crates/kreuzberg-py/Cargo.toml, packages/python/
[dependencies]
kreuzberg = { path = "../kreuzberg", features = [
"python-bindings",
"api",
"tesseract",
"pdf",
"office",
"embeddings",
"keywords-yake",
"language-detection",
"otel", # Optional tracing support
] }
[features]
default = []
full = ["kreuzberg/embeddings", "kreuzberg/keywords-rake"]
lite = ["kreuzberg/ocr-minimal", "kreuzberg/pdf-minimal"]
Build Matrix (via maturin):
# .github/workflows/python-wheels.yml python-versions: [3.8, 3.9, 3.10, 3.11, 3.12] platforms: [x86_64-unknown-linux-gnu, x86_64-apple-darwin, aarch64-apple-darwin, x86_64-pc-windows-msvc] features: - default (full features) - lite (minimal embeddings/ocr) - staticlibs (bundled Tesseract)
3. Node.js Binding (crates/kreuzberg-node)
Location: crates/kreuzberg-node/Cargo.toml
[dependencies]
kreuzberg = { path = "../kreuzberg", features = [
"node-bindings",
"api",
"tesseract",
"pdf",
"office",
"embeddings",
"keywords-yake",
"language-detection",
] }
[features]
# Node specific
napi-rs = ["dep:napi-rs"]
Runtime Detection (JavaScript):
// packages/typescript/node/src/index.ts
import addon from '../native/index.node';
// Feature detection at runtime
export function getCapabilities(): Capabilities {
return {
ocr: addon.hasOcr(), // true if tesseract feature enabled
embeddings: addon.hasEmbeddings(), // true if embeddings feature enabled
keywords: addon.hasKeywords(), // true if keyword extraction available
pdf: addon.hasPdf(), // true if pdf feature enabled
office: addon.hasOffice(), // true if office feature enabled
};
}
4. WebAssembly (crates/kreuzberg-wasm)
Location: crates/kreuzberg-wasm/Cargo.toml
[dependencies]
kreuzberg = { path = "../kreuzberg", features = [
"wasm",
"pdf-minimal", # Light PDF parsing
"ocr-minimal", # No Tesseract (browser-unsafe)
# NO embeddings (ONNX Runtime incompatible with WASM)
# NO keywords (language models too large for browser)
], default-features = false }
[lib]
crate-type = ["cdylib"]
[profile.release]
opt-level = "z" # Minimize binary size
lto = true
codegen-units = 1
Feature Limitations (WASM):
- •NO OCR (Tesseract not available in browsers)
- •NO Embeddings (ONNX Runtime incompatible)
- •NO Keywords extraction (YAKE/RAKE models too large)
- •Limited PDF support (use
pdf-minimalfor client-side parsing) - •Table extraction only (no reconstruction)
5. Go Binding (packages/go/v4)
No Cargo.toml - Uses kreuzberg-ffi C binding. Feature flags applied at Rust compile time.
# Compile Rust FFI with all features cd crates/kreuzberg-ffi cargo build --release --features "api,tesseract,pdf,office,embeddings,keywords-yake,language-detection" # Go then imports C header # packages/go/v4/binding.go includes <kreuzberg.h>
6. Other Bindings (Ruby, PHP, Elixir, C#, Java)
All use FFI/bindings to pre-compiled Rust core → feature flags set at Rust build time, not in language-specific config.
Conditional Compilation Patterns
Location-Based Feature Gating
Location: crates/kreuzberg/src/lib.rs, module structure
// src/lib.rs #[cfg(feature = "api")] pub mod api; #[cfg(feature = "mcp")] pub mod mcp; #[cfg(feature = "embeddings")] pub mod embeddings; #[cfg(any(feature = "keywords-yake", feature = "keywords-rake"))] pub mod keywords; // Only include extraction format if feature enabled #[cfg(feature = "pdf")] pub mod pdf; #[cfg(feature = "office")] pub mod office_extractors;
Function-Level Feature Gating
// src/extraction/image.rs
pub async fn extract_text_from_image(
image_data: &[u8],
config: &ImageConfig,
) -> Result<ExtractionResult> {
// ... common image processing ...
// OCR is optional
#[cfg(feature = "tesseract")]
{
if config.enable_ocr {
return ocr_text_extraction(image_data, config).await;
}
}
// Fallback to direct text extraction (no OCR)
Ok(ExtractionResult {
content: extract_text_direct(image_data)?,
..Default::default()
})
}
Runtime Feature Detection
// src/core/config_validation.rs
pub fn validate_config(config: &ExtractionConfig) -> Result<ValidationReport> {
let mut warnings = Vec::new();
// Check if requested feature is compiled in
#[cfg(not(feature = "embeddings"))]
{
if config.chunking.as_ref().map_or(false, |c| c.embedding.is_some()) {
warnings.push("Embeddings requested but not compiled in. Set feature 'embeddings'".to_string());
}
}
#[cfg(not(feature = "tesseract"))]
{
if config.image.as_ref().map_or(false, |c| c.enable_ocr) {
warnings.push("OCR requested but tesseract feature not enabled".to_string());
}
}
Ok(ValidationReport { warnings, ..Default::default() })
}
Build Target Configuration
Supported Targets
# Cargo.toml - [profile.*.package] overrides [profile.release.package.kreuzberg-wasm] opt-level = "z" # Minimize WASM size codegen-units = 1 lto = true [profile.release] lto = "thin" # Balance between compile time and performance opt-level = 3 strip = true codegen-units = 1
Target-Specific Builds
Cross-compilation:
# Compile for ARM64 on macOS cargo build --release --target aarch64-apple-darwin # Compile for Linux on macOS (requires cross) cross build --release --target x86_64-unknown-linux-gnu # Windows MSVC (for embeddings support) cargo build --release --target x86_64-pc-windows-msvc # WebAssembly wasm-pack build --target web crates/kreuzberg-wasm --release
Docker Multi-Stage Build
Location: docker/Dockerfile
# Stage 1: Build with all features FROM rust:1.91 as builder WORKDIR /build COPY . . RUN cargo build --release --features "api,tesseract-static,pdf,office,embeddings,keywords-yake,language-detection,otel" # Stage 2: Minimal runtime FROM debian:bookworm-slim COPY --from=builder /build/target/release/kreuzberg-api /usr/local/bin/ RUN apt-get update && apt-get install -y libssl3 && rm -rf /var/lib/apt/lists/* ENTRYPOINT ["kreuzberg-api"]
Feature Flag Testing Strategy
Location: tests/, .github/workflows/
Matrix Testing
# .github/workflows/feature-tests.yml
strategy:
matrix:
features:
- "default"
- "api,tesseract,pdf,office,embeddings"
- "pdf-minimal,ocr-minimal" # Minimal build
- "api,tesseract,language-detection" # No embeddings
- "api,keywords-yake,keywords-rake" # No PDF
os: [ubuntu-latest, macos-latest, windows-latest]
steps:
- run: cargo test --features "${{ matrix.features }}" --release
- run: cargo build --target wasm32-unknown-unknown --no-default-features
Feature Availability Tests
// tests/feature_detection.rs
#[test]
fn test_embeddings_available() {
#[cfg(feature = "embeddings")]
{
assert!(can_use_embeddings());
}
#[cfg(not(feature = "embeddings"))]
{
assert!(must_skip_embeddings_tests());
}
}
#[test]
fn test_ocr_backends_available() {
#[cfg(feature = "tesseract")]
{
assert!(ocr_processor().is_ok());
}
#[cfg(all(not(feature = "tesseract"), not(feature = "ocr-minimal")))]
{
assert_eq!(extract_image_text().unwrap().len(), 0); // No OCR
}
}
Language Binding Feature Exposure
Python
# kreuzberg/__init__.pyi (type stubs)
import sys
if sys.version_info >= (3, 10):
from typing import TypeAlias
else:
from typing_extensions import TypeAlias
# Feature availability
HAS_EMBEDDINGS: bool
HAS_KEYWORDS: bool
HAS_OCR: bool
HAS_PDF: bool
def get_features() -> dict[str, bool]:
"""Return which features are compiled in"""
...
TypeScript/Node.js
// @kreuzberg/node
export interface Capabilities {
ocr: boolean;
embeddings: boolean;
keywords: boolean;
pdf: boolean;
office: boolean;
mcp: boolean;
}
export function getCapabilities(): Capabilities {
// Runtime detection via NAPI-RS
}
Critical Rules
- •Never mix conflicting features - e.g.,
ocr-minimal+tesseractshould error at compile time - •Always provide feature diagnostics - Config validation must warn if feature unavailable
- •Default to maximum feature set - Unless embedded/minimal specifically requested
- •Test all feature combinations - Matrix testing in CI catches regressions
- •Document feature dependencies:
- •
embeddingsrequires ONNX Runtime installed - •
tesseract-staticincludes Tesseract binary (larger binary) - •
apirequires tokio runtime - •WASM incompatible with embeddings, keywords, OCR
- •
- •Version gate features carefully - Python 3.8 vs 3.12 differ in typing
- •CI/CD gate features by platform - Windows MSVC required for embeddings, WASM only on web
Related Skills
- •extraction-pipeline-patterns - Feature-gated extraction paths
- •ocr-backend-management - Feature gates for different OCR backends
- •api-server-patterns -
apifeature gating Axum server - •chunking-embeddings - Feature gating embeddings/FastEmbed
- •mcp-protocol-integration - MCP feature compilation