LogChef Log Analysis
Query production logs using the LogChef CLI for incident investigation.
Data Safety Rules
Logs contain sensitive data. Follow these rules to avoid loading excessive data into context:
- •Always use
--limitforlogchef querycommands (max 50 rows default) - •Prefer SQL aggregations - use
COUNT(),GROUP BYinstead of pulling raw logs - •Ask before broadening searches - If initial query returns no results, explicitly ask the user for permission before:
- •Extending the time window (e.g., 30m → 1h → 2h)
- •Removing or relaxing filters (e.g., dropping namespace filter)
- •Searching across all namespaces instead of a specific one
- •For large datasets, pipe to file and sample:
bash
logchef query '...' --limit 500 --output jsonl > /tmp/logs.jsonl head -20 /tmp/logs.jsonl # Sample first 20 lines wc -l /tmp/logs.jsonl # Check total count
- •Never pull unbounded raw logs - always aggregate or limit first
- •Start with counts, then drill down to samples only when needed
Quick Reference
| Command | Use Case |
|---|---|
logchef sql "..." | SQL queries (aggregations, counts, time series) |
logchef query '...' | Filter queries (sample logs, grep-style) |
Required Parameters
Always include these flags (get values from logchef config show):
- •
-t <team>- Team name or ID - •
-S <source>- Source name,database.table_name, or ID
Or set defaults:
bash
logchef config set team "my-team" logchef config set source "my-source"
Note: If a user mentions a namespace (e.g., "rms namespace"), filter by the namespace field in your query (e.g., namespace="rms"), not by changing the source.
Command Reference
code
logchef
├── query <QUERY> # LogChefQL query (filter logs)
│ ├── -s, --since <15m|1h> # Relative time
│ ├── --from/--to <TIME> # Absolute time range
│ ├── -t, --team <TEAM> # Team ID/name
│ ├── -S, --source <SOURCE> # Source ID/name
│ ├── -l, --limit <N> # Row limit (ALWAYS USE)
│ ├── --output <FORMAT> # text|json|jsonl|table
│ ├── --show-sql # Show generated SQL
│ └── --timeout <SECS> # Query timeout [default: 30]
│
├── sql <SQL> # Raw SQL query
│ ├── -t, --team <TEAM>
│ ├── -S, --source <SOURCE>
│ ├── --output <FORMAT> # text|json|jsonl|table
│ └── --timeout <SECS> # [default: 30]
│
├── collections [NAME] # List or run saved collections
│ ├── (no args) # List available collections
│ ├── <NAME> # Run named collection
│ ├── -s, --since <TIME> # Override time range
│ ├── -l, --limit <N> # Override limit
│ ├── -V, --var <K=V> # Variable overrides
│ └── --output <FORMAT> # text|json|jsonl|table|list
│
├── config # Manage CLI configuration
│ ├── show # Show current context config
│ ├── set <KEY> <VALUE> # Set config value (team, source)
│ ├── list # List all contexts
│ ├── use <NAME> # Switch context
│ ├── rename <OLD> <NEW> # Rename context
│ ├── delete <NAME> # Delete context
│ └── path # Show config file path
│
├── auth # Authentication
│ ├── (no args) # Login interactively
│ ├── --status # Check auth status
│ └── -l, --logout # Logout
│
└── Global options (all commands):
├── -c, --context <CTX> # Use specific context
├── --server <URL> # Override server URL
├── --token <TOKEN> # Override auth token
├── --no-highlight # Disable highlighting
└── -d, --debug # Debug mode
Time Formats
bash
# Relative time (recommended - avoids timezone issues) --since 1h --since 15m --since 24h # Absolute time with explicit timezone (ISO 8601) --from "2026-01-22T09:15:00+05:30" --to "2026-01-22T10:00:00+05:30" --from "2026-01-22T09:15:00Z" --to "2026-01-22T10:00:00Z" # Absolute time without timezone (uses server's configured timezone) --from "2026-01-22 09:15:00" --to "2026-01-22 10:00:00"
Timezone handling:
- •Infer user's timezone from system (
date +%Z) or ask if unclear - •Use ISO 8601 with offset (e.g.,
+05:30,Z) for precision - •Relative times (
--since) are timezone-agnostic and preferred
LogChefQL Syntax
LogChefQL is a simple query language for filtering logs.
Operators
| Operator | Description | Example |
|---|---|---|
= | Exact match | level="error" |
!= | Not equal | status!=200 |
~ | Contains/regex (case-insensitive) | msg~"timeout" |
!~ | Does not contain | msg!~"expected" |
> | Greater than | status>400 |
< | Less than | response_time<100 |
>= | Greater or equal | severity>=3 |
<= | Less or equal | count<=10 |
Boolean Operators
- •
and- Both conditions must match - •
or- Either condition matches - •
()- Grouping for precedence
Examples
bash
# Exact match (quoted value) level="error" # Exact match (unquoted value) level=error # Contains/regex match msg~"timeout" # Negation msg!~"noise pattern" # Combined conditions level="error" and service="api" # OR conditions level="error" or level="warn" # Grouping (level="error" or level="warn") and service="api" # Field selection with pipe level="error" | timestamp msg service
Common Patterns
1. Log Volume Over Time
bash
logchef sql "SELECT toStartOfMinute(_timestamp) as ts, count() as logs FROM DATABASE.TABLE WHERE _timestamp >= 'YYYY-MM-DD HH:MM:SS' AND _timestamp <= 'YYYY-MM-DD HH:MM:SS' GROUP BY ts ORDER BY ts" -t TEAM -S SOURCE
2. Error Count by Minute
bash
logchef sql "SELECT toStartOfMinute(_timestamp) as ts, count() as errors FROM DATABASE.TABLE WHERE _timestamp >= 'YYYY-MM-DD HH:MM:SS' AND _timestamp <= 'YYYY-MM-DD HH:MM:SS' AND msg ILIKE '%error%' GROUP BY ts ORDER BY ts" -t TEAM -S SOURCE
3. Sample Actual Logs (Always Use --limit)
bash
# ALWAYS include --limit to avoid pulling too much data logchef query 'service="my-service" and msg~"pattern"' \ -t TEAM -S SOURCE \ --from "YYYY-MM-DD HH:MM:SS" \ --to "YYYY-MM-DD HH:MM:SS" \ --limit 20
4. List Distinct Values
bash
logchef sql "SELECT DISTINCT service FROM DATABASE.TABLE WHERE _timestamp >= now() - INTERVAL 1 HOUR LIMIT 50" -t TEAM -S SOURCE
5. High Resolution (30-Second Granularity)
bash
logchef sql "SELECT toStartOfInterval(_timestamp, INTERVAL 30 SECOND) as ts, count() as logs FROM DATABASE.TABLE WHERE _timestamp >= 'YYYY-MM-DD HH:MM:SS' AND _timestamp <= 'YYYY-MM-DD HH:MM:SS' GROUP BY ts ORDER BY ts" -t TEAM -S SOURCE
6. Multiple Conditions with countIf
bash
logchef sql "SELECT toStartOfMinute(_timestamp) as ts, countIf(msg ILIKE '%error%') as errors, countIf(msg ILIKE '%timeout%') as timeouts, countIf(msg ILIKE '%connection%refused%') as conn_refused FROM DATABASE.TABLE WHERE _timestamp >= 'YYYY-MM-DD HH:MM:SS' AND _timestamp <= 'YYYY-MM-DD HH:MM:SS' GROUP BY ts ORDER BY ts" -t TEAM -S SOURCE
7. Log Level Distribution
bash
logchef sql "SELECT level, count() as cnt FROM DATABASE.TABLE WHERE _timestamp >= now() - INTERVAL 1 HOUR GROUP BY level ORDER BY cnt DESC" -t TEAM -S SOURCE
8. Error Messages Breakdown
bash
logchef sql "SELECT extractAll(msg, 'error[: ]([^,\n]+)')[1] as error_type, count() as cnt FROM DATABASE.TABLE WHERE _timestamp >= now() - INTERVAL 1 HOUR AND msg ILIKE '%error%' GROUP BY error_type ORDER BY cnt DESC LIMIT 20" -t TEAM -S SOURCE
Investigation Workflows
1. Initial Triage
bash
# Get log volume pattern logchef sql "SELECT toStartOfMinute(_timestamp) as ts, count() as logs FROM DATABASE.TABLE WHERE _timestamp >= 'START_TIME' AND _timestamp <= 'END_TIME' GROUP BY ts ORDER BY ts" -t TEAM -S SOURCE # Look for cliff (sudden drop) or spike (sudden increase)
2. Error Analysis
bash
# Count errors by minute logchef sql "SELECT toStartOfMinute(_timestamp) as ts, count() as errors FROM DATABASE.TABLE WHERE _timestamp >= 'START_TIME' AND _timestamp <= 'END_TIME' AND msg ILIKE '%error%' GROUP BY ts ORDER BY ts" -t TEAM -S SOURCE # Sample actual errors (always limit raw log pulls) logchef query 'msg~"error"' \ -t TEAM -S SOURCE \ --from "START_TIME" \ --to "END_TIME" \ --limit 20 # Never omit --limit
3. Cross-Service Correlation
bash
# Check multiple services at once
logchef sql "SELECT toStartOfMinute(_timestamp) as ts,
countIf(service='api') as api,
countIf(service='web') as web,
countIf(service='worker') as worker
FROM DATABASE.TABLE
WHERE service IN ('api', 'web', 'worker')
AND _timestamp >= 'START_TIME'
AND _timestamp <= 'END_TIME'
AND msg ILIKE '%error%'
GROUP BY ts ORDER BY ts" -t TEAM -S SOURCE
4. Host-Level Analysis
bash
# Errors by host logchef sql "SELECT host, count() as errors FROM DATABASE.TABLE WHERE _timestamp >= 'START_TIME' AND _timestamp <= 'END_TIME' AND msg ILIKE '%error%' GROUP BY host ORDER BY errors DESC" -t TEAM -S SOURCE
5. Large Result Sets (File + Sample Pattern)
When you need to examine more logs than safe for context:
bash
# Step 1: Save to file with reasonable limit logchef query 'level="error"' \ -t TEAM -S SOURCE \ --since 1h \ --limit 1000 \ --output jsonl > /tmp/errors.jsonl # Step 2: Check how many we got wc -l /tmp/errors.jsonl # Step 3: Sample for context (only load what's needed) head -30 /tmp/errors.jsonl # Step 4: Search within the file if needed grep "specific_pattern" /tmp/errors.jsonl | head -20
Common Gotchas
| Issue | Solution |
|---|---|
| Query timeout | Narrow time window, add more filters |
| No results | Check field names, verify time range, ask user before widening time window |
| Wrong timestamp | Use _timestamp (check your schema) |
| Regex not working | Use ILIKE '%pattern%' in SQL, msg~"pattern" in query |
| Case sensitive | Use ILIKE (case-insensitive) instead of LIKE |
| Performance | Always include time filter first |
| Syntax error | Use = not : for field matching |
| Empty results with --since | Ask user before expanding time range; explain what was tried and propose alternatives |
| Field not found (e.g. user_id) | Fields like user_id may be embedded in JSON msg field - use msg~"value" instead of user_id="value" |
SQL Functions Reference
sql
-- Time bucketing toStartOfMinute(_timestamp) -- 1-minute buckets toStartOfFiveMinutes(_timestamp) -- 5-minute buckets toStartOfHour(_timestamp) -- Hourly buckets toStartOfInterval(_timestamp, INTERVAL 30 SECOND) -- Custom interval -- Conditional counting countIf(condition) sumIf(column, condition) -- String matching msg ILIKE '%pattern%' -- Case-insensitive contains msg LIKE '%pattern%' -- Case-sensitive contains match(msg, 'regex') -- Regex match -- Extraction extractAll(msg, 'pattern')[1] -- Extract regex group substring(msg, 1, 100) -- First 100 chars
Output Formats
bash
# Default text output with highlighting logchef query 'level="error"' # JSON output (for jq processing) logchef query 'level="error"' --output json | jq '.logs[] | .msg' # JSON Lines (one object per line) logchef query 'level="error"' --output jsonl | jq '.msg' # Disable highlighting for piping logchef query 'level="error"' --no-highlight | grep "pattern" # Show generated SQL logchef query 'level="error"' --show-sql
Performance Tips
- •Always filter by time first - LogChef uses time-based partitioning
- •Narrow time windows - Start with 15 minutes, expand if needed
- •Filter early - Add service/level filters to reduce scan scope
- •Always use
--limit- Never pull unbounded raw logs (max 50 for context) - •Aggregate before retrieving - Use SQL
COUNT()/GROUP BYto analyze, only sample raw logs when needed - •Pipe large results to file - Use
> /tmp/logs.jsonlthenheadto sample