SQLite for Structured Data
STOP - Before You Start
Before writing any data analysis code, answer these questions:
- •Will I query this data more than once? → Use SQLite
- •Do I need GROUP BY, COUNT, AVG, or JOIN? → Use SQLite
- •Am I about to write Python/jq parsing code? → Use SQLite instead
- •Is the dataset >100 records? → Use SQLite
If you answered YES to any question above, use SQLite. Don't write custom parsing code.
Core Principle
SQLite is just a file - no server, no setup, zero dependencies. Use it when you'd otherwise write custom parsing code or re-process data for each query.
Common Misconception
❌ "Databases are too complex for small datasets"
Reality: SQLite = simpler than writing JSON parsing code.
# This is "complex" (custom code for every query): cat data.json | jq '.[] | select(.status=="failed")' | jq -r '.error_type' | sort | uniq -c # This is "simple" (SQL does the work): sqlite3 data.db "SELECT error_type, COUNT(*) FROM errors WHERE status='failed' GROUP BY error_type"
Setup cost: sqlite3 file.db - that's it. It's just a file like JSON.
When to Use SQLite
Use when ANY of these apply:
- •>100 records - JSON/grep becomes unwieldy
- •Multiple aggregations - Need to GROUP BY, COUNT, AVG, etc.
- •Multiple queries - Will ask follow-up questions about same data
- •Correlation needed - Joining data from multiple sources
- •State tracking - Need queryable progress/status over time
When NOT to Use SQLite
Don't use when ALL of these are true:
- •<50 records total
- •Single simple query
- •No aggregations needed
- •Won't have follow-up questions
→ For tiny datasets with simple access, JSON/grep is fine.
Red Flags - Use SQLite Instead
STOP and use SQLite if you're about to:
- •Write Python/Node code to parse JSON/CSV for analysis
- •Run same jq/grep command with slight variations
- •Write custom aggregation logic (COUNT, AVG, GROUP BY in code)
- •Manually correlate data by timestamps or IDs
- •Create temp files to store intermediate results
- •Process same data multiple times for different questions
All of these mean: Load into SQLite once, query with SQL.
Decision Flow
digraph when_sqlite {
"Structured data to analyze?" [shape=diamond];
"Multiple queries or aggregations?" [shape=diamond];
"Use SQLite" [shape=box, style=filled, fillcolor=lightgreen];
"Python/jq/grep fine" [shape=box];
"Structured data to analyze?" -> "Multiple queries or aggregations?" [label="yes"];
"Structured data to analyze?" -> "Python/jq/grep fine" [label="no"];
"Multiple queries or aggregations?" -> "Use SQLite" [label="yes (>1 query)\nor data >100 records"];
"Multiple queries or aggregations?" -> "Python/jq/grep fine" [label="no (<50 records,\n1 simple query)"];
}
The Threshold
| Scenario | Tool | Why |
|---|---|---|
| 50 test results, one-time summary | Python/jq | Fast, appropriate |
| 200+ test results, find flaky tests | SQLite | GROUP BY simpler than code |
| 3 log files, correlate by time | SQLite | JOIN simpler than manual grep |
| Track 1000+ file processing state | SQLite | Queries beat JSON parsing |
Rule of thumb: If you're writing parsing code or re-processing data → use SQLite instead.
Available Tools
# Built-in command sqlite3 ~/.claude-logs/project.db # User has sqlite-utils installed (even easier) sqlite-utils insert data.db table_name data.json --pk=id sqlite-utils query data.db "SELECT * FROM table"
Quick Start
Store Location
~/.claude-logs/<project-name>.db # Persists across sessions
Basic Workflow
# 1. Connect
PROJECT=$(basename $(git rev-parse --show-toplevel 2>/dev/null || pwd))
sqlite3 ~/.claude-logs/$PROJECT.db
# 2. Create table (first time)
CREATE TABLE results (
id INTEGER PRIMARY KEY,
name TEXT,
status TEXT,
duration_ms INTEGER
);
# 3. Load data
INSERT INTO results (name, status, duration_ms)
SELECT json_extract(value, '$.name'),
json_extract(value, '$.status'),
json_extract(value, '$.duration_ms')
FROM json_each(readfile('data.json'));
# 4. Query (SQL does the work)
SELECT status, COUNT(*), AVG(duration_ms)
FROM results
GROUP BY status;
Python vs SQL: Side-by-Side
Task: Find error types and their counts from test results.
Python Approach (What agents default to):
import json
from collections import defaultdict
# Load and parse
with open('test-results.json') as f:
data = json.load(f)
# Custom aggregation logic
errors_by_type = defaultdict(int)
for record in data:
if record['status'] == 'failed' and record['error_message']:
# Extract error type (custom parsing)
error_type = record['error_message'].split(':')[0]
errors_by_type[error_type] += 1
# Sort and display (more custom code)
sorted_errors = sorted(errors_by_type.items(), key=lambda x: x[1], reverse=True)
for error_type, count in sorted_errors:
print(f"{error_type}: {count}")
Lines of code: 15+ lines of custom logic
SQLite Approach (Simpler):
# Load once
sqlite3 data.db <<EOF
CREATE TABLE IF NOT EXISTS results (status TEXT, error_message TEXT);
.import --json test-results.json results
EOF
# Query (SQL does aggregation)
sqlite3 data.db "
SELECT substr(error_message, 1, instr(error_message, ':')-1) as error_type,
COUNT(*) as count
FROM results
WHERE status='failed' AND error_message IS NOT NULL
GROUP BY error_type
ORDER BY count DESC
"
Lines of code: 3 lines (load once, query many times)
Key difference: With Python, you write parsing/aggregation logic. With SQL, you write what you want and SQL does it.
Real Examples from Baseline
Example 1: Test Analysis (224 results)
Without SQLite: Wrote Python code, loaded to memory, custom aggregation logic, data discarded.
With SQLite:
CREATE TABLE test_runs (test_name TEXT, status TEXT, duration_ms INT, run_id TEXT);
-- Load once
INSERT INTO test_runs SELECT ...;
-- Query many times (no re-processing!)
-- Find flaky tests
SELECT test_name,
SUM(CASE WHEN status='pass' THEN 1 ELSE 0 END) as passes,
SUM(CASE WHEN status='fail' THEN 1 ELSE 0 END) as fails
FROM test_runs
GROUP BY test_name
HAVING passes > 0 AND fails > 0;
-- Slowest tests
SELECT test_name, AVG(duration_ms) FROM test_runs GROUP BY test_name ORDER BY AVG(duration_ms) DESC LIMIT 10;
Example 2: Error Correlation (3 log files)
Without SQLite: Used jq/grep/pipes, re-processed for each question, manual timestamp correlation.
With SQLite:
CREATE TABLE errors (timestamp TEXT, service TEXT, error_type TEXT, message TEXT); -- Load all 3 files once INSERT INTO errors SELECT ...; -- Find cross-service failures (JOIN is easier than grep) SELECT e1.timestamp, e1.service, e2.service FROM errors e1 JOIN errors e2 ON datetime(e1.timestamp) BETWEEN datetime(e2.timestamp, '-5 minutes') AND datetime(e2.timestamp, '+5 minutes') WHERE e1.service != e2.service; -- Error frequency SELECT error_type, COUNT(*) FROM errors GROUP BY error_type ORDER BY COUNT(*) DESC;
Example 3: File Processing State (50 files)
Without SQLite: Wrote custom JSON parsing code, manual state updates, custom query logic.
With SQLite:
CREATE TABLE files (name TEXT PRIMARY KEY, status TEXT, error TEXT, completed_at TEXT);
-- Initialize
INSERT INTO files (name, status) VALUES ('file1.txt', 'pending');
-- Update state
UPDATE files SET status='completed', completed_at=datetime('now') WHERE name='file1.txt';
-- Queries (no custom code!)
SELECT COUNT(*) FROM files WHERE status='completed';
SELECT * FROM files WHERE status='failed';
SELECT * FROM files WHERE status='pending' LIMIT 1; -- next file
Note: For only 50 files with simple queries, JSON was actually fine! Use SQLite when you need complex queries or >100 files.
Key Mindset Shift
From: "Process data once and done" To: "Make data queryable"
Benefits:
- •User asks follow-up → data already loaded
- •Try different analyses → no re-processing
- •State persists across sessions
- •SQL handles complexity → you write less code
Common Mistakes
| Mistake | Fix |
|---|---|
| "Too complex for my small dataset" | SQLite = just a file. Try it. |
| Writing JSON parsing code | Use SQL SELECT instead |
| Re-running jq/grep for each query | Load once with INSERT, query with SELECT |
| Assuming "database" = "server" | SQLite has no server, it's a local file |
Red Flags - Consider SQLite
- •Writing custom aggregation code
- •Running same grep/jq command with slight variations
- •Manually correlating data from multiple files
- •Thinking "I wish I could just query this"
All of these mean: Use SQLite instead.