Database Management
Decision guidance for PostgreSQL, DuckDB, and Parquet in hybrid storage architectures.
Contents
- •When to use which database
- •PostgreSQL quick reference
- •DuckDB quick reference
- •Parquet quick reference
- •PGVector quick reference
- •Cross-database conventions
- •Performance debugging checklist
When to use which database
| Workload | Use | Why |
|---|---|---|
| Transactional (CRUD, users, sessions) | PostgreSQL | ACID, row-level locking, indexes |
| Analytical (aggregations, scans) | DuckDB | Columnar, vectorized, parallel |
| Data storage/interchange | Parquet | Compressed, columnar, portable |
| Metadata + relationships | PostgreSQL | Foreign keys, constraints |
| Ad-hoc exploration | DuckDB | Fast on Parquet, no ETL needed |
| Time-series with point lookups | PostgreSQL + partitioning | Partition pruning + indexes |
| Time-series analytics | DuckDB on Parquet | Scan performance |
| Vector similarity search | PostgreSQL + PGVector | HNSW/IVFFlat indexes, hybrid search |
| RAG / semantic search | PostgreSQL + PGVector | Embeddings + metadata in same DB |
Hybrid pattern example:
- •PostgreSQL: transactional data, relationships, users (metadata)
- •DuckDB + Parquet: analytical content, aggregations, time-series
PostgreSQL quick reference
Use for: Metadata, relationships, OLTP workloads, anything needing ACID.
Key decisions:
- •Partition tables >100M rows or with retention requirements
- •Index columns in WHERE/JOIN clauses, not everything
- •Tune autovacuum for high-churn tables
See references/postgres-architecture.md for maintenance patterns. See references/postgres-querying.md for advanced query techniques.
DuckDB quick reference
Use for: Analytics, aggregations, Parquet queries, data exploration.
Key decisions:
- •Prefer Parquet files over CSV (10-100x faster)
- •Let DuckDB auto-parallelize; don't micro-optimize
- •For remote data, increase threads beyond CPU count
See references/duckdb-architecture.md for storage and parallelism. See references/duckdb-querying.md for DuckDB-specific SQL features.
Parquet quick reference
Use for: Storing analytical data, data interchange, columnar compression.
Key decisions:
- •Target 128MB-1GB file sizes
- •Partition by low-to-moderate cardinality columns (date, region)
- •Sort by columns used in filters for better pruning
See references/parquet-architecture.md for file design. See references/parquet-querying.md for query optimization.
PGVector quick reference
Use for: Similarity search, RAG applications, semantic search, recommendations.
Key decisions:
- •HNSW for low-latency, high-recall (default choice)
- •IVFFlat for memory-constrained or batch-updated data
- •Use iterative scan for filtered queries
- •Consider hybrid search (vector + keyword) for 8-15% accuracy boost
See references/pgvector-architecture.md for index configuration. See references/pgvector-querying.md for hybrid search and filtering.
Cross-database conventions
Naming
| Convention | Example | Applies to |
|---|---|---|
| snake_case tables | dataset_jobs | All |
| snake_case columns | created_at | All |
| Singular table names | dataset not datasets | PostgreSQL |
| Plural for collections | datasets/ directory | Parquet files |
Normalization decisions
| Pattern | When to normalize | When to denormalize |
|---|---|---|
| Lookup tables | PostgreSQL, changes frequently | DuckDB/Parquet, static data |
| Repeated values | PostgreSQL, storage matters | Parquet, compression handles it |
| Joins at query time | PostgreSQL, complex relationships | Parquet, pre-join for analytics |
Timestamps
- •Store as UTC always
- •PostgreSQL:
TIMESTAMPTZ - •Parquet:
TIMESTAMPwithisAdjustedToUTC=true - •DuckDB: reads both correctly
Performance debugging checklist
PostgreSQL slow query
- •Run
EXPLAIN (ANALYZE, BUFFERS)on the query - •Check for sequential scans on large tables
- •Verify indexes exist on filter/join columns
- •Check
pg_stat_user_tablesfor bloat (dead tuples) - •Review
work_memif seeing disk sorts
DuckDB slow query
- •Check if reading CSV instead of Parquet
- •Verify not doing
SELECT *on remote data - •Check thread count matches workload
- •Look for unnecessary type conversions
Parquet slow reads
- •Verify predicate pushdown is working (check query plan)
- •Check file sizes (too small = overhead, too large = no parallelism)
- •Confirm data is sorted by filter columns
- •Look for high-cardinality partition keys (too many small files)
PGVector slow search
- •Verify index exists and is being used (EXPLAIN)
- •Check
ef_search(HNSW) orprobes(IVFFlat) settings - •Enable iterative scan for filtered queries
- •Check if IVFFlat recall degraded (rebuild index if heavily updated)
- •Consider partial indexes for common filters