AgentSkillsCN

mastering-postgresql

使用Python开发PostgreSQL,支持全文检索(tsvector、tsquery、通过pg_search实现BM25)、向量相似度计算(pgvector,结合HNSW/IVFFlat)、JSONB与数组索引,以及生产环境部署。当您需要构建搜索功能、存储AI嵌入向量、查询向量相似度、优化PostgreSQL索引,或部署至AWS RDS/Aurora、GCP Cloud SQL/AlloyDB,或Azure时,可使用此功能。涵盖psycopg2、psycopg3、asyncpg、SQLAlchemy集成、Docker开发环境搭建,以及索引选择策略。触发语句包括:“PostgreSQL搜索”、“pgvector”、“BM25 PostgreSQL”、“JSONB索引”、“psycopg”、“asyncpg”、“PostgreSQL Docker”、“AlloyDB向量”。不涉及DBA管理(备份、复制、用户管理)、MySQL/MongoDB/Redis、Schema设计理论,以及存储过程。

SKILL.md
--- frontmatter
name: mastering-postgresql
description: PostgreSQL development for Python with full-text search (tsvector, tsquery, BM25 via pg_search), vector similarity (pgvector with HNSW/IVFFlat), JSONB and array indexing, and production deployment. Use when creating search features, storing AI embeddings, querying vector similarity, optimizing PostgreSQL indexes, or deploying to AWS RDS/Aurora, GCP Cloud SQL/AlloyDB, or Azure. Covers psycopg2, psycopg3, asyncpg, SQLAlchemy integration, Docker development setup, and index selection strategies. Triggers are Use "PostgreSQL search", "pgvector", "BM25 postgres", "JSONB index", "psycopg", "asyncpg", "PostgreSQL Docker", "AlloyDB vector". Does NOT cover - DBA administration (backup, replication, users), MySQL/MongoDB/Redis, schema design theory, stored procedures.
allowed-tools:
  - Read
  - Bash
  - Write

PostgreSQL Python Development

Build search, vector similarity, and data-intensive applications with PostgreSQL and Python.

Quick Reference

TaskGo To
Docker/local setupsetup-and-docker.md
Full-text search & BM25search-fulltext.md
pgvector & JSONB indexingsearch-vectors-json.md
Python drivers & poolspython-drivers.md
Python query patternspython-queries.md
AWS RDS/Auroracloud-aws.md
GCP Cloud SQL/AlloyDBcloud-gcp.md
Azure Flexible Servercloud-azure.md
Neon & Supabasecloud-serverless.md
Cloud common (pooling, config)cloud-common.md

When NOT to Use This Skill

  • DBA tasks: Backup strategies, replication setup, user management, security hardening
  • Other databases: MySQL, MongoDB, Redis, Elasticsearch-specific queries
  • Schema design: Normalization theory, data modeling patterns
  • Stored procedures: PL/pgSQL function development
  • Application frameworks: Django ORM specifics, FastAPI integration details

Quick Start Checklist

Copy this checklist to track progress:

code
Setup Progress:
- [ ] Docker environment running (docker-compose up -d)
- [ ] Connected to database (psql or Python)
- [ ] Extensions created (pgvector, pg_trgm)
- [ ] Table created with search_vector and embedding columns
- [ ] GIN index on search_vector created
- [ ] HNSW index on embedding created
- [ ] Test full-text query returns results
- [ ] Test vector query returns results

Quick Start: Search + Vectors in 5 Minutes

1. Start PostgreSQL with pgvector

yaml
# docker-compose.yml
services:
  postgres:
    image: pgvector/pgvector:pg17
    environment:
      POSTGRES_PASSWORD: devpass
    ports: ["5432:5432"]
    volumes: [pgdata:/var/lib/postgresql/data]
volumes:
  pgdata:
bash
docker-compose up -d

# Verify container is running:
docker-compose ps
# Expected: postgres service with status "Up"

2. Enable Extensions

sql
CREATE EXTENSION vector;      -- pgvector for embeddings
CREATE EXTENSION pg_trgm;     -- Trigram for fuzzy search

-- Verify extensions installed:
SELECT extname, extversion FROM pg_extension 
WHERE extname IN ('vector', 'pg_trgm');
-- Expected: 2 rows with version numbers

3. Create Searchable Table with Vectors

sql
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT,
    embedding vector(1536),
    search_vector tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(content, '')), 'B')
    ) STORED
);

-- Create indexes
CREATE INDEX idx_docs_search ON documents USING GIN (search_vector);
CREATE INDEX idx_docs_embedding ON documents USING hnsw (embedding vector_cosine_ops);

-- Verify indexes created:
SELECT indexname FROM pg_indexes WHERE tablename = 'documents';
-- Expected: idx_docs_search, idx_docs_embedding, documents_pkey

4. Query from Python

python
import asyncpg

async def search(pool, query: str, embedding: list[float], limit: int = 10):
    return await pool.fetch("""
        SELECT id, title,
               ts_rank(search_vector, websearch_to_tsquery('english', $1)) AS text_rank,
               embedding <=> $2::vector AS vector_dist
        FROM documents
        WHERE search_vector @@ websearch_to_tsquery('english', $1)
        ORDER BY vector_dist
        LIMIT $3
    """, query, embedding, limit)

# Verify connection works:
# pool = await asyncpg.create_pool('postgresql://postgres:devpass@localhost/postgres')
# rows = await pool.fetch("SELECT 1 AS test")
# assert rows[0]['test'] == 1

Decision Trees

Which Search Approach?

code
Need search? ─┬─► Exact keyword match ──────► B-tree index + WHERE =
              │
              ├─► Full-text search (FTS) ───► tsvector + GIN + ts_rank
              │
              ├─► Relevance like Google ────► pg_search BM25 (ParadeDB)
              │
              ├─► Typo tolerance ───────────► pg_trgm + similarity()
              │
              ├─► Semantic/AI search ───────► pgvector + embeddings
              │
              └─► Hybrid (keywords + semantic) ► Combine tsvector + pgvector

Which Vector Index?

code
Vector index? ─┬─► Dataset < 100K rows ────► No index (exact search OK)
               │
               ├─► Need best recall ────────► HNSW (slower build, fast query)
               │
               ├─► Fast index build ────────► IVFFlat (needs data first)
               │
               ├─► On AlloyDB ──────────────► ScaNN (Google optimized)
               │
               ├─► On Azure ────────────────► pg_diskann (32x less memory)
               │
               ├─► Billions of vectors ─────► VectorChord vchordrq (self-host)
               │
               └─► Dimensions > 2000 ───────► halfvec or binary quantization

Which Python Library?

code
Python lib? ──┬─► Sync, simple, stable ─────► psycopg2
              │
              ├─► Async + modern features ──► psycopg3
              │
              ├─► Max async performance ────► asyncpg
              │
              └─► ORM needed ───────────────► SQLAlchemy + asyncpg/psycopg

Which Index Type for Column?

code
Column type? ─┬─► Scalar (int, text, timestamp) ─► B-tree (default)
              │
              ├─► JSONB ────────────────────────┬► GIN (general queries)
              │                                 └► GIN jsonb_path_ops (@> only)
              │
              ├─► Array ────────────────────────► GIN
              │
              ├─► tsvector ─────────────────────► GIN (or GiST for updates)
              │
              ├─► vector ───────────────────────► HNSW or IVFFlat
              │
              └─► Range / Geometric ────────────► GiST

Common Patterns

For implementation details, see the reference files:

Index Tuning Quick Reference

HNSW Parameters

ParameterDefaultGuidance
m16Higher = better recall, more memory. 12-48 typical
ef_construction64Higher = better index quality, slower build. 64-200
hnsw.ef_search40Set at query time. Higher = better recall, slower
sql
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=100);
SET hnsw.ef_search = 100;  -- Before querying

-- Verify setting applied:
SHOW hnsw.ef_search;

IVFFlat Parameters

ParameterGuidance
listssqrt(rows) for <1M rows; rows/1000 for >1M
ivfflat.probesStart at sqrt(lists), increase for recall
sql
CREATE INDEX ON docs USING ivfflat (embedding vector_l2_ops) WITH (lists=100);
SET ivfflat.probes = 10;

Troubleshooting Quick Reference

SymptomLikely CauseFix
Seq Scan on indexed columnStats outdatedANALYZE tablename;
Vector search slowNo index or low ef_searchCreate HNSW index, increase ef_search
Poor vector recallIVFFlat probes too lowIncrease ivfflat.probes
FTS not matchingWrong language configCheck to_tsvector('english', ...)
Index not usedQuery doesn't match opsVerify operator class matches query
Connection timeoutPool exhaustedIncrease pool size or fix leaks
Extension not foundNot installedCREATE EXTENSION name;
HNSW build OOMInsufficient memoryIncrease maintenance_work_mem
Filtered queries return few resultsFiltering after index scanEnable hnsw.iterative_scan
Connection drops in productionNo health checkingUse check=ConnectionPool.check_connection
Scaling past 100M vectorspgvector limitsConsider VectorChord vchordrq

For detailed troubleshooting, see search-vectors-json.md.

Script Usage

bash
pip install -r scripts/requirements.txt  # Install dependencies first
ScriptPurposeWhen to Use
setup_extensions.pyInstall pgvector, pg_trgm extensionsInitial database setup
create_search_tables.pyCreate tables with search_vector and embedding columnsAfter extensions installed
health_check.pyCheck index health, bloat, and performanceDiagnosing slow queries
vector_search.py --demoDemonstrate vector similarity queriesLearning pgvector patterns
bulk_insert.pyHigh-performance data loadingImporting large datasets
fts_examples.pyFull-text search query examplesLearning FTS syntax
connection_pool.pyConnection pooling patternsProduction deployments

Example:

bash
python scripts/setup_extensions.py --host localhost --dbname mydb
python scripts/create_search_tables.py --host localhost --dbname mydb
python scripts/health_check.py --host localhost --dbname mydb

Cloud Quick Reference

ProviderpgvectorBM25 SupportConnection Pooling
AWS RDS/Aurora0.8.0pg_textsearch (preview)RDS Proxy
GCP Cloud SQL0.8.0pg_textsearch (preview)Cloud SQL Proxy
GCP AlloyDB0.8.0 + ScaNNpg_textsearch (preview)Built-in
Azure Flexible0.8.0 + pg_diskannpg_textsearch (preview)Built-in PgBouncer
Neonpg_searchBuilt-in
Supabasepg_searchBuilt-in

Serverless options: Neon (scale-to-zero, instant branching) and Supabase (BaaS with auth/real-time) are ideal for dev/test and startups. See cloud-serverless.md.

BM25 Options:

  • pg_search (ParadeDB): Production-ready, self-host or ParadeDB managed service
  • pg_textsearch (TigerData): Preview status, available on managed PostgreSQL services

See provider-specific files for setup commands: AWS | GCP | Azure

Reference Files

Load these for detailed implementation guidance:

ReferenceLoad When
setup-and-docker.mdDocker setup, extension installation, postgresql.conf tuning
search-fulltext.mdFull-text search (FTS), BM25 setup, trigram fuzzy search
search-vectors-json.mdpgvector tuning, JSONB/array indexing, maintenance
python-drivers.mdpsycopg2/psycopg3/asyncpg, connection pools, SQLAlchemy
python-queries.mdBulk inserts, FTS queries, vector queries, JSONB operations
cloud-aws.mdAWS RDS/Aurora setup, RDS Proxy
cloud-gcp.mdGCP Cloud SQL/AlloyDB, ScaNN indexes
cloud-azure.mdAzure Flexible Server, pg_diskann
cloud-serverless.mdNeon, Supabase (scale-to-zero, branching)
cloud-common.mdExtension matrix, pooling, production config, costs