PostgreSQL Databases (Giratina)
Server: 192.168.1.100 PostgreSQL Version: 17.6 Location: /mnt/raid6/databases/postgresql/17/main Storage: 13TB RAID6 array (4.9TB available)
Database Summary
| Database | Size | Records | Primary Use |
|---|---|---|---|
| pmc_fulltext | 227 GB | 4.5M articles, 255M citations | Medical/scientific research |
| legal_db | 81 GB | 9.2M documents | Legal document search |
| stackoverflow_complete | 53 GB | 18.5M solutions | Programming Q&A |
| arxiv_papers | 16 GB | 2.8M papers | Scientific preprints |
| TOTAL | 377 GB | ~340M records |
pmc_fulltext (227 GB) - PubMed Central Research
Medical and scientific research articles with full citation network.
Stats
- •4,483,066 articles with full-text (68 GB raw text)
- •255,138,874 citations - massive citation network
- •88,560,660 author records
- •35 tables including network analysis, co-citation graphs
- •Query performance: 23ms for indexed queries
Key Tables
sql
-- Articles with full text SELECT id, title, abstract, full_text FROM articles LIMIT 10; -- Citation network SELECT * FROM citations WHERE source_pmid = '12345678'; -- Author search SELECT * FROM authors WHERE name ILIKE '%Smith%';
Example Queries
sql
-- Find articles by keyword in title SELECT id, title, pub_date FROM articles WHERE title ILIKE '%cancer%' AND pub_date > '2020-01-01' LIMIT 100; -- Get citation count for an article SELECT COUNT(*) FROM citations WHERE target_pmid = '12345678'; -- Find co-authors SELECT DISTINCT a2.name FROM authors a1 JOIN article_authors aa1 ON a1.id = aa1.author_id JOIN article_authors aa2 ON aa1.article_id = aa2.article_id JOIN authors a2 ON aa2.author_id = a2.id WHERE a1.name = 'John Smith' AND a2.name != 'John Smith';
legal_db (81 GB) - Legal Documents
Legal documents from the "Pile of Law" dataset with excellent full-text search.
Stats
- •9,206,663 documents (124 GB raw → 11 GB stored, 91% compression)
- •70 GB of indexes (full-text search, trigrams, hash indexes)
- •Query performance: 1ms for simple lookups
Document Categories
| Category | Count |
|---|---|
| Caselaw | 5.9M |
| Miscellaneous | 2.2M |
| Patents | 543K |
| Contracts | 353K |
Key Tables
sql
-- Main documents table
SELECT id, title, category, text FROM documents LIMIT 10;
-- Full-text search
SELECT id, title FROM documents
WHERE to_tsvector('english', text) @@ to_tsquery('contract & breach');
Example Queries
sql
-- Search by category
SELECT id, title, LEFT(text, 500) FROM documents
WHERE category = 'caselaw'
LIMIT 100;
-- Full-text search with ranking
SELECT id, title, ts_rank(to_tsvector('english', text), query) AS rank
FROM documents, to_tsquery('intellectual & property') query
WHERE to_tsvector('english', text) @@ query
ORDER BY rank DESC
LIMIT 20;
-- Find patents
SELECT id, title FROM documents
WHERE category = 'patents' AND title ILIKE '%machine learning%';
stackoverflow_complete (53 GB) - Programming Solutions
Programming Q&A from Stack Overflow.
Stats
- •18,563,455 solutions
- •12,718,160 unique questions
- •12 programming languages tracked
- •Full-text search enabled on answers
- •Query performance: ~26 seconds for complex queries (needs optimization)
Top Languages
| Language | Solutions |
|---|---|
| JavaScript | 3.4M |
| Java | 2.6M |
| Python | 1.9M |
| C# | 1.5M |
| PHP | 1.2M |
Key Tables
sql
-- Questions SELECT id, title, tags, score FROM questions LIMIT 10; -- Answers/Solutions SELECT id, question_id, body, score FROM answers LIMIT 10;
Example Queries
sql
-- Find Python questions by keyword SELECT id, title, score FROM questions WHERE tags LIKE '%python%' AND title ILIKE '%pandas%' ORDER BY score DESC LIMIT 20; -- Get top answers for a question SELECT body, score FROM answers WHERE question_id = 12345 ORDER BY score DESC; -- Find high-scoring solutions SELECT q.title, a.body, a.score FROM answers a JOIN questions q ON a.question_id = q.id WHERE q.tags LIKE '%python%' AND a.score > 100 ORDER BY a.score DESC LIMIT 10;
arxiv_papers (16 GB) - Research Papers
Scientific preprints from ArXiv.
Stats
- •2,770,235 papers
- •2.2 GB of abstracts with full-text search
- •Multiple full-text search indexes
- •Query performance: 23ms for category searches
Key Tables
sql
-- Papers SELECT id, title, abstract, categories, authors FROM papers LIMIT 10;
Example Queries
sql
-- Search by category (cs.AI, physics, math, etc.)
SELECT id, title, authors FROM papers
WHERE categories LIKE '%cs.AI%'
ORDER BY id DESC
LIMIT 100;
-- Full-text search in abstracts
SELECT id, title FROM papers
WHERE to_tsvector('english', abstract) @@ to_tsquery('transformer & attention')
LIMIT 50;
-- Find papers by author
SELECT id, title, categories FROM papers
WHERE authors ILIKE '%Hinton%';
-- Recent ML papers
SELECT id, title, abstract FROM papers
WHERE categories LIKE '%cs.LG%'
ORDER BY id DESC
LIMIT 20;
Connection Details
User Credentials
Read-Only User (Recommended for Scripts)
- •Username:
readonly - •Password:
ReadOnly2025 - •Permissions: SELECT only (cannot modify data)
Admin User (Local Only)
- •Username:
postgres - •Network access: Not configured (use locally on Giratina)
From Giratina (localhost)
bash
psql -U postgres -d pmc_fulltext psql -U postgres -d legal_db psql -U postgres -d stackoverflow_complete psql -U postgres -d arxiv_papers
From Other Nodes (192.168.1.x)
bash
# Read-only access (recommended for all network connections) PGPASSWORD='ReadOnly2025' psql -h 192.168.1.100 -U readonly -d pmc_fulltext
Connection String
code
postgresql://readonly:ReadOnly2025@192.168.1.100:5432/DATABASE_NAME
Note: No special characters in password - exclamation marks cause md5 auth failures with psycopg2.
Code Examples
Python (psycopg2)
python
import psycopg2
conn = psycopg2.connect(
host="192.168.1.100",
port=5432,
database="pmc_fulltext",
user="readonly",
password="ReadOnly2025"
)
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM articles")
print(cursor.fetchone())
Python (SQLAlchemy + Pandas)
python
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine(
"postgresql://readonly:ReadOnly2025@192.168.1.100:5432/legal_db"
)
df = pd.read_sql("SELECT * FROM documents LIMIT 1000", engine)
Node.js (pg)
javascript
const { Client } = require('pg');
const client = new Client({
host: '192.168.1.100',
port: 5432,
database: 'stackoverflow_complete',
user: 'readonly',
password: 'ReadOnly2025'
});
await client.connect();
const res = await client.query('SELECT COUNT(*) FROM solutions');
console.log(res.rows);
Server-Side Cursors (Large Datasets)
python
import psycopg2
conn = psycopg2.connect(
host="192.168.1.100",
database="pmc_fulltext",
user="readonly",
password="ReadOnly2025"
)
# Named cursor = server-side cursor (streams results)
cursor = conn.cursor(name='fetch_large_data')
cursor.execute("SELECT * FROM citations WHERE cited_year > 2020")
# Fetch in batches to avoid memory issues
while True:
rows = cursor.fetchmany(10000) # 10k rows at a time
if not rows:
break
for row in rows:
process(row)
Network Access
Local Network (192.168.1.x) ✅
Direct connection works from any cluster node:
- •Silvally (192.168.1.52)
- •Hoopa (192.168.1.79)
- •Victini (192.168.1.115)
- •Talon (192.168.1.7)
- •Any laptop/desktop on the network
External Access
Use Cloudflare Tunnel:
bash
cloudflared access tcp --hostname giratina-db.built-simple.ai --url localhost:5432 & psql -h localhost -p 5432 -U readonly -d pmc_fulltext
Performance Tips
| Operation | Speed | Notes |
|---|---|---|
| Simple SELECT by ID | ~1ms | Using indexes |
| Category filter | ~23ms | Indexed columns |
| Full-text search | Fast | GIN indexes enabled |
| Complex joins | Varies | Use EXPLAIN ANALYZE |
Best Practices:
- •Use
readonlyuser for analysis scripts - •Fetch only needed columns (avoid
SELECT *) - •Filter on server side, not in Python
- •Use server-side cursors for large results
- •Use parallel processing for big jobs
Backup Files (on RAID6)
| File | Size |
|---|---|
| legal_db.sql.gz | 40 GB |
| legal_documents_20250828_095834.parquet.7z | 78 GB |
| legal_master_duckdb.db.7z | 39 GB |