PostgreSQL Expert
PostgreSQL is a full-featured RDBMS with a sophisticated internals model that rewards deep understanding. Performance problems and subtle bugs almost always trace back to MVCC semantics, bloat accumulation, or suboptimal index selection. This skill covers the mental models and concrete techniques that separate competent Postgres users from true experts.
Core Mental Model
PostgreSQL uses Multi-Version Concurrency Control (MVCC): every row has xmin and xmax transaction IDs that define its visible lifetime. Writers never block readers — each transaction sees a snapshot of the database taken at its start. This elegance comes at a cost: dead tuples accumulate and must be physically reclaimed by VACUUM. Connection count is a first-class resource because each connection spawns a backend process. Index choice is never arbitrary — each type is optimized for a fundamentally different access pattern.
MVCC and Transaction Isolation
Isolation Levels
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Use When |
READ COMMITTED (default) | ✗ | Possible | Possible | Most OLTP workloads |
REPEATABLE READ | ✗ | ✗ | ✗* | Reports, analytics queries |
SERIALIZABLE | ✗ | ✗ | ✗ | Financial transactions, strict consistency |
*Postgres REPEATABLE READ prevents phantoms too — stronger than SQL standard requires.
-- Example: serializable transaction for double-spend prevention
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT balance FROM accounts WHERE id = $1 FOR UPDATE;
-- Postgres will serialize concurrent transactions touching same rows
-- One will succeed, others get ERROR: could not serialize access
UPDATE accounts SET balance = balance - $2 WHERE id = $1;
COMMIT;
Locking Hierarchy
-- Advisory locks: application-level distributed coordination
-- Useful when you don't want to lock actual rows
SELECT pg_try_advisory_lock(hashtext('job_processor_' || job_id::text))
FROM jobs WHERE status = 'pending' LIMIT 1;
-- Release when done
SELECT pg_advisory_unlock(hashtext('job_processor_' || job_id::text));
-- Session-level vs transaction-level advisory locks
-- pg_advisory_lock → held until explicit release or session end
-- pg_advisory_xact_lock → released automatically at transaction end
Vacuum Mechanics and Bloat
How Bloat Accumulates
EveryUPDATE writes a new row version and marks the old one dead. Every DELETE marks rows dead. VACUUM reclaims space within a page; VACUUM FULL rewrites the entire table (requires exclusive lock — use sparingly).
-- Identify bloated tables
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
n_dead_tup,
n_live_tup,
round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct,
last_autovacuum,
last_analyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
Autovacuum Tuning
Default autovacuum settings are conservative. High-write tables need aggressive tuning:-- Per-table autovacuum override (preferred — least invasive)
ALTER TABLE high_write_table SET (
autovacuum_vacuum_scale_factor = 0.01, -- 1% dead tuples triggers vacuum (default 20%)
autovacuum_analyze_scale_factor = 0.005, -- 0.5% change triggers analyze
autovacuum_vacuum_cost_delay = 2, -- milliseconds (default 20ms — slow it down less)
autovacuum_vacuum_threshold = 50 -- minimum rows before triggering
);
-- postgresql.conf global tuning for heavy-write systems
-- autovacuum_max_workers = 6 (default 3)
-- autovacuum_vacuum_cost_delay = 2ms (default 20ms — higher = more IO throttling)
-- autovacuum_vacuum_scale_factor = 0.02
-- Force manual vacuum on a specific table
VACUUM (VERBOSE, ANALYZE) high_write_table;
-- Vacuum full (rewrites table — exclusive lock, use with caution)
-- Better alternative: pg_repack extension (online repack, no exclusive lock)
VACUUM FULL high_write_table;
Connection Pooling with PgBouncer
PostgreSQL forks a new OS process per connection (~5-10MB RAM each). At 500+ connections, context switching dominates. PgBouncer multiplexes many client connections onto a small pool of server connections.
Pool Modes
| Mode | Connection reuse | Safe for | Unsupported |
| Transaction (recommended) | After each transaction | Most apps | SET, prepared statements (unless server_reset_query) |
| Session | After client disconnects | Legacy apps, SET-heavy | High concurrency |
| Statement | After each statement | Simple queries only | Multi-statement transactions |
# pgbouncer.ini
[databases]
myapp = host=127.0.0.1 port=5432 dbname=myapp
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000 # Total clients PgBouncer accepts
default_pool_size = 25 # Server connections per database/user pair
reserve_pool_size = 5 # Extra connections for bursts
reserve_pool_timeout = 5 # Seconds before using reserve pool
server_idle_timeout = 600 # Idle server connections closed after 10m
server_reset_query = DISCARD ALL # Resets SET vars between clients in session mode
auth_type = scram-sha-256
listen_addr = 0.0.0.0
listen_port = 5432
Rule of thumb:default_pool_size=(core_count * 2) + effective_spindle_count. For a 4-core, NVMe server: ~10-20 connections per database.
Table Partitioning
When Partitioning Helps
- Table exceeds ~50GB and most queries filter on the partition key
- You need cheap bulk deletes (drop partition vs DELETE)
- Archiving old data to cold storage
Partition Strategies
-- RANGE partitioning (best for time-series, append-heavy data)
CREATE TABLE events (
id BIGSERIAL,
created_at TIMESTAMPTZ NOT NULL,
payload JSONB
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2024_q1 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE events_2024_q2 PARTITION OF events
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
-- Automate with pg_partman extension
SELECT partman.create_parent(
p_parent_table => 'public.events',
p_control => 'created_at',
p_type => 'native',
p_interval => 'monthly'
);
-- LIST partitioning (best for known categorical values)
CREATE TABLE orders (
id BIGSERIAL,
region TEXT NOT NULL,
total NUMERIC
) PARTITION BY LIST (region);
CREATE TABLE orders_us PARTITION OF orders FOR VALUES IN ('US');
CREATE TABLE orders_eu PARTITION OF orders FOR VALUES IN ('DE', 'FR', 'UK');
CREATE TABLE orders_apac PARTITION OF orders FOR VALUES IN ('AU', 'SG', 'JP');
-- HASH partitioning (best for even distribution, no natural partition key)
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
username TEXT
) PARTITION BY HASH (id);
CREATE TABLE users_p0 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE users_p1 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE users_p2 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE users_p3 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 3);
Index Types
| Index Type | Best For | Avoid When |
| B-tree (default) | Equality, ranges, ORDER BY, <, > | Full-text search, geometric |
| GIN | Arrays, JSONB keys, full-text (tsvector) | Frequently updated columns (slow inserts) |
| GiST | Geometric data, ranges, nearest-neighbor | Simple equality lookups |
| BRIN | Huge append-only tables with naturally ordered data | Random inserts, non-correlated data |
| Hash | Equality only, slightly faster than B-tree for = | Range queries |
-- GIN index for JSONB queries
CREATE INDEX idx_payload_gin ON events USING GIN (payload);
-- Supports: payload @> '{"status": "active"}', payload ? 'key'
-- GIN with jsonb_path_ops (smaller, faster for @> operator only)
CREATE INDEX idx_payload_path ON events USING GIN (payload jsonb_path_ops);
-- Partial index (index only subset — smaller, faster)
CREATE INDEX idx_active_users ON users (email) WHERE deleted_at IS NULL;
-- Expression index
CREATE INDEX idx_email_lower ON users (lower(email));
-- Query must use: WHERE lower(email) = '[email protected]'
-- BRIN for time-series (tiny index, good enough for append-only)
CREATE INDEX idx_events_created_brin ON events USING BRIN (created_at)
WITH (pages_per_range = 128);
-- Covering index (include non-key columns to avoid heap fetch)
CREATE INDEX idx_orders_covering ON orders (customer_id)
INCLUDE (total, status, created_at);
pg_stat_statements for Query Analysis
-- Enable extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Add to postgresql.conf: shared_preload_libraries = 'pg_stat_statements'
-- Find slowest queries by total time
SELECT
round(total_exec_time::numeric, 2) AS total_ms,
calls,
round(mean_exec_time::numeric, 2) AS avg_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms,
rows,
round(rows::numeric / calls, 1) AS rows_per_call,
query
FROM pg_stat_statements
WHERE calls > 50
ORDER BY total_exec_time DESC
LIMIT 20;
-- Find queries with high variance (inconsistent performance)
SELECT
query,
calls,
round(mean_exec_time::numeric, 2) AS avg_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms,
round(stddev_exec_time / NULLIF(mean_exec_time, 0) * 100, 1) AS cv_pct
FROM pg_stat_statements
WHERE calls > 100 AND mean_exec_time > 10
ORDER BY cv_pct DESC
LIMIT 20;
Logical Replication and Slots
-- Publisher side
-- postgresql.conf: wal_level = logical
CREATE PUBLICATION my_pub FOR TABLE users, orders;
-- Or all tables:
CREATE PUBLICATION my_pub FOR ALL TABLES;
-- Subscriber side (different Postgres instance)
CREATE SUBSCRIPTION my_sub
CONNECTION 'host=publisher_host dbname=mydb user=repl_user password=secret'
PUBLICATION my_pub;
-- Monitor replication lag
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS replication_lag
FROM pg_replication_slots;
-- DANGER: Unused replication slots cause WAL accumulation — monitor slot lag!
-- Alert if lag > 1GB
Row-Level Security (RLS)
-- Enable RLS on table
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE documents FORCE ROW LEVEL SECURITY; -- applies to table owner too
-- Policy: users see only their own documents
CREATE POLICY documents_isolation ON documents
USING (owner_id = current_setting('app.current_user_id')::bigint);
-- Policy: allow insert only for own records
CREATE POLICY documents_insert ON documents
FOR INSERT
WITH CHECK (owner_id = current_setting('app.current_user_id')::bigint);
-- Application sets context before queries
SET LOCAL app.current_user_id = '42';
-- Or via connection parameter:
-- options='-c app.current_user_id=42'
-- Admin bypass (use security definer function or bypass RLS role)
CREATE ROLE admin_role BYPASSRLS;
WAL and Point-in-Time Recovery
# postgresql.conf settings for PITR
# wal_level = replica (minimum for base backup)
# archive_mode = on
# archive_command = 'aws s3 cp %p s3://my-wal-bucket/wal/%f'
# restore_command = 'aws s3 cp s3://my-wal-bucket/wal/%f %p'
# Take a base backup
pg_basebackup -h localhost -U replication_user -D /backup/base -Ft -z -P
# recovery.conf / postgresql.conf (PG12+) for PITR
# restore_command = 'aws s3 cp s3://my-wal-bucket/wal/%f %p'
# recovery_target_time = '2024-01-15 14:30:00'
# recovery_target_action = 'promote'
Useful Extensions
-- pg_trgm: trigram similarity for fuzzy search
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_name_trgm ON users USING GIN (name gin_trgm_ops);
SELECT name FROM users WHERE name % 'Jon Smth' ORDER BY similarity(name, 'Jon Smth') DESC LIMIT 5;
-- unaccent: accent-insensitive search
CREATE EXTENSION unaccent;
CREATE INDEX idx_name_unaccent ON users USING GIN (unaccent(name) gin_trgm_ops);
-- uuid-ossp: UUID generation
CREATE EXTENSION "uuid-ossp";
SELECT uuid_generate_v4(); -- random UUID
SELECT uuid_generate_v1mc(); -- time-based, MAC-address variant
-- pgcrypto: password hashing, encryption
CREATE EXTENSION pgcrypto;
UPDATE users SET password_hash = crypt('plaintext_pw', gen_salt('bf', 10));
SELECT crypt('attempt', password_hash) = password_hash FROM users WHERE id = 1;
Anti-Patterns
-- ❌ SELECT * in production queries — fetches all columns, hurts index-only scans
SELECT * FROM orders WHERE customer_id = 1;
-- ✅ Name columns explicitly
SELECT id, total, status FROM orders WHERE customer_id = 1;
-- ❌ NOT IN with subquery (NULL values cause silent failures)
SELECT id FROM orders WHERE customer_id NOT IN (SELECT id FROM deleted_users);
-- ✅ Use NOT EXISTS or LEFT JOIN / IS NULL
SELECT o.id FROM orders o
WHERE NOT EXISTS (SELECT 1 FROM deleted_users d WHERE d.id = o.customer_id);
-- ❌ OR in WHERE clauses disables index usage
SELECT * FROM orders WHERE customer_id = 1 OR status = 'pending';
-- ✅ UNION ALL (two separate indexed scans)
SELECT * FROM orders WHERE customer_id = 1
UNION ALL
SELECT * FROM orders WHERE status = 'pending' AND customer_id != 1;
-- ❌ Functions on indexed columns prevent index use
SELECT * FROM users WHERE LOWER(email) = '[email protected]';
-- ✅ Expression index matching the exact function call
CREATE INDEX idx_email_lower ON users (LOWER(email));
-- ❌ Unbounded UPDATE on large tables — creates huge write contention
UPDATE large_table SET processed = true WHERE status = 'new';
-- ✅ Batch updates
DO $
DECLARE
updated INT;
BEGIN
LOOP
UPDATE large_table SET processed = true
WHERE id IN (SELECT id FROM large_table WHERE status = 'new' LIMIT 1000);
GET DIAGNOSTICS updated = ROW_COUNT;
EXIT WHEN updated = 0;
PERFORM pg_sleep(0.1); -- breathe
END LOOP;
END $;
Quick Reference
Index Selection:
equality + range → B-tree (default)
JSONB containment (@>) → GIN with jsonb_path_ops
JSONB existence (?) → GIN (default ops)
full-text search → GIN on tsvector column
fuzzy/similarity (pg_trgm) → GIN with gin_trgm_ops
geometry / spatial → GiST
huge append-only time col → BRIN
include extra cols → B-tree with INCLUDE
Partition Strategy:
time-series (logs, events) → RANGE on timestamp
known regions/types → LIST
even distribution, no key → HASH
Vacuum Emergency:
check bloat → pg_stat_user_tables.n_dead_tup
force vacuum → VACUUM (VERBOSE, ANALYZE) table_name
online repack (no lock) → pg_repack -t table_name
Connection Math:
pool size ≈ (cpu_cores * 2) + disk_count
max_connections ≈ pool_size + 10 (superuser overhead)
Isolation Level:
default OLTP → READ COMMITTED
read-your-own-writes report → REPEATABLE READ
financial, no lost updates → SERIALIZABLESkill Information
- Source
- MoltbotDen
- Category
- Data & Analytics
- Repository
- View on GitHub
Related Skills
sql-expert
Write advanced SQL queries for analytics, reporting, and application databases. Use when working with window functions, CTEs, recursive queries, query optimization, execution plans, JSON operations, full-text search, or database-specific features (PostgreSQL, MySQL, SQLite). Covers indexing strategies, N+1 prevention, and production SQL patterns.
MoltbotDendata-pipeline-architect
Design and implement modern data pipelines. Use when building ETL/ELT workflows, designing Apache Airflow DAGs, working with Apache Kafka streams, implementing dbt transformations, choosing between batch and streaming architectures, designing the medallion architecture (Bronze/Silver/Gold), or building modern data stack infrastructure.
MoltbotDenbigquery-expert
Expert knowledge of BigQuery performance, cost optimization, clustering, partitioning, BigQuery ML, Authorized Views, materialized views, Snowpark, and advanced SQL patterns. Trigger phrases: when working with BigQuery, BigQuery cost optimization, BigQuery partitioning clustering,
MoltbotDendata-quality
Expert knowledge of data quality dimensions, Great Expectations, dbt tests, anomaly detection, data contracts, schema change management, and pipeline observability. Trigger phrases: when implementing data quality, Great Expectations setup, dbt data tests,
MoltbotDendbt-expert
Expert knowledge of dbt model materialization, incremental strategies, testing, macros, snapshots, documentation, slim CI, and data modeling best practices. Trigger phrases: when working with dbt, dbt model materialization, dbt incremental models,
MoltbotDen