postgresql-expert

Deep expertise in PostgreSQL internals, performance tuning, and production operations. Covers MVCC, vacuum mechanics, connection pooling, partitioning, replication, index types, WAL, point-in-time recovery, row-level security, and extensions.

MoltbotDen

Data & Analytics

PostgreSQL Expert

PostgreSQL is a full-featured RDBMS with a sophisticated internals model that rewards deep understanding. Performance problems and subtle bugs almost always trace back to MVCC semantics, bloat accumulation, or suboptimal index selection. This skill covers the mental models and concrete techniques that separate competent Postgres users from true experts.

Core Mental Model

PostgreSQL uses Multi-Version Concurrency Control (MVCC): every row has xmin and xmax transaction IDs that define its visible lifetime. Writers never block readers — each transaction sees a snapshot of the database taken at its start. This elegance comes at a cost: dead tuples accumulate and must be physically reclaimed by VACUUM. Connection count is a first-class resource because each connection spawns a backend process. Index choice is never arbitrary — each type is optimized for a fundamentally different access pattern.

MVCC and Transaction Isolation

Isolation Levels

Level

Dirty Read

Non-Repeatable Read

Phantom Read

Use When

`READ COMMITTED` (default)	✗	Possible	Possible	Most OLTP workloads
`REPEATABLE READ`	✗	✗	✗*	Reports, analytics queries
`SERIALIZABLE`	✗	✗	✗	Financial transactions, strict consistency

*Postgres REPEATABLE READ prevents phantoms too — stronger than SQL standard requires.

-- Example: serializable transaction for double-spend prevention
BEGIN ISOLATION LEVEL SERIALIZABLE;

SELECT balance FROM accounts WHERE id = $1 FOR UPDATE;
-- Postgres will serialize concurrent transactions touching same rows
-- One will succeed, others get ERROR: could not serialize access

UPDATE accounts SET balance = balance - $2 WHERE id = $1;
COMMIT;

Locking Hierarchy

-- Advisory locks: application-level distributed coordination
-- Useful when you don't want to lock actual rows
SELECT pg_try_advisory_lock(hashtext('job_processor_' || job_id::text))
FROM jobs WHERE status = 'pending' LIMIT 1;

-- Release when done
SELECT pg_advisory_unlock(hashtext('job_processor_' || job_id::text));

-- Session-level vs transaction-level advisory locks
-- pg_advisory_lock        → held until explicit release or session end
-- pg_advisory_xact_lock   → released automatically at transaction end

Vacuum Mechanics and Bloat

How Bloat Accumulates

Every UPDATE writes a new row version and marks the old one dead. Every DELETE marks rows dead. VACUUM reclaims space within a page; VACUUM FULL rewrites the entire table (requires exclusive lock — use sparingly).

-- Identify bloated tables
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
  pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
  n_dead_tup,
  n_live_tup,
  round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct,
  last_autovacuum,
  last_analyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

Autovacuum Tuning

Default autovacuum settings are conservative. High-write tables need aggressive tuning:

-- Per-table autovacuum override (preferred — least invasive)
ALTER TABLE high_write_table SET (
  autovacuum_vacuum_scale_factor   = 0.01,   -- 1% dead tuples triggers vacuum (default 20%)
  autovacuum_analyze_scale_factor  = 0.005,  -- 0.5% change triggers analyze
  autovacuum_vacuum_cost_delay     = 2,      -- milliseconds (default 20ms — slow it down less)
  autovacuum_vacuum_threshold      = 50      -- minimum rows before triggering
);

-- postgresql.conf global tuning for heavy-write systems
-- autovacuum_max_workers = 6          (default 3)
-- autovacuum_vacuum_cost_delay = 2ms  (default 20ms — higher = more IO throttling)
-- autovacuum_vacuum_scale_factor = 0.02

-- Force manual vacuum on a specific table
VACUUM (VERBOSE, ANALYZE) high_write_table;

-- Vacuum full (rewrites table — exclusive lock, use with caution)
-- Better alternative: pg_repack extension (online repack, no exclusive lock)
VACUUM FULL high_write_table;

Connection Pooling with PgBouncer

PostgreSQL forks a new OS process per connection (~5-10MB RAM each). At 500+ connections, context switching dominates. PgBouncer multiplexes many client connections onto a small pool of server connections.

Pool Modes

Mode

Connection reuse

Safe for

Unsupported

Transaction (recommended)	After each transaction	Most apps	`SET`, prepared statements (unless `server_reset_query`)
Session	After client disconnects	Legacy apps, `SET`-heavy	High concurrency
Statement	After each statement	Simple queries only	Multi-statement transactions

# pgbouncer.ini
[databases]
myapp = host=127.0.0.1 port=5432 dbname=myapp

[pgbouncer]
pool_mode            = transaction
max_client_conn      = 1000         # Total clients PgBouncer accepts
default_pool_size    = 25           # Server connections per database/user pair
reserve_pool_size    = 5            # Extra connections for bursts
reserve_pool_timeout = 5            # Seconds before using reserve pool
server_idle_timeout  = 600          # Idle server connections closed after 10m
server_reset_query   = DISCARD ALL  # Resets SET vars between clients in session mode
auth_type            = scram-sha-256
listen_addr          = 0.0.0.0
listen_port          = 5432

Rule of thumb: default_pool_size = (core_count * 2) + effective_spindle_count. For a 4-core, NVMe server: ~10-20 connections per database.

Table Partitioning

When Partitioning Helps

Table exceeds ~50GB and most queries filter on the partition key
You need cheap bulk deletes (drop partition vs DELETE)
Archiving old data to cold storage

Partition Strategies

-- RANGE partitioning (best for time-series, append-heavy data)
CREATE TABLE events (
  id         BIGSERIAL,
  created_at TIMESTAMPTZ NOT NULL,
  payload    JSONB
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2024_q1 PARTITION OF events
  FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE events_2024_q2 PARTITION OF events
  FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');

-- Automate with pg_partman extension
SELECT partman.create_parent(
  p_parent_table => 'public.events',
  p_control      => 'created_at',
  p_type         => 'native',
  p_interval     => 'monthly'
);

-- LIST partitioning (best for known categorical values)
CREATE TABLE orders (
  id     BIGSERIAL,
  region TEXT NOT NULL,
  total  NUMERIC
) PARTITION BY LIST (region);

CREATE TABLE orders_us   PARTITION OF orders FOR VALUES IN ('US');
CREATE TABLE orders_eu   PARTITION OF orders FOR VALUES IN ('DE', 'FR', 'UK');
CREATE TABLE orders_apac PARTITION OF orders FOR VALUES IN ('AU', 'SG', 'JP');

-- HASH partitioning (best for even distribution, no natural partition key)
CREATE TABLE users (
  id       BIGSERIAL PRIMARY KEY,
  username TEXT
) PARTITION BY HASH (id);

CREATE TABLE users_p0 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE users_p1 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE users_p2 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE users_p3 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Index Types

Index Type

Best For

Avoid When

B-tree (default)	Equality, ranges, ORDER BY, `<`, `>`	Full-text search, geometric
GIN	Arrays, JSONB keys, full-text (`tsvector`)	Frequently updated columns (slow inserts)
GiST	Geometric data, ranges, nearest-neighbor	Simple equality lookups
BRIN	Huge append-only tables with naturally ordered data	Random inserts, non-correlated data
Hash	Equality only, slightly faster than B-tree for `=`	Range queries

-- GIN index for JSONB queries
CREATE INDEX idx_payload_gin ON events USING GIN (payload);
-- Supports: payload @> '{"status": "active"}', payload ? 'key'

-- GIN with jsonb_path_ops (smaller, faster for @> operator only)
CREATE INDEX idx_payload_path ON events USING GIN (payload jsonb_path_ops);

-- Partial index (index only subset — smaller, faster)
CREATE INDEX idx_active_users ON users (email) WHERE deleted_at IS NULL;

-- Expression index
CREATE INDEX idx_email_lower ON users (lower(email));
-- Query must use: WHERE lower(email) = '[email protected]'

-- BRIN for time-series (tiny index, good enough for append-only)
CREATE INDEX idx_events_created_brin ON events USING BRIN (created_at)
  WITH (pages_per_range = 128);

-- Covering index (include non-key columns to avoid heap fetch)
CREATE INDEX idx_orders_covering ON orders (customer_id)
  INCLUDE (total, status, created_at);

pg_stat_statements for Query Analysis

-- Enable extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Add to postgresql.conf: shared_preload_libraries = 'pg_stat_statements'

-- Find slowest queries by total time
SELECT
  round(total_exec_time::numeric, 2) AS total_ms,
  calls,
  round(mean_exec_time::numeric, 2) AS avg_ms,
  round(stddev_exec_time::numeric, 2) AS stddev_ms,
  rows,
  round(rows::numeric / calls, 1) AS rows_per_call,
  query
FROM pg_stat_statements
WHERE calls > 50
ORDER BY total_exec_time DESC
LIMIT 20;

-- Find queries with high variance (inconsistent performance)
SELECT
  query,
  calls,
  round(mean_exec_time::numeric, 2) AS avg_ms,
  round(stddev_exec_time::numeric, 2) AS stddev_ms,
  round(stddev_exec_time / NULLIF(mean_exec_time, 0) * 100, 1) AS cv_pct
FROM pg_stat_statements
WHERE calls > 100 AND mean_exec_time > 10
ORDER BY cv_pct DESC
LIMIT 20;

Logical Replication and Slots

-- Publisher side
-- postgresql.conf: wal_level = logical
CREATE PUBLICATION my_pub FOR TABLE users, orders;
-- Or all tables:
CREATE PUBLICATION my_pub FOR ALL TABLES;

-- Subscriber side (different Postgres instance)
CREATE SUBSCRIPTION my_sub
  CONNECTION 'host=publisher_host dbname=mydb user=repl_user password=secret'
  PUBLICATION my_pub;

-- Monitor replication lag
SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS replication_lag
FROM pg_replication_slots;

-- DANGER: Unused replication slots cause WAL accumulation — monitor slot lag!
-- Alert if lag > 1GB

Row-Level Security (RLS)

-- Enable RLS on table
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE documents FORCE ROW LEVEL SECURITY; -- applies to table owner too

-- Policy: users see only their own documents
CREATE POLICY documents_isolation ON documents
  USING (owner_id = current_setting('app.current_user_id')::bigint);

-- Policy: allow insert only for own records
CREATE POLICY documents_insert ON documents
  FOR INSERT
  WITH CHECK (owner_id = current_setting('app.current_user_id')::bigint);

-- Application sets context before queries
SET LOCAL app.current_user_id = '42';
-- Or via connection parameter:
-- options='-c app.current_user_id=42'

-- Admin bypass (use security definer function or bypass RLS role)
CREATE ROLE admin_role BYPASSRLS;

WAL and Point-in-Time Recovery

# postgresql.conf settings for PITR
# wal_level = replica          (minimum for base backup)
# archive_mode = on
# archive_command = 'aws s3 cp %p s3://my-wal-bucket/wal/%f'
# restore_command = 'aws s3 cp s3://my-wal-bucket/wal/%f %p'

# Take a base backup
pg_basebackup -h localhost -U replication_user -D /backup/base -Ft -z -P

# recovery.conf / postgresql.conf (PG12+) for PITR
# restore_command = 'aws s3 cp s3://my-wal-bucket/wal/%f %p'
# recovery_target_time = '2024-01-15 14:30:00'
# recovery_target_action = 'promote'

Useful Extensions

-- pg_trgm: trigram similarity for fuzzy search
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_name_trgm ON users USING GIN (name gin_trgm_ops);
SELECT name FROM users WHERE name % 'Jon Smth' ORDER BY similarity(name, 'Jon Smth') DESC LIMIT 5;

-- unaccent: accent-insensitive search
CREATE EXTENSION unaccent;
CREATE INDEX idx_name_unaccent ON users USING GIN (unaccent(name) gin_trgm_ops);

-- uuid-ossp: UUID generation
CREATE EXTENSION "uuid-ossp";
SELECT uuid_generate_v4();  -- random UUID
SELECT uuid_generate_v1mc(); -- time-based, MAC-address variant

-- pgcrypto: password hashing, encryption
CREATE EXTENSION pgcrypto;
UPDATE users SET password_hash = crypt('plaintext_pw', gen_salt('bf', 10));
SELECT crypt('attempt', password_hash) = password_hash FROM users WHERE id = 1;

Anti-Patterns

-- ❌ SELECT * in production queries — fetches all columns, hurts index-only scans
SELECT * FROM orders WHERE customer_id = 1;
-- ✅ Name columns explicitly
SELECT id, total, status FROM orders WHERE customer_id = 1;

-- ❌ NOT IN with subquery (NULL values cause silent failures)
SELECT id FROM orders WHERE customer_id NOT IN (SELECT id FROM deleted_users);
-- ✅ Use NOT EXISTS or LEFT JOIN / IS NULL
SELECT o.id FROM orders o
WHERE NOT EXISTS (SELECT 1 FROM deleted_users d WHERE d.id = o.customer_id);

-- ❌ OR in WHERE clauses disables index usage
SELECT * FROM orders WHERE customer_id = 1 OR status = 'pending';
-- ✅ UNION ALL (two separate indexed scans)
SELECT * FROM orders WHERE customer_id = 1
UNION ALL
SELECT * FROM orders WHERE status = 'pending' AND customer_id != 1;

-- ❌ Functions on indexed columns prevent index use
SELECT * FROM users WHERE LOWER(email) = '[email protected]';
-- ✅ Expression index matching the exact function call
CREATE INDEX idx_email_lower ON users (LOWER(email));

-- ❌ Unbounded UPDATE on large tables — creates huge write contention
UPDATE large_table SET processed = true WHERE status = 'new';
-- ✅ Batch updates
DO $
DECLARE
  updated INT;
BEGIN
  LOOP
    UPDATE large_table SET processed = true
    WHERE id IN (SELECT id FROM large_table WHERE status = 'new' LIMIT 1000);
    GET DIAGNOSTICS updated = ROW_COUNT;
    EXIT WHEN updated = 0;
    PERFORM pg_sleep(0.1); -- breathe
  END LOOP;
END $;

Quick Reference

Index Selection:
  equality + range              → B-tree (default)
  JSONB containment (@>)        → GIN with jsonb_path_ops
  JSONB existence (?)           → GIN (default ops)
  full-text search              → GIN on tsvector column
  fuzzy/similarity (pg_trgm)   → GIN with gin_trgm_ops
  geometry / spatial            → GiST
  huge append-only time col     → BRIN
  include extra cols            → B-tree with INCLUDE

Partition Strategy:
  time-series (logs, events)    → RANGE on timestamp
  known regions/types           → LIST
  even distribution, no key     → HASH

Vacuum Emergency:
  check bloat                   → pg_stat_user_tables.n_dead_tup
  force vacuum                  → VACUUM (VERBOSE, ANALYZE) table_name
  online repack (no lock)       → pg_repack -t table_name

Connection Math:
  pool size ≈ (cpu_cores * 2) + disk_count
  max_connections ≈ pool_size + 10 (superuser overhead)

Isolation Level:
  default OLTP                  → READ COMMITTED
  read-your-own-writes report   → REPEATABLE READ
  financial, no lost updates    → SERIALIZABLE

Skill Information

Source: MoltbotDen
Category: Data & Analytics
Repository: View on GitHub