Heads-up: The quick test is available to everyone. If you are logged in, your progress will be saved automatically.
Why this matters
As a Data Analyst, your queries must be fast and reliable. Tables grow, rows get updated/deleted, and the optimizer relies on statistics to pick efficient plans. Routine maintenance keeps your dashboards responsive and your analyses correct.
- Speed up slow dashboards by removing dead rows that bloat tables.
- Keep query plans accurate with fresh column statistics.
- Prevent outages caused by transaction ID wraparound (in systems like PostgreSQL).
Who this is for
- Data Analysts and Analytics Engineers running SQL on shared warehouses or OLTP replicas.
- Anyone who owns analytical tables and needs predictable query performance.
Prerequisites
Concept explained simply
VACUUM: Reclaims space from deleted/updated rows (dead tuples) and keeps transaction IDs healthy. Normal VACUUM is online and concurrent; VACUUM FULL rewrites the table, frees more space, but requires an exclusive lock.
ANALYZE: Scans samples of your table to estimate value distributions, null ratios, and distinct counts. The optimizer uses these to choose join orders and indexes efficiently.
Mental model
- Imagine a library: VACUUM removes discarded books taking shelf space, so aisles stay clear.
- ANALYZE updates the catalog’s knowledge of which books are popular or rare, so the librarian can fetch them quickly.
- Autovacuum (in PostgreSQL) is the nightly cleaning crew. You still need to spot-check busy sections and occasionally do a deeper clean.
Key commands and views
VACUUM;orVACUUM (VERBOSE);VACUUM ANALYZE;to reclaim space and refresh stats together after large changes.VACUUM FULL;deep compaction; requires exclusive lock (plan for downtime).ANALYZE;orANALYZE schema.table;orANALYZE table(column1, column2);ALTER TABLE ... SET (autovacuum_vacuum_scale_factor=..., autovacuum_vacuum_threshold=...);ALTER TABLE ... ALTER COLUMN col SET STATISTICS 100-1000;to increase sampling for skewed columns.
Helpful monitoring queries (PostgreSQL)
-- Dead tuples and last maintenance
SELECT relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
-- Current autovacuum/vacuum activity
SELECT * FROM pg_stat_progress_vacuum;
-- Check settings relevant to vacuum
SELECT name, setting FROM pg_settings
WHERE name IN ('autovacuum', 'autovacuum_vacuum_scale_factor', 'autovacuum_vacuum_threshold',
'autovacuum_analyze_scale_factor', 'autovacuum_analyze_threshold', 'default_statistics_target')
ORDER BY name;
-- Table bloat hint (approx; use with caution)
-- Strategy: Compare relpages (table) vs index pages, and track growth over time.
SELECT relname, n_live_tup, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;
Worked examples
1) Remove dead tuples and refresh stats after a batch delete
- Inspect dead tuples:
SELECT relname, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'sales';
- Run maintenance:
VACUUM ANALYZE sales;
- Confirm improvement:
SELECT relname, n_live_tup, n_dead_tup, last_analyze
FROM pg_stat_user_tables
WHERE relname = 'sales';
Expected: n_dead_tup drops close to zero; last_analyze updates.
2) Improve plans on a skewed column with higher statistics
-- Column 'user_id' is heavily skewed (a few users generate most events)
ALTER TABLE web_events ALTER COLUMN user_id SET STATISTICS 500;
ANALYZE web_events(user_id);
-- Compare estimates vs actuals on a known filter
EXPLAIN ANALYZE
SELECT * FROM web_events WHERE user_id = 42;
Goal: The planner better estimates selectivity and may pick a more efficient index plan.
3) Watch autovacuum progress on a large table
-- In another session, a large update/delete runs on 'orders'
-- Here, you observe progress:
SELECT relid::regclass AS table_name, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed
FROM pg_stat_progress_vacuum
WHERE relid::regclass::text = 'orders';
Use this to time manual VACUUMs and avoid clashes with peak hours.
Practice: try it yourself
Complete the exercises below. They match the Exercises section at the bottom of this page.
Common mistakes and how to self-check
- Confusing VACUUM with VACUUM FULL. Use FULL only when you plan downtime and need compaction via rewrite. Self-check: Does your task require an exclusive lock?
- Skipping ANALYZE after big changes. Fresh data without fresh stats leads to bad plans. Self-check: Did row counts change by millions? Run ANALYZE.
- Disabling autovacuum globally. This can cause table bloat and transaction ID wraparound. Self-check: Keep autovacuum on; tune per-table if needed.
- Over-tuning stats on every column. Higher targets increase ANALYZE time. Self-check: Only raise for columns used in WHERE/JOIN with skew.
- Vacuuming during peak traffic without awareness. While regular VACUUM is concurrent, it still uses resources. Self-check: Schedule during low-load windows.
Practical projects
- Maintenance playbook: Create a weekly script with queries that list top 20 tables by dead tuples, run targeted VACUUM ANALYZE, and log before/after metrics.
- Stats tuning pilot: Pick 3 slow queries. Identify one selective column each, raise its statistics target, ANALYZE, and capture EXPLAIN ANALYZE before/after.
- Autovacuum tuning: For a hot table, set
autovacuum_vacuum_scale_factorandautovacuum_vacuum_thresholdto trigger earlier, then validate withpg_stat_user_tablesover a week.
Learning path
- Refresh SQL basics: SELECT, joins, indexes, and EXPLAIN.
- Learn VACUUM, ANALYZE, and when to use FULL.
- Practice monitoring:
pg_stat_user_tables,pg_stat_progress_vacuum, and relevant settings. - Tune per-table autovacuum and column statistics for skewed data.
- Automate: build a simple weekly checklist or script.
Next steps
- Take the Quick Test to confirm understanding.
- Automate a weekly maintenance report for your team.
- Document table-specific tuning choices and revisit monthly.
Exercises
Exercise 1: Targeted VACUUM ANALYZE
Goal: Reduce dead tuples and refresh stats for one table you own.
- Find a candidate with many dead tuples.
- Run VACUUM ANALYZE.
- Confirm improvement.
-- 1) Identify a table (example: orders)
SELECT relname, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;
-- 2) Maintain it
VACUUM ANALYZE orders;
-- 3) Verify
SELECT relname, n_dead_tup, last_analyze
FROM pg_stat_user_tables
WHERE relname = 'orders';
Exercise 2: Stats tuning for a skewed column
Goal: Improve plan accuracy for a frequently filtered column.
- Pick a table and column commonly used in WHERE/JOIN (example: events.country_code).
- Raise column statistics and ANALYZE just that column.
- Compare EXPLAIN ANALYZE before/after on a representative filter.
-- Before
EXPLAIN ANALYZE SELECT * FROM events WHERE country_code = 'US';
-- Tune
ALTER TABLE events ALTER COLUMN country_code SET STATISTICS 400;
ANALYZE events(country_code);
-- After
EXPLAIN ANALYZE SELECT * FROM events WHERE country_code = 'US';
Mini challenge
You have a 200M-row table with frequent deletes. Autovacuum keeps it usable but queries are slowly degrading. Propose a plan with timing that includes: monitoring dead tuples daily, targeted VACUUM ANALYZE windows, and a rare, scheduled VACUUM FULL with expected downtime. Write the plan as 3-5 steps.