Why this matters
As a Data Analyst, you often fix data quality issues, refresh reference tables, and prepare datasets for reporting. DML (Data Manipulation Language) lets you do that safely and efficiently:
- Insert: load new records (e.g., new customers from a landing table).
- Update: correct values (e.g., mark shipped orders, fix category mappings).
- Delete: remove bad data (e.g., test rows or duplicates).
- Merge: upsert in one step—insert new rows, update existing rows.
Real tasks you might do
- Append monthly transactions into a fact table.
- Update customer segments after a new scoring model.
- Delete duplicate signups created by a bug.
- Merge product price changes from a vendor feed.
Who this is for
- Data Analysts who already write SELECT queries and want to change data safely.
- Anyone tasked with preparing or correcting tables for dashboards, reports, or models.
Prerequisites
- Comfort with SELECT, WHERE, JOIN, and GROUP BY.
- Basic understanding of primary keys and constraints (unique, not null).
- Know your environment: do you have permission to write to tables? Is auto-commit on?
Concept explained simply
DML are verbs for rows:
- INSERT adds new rows.
- UPDATE changes existing rows.
- DELETE removes rows.
- MERGE decides row-by-row whether to UPDATE or INSERT (and sometimes DELETE).
Almost always, you combine them with a WHERE or a join condition so you modify exactly the rows you intend.
Mental model
Imagine a spreadsheet with a unique key column. INSERT adds new lines, UPDATE edits cells for matching keys, DELETE removes lines, MERGE reads a change list and applies each row: if the key exists, edit; if not, add.
Safety first: transactions and WHERE
- Preview first: run a SELECT that returns the same rows you intend to UPDATE or DELETE.
- Start a transaction when possible:
BEGIN; -- your UPDATE/DELETE -- SELECT to verify results ROLLBACK; -- or COMMIT when you're 100% sure - Always include a WHERE (or matching ON) when changing data. No WHERE = change all rows.
- If supported, use RETURNING to see what changed.
Dialect notes
- MERGE is supported in many systems. Some engines use alternatives like INSERT ... ON CONFLICT/ON DUPLICATE KEY UPDATE for upserts.
- UPDATE ... FROM syntax varies slightly by database.
- Always test on a small subset before running on full tables.
Core syntax cheat sheet
-- INSERT
INSERT INTO schema.table (col1, col2)
VALUES (val1, val2);
-- INSERT from SELECT
INSERT INTO target_table (col1, col2)
SELECT col1, col2 FROM source_table;
-- UPDATE (basic)
UPDATE target t
SET col = new_value
WHERE condition;
-- UPDATE with join (typical pattern)
UPDATE target t
SET col = s.new_col
FROM source s
WHERE t.key = s.key;
-- DELETE
DELETE FROM target
WHERE condition;
-- MERGE (ANSI-style)
MERGE INTO target t
USING source s
ON t.key = s.key
WHEN MATCHED THEN
UPDATE SET t.col = s.col
WHEN NOT MATCHED THEN
INSERT (key, col) VALUES (s.key, s.col);
Using RETURNING (when available)
UPDATE orders
SET status = 'shipped'
WHERE shipped_at IS NOT NULL AND status <> 'shipped'
RETURNING id, status;
Worked examples
Example 1 — Insert only new customers
Goal: append customers that are not already in the table by email.
-- Sample tables
CREATE TABLE IF NOT EXISTS customers (
id SERIAL PRIMARY KEY,
email TEXT UNIQUE,
name TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS staging_new_customers (
email TEXT,
name TEXT
);
-- Reset sample data (optional)
DELETE FROM staging_new_customers; DELETE FROM customers;
INSERT INTO customers (email, name) VALUES
('alice@example.com','Alice'),
('bob@example.com','Bob');
INSERT INTO staging_new_customers (email, name) VALUES
('carol@example.com','Carol'),
('alice@example.com','Alice A.');
-- Insert only those not present
INSERT INTO customers (email, name)
SELECT s.email, s.name
FROM staging_new_customers s
LEFT JOIN customers c ON c.email = s.email
WHERE c.email IS NULL;
-- Preview results
SELECT email, name FROM customers ORDER BY email;
Result: carol@example.com is inserted; alice@example.com is skipped.
Example 2 — Update shipped orders
Goal: set status='shipped' for orders with a shipped_at timestamp.
CREATE TABLE IF NOT EXISTS orders (
id INT PRIMARY KEY,
customer_id INT,
status TEXT,
shipped_at TIMESTAMP
);
DELETE FROM orders;
INSERT INTO orders (id, customer_id, status, shipped_at) VALUES
(1, 1, 'pending', '2023-01-02'),
(2, 2, 'pending', '2023-01-03'),
(3, 1, 'pending', NULL);
UPDATE orders
SET status = 'shipped'
WHERE shipped_at IS NOT NULL
AND status <> 'shipped';
SELECT id, status FROM orders ORDER BY id;
Result: rows 1 and 2 become shipped; row 3 remains pending.
Example 3 — Delete duplicates, keep the smallest id
Goal: remove duplicate emails, keeping the earliest id.
WITH ranked AS (
SELECT id, email,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM customers
)
DELETE FROM customers
USING ranked r
WHERE customers.id = r.id
AND r.rn > 1;
Tip: Preview with the CTE SELECT first to see which rows have rn > 1.
Example 4 — MERGE product prices (upsert)
Goal: update existing product prices and insert new SKUs.
CREATE TABLE IF NOT EXISTS products (
sku TEXT PRIMARY KEY,
name TEXT,
price NUMERIC(10,2)
);
CREATE TABLE IF NOT EXISTS staging_prices (
sku TEXT,
price NUMERIC(10,2)
);
DELETE FROM products; DELETE FROM staging_prices;
INSERT INTO products (sku, name, price) VALUES
('P1','Pen',10.00),
('P2','Notebook',15.00);
INSERT INTO staging_prices (sku, price) VALUES
('P2',17.00),
('P3',12.00);
MERGE INTO products p
USING staging_prices s
ON p.sku = s.sku
WHEN MATCHED THEN
UPDATE SET price = s.price
WHEN NOT MATCHED THEN
INSERT (sku, name, price) VALUES (s.sku, COALESCE(p.name,'New'), s.price);
SELECT sku, price FROM products ORDER BY sku;
Result: P2 price updated to 17.00; P3 inserted at 12.00.
Exercises
Use these small schemas; run the setup in each exercise once. Then attempt the task. Check off steps as you go.
- I ran the setup statements.
- I previewed affected rows with SELECT before changing data.
- I wrapped changes in a transaction (if supported) and verified results.
Exercise 1 — Insert new customers without duplicates (EX1)
Setup
CREATE TABLE IF NOT EXISTS customers (
id SERIAL PRIMARY KEY,
email TEXT UNIQUE,
name TEXT
);
CREATE TABLE IF NOT EXISTS staging_new_customers (
email TEXT,
name TEXT
);
DELETE FROM customers; DELETE FROM staging_new_customers;
INSERT INTO customers (email, name) VALUES
('alice@example.com','Alice'),
('bob@example.com','Bob');
INSERT INTO staging_new_customers (email, name) VALUES
('carol@example.com','Carol'),
('alice@example.com','Alice New');
Task: Insert from staging_new_customers into customers but only for emails not already present. Return inserted rows if possible.
Hint
- Think LEFT JOIN ... WHERE target.email IS NULL or NOT EXISTS.
- RETURNING can show what was inserted.
Exercise 2 — Update order status to shipped (EX2)
Setup
CREATE TABLE IF NOT EXISTS orders (
id INT PRIMARY KEY,
customer_id INT,
status TEXT,
shipped_at TIMESTAMP
);
DELETE FROM orders;
INSERT INTO orders (id, customer_id, status, shipped_at) VALUES
(1, 1, 'pending', '2023-01-02'),
(2, 2, 'pending', '2023-01-03'),
(3, 1, 'pending', NULL);
Task: Update orders so that any row with shipped_at IS NOT NULL has status='shipped', but avoid re-updating already 'shipped' rows.
Hint
- WHERE shipped_at IS NOT NULL AND status <> 'shipped'
- Add RETURNING if supported to verify.
Exercise 3 — MERGE product prices (EX3)
Setup
CREATE TABLE IF NOT EXISTS products (
sku TEXT PRIMARY KEY,
name TEXT,
price NUMERIC(10,2)
);
CREATE TABLE IF NOT EXISTS staging_prices (
sku TEXT,
price NUMERIC(10,2)
);
DELETE FROM products; DELETE FROM staging_prices;
INSERT INTO products (sku, name, price) VALUES
('P1','Pen',10.00),
('P2','Notebook',15.00);
INSERT INTO staging_prices (sku, price) VALUES
('P2',17.00),
('P3',12.00);
Task: MERGE staging_prices into products to update price for matching SKUs and insert missing SKUs. Preview results with SELECT afterward.
Hint
- ON p.sku = s.sku; WHEN MATCHED UPDATE; WHEN NOT MATCHED INSERT
- Set a default name like 'New' when inserting.
Common mistakes and self-check
- No WHERE on UPDATE/DELETE: use SELECT preview to ensure the row count is what you expect.
- Wrong join in UPDATE/MERGE: verify the join keys are unique; run a SELECT count to detect duplicates.
- Overwriting with NULLs: in UPDATE, use COALESCE(s.col, t.col) if the source may have NULLs you don’t want to apply.
- Violating constraints: check for duplicate keys before INSERT or use upsert patterns.
- Auto-commit surprises: test with a transaction on a small subset.
Self-check mini-list
- I can preview affected rows with a SELECT that mimics my WHERE/join.
- I can explain why my UPDATE won’t multiply rows.
- I know how to undo changes (ROLLBACK) in my environment.
Practical projects
- Customer dedup pipeline: load signups into a clean customers table with INSERT-from-SELECT and a NOT EXISTS filter.
- Order state refresher: nightly UPDATE of order statuses based on event timestamps, with a change log table capturing RETURNING output.
- Reference price upsert: weekly MERGE vendor prices into products; alert when prices change over 20%.
Mini challenge
Create a table product_categories(sku, category) with a unique constraint on sku. Build a staging table with some new categories and some existing. Write a MERGE that:
- Updates category for existing SKUs only if the new category is not NULL.
- Inserts rows for new SKUs when category is provided.
Hint
Use WHEN MATCHED THEN UPDATE SET category = COALESCE(s.category, t.category) and WHEN NOT MATCHED THEN INSERT ... WHERE s.category IS NOT NULL (supported in some dialects) or filter via USING subquery.
Learning path
- Before: SELECT and JOIN fundamentals; constraints and keys.
- This lesson: DML—INSERT, UPDATE, DELETE, MERGE; safety patterns.
- Next: Transactions, isolation levels, and auditing changes; incremental ETL patterns.
Next steps
- Re-run the worked examples on your own sandbox.
- Complete the exercises below; then take the quick test.
- Tip: The test is available to everyone; only logged-in users get saved progress.