How to learn Code Review Habits For Analytics for Version Control Basics in BI Analyst for free

Why this matters

In analytics, a small logic error can change numbers in dashboards, cause bad decisions, and break trust. Code reviews help you catch logic, performance, and data quality issues before they reach stakeholders. You will use reviews for SQL models (dbt or warehouse SQL), ELT scripts, notebooks, metrics definitions, and dashboard queries.

Reduce data defects: confirm joins, filters, time windows, and aggregations.
Improve performance: avoid unnecessary scans, fix joins and predicates.
Create shared understanding: document assumptions and edge cases.
Raise team quality: consistent patterns, tests, and naming conventions.

Concept explained simply

Think of a code review as a pre-flight checklist for your analysis. You are verifying that the code does what the author intended, with correct data and acceptable cost, and that future readers can maintain it.

Mental model: PREP

Purpose: Is the business question and metric definition clear?
Reproducibility: Can someone else run this and get the same result?
Efficiency: Is the query as simple and performant as it reasonably can be?
Protection: Are edge cases, nulls, and data contracts handled?

What to focus on first vs later

First pass: correctness (joins, filters, aggregations) and risk.
Second pass: performance and readability.
Final pass: docs, tests, naming, and commit hygiene.

Essential habits for analytics reviews

Read the PR description: scope, datasets touched, and validation steps.
Open the diff, scan file-by-file, start with the highest-impact models or queries.
Re-run validation queries from the PR locally or in a safe environment.
Check row counts before/after joins and filters; spot unexpected changes.
Verify metric definitions against the source of truth (e.g., definition doc).
Look for deterministic filtering and time windows (no accidental time drift).
Watch for null handling, deduping, and idempotency (safe reruns).
Prefer smaller PRs. If large, request splitting by logical parts.

Step-by-step review workflow

1) Understand intent

Read: What problem is solved? Which tables/models are modified? What should change in outputs?

2) Scan the diff

Locate risky areas: joins, where clauses, window functions, and group by changes.

3) Validate logic

Check join types and keys, aggregation correctness, and time filters. Confirm metric formulas.

4) Test with samples

Run sample queries: row counts, distinct keys, null proportions, and before/after diffs.

5) Assess performance

Look for unnecessary DISTINCT, functions on join keys, and non-sargable predicates.

6) Check data contracts

Columns added/removed, types changed, and downstream impact. Are migrations/backfills safe?

7) Documentation & tests

PR description, inline comments, dataset notes, and tests/queries to validate.

8) Approve with notes

Summarize what you checked; request changes if uncertainty remains.

Worked examples

Example 1 — Join logic and filters

-- Proposed change
SELECT o.order_id, c.email, SUM(o.amount) AS revenue
FROM orders o
LEFT JOIN customers c ON c.id = o.customer_id
WHERE c.is_active = TRUE
  AND o.created_at BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY 1,2;

Issues to spot:

LEFT JOIN with a WHERE filter on the right table acts like an INNER JOIN. Move the predicate into the JOIN or switch to INNER JOIN intentionally.
Time window is fixed; is that expected or should it be parameterized (e.g., last full month)?
Aggregation: confirm revenue definition (gross, net, refunds?).

Suggested fix

SELECT o.order_id, c.email, SUM(o.amount) AS revenue
FROM orders o
INNER JOIN customers c ON c.id = o.customer_id AND c.is_active = TRUE
WHERE o.created_at >= DATE '2024-01-01'
  AND o.created_at < DATE '2024-02-01'
GROUP BY 1,2;

Example 2 — DISTINCT vs GROUP BY

-- Proposed change
WITH jan AS (
  SELECT DISTINCT id, customer_id, amount
  FROM orders
  WHERE DATE_TRUNC('month', created_at) = DATE '2024-01-01'
)
SELECT id, customer_id, amount FROM jan;

Issues to spot:

DISTINCT can hide duplicates without explaining why they exist. Prefer explicit deduping with keys and rules (e.g., ROW_NUMBER() over a partition).
DATE_TRUNC comparison to a date literal is fragile; use a start/end boundary for determinism.

Suggested fix

WITH jan AS (
  SELECT * EXCLUDE rn FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn
    FROM orders
    WHERE created_at >= DATE '2024-01-01' AND created_at < DATE '2024-02-01'
  ) t
  WHERE rn = 1
)
SELECT id, customer_id, amount FROM jan;

Example 3 — Metrics correctness

-- Proposed change: 'Active users' metric
SELECT COUNT(DISTINCT user_id) AS active_users
FROM events
WHERE event_name = 'session_start'
  AND event_time >= CURRENT_DATE - 7;

Questions to ask:

Does "active" require any event or specific events?
Should the window be a full week (Mon–Sun) or rolling 7 days?
Timezone alignment needed? Are bots filtered?

Better definition (example)

-- Example only; confirm with stakeholders
SELECT COUNT(DISTINCT user_id) AS active_users_7d
FROM events
WHERE event_time >= DATE_TRUNC('day', CURRENT_TIMESTAMP AT TIME ZONE 'UTC') - INTERVAL '7 days'
  AND is_bot = FALSE;

Review checklists

Pull Request (PR) checklist for analytics

Clear goal and scope stated
Datasets/models affected listed
Validation steps included (row counts, sample outputs)
Join keys and join types justified
Aggregations match metric definitions
Time windows deterministic and documented
Null handling and deduping explicit
Performance considerations noted (predicates, scans)
Downstream impact and migration/backfill plan
Rollback plan described
Tests or validation queries attached
Naming, style, and comments consistent

Self-check while reviewing

Can I explain what changes in outputs and why?
Did I run at least two sanity queries (row count delta, distinct keys)?
Did I check at least one edge case (nulls, missing foreign keys)?
Am I comfortable owning the result if merged?

Common mistakes and how to self-check

Focusing on style only: Always verify logic and data changes with simple queries.
Merging large PRs without isolating risk: Request splitting by model or step.
Relying on DISTINCT to "fix" duplicates: Investigate root cause and dedupe explicitly.
Filtering on right-table columns after LEFT JOIN: Move conditions to the JOIN or use INNER JOIN.
Using non-deterministic time filters: Prefer start/end boundaries over vague relative functions.
Skipping downstream checks: Scan dashboards or models that consume the changed outputs.

Exercises

These mirror the interactive exercises below. Do them here, then submit in the Exercises panel.

Exercise 1 — Review a SQL diff

Read the diff and write at least five review comments focused on correctness, performance, or clarity.

Show diff

-- BEFORE
SELECT o.id, c.email, SUM(o.amount) AS revenue
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY 1,2;

-- AFTER
WITH orders_jan AS (
  SELECT DISTINCT o.id, o.customer_id, o.amount, o.created_at
  FROM orders o
  WHERE DATE_TRUNC('month', o.created_at) = DATE '2024-01-01'
)
SELECT o.id, c.email, o.amount::FLOAT AS revenue
FROM orders_jan o
LEFT JOIN customers c ON c.id = o.customer_id
WHERE c.is_active = TRUE;

Write 5–8 specific review comments. Note assumptions or missing validation.

Example solution

See the solution in the Exercises panel for a model answer you can compare with.

Exercise 2 — Draft a team PR checklist

Create a concise checklist your team can paste into PR descriptions for analytics work.

Keep it under 12 items; prioritize correctness checks.
Add 2–3 validation queries your team should run.

Example solution

See the solution in the Exercises panel for a reference template.

Practical projects

Turn one existing large PR into two smaller PRs and write risk-focused descriptions.
Create a shared PR template with sections for goal, datasets touched, validation, and rollback.
Run a 20-minute pair review on a complex join; each person explains the other’s code.

Who this is for

BI Analysts and Analytics Engineers working with SQL, dbt, notebooks, and dashboards.
Anyone contributing to analytics repositories and pull requests.

Prerequisites

Basic Git (branch, commit, push, PR) and reading diffs.
Intermediate SQL (joins, group by, window functions).

Learning path

Start: Review basics and PREP mental model.
Practice: Do Exercises 1–2 with real code from a sandbox dataset.
Apply: Use the PR checklist on your next team change.
Confirm: Take the quick test to check understanding.

Next steps

Adopt the checklist in your repo’s PR template.
Add a small set of validation queries to each PR you open.
Schedule a weekly 15-minute review huddle for tricky changes.

Mini challenge

Pick a metric your team uses (e.g., Active Users) and write a one-paragraph definition, plus two SQL queries: one to compute it and one to validate it (sanity check). Share with a peer for review comments.

Try it — Quick Test

The quick test is available to everyone. Only logged-in users will have their progress saved.

Instructions

Read the BEFORE/AFTER code and write 5–8 review comments that address correctness, performance, or clarity. Use bullets. Prioritize high-risk items.

Diff to review

-- BEFORE
SELECT o.id, c.email, SUM(o.amount) AS revenue
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY 1,2;

-- AFTER
WITH orders_jan AS (
  SELECT DISTINCT o.id, o.customer_id, o.amount, o.created_at
  FROM orders o
  WHERE DATE_TRUNC('month', o.created_at) = DATE '2024-01-01'
)
SELECT o.id, c.email, o.amount::FLOAT AS revenue
FROM orders_jan o
LEFT JOIN customers c ON c.id = o.customer_id
WHERE c.is_active = TRUE;

Call out any logic changes (join type, filters, aggregations, time windows).
Suggest validation queries (row count deltas, null checks, key uniqueness).
Propose documentation lines for the PR description.

Menu

Code Review Habits For Analytics

Table of Contents