Why this matters
As a Machine Learning Engineer, your models are only as good as the data they receive. ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) determine how raw data becomes reliable features for training and serving. You will design batch and streaming dataflows, avoid target leakage, backfill history, and make pipelines idempotent and observable. These are daily, real-world responsibilities.
- Real task: Build nightly features (e.g., 7-day spend) for a churn model.
- Real task: Stream fraud signals to an online feature store with millisecond latency.
- Real task: Recompute features for the last 90 days after a schema change, without duplications.
Concept explained simply
ETL means you clean/transform data before loading it into the warehouse or lake. ELT means you load raw data first, then transform inside the warehouse/lake with SQL or distributed compute. Both aim to produce trustworthy, time-consistent features.
- ETL: good when you must standardize data on the way in, or keep the warehouse lean.
- ELT: good when your warehouse/lake is powerful and you want flexible, versioned transformations in SQL.
Mental model
Think of data as ingredients:
- ETL: you prep (wash, cut) ingredients in the kitchen, then store ready-to-cook boxes.
- ELT: you store raw ingredients, and do prep right before cooking with powerful tools.
In ML, the final dish is a feature table aligned to a specific time (as_of_date). Everything you do must respect time to avoid seeing the future.
Deep dive: When ETL vs ELT?
- Choose ETL when upstream systems are messy and you must enforce strict contracts before storage, or when data egress costs make pre-aggregation cheaper.
- Choose ELT when you have a strong warehouse/lakehouse (SQL, scalable compute), want versioned transformations, and easier backfills.
Key concepts and decisions
- Batch vs Streaming: Batch for daily features; Streaming for near-real-time signals.
- File formats: Prefer columnar (Parquet) for analytics; use Avro/JSON for logs or CDC.
- Partitioning: Commonly by event_date; for features, also partition by as_of_date for reproducible training sets.
- Idempotency: Reruns produce the same result. Use partition overwrite, stable keys, and deterministic transforms.
- Backfills: Recompute historical partitions when logic changes. Use versions (feature_version, transformation_version).
- CDC (Change Data Capture): Capture inserts/updates/deletes without full reloads to keep features fresh.
- Schema evolution: Add columns compatibly; validate schemas; fail fast on breaking changes.
- Time correctness: Filter events to <= as_of_time for training; align training and serving logic to prevent skew.
- Monitoring: Track freshness, null rates, distribution shifts, and late data percentages.
Worked examples
Example 1: Nightly ETL for a 7-day average purchase feature
- Extract: Pull yesterday's transactions from the OLTP replica.
- Transform: Filter successful payments; compute rolling 7-day average per user up to as_of_date.
- Load: Write to a partitioned feature table by as_of_date in Parquet; overwrite that partition.
Sketch (SQL-like)
WITH filtered AS (
SELECT user_id, amount, event_ts::date AS event_date
FROM raw_transactions
WHERE status = 'success' AND event_ts::date <= :as_of_date
),
seven_day AS (
SELECT user_id,
:as_of_date AS as_of_date,
AVG(amount) FILTER (WHERE event_date > :as_of_date - INTERVAL '7 days') AS avg_amount_7d
FROM filtered
GROUP BY user_id
)
SELECT * FROM seven_day;
Example 2: ELT in a warehouse (raw first, transform later)
- Load: Land raw clickstream JSON into a raw schema.
- Transform: Create SQL models to parse JSON, sessionize by user, and compute 24h active flags.
- Version: Tag models with transformation_version and rebuild affected downstream features; backfill partitions.
Sketch (SQL)
CREATE OR REPLACE TABLE features.user_active_24h AS
SELECT user_id,
:as_of_date AS as_of_date,
CASE WHEN MAX(event_time) > :as_of_date - INTERVAL '24 hours' THEN TRUE ELSE FALSE END AS active_24h
FROM refined.clicks
WHERE event_time <= :as_of_date
GROUP BY user_id;
Example 3: Streaming ELT to online + offline feature stores
- Ingest: Stream events from a queue; parse and validate.
- Transform: Maintain per-user rolling counts in a stateful stream job.
- Serve: Write the latest value to an online store (low latency) and append events to offline storage for training backfills.
Operational notes
- Use exactly-once semantics or idempotent upserts (user_id, feature_name).
- Checkpoint progress; alert on lag or high late-event ratio.
Hands-on exercises
These exercises mirror the tasks below. Your progress is saved if you are logged in; otherwise you can still complete them and the quick test, but progress will not be saved.
Exercise 1 (matches: Build a simple ETL feature aggregation)
Compute a 7-day average transaction amount per user as of 2025-01-10. Only count status = "success".
transactions: user_id,event_ts,amount,status u1,2025-01-05T10:00:00,10,success u1,2025-01-07T09:00:00,30,success u1,2025-01-09T15:00:00,20,failed u2,2025-01-03T12:00:00,5,success u2,2025-01-08T08:00:00,15,success
- Window: 7 days inclusive up to as_of_date (2025-01-10), i.e., 2025-01-04 to 2025-01-10.
- Output columns: user_id, as_of_date, avg_txn_amount_7d
Exercise 2 (matches: Fix leakage with as-of filtering)
Count support tickets in the last 30 days as of 2025-03-01 (exclusive of future events). Ensure no future events are counted.
support_tickets: user_id,event_ts u1,2025-02-10T11:00:00 u1,2025-03-05T08:00:00 (should NOT be counted) u2,2025-02-20T16:00:00
- Window: 30 days back from 2025-03-01 (i.e., 2025-01-30 inclusive) to 2025-03-01 exclusive.
- Output columns: user_id, as_of_date, ticket_count_30d
Pipeline readiness checklist
- Partitioned outputs by as_of_date.
- Deterministic transformations and idempotent writes (overwrite or upsert by keys).
- Schema validation with clear error on breaking changes.
- Data quality: null-rate and range checks on key features.
- Time filters applied correctly (no future data for training).
- Logging: row counts in/out, durations, late event rate.
Common mistakes and self-check
- Target leakage: Using events after the label/as_of_date. Self-check: sample a few users and manually verify the latest event time used.
- Non-idempotent writes: Appending duplicates on rerun. Self-check: rerun the same partition; row count must not increase.
- Schema drift ignored: New columns or types break downstream. Self-check: enforce schemas and contract tests.
- Partitioning mismatch: Training joins fail due to different time grains. Self-check: join by user_id and as_of_date consistently.
- Training-serving skew: Different logic online vs offline. Self-check: keep shared transformations or validated parity tests.
Practical projects
- Project 1: Batch feature store seed
- Build nightly features: avg_amount_7d, purchases_30d, active_24h.
- Partition by as_of_date; add quality checks and a backfill script for 90 days.
- Deliver: reproducible SQL or PySpark jobs and a data dictionary.
- Project 2: Streaming fraud signals
- Ingest transactions stream; maintain rolling count of high-risk events per user.
- Write to an online store (upsert) and offline lake (append) with exactly-once or idempotent semantics.
- Deliver: latency dashboard and drift alert thresholds.
- Project 3: ELT warehouse transformations
- Load raw JSON to a warehouse, build layered views: raw → refined → features.
- Add transformation_version and run a controlled backfill.
- Deliver: change log and performance benchmarks.
Who this is for
- ML Engineers and Data Scientists who need reliable, time-correct features.
- Data Engineers standardizing pipelines for ML training and serving.
- Analysts moving from ad-hoc SQL to production-grade feature builds.
Prerequisites
- Comfortable with SQL and at least one data processing tool (e.g., Spark or a warehouse engine).
- Basic understanding of model training and evaluation.
- Familiarity with batch scheduling and data storage formats (CSV, Parquet).
Learning path
- Master time correctness (as_of_date), idempotency, and partitioning.
- Practice ETL in a compute engine and ELT in a warehouse.
- Add monitoring: freshness, nulls, distribution checks.
- Implement backfills and CDC handling.
- Unify training and serving logic to avoid skew.
Next steps
- Complete the exercises below and validate against expected outputs.
- Take the quick test to confirm understanding.
- Start a small project from the list and iterate with monitoring and backfills.
Mini challenge
Design a minimal DAG for nightly features: extract → transform → validate → load → notify. Specify partition keys, idempotent write strategy, and how you will backfill the last 60 days safely.
Hint
- One partition per as_of_date; overwrite or upsert by (user_id, as_of_date).
- Validation gates the load step; backfills loop over dates with the same logic.
Quick Test
Available to everyone. Log in to save your progress.