luvv to helpDiscover the Best Free Online Tools

ETL ELT Development

Learn ETL ELT Development for Data Engineer for free: roadmap, examples, subskills, and a skill exam.

Published: January 8, 2026 | Updated: January 8, 2026

Why ETL/ELT matters for Data Engineers

ETL/ELT Development turns raw data into reliable, analytics-ready datasets. As a Data Engineer, you design how data is ingested, cleaned, modeled, and delivered to stakeholders. You’ll implement batch or streaming pipelines, ensure data quality, handle late-arriving records, and make transformations efficient and testable.

  • ETL (Extract-Transform-Load): Transform before loading into target.
  • ELT (Extract-Load-Transform): Load raw data first, then transform inside the warehouse/lakehouse.

Modern teams often prefer ELT for scalability and simplicity, but both patterns remain useful.

Who this is for

  • Aspiring or junior Data Engineers building their first production-grade pipelines.
  • Analysts transitioning to engineering roles and formalizing SQL transformations.
  • Engineers who want reproducible, testable data models for analytics and ML.

Prerequisites

  • Comfortable SQL: joins, window functions, CTEs, aggregation.
  • Basic Python (or another scripting language) for orchestration and utilities.
  • Familiarity with a data warehouse or lakehouse (e.g., columnar storage concepts, partitions).
  • Understanding of batch scheduling and job runs.

Learning path

  1. Set up layers: Organize raw → staging → mart zones with clear naming and ownership.
  2. Model for analytics: Build facts and dimensions that answer real business questions.
  3. Make transforms reusable: Parametric SQL patterns, macros, or views for consistency.
  4. Handle change and lateness: Watermarks, deduplication, and upserts.
  5. Partition and cluster: Choose keys aligned to query patterns and data volume.
  6. Recovery-ready pipelines: Idempotent design, reprocessing windows, and backfills.
  7. Validate and monitor: Row counts, null checks, freshness tests, and simple data contracts.

Worked examples

Example 1 — Simple ELT from raw to mart

Goal: Load raw.orders to staging, then produce mart fact and dimension tables.

-- 1) Staging: normalize types and basic cleaning
create or replace table staging.orders as
select
  cast(order_id as bigint) as order_id,
  cast(customer_id as bigint) as customer_id,
  cast(order_ts as timestamp) as order_ts,
  cast(total_amount as numeric(12,2)) as total_amount,
  trim(status) as status,
  ingestion_ts
from raw.orders;

-- 2) Dimension: customers (using latest seen attributes)
create or replace table mart.dim_customer as
select customer_id,
       any_value(customer_name) as customer_name,
       any_value(country) as country
from staging.customers
group by customer_id;

-- 3) Fact: orders (join with customer dim for convenience if needed)
create or replace table mart.fact_orders as
select o.order_id,
       o.customer_id,
       d.country,
       o.order_ts,
       o.total_amount,
       o.status
from staging.orders o
left join mart.dim_customer d using (customer_id);

Note: Keep dimensions and facts as separate layers; avoid mixing business logic for different domains in one transformation.

Example 2 — Idempotent upsert with deduplication

Goal: Merge incremental records from staging into a mart table. Remove duplicates using row_number and a hash for change detection.

-- De-duplicate incoming records by natural key (order_id) and most recent ingestion_ts
with incoming as (
  select *,
         row_number() over (partition by order_id order by ingestion_ts desc) as rn,
         md5(concat_ws('|', order_id::text, status, total_amount::text)) as rec_hash
  from staging.orders_incremental
), cleaned as (
  select * from incoming where rn = 1
)
merge into mart.fact_orders t
using cleaned s
on (t.order_id = s.order_id)
when matched and t.rec_hash <> s.rec_hash then
  update set status = s.status,
             total_amount = s.total_amount,
             order_ts = s.order_ts,
             rec_hash = s.rec_hash
when not matched then
  insert (order_id, customer_id, order_ts, total_amount, status, rec_hash)
  values (s.order_id, s.customer_id, s.order_ts, s.total_amount, s.status, s.rec_hash);

Tip: Store a rec_hash in the target to avoid unnecessary updates.

Example 3 — Handling late-arriving data with a watermark

Goal: Process new data but allow a grace period for late records.

-- Control table stores the last processed event_time
-- control.watermarks(event_name, last_event_time)

-- 1) Read watermark and define a safety window (e.g., 2 days)
with ctrl as (
  select coalesce(max(last_event_time), timestamp '1970-01-01') as wm
  from control.watermarks
  where event_name = 'orders'
), scoped as (
  select s.*
  from staging.orders s, ctrl
  where s.event_time >= wm - interval '2 days'
)
, dedup as (
  select *,
         row_number() over (partition by order_id order by ingestion_ts desc) as rn
  from scoped
)
select * from dedup where rn = 1;

-- Upsert dedup into mart as in Example 2, then update watermark to max(event_time)

Use a small overlap window to reprocess potential late events safely.

Example 4 — Partitioning and pruning for performance

Goal: Partition a large fact by event_date to reduce scan costs and speed queries.

-- Example partitioning by date (syntax varies by engine)
create or replace table mart.fact_orders_partitioned (
  order_id bigint,
  customer_id bigint,
  event_date date,
  order_ts timestamp,
  total_amount numeric(12,2),
  status text
)
-- partition by (event_date)
;

-- Write queries with partition filters
select sum(total_amount)
from mart.fact_orders_partitioned
where event_date between date '2025-01-01' and date '2025-01-31';

Always include partition filters in scheduled queries and transformations.

Example 5 — Business logic: order status derivation

Goal: Implement consistent rules for a derived status across systems.

create or replace table mart.fact_orders_enriched as
select f.*,
  case
    when f.status in ('cancelled','refunded') then 'NEGATIVE_OUTCOME'
    when f.total_amount >= 1000 then 'HIGH_VALUE'
    when f.total_amount between 100 and 999.99 then 'STANDARD_VALUE'
    else 'LOW_VALUE'
  end as order_segment
from mart.fact_orders f;

Document logic with comments and tests. Keep mapping rules centralized in a small reference table when they change frequently.

Drills and exercises

  • Create staging tables for two raw sources, normalizing types and trimming text fields.
  • Write a SQL transformation that removes duplicates using row_number and keeps the latest by ingestion timestamp.
  • Implement a merge-based upsert for an incremental load.
  • Add a watermark table and re-run with a two-day overlap to capture late events.
  • Partition a large table by date and run a query with partition pruning.
  • Write tests: row count checks, not null on keys, and acceptable value ranges.

Common mistakes and debugging tips

  • Mixing raw and business logic in one step: Separate raw cleanup (staging) from business rules (mart).
  • Forgetting idempotency: Ensure re-runs don’t duplicate rows; use dedup and upserts.
  • No partition filter: Large scans cause slow queries; always filter by partitions.
  • Missing late data: Use watermarks with overlap windows to recapture late arrivals.
  • Unstable keys: Define deterministic primary or compound keys; avoid relying on non-unique fields.
  • Silent schema drift: Add checks for new/removed columns and default values.
  • Unbounded updates: Limit update sets to changed rows via a change hash to reduce load.

Mini project: Orders analytics mart

Build a small ELT flow for an e-commerce dataset with customers, orders, and order_items.

  • Layers: raw → staging → mart.
  • Transformations: clean types, deduplicate, enrich with derived fields (e.g., order segment).
  • Models: dim_customer, fact_orders, fact_order_items, and a daily_sales_summary.
  • Incremental logic: watermark on order_ts, merge-based upsert for fact tables.
  • Partitioning: partition facts by event_date (derived from order_ts).

Acceptance checks:

  • No duplicate order_id in fact_orders.
  • Total of order_items equals order totals within ±0.5% (tolerate rounding).
  • No nulls in primary keys; dates within expected ranges.
  • Daily summary matches the sum of facts when filtered by date.

Practical projects to solidify skills

  • Marketing attribution mart: join ad impressions, clicks, and conversions; handle late conversions and assign first/last-touch attribution.
  • Subscription revenue mart: implement MRR/ARR metrics, churn flags, and cohort tables with slowly changing dimensions.
  • Operational SLA dashboard: build a freshness table tracking when critical sources last updated and highlight delays.

Next steps

  • Add data quality tests and simple anomaly checks (e.g., sudden drops to zero).
  • Introduce parameterized transformations so the same logic can run for multiple regions or brands.
  • Automate backfills with a date range input and ensure outputs are idempotent.

Subskills

  • Transform Layers: Raw, Staging, Mart — structure the pipeline with clear responsibilities and naming.
  • Reusable SQL Transformations — standardize CTE patterns, views, and macros for repeatable logic.
  • Data Modeling For Analytics — design facts and dimensions for fast, reliable queries.
  • Handling Late Arriving Data — use watermarks and overlaps to capture late events safely.
  • Deduplication And Upserts — keep one truth via windowed dedup and merge operations.
  • Business Logic Implementation — centralize rules with tests and reference tables.
  • Partitioning Strategies Basics — choose keys that match query filters to reduce scans.
  • Reprocessing And Recovery Patterns — build idempotent, backfillable transformations.

ETL ELT Development — Skill Exam

This exam checks practical ETL/ELT knowledge: layers, modeling, dedup/upserts, late data, partitioning, and recovery. Everyone can take it for free. Only logged-in users have their progress and results saved.Rules: Closed-book, no external tools required. Choose the best answers. Some questions are multiple-select. Aim for 70% to pass.

13 questions70% to pass

Have questions about ETL ELT Development?

AI Assistant

Ask questions about this tool