How to learn ETL ELT Development for Data Engineer for free

Why ETL/ELT matters for Data Engineers

ETL/ELT Development turns raw data into reliable, analytics-ready datasets. As a Data Engineer, you design how data is ingested, cleaned, modeled, and delivered to stakeholders. You’ll implement batch or streaming pipelines, ensure data quality, handle late-arriving records, and make transformations efficient and testable.

ETL (Extract-Transform-Load): Transform before loading into target.
ELT (Extract-Load-Transform): Load raw data first, then transform inside the warehouse/lakehouse.

Modern teams often prefer ELT for scalability and simplicity, but both patterns remain useful.

Who this is for

Aspiring or junior Data Engineers building their first production-grade pipelines.
Analysts transitioning to engineering roles and formalizing SQL transformations.
Engineers who want reproducible, testable data models for analytics and ML.

Prerequisites

Comfortable SQL: joins, window functions, CTEs, aggregation.
Basic Python (or another scripting language) for orchestration and utilities.
Familiarity with a data warehouse or lakehouse (e.g., columnar storage concepts, partitions).
Understanding of batch scheduling and job runs.

Learning path

Set up layers: Organize raw → staging → mart zones with clear naming and ownership.
Model for analytics: Build facts and dimensions that answer real business questions.
Make transforms reusable: Parametric SQL patterns, macros, or views for consistency.
Handle change and lateness: Watermarks, deduplication, and upserts.
Partition and cluster: Choose keys aligned to query patterns and data volume.
Recovery-ready pipelines: Idempotent design, reprocessing windows, and backfills.
Validate and monitor: Row counts, null checks, freshness tests, and simple data contracts.

Worked examples

Example 1 — Simple ELT from raw to mart

Goal: Load raw.orders to staging, then produce mart fact and dimension tables.

-- 1) Staging: normalize types and basic cleaning
create or replace table staging.orders as
select
  cast(order_id as bigint) as order_id,
  cast(customer_id as bigint) as customer_id,
  cast(order_ts as timestamp) as order_ts,
  cast(total_amount as numeric(12,2)) as total_amount,
  trim(status) as status,
  ingestion_ts
from raw.orders;

-- 2) Dimension: customers (using latest seen attributes)
create or replace table mart.dim_customer as
select customer_id,
       any_value(customer_name) as customer_name,
       any_value(country) as country
from staging.customers
group by customer_id;

-- 3) Fact: orders (join with customer dim for convenience if needed)
create or replace table mart.fact_orders as
select o.order_id,
       o.customer_id,
       d.country,
       o.order_ts,
       o.total_amount,
       o.status
from staging.orders o
left join mart.dim_customer d using (customer_id);

Note: Keep dimensions and facts as separate layers; avoid mixing business logic for different domains in one transformation.

Example 2 — Idempotent upsert with deduplication

Goal: Merge incremental records from staging into a mart table. Remove duplicates using row_number and a hash for change detection.

-- De-duplicate incoming records by natural key (order_id) and most recent ingestion_ts
with incoming as (
  select *,
         row_number() over (partition by order_id order by ingestion_ts desc) as rn,
         md5(concat_ws('|', order_id::text, status, total_amount::text)) as rec_hash
  from staging.orders_incremental
), cleaned as (
  select * from incoming where rn = 1
)
merge into mart.fact_orders t
using cleaned s
on (t.order_id = s.order_id)
when matched and t.rec_hash <> s.rec_hash then
  update set status = s.status,
             total_amount = s.total_amount,
             order_ts = s.order_ts,
             rec_hash = s.rec_hash
when not matched then
  insert (order_id, customer_id, order_ts, total_amount, status, rec_hash)
  values (s.order_id, s.customer_id, s.order_ts, s.total_amount, s.status, s.rec_hash);

Tip: Store a rec_hash in the target to avoid unnecessary updates.

Example 3 — Handling late-arriving data with a watermark

Goal: Process new data but allow a grace period for late records.

-- Control table stores the last processed event_time
-- control.watermarks(event_name, last_event_time)

-- 1) Read watermark and define a safety window (e.g., 2 days)
with ctrl as (
  select coalesce(max(last_event_time), timestamp '1970-01-01') as wm
  from control.watermarks
  where event_name = 'orders'
), scoped as (
  select s.*
  from staging.orders s, ctrl
  where s.event_time >= wm - interval '2 days'
)
, dedup as (
  select *,
         row_number() over (partition by order_id order by ingestion_ts desc) as rn
  from scoped
)
select * from dedup where rn = 1;

-- Upsert dedup into mart as in Example 2, then update watermark to max(event_time)

Use a small overlap window to reprocess potential late events safely.

Example 4 — Partitioning and pruning for performance

Goal: Partition a large fact by event_date to reduce scan costs and speed queries.

-- Example partitioning by date (syntax varies by engine)
create or replace table mart.fact_orders_partitioned (
  order_id bigint,
  customer_id bigint,
  event_date date,
  order_ts timestamp,
  total_amount numeric(12,2),
  status text
)
-- partition by (event_date)
;

-- Write queries with partition filters
select sum(total_amount)
from mart.fact_orders_partitioned
where event_date between date '2025-01-01' and date '2025-01-31';

Always include partition filters in scheduled queries and transformations.

Example 5 — Business logic: order status derivation

Goal: Implement consistent rules for a derived status across systems.

create or replace table mart.fact_orders_enriched as
select f.*,
  case
    when f.status in ('cancelled','refunded') then 'NEGATIVE_OUTCOME'
    when f.total_amount >= 1000 then 'HIGH_VALUE'
    when f.total_amount between 100 and 999.99 then 'STANDARD_VALUE'
    else 'LOW_VALUE'
  end as order_segment
from mart.fact_orders f;

Document logic with comments and tests. Keep mapping rules centralized in a small reference table when they change frequently.

Drills and exercises

Create staging tables for two raw sources, normalizing types and trimming text fields.
Write a SQL transformation that removes duplicates using row_number and keeps the latest by ingestion timestamp.
Implement a merge-based upsert for an incremental load.
Add a watermark table and re-run with a two-day overlap to capture late events.
Partition a large table by date and run a query with partition pruning.
Write tests: row count checks, not null on keys, and acceptable value ranges.

Common mistakes and debugging tips

Mixing raw and business logic in one step: Separate raw cleanup (staging) from business rules (mart).
Forgetting idempotency: Ensure re-runs don’t duplicate rows; use dedup and upserts.
No partition filter: Large scans cause slow queries; always filter by partitions.
Missing late data: Use watermarks with overlap windows to recapture late arrivals.
Unstable keys: Define deterministic primary or compound keys; avoid relying on non-unique fields.
Silent schema drift: Add checks for new/removed columns and default values.
Unbounded updates: Limit update sets to changed rows via a change hash to reduce load.

Mini project: Orders analytics mart

Build a small ELT flow for an e-commerce dataset with customers, orders, and order_items.

Layers: raw → staging → mart.
Transformations: clean types, deduplicate, enrich with derived fields (e.g., order segment).
Models: dim_customer, fact_orders, fact_order_items, and a daily_sales_summary.
Incremental logic: watermark on order_ts, merge-based upsert for fact tables.
Partitioning: partition facts by event_date (derived from order_ts).

Acceptance checks:

No duplicate order_id in fact_orders.
Total of order_items equals order totals within ±0.5% (tolerate rounding).
No nulls in primary keys; dates within expected ranges.
Daily summary matches the sum of facts when filtered by date.

Practical projects to solidify skills

Marketing attribution mart: join ad impressions, clicks, and conversions; handle late conversions and assign first/last-touch attribution.
Subscription revenue mart: implement MRR/ARR metrics, churn flags, and cohort tables with slowly changing dimensions.
Operational SLA dashboard: build a freshness table tracking when critical sources last updated and highlight delays.

Next steps

Add data quality tests and simple anomaly checks (e.g., sudden drops to zero).
Introduce parameterized transformations so the same logic can run for multiple regions or brands.
Automate backfills with a date range input and ensure outputs are idempotent.

Subskills

Transform Layers: Raw, Staging, Mart — structure the pipeline with clear responsibilities and naming.
Reusable SQL Transformations — standardize CTE patterns, views, and macros for repeatable logic.
Data Modeling For Analytics — design facts and dimensions for fast, reliable queries.
Handling Late Arriving Data — use watermarks and overlaps to capture late events safely.
Deduplication And Upserts — keep one truth via windowed dedup and merge operations.
Business Logic Implementation — centralize rules with tests and reference tables.
Partitioning Strategies Basics — choose keys that match query filters to reduce scans.
Reprocessing And Recovery Patterns — build idempotent, backfillable transformations.

Menu

ETL ELT Development

Table of Contents

Why ETL/ELT matters for Data Engineers

Who this is for

Prerequisites

Learning path

Worked examples

Drills and exercises

Common mistakes and debugging tips

Mini project: Orders analytics mart

Practical projects to solidify skills

Next steps

Subskills

ETL ELT Development — Skill Exam

Topics

Transform Layers Raw Staging Mart

Reusable SQL Transformations

Data Modeling For Analytics

Handling Late Arriving Data

Deduplication And Upserts

Partitioning Strategies Basics

Business Logic Implementation

Reprocessing And Recovery Patterns

Have questions about ETL ELT Development?

AI Assistant