Why ETL/ELT matters for Data Engineers
ETL/ELT Development turns raw data into reliable, analytics-ready datasets. As a Data Engineer, you design how data is ingested, cleaned, modeled, and delivered to stakeholders. You’ll implement batch or streaming pipelines, ensure data quality, handle late-arriving records, and make transformations efficient and testable.
- ETL (Extract-Transform-Load): Transform before loading into target.
- ELT (Extract-Load-Transform): Load raw data first, then transform inside the warehouse/lakehouse.
Modern teams often prefer ELT for scalability and simplicity, but both patterns remain useful.
Who this is for
- Aspiring or junior Data Engineers building their first production-grade pipelines.
- Analysts transitioning to engineering roles and formalizing SQL transformations.
- Engineers who want reproducible, testable data models for analytics and ML.
Prerequisites
- Comfortable SQL: joins, window functions, CTEs, aggregation.
- Basic Python (or another scripting language) for orchestration and utilities.
- Familiarity with a data warehouse or lakehouse (e.g., columnar storage concepts, partitions).
- Understanding of batch scheduling and job runs.
Learning path
- Set up layers: Organize raw → staging → mart zones with clear naming and ownership.
- Model for analytics: Build facts and dimensions that answer real business questions.
- Make transforms reusable: Parametric SQL patterns, macros, or views for consistency.
- Handle change and lateness: Watermarks, deduplication, and upserts.
- Partition and cluster: Choose keys aligned to query patterns and data volume.
- Recovery-ready pipelines: Idempotent design, reprocessing windows, and backfills.
- Validate and monitor: Row counts, null checks, freshness tests, and simple data contracts.
Worked examples
Example 1 — Simple ELT from raw to mart
Goal: Load raw.orders to staging, then produce mart fact and dimension tables.
-- 1) Staging: normalize types and basic cleaning
create or replace table staging.orders as
select
cast(order_id as bigint) as order_id,
cast(customer_id as bigint) as customer_id,
cast(order_ts as timestamp) as order_ts,
cast(total_amount as numeric(12,2)) as total_amount,
trim(status) as status,
ingestion_ts
from raw.orders;
-- 2) Dimension: customers (using latest seen attributes)
create or replace table mart.dim_customer as
select customer_id,
any_value(customer_name) as customer_name,
any_value(country) as country
from staging.customers
group by customer_id;
-- 3) Fact: orders (join with customer dim for convenience if needed)
create or replace table mart.fact_orders as
select o.order_id,
o.customer_id,
d.country,
o.order_ts,
o.total_amount,
o.status
from staging.orders o
left join mart.dim_customer d using (customer_id);
Note: Keep dimensions and facts as separate layers; avoid mixing business logic for different domains in one transformation.
Example 2 — Idempotent upsert with deduplication
Goal: Merge incremental records from staging into a mart table. Remove duplicates using row_number and a hash for change detection.
-- De-duplicate incoming records by natural key (order_id) and most recent ingestion_ts
with incoming as (
select *,
row_number() over (partition by order_id order by ingestion_ts desc) as rn,
md5(concat_ws('|', order_id::text, status, total_amount::text)) as rec_hash
from staging.orders_incremental
), cleaned as (
select * from incoming where rn = 1
)
merge into mart.fact_orders t
using cleaned s
on (t.order_id = s.order_id)
when matched and t.rec_hash <> s.rec_hash then
update set status = s.status,
total_amount = s.total_amount,
order_ts = s.order_ts,
rec_hash = s.rec_hash
when not matched then
insert (order_id, customer_id, order_ts, total_amount, status, rec_hash)
values (s.order_id, s.customer_id, s.order_ts, s.total_amount, s.status, s.rec_hash);
Tip: Store a rec_hash in the target to avoid unnecessary updates.
Example 3 — Handling late-arriving data with a watermark
Goal: Process new data but allow a grace period for late records.
-- Control table stores the last processed event_time
-- control.watermarks(event_name, last_event_time)
-- 1) Read watermark and define a safety window (e.g., 2 days)
with ctrl as (
select coalesce(max(last_event_time), timestamp '1970-01-01') as wm
from control.watermarks
where event_name = 'orders'
), scoped as (
select s.*
from staging.orders s, ctrl
where s.event_time >= wm - interval '2 days'
)
, dedup as (
select *,
row_number() over (partition by order_id order by ingestion_ts desc) as rn
from scoped
)
select * from dedup where rn = 1;
-- Upsert dedup into mart as in Example 2, then update watermark to max(event_time)
Use a small overlap window to reprocess potential late events safely.
Example 4 — Partitioning and pruning for performance
Goal: Partition a large fact by event_date to reduce scan costs and speed queries.
-- Example partitioning by date (syntax varies by engine)
create or replace table mart.fact_orders_partitioned (
order_id bigint,
customer_id bigint,
event_date date,
order_ts timestamp,
total_amount numeric(12,2),
status text
)
-- partition by (event_date)
;
-- Write queries with partition filters
select sum(total_amount)
from mart.fact_orders_partitioned
where event_date between date '2025-01-01' and date '2025-01-31';
Always include partition filters in scheduled queries and transformations.
Example 5 — Business logic: order status derivation
Goal: Implement consistent rules for a derived status across systems.
create or replace table mart.fact_orders_enriched as
select f.*,
case
when f.status in ('cancelled','refunded') then 'NEGATIVE_OUTCOME'
when f.total_amount >= 1000 then 'HIGH_VALUE'
when f.total_amount between 100 and 999.99 then 'STANDARD_VALUE'
else 'LOW_VALUE'
end as order_segment
from mart.fact_orders f;
Document logic with comments and tests. Keep mapping rules centralized in a small reference table when they change frequently.
Drills and exercises
- Create staging tables for two raw sources, normalizing types and trimming text fields.
- Write a SQL transformation that removes duplicates using row_number and keeps the latest by ingestion timestamp.
- Implement a merge-based upsert for an incremental load.
- Add a watermark table and re-run with a two-day overlap to capture late events.
- Partition a large table by date and run a query with partition pruning.
- Write tests: row count checks, not null on keys, and acceptable value ranges.
Common mistakes and debugging tips
- Mixing raw and business logic in one step: Separate raw cleanup (staging) from business rules (mart).
- Forgetting idempotency: Ensure re-runs don’t duplicate rows; use dedup and upserts.
- No partition filter: Large scans cause slow queries; always filter by partitions.
- Missing late data: Use watermarks with overlap windows to recapture late arrivals.
- Unstable keys: Define deterministic primary or compound keys; avoid relying on non-unique fields.
- Silent schema drift: Add checks for new/removed columns and default values.
- Unbounded updates: Limit update sets to changed rows via a change hash to reduce load.
Mini project: Orders analytics mart
Build a small ELT flow for an e-commerce dataset with customers, orders, and order_items.
- Layers: raw → staging → mart.
- Transformations: clean types, deduplicate, enrich with derived fields (e.g., order segment).
- Models: dim_customer, fact_orders, fact_order_items, and a daily_sales_summary.
- Incremental logic: watermark on order_ts, merge-based upsert for fact tables.
- Partitioning: partition facts by event_date (derived from order_ts).
Acceptance checks:
- No duplicate order_id in fact_orders.
- Total of order_items equals order totals within ±0.5% (tolerate rounding).
- No nulls in primary keys; dates within expected ranges.
- Daily summary matches the sum of facts when filtered by date.
Practical projects to solidify skills
- Marketing attribution mart: join ad impressions, clicks, and conversions; handle late conversions and assign first/last-touch attribution.
- Subscription revenue mart: implement MRR/ARR metrics, churn flags, and cohort tables with slowly changing dimensions.
- Operational SLA dashboard: build a freshness table tracking when critical sources last updated and highlight delays.
Next steps
- Add data quality tests and simple anomaly checks (e.g., sudden drops to zero).
- Introduce parameterized transformations so the same logic can run for multiple regions or brands.
- Automate backfills with a date range input and ensure outputs are idempotent.
Subskills
- Transform Layers: Raw, Staging, Mart — structure the pipeline with clear responsibilities and naming.
- Reusable SQL Transformations — standardize CTE patterns, views, and macros for repeatable logic.
- Data Modeling For Analytics — design facts and dimensions for fast, reliable queries.
- Handling Late Arriving Data — use watermarks and overlaps to capture late events safely.
- Deduplication And Upserts — keep one truth via windowed dedup and merge operations.
- Business Logic Implementation — centralize rules with tests and reference tables.
- Partitioning Strategies Basics — choose keys that match query filters to reduce scans.
- Reprocessing And Recovery Patterns — build idempotent, backfillable transformations.