How to learn Integration Architecture ETL ELT for Data Architect for free

Integration Architecture (ETL/ELT) for Data Architects

As a Data Architect, you choose how data moves from source systems to analytics platforms. ETL (transform before load) and ELT (load then transform) are core patterns you will combine with batch and streaming, CDC, data contracts, schema evolution, and standardized ingestion frameworks. This skill helps you design reliable, scalable, and observable integrations.

Why this matters in the Data Architect role

Unlocks trustworthy analytics by defining ingestion SLAs, lineage, and quality gates.
Controls cost and performance by choosing ETL vs ELT, batch vs streaming, and partitioning.
Reduces risk with idempotent design, CDC, backfills, and schema evolution strategies.
Accelerates delivery with reusable ingestion frameworks and data contracts.

Who this is for

Aspiring and practicing Data Architects aligning data flows to business needs.
Senior data engineers preparing to own platform-wide integration patterns.
Analytics engineers needing robust ingestion and transformation designs.

Prerequisites

Comfort with SQL (joins, window functions, MERGE/UPSERT).
Basic Python or another scripting language for orchestration.
Familiarity with data warehouses/lakes and message/batch processing.
Understanding of data modeling (dimensions, facts, slowly changing dimensions).

Learning path

Milestone 1 — Foundation: ETL vs ELT, when to choose

Compare compute locations, cost/performance, governance, and team skills.
Define staging zones for both patterns.
Outcome: articulate a decision matrix for ETL vs ELT.

Milestone 2 — Source integration patterns

Learn file, API, database, and event-stream ingestion patterns.
Define authentication, pagination, partitioning, and scheduling.
Outcome: a checklist for adding a new source safely.

Milestone 3 — Incremental loading and CDC

Implement high-water marks and change data capture.
Design idempotent upserts and late-arriving data handling.
Outcome: repeatable incremental loads with correctness guarantees.

Milestone 4 — Batch vs streaming

Choose SLAs and latency budgets; implement micro-batches or streaming.
Plan ordering, watermarking, and replay strategies.
Outcome: blueprint for both daily batch and near-real-time feeds.

Milestone 5 — Reliability and evolution

Design retries, backoffs, dead-letter queues, and alerting.
Handle schema evolution with compatibility rules and versioning.
Outcome: resilient pipelines that self-heal or fail fast with context.

Milestone 6 — Data contracts and standardized frameworks

Define producer/consumer expectations: fields, types, SLAs, semantics, quality checks.
Create templates and metadata-driven ingestion.
Outcome: scalable ingestion with consistent observability.

Worked examples

Example 1 — ETL with staging and SCD Type 2

Goal: transform customer records before loading to the warehouse; track history.

-- Staging table (landed daily from source ETL job)
CREATE TABLE stage_customers (
  customer_id STRING,
  name STRING,
  email STRING,
  updated_at TIMESTAMP
);

-- Dimension table with SCD2
CREATE TABLE dim_customer (
  customer_id STRING,
  name STRING,
  email STRING,
  effective_from TIMESTAMP,
  effective_to TIMESTAMP,
  is_current BOOLEAN
);

-- Close out changed records and insert new versions
MERGE INTO dim_customer d
USING (
  SELECT s.* FROM stage_customers s
) n
ON d.customer_id = n.customer_id AND d.is_current = TRUE
WHEN MATCHED AND (
  d.name != n.name OR d.email != n.email
) THEN UPDATE SET d.effective_to = n.updated_at, d.is_current = FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT (
  customer_id, name, email, effective_from, effective_to, is_current
) VALUES (
  n.customer_id, n.name, n.email, n.updated_at, TIMESTAMP '9999-12-31 00:00:00', TRUE
);

Example 2 — ELT for clickstream

Goal: load raw JSON to a landing table quickly; transform inside the warehouse.

-- Raw landing
CREATE TABLE raw_clicks (
  event_time TIMESTAMP,
  user_id STRING,
  url STRING,
  referrer STRING,
  ua STRING
);

-- Transform in-warehouse
CREATE TABLE fct_sessions AS
SELECT
  user_id,
  DATE_TRUNC('hour', event_time) AS session_hour,
  COUNT(*) AS hits,
  APPROX_COUNT_DISTINCT(url) AS pages
FROM raw_clicks
GROUP BY user_id, DATE_TRUNC('hour', event_time);

Example 3 — Incremental load with high‑water mark

# Pseudocode
last_ts = state_store.get('orders_max_ts', default='1970-01-01T00:00:00Z')
new_rows = source_db.read("SELECT * FROM orders WHERE updated_at > ? ORDER BY updated_at", [last_ts])

for batch in chunk(new_rows, size=5000):
    upsert(batch)  # must be idempotent
    last_ts = batch[-1]['updated_at']
    state_store.set('orders_max_ts', last_ts)

Key: ordering by the watermark column and persisting state after successful upsert.

Example 4 — Idempotent UPSERT (handles retries)

MERGE INTO fct_orders t
USING stage_orders s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
  t.status = s.status,
  t.amount = s.amount,
  t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (
  order_id, status, amount, updated_at
) VALUES (
  s.order_id, s.status, s.amount, s.updated_at
);

If the same batch is reprocessed, the MERGE repeats safely without duplication.

Example 5 — Schema evolution: add a new nullable column

-- Add new column safely
ALTER TABLE fct_orders ADD COLUMN discount_pct DECIMAL(5,2) NULL;

-- Backfill with default for historical rows
UPDATE fct_orders SET discount_pct = 0.00 WHERE discount_pct IS NULL;

-- Downstream model references must handle NULLs until backfill completes

Prefer additive, backward-compatible changes (new nullable columns, new fields with defaults) and version the contract.

Batch vs streaming — choosing wisely

Batch: simpler, cheaper, high throughput. Good for daily reports and large backfills.
Streaming/micro-batch: lower latency, continuous updates. Good for near-real-time dashboards and event-driven features.
Decision inputs: latency SLA, event volume/variability, ordering needs, cost constraints, team ops maturity.

Design tip

Start with batch to prove value. Move the few pipelines that truly need low latency to streaming. Keep shared contracts and transformations consistent across both.

Error handling, retries, and backpressure

Retry with exponential backoff for transient failures; cap max attempts.
Use dead-letter queues or quarantine tables for poison records with full context.
Emit structured errors: source, payload snippet, exception, attempt count, correlation id.
Apply backpressure: limit concurrency or temporarily pause ingestion to protect sinks.

# Pseudocode retry wrapper
for attempt in range(1, MAX_TRIES+1):
    try:
        process(batch)
        break
    except TransientError as e:
        sleep(backoff(attempt))
    except PermanentError as e:
        to_dead_letter(batch, error=str(e))
        break

Data contracts in integration

Define schema (names, types, nullability) and semantics (business meaning).
Operational SLAs: delivery frequency/latency, completeness, retention.
Quality rules: required fields, ranges, referential checks, duplication policy.
Change policy: versioning, deprecation windows, compatibility expectations.

Contract change example

Producer adds a nullable field with default; version moves from v1.2 to v1.3. Consumers have 30 days to adopt. Breaking changes require v2.0 with coexistence period.

Standardized ingestion frameworks

Template connectors: file, API, relational DB, event stream.
Metadata-driven configs: source, schedule, partitions, schema mapping, destination.
Built-in quality checks, lineage tagging, and observability.
Idempotent writes and automatic backfills with safety rails.

Minimal config example

source: "orders_db"
entity: "orders"
mode: "incremental"
watermark: "updated_at"
partition: "ingest_date"
destination: "lake.raw.orders"
quality:
  - rule: "not_null" field: "order_id"
  - rule: "range" field: "amount" min: 0 max: 100000
retries: {max: 5, backoff: "exponential"}

Drills / exercises

Sketch an ETL vs ELT decision matrix for three sources: files, DB, and events.
Design a CDC plan for a table with soft deletes (include delete capture).
Write a MERGE statement to upsert by natural key and keep last-updated wins.
Define a data contract change from v1 to v1.1 that is backward-compatible.
Plan a 60-day historical backfill with safe batching and idempotent writes.

Common mistakes and debugging tips

Relying on full reloads when increments are available. Tip: start with a high-water mark, then explore CDC for fidelity.
Ignoring idempotency. Tip: enforce unique keys and use MERGE instead of INSERT.
No quarantine path. Tip: implement dead-letter handling to keep the pipeline flowing.
Unbounded schema drift. Tip: gate changes via contracts and versioning.
Small-file explosion in lakes. Tip: compact files by partition and target size.
Missing observability. Tip: track rows in/out, lateness, and error rates per step.

Mini project — Incremental orders integration

Build an incremental pipeline that ingests orders from a relational source into a lake/warehouse with ELT transforms.

Scope: one table orders with updates and deletes, plus a daily dimension lookup.
Requirements:
- Landing zone for raw data with partitioning by ingest_date.
- Incremental load using updated_at as watermark; capture deletes.
- Idempotent MERGE into analytical table.
- Contract v1.0 with two quality rules (not-null key, non-negative amount).
- Error handling: retries and a quarantine table for bad records.
Deliverables:
- DDL for landing, stage, and final tables.
- Python or SQL scripts for load, merge, and backfill.
- Runbook: how to reprocess a day and how to backfill 30 days safely.
Acceptance checks:
- Re-running the same batch produces no duplicates.
- Row counts before/after match expectations; deletes reflected.
- Contract validation passes; errors visible in quarantine.

Practical projects

Build a standardized ingestion template that onboards a CSV source by configuration only.
Create a streaming micro-batch job that materializes rolling 1-hour metrics with late data handling.
Design a schema evolution policy and implement a migration that adds two new nullable fields and backfills them.

Subskills

Source System Integration Patterns
CDC And Incremental Loads
Batch And Streaming Integration Basics
Idempotency And Backfills
Handling Schema Evolution
Error Handling And Retry Patterns
Data Contracts Concepts
Standardized Ingestion Frameworks

Next steps

Finish the subskills below; apply each to your mini project.
Take the skill exam to confirm readiness. Everyone can take it for free; logged-in learners get saved progress.
Extend the mini project with a dimension table and SCD2 handling.

Menu

Integration Architecture ETL ELT

Table of Contents

Integration Architecture (ETL/ELT) for Data Architects

Why this matters in the Data Architect role

Who this is for

Prerequisites

Learning path

Worked examples

Batch vs streaming — choosing wisely

Error handling, retries, and backpressure

Data contracts in integration

Standardized ingestion frameworks

Drills / exercises

Common mistakes and debugging tips

Mini project — Incremental orders integration

Practical projects

Subskills

Next steps

Integration Architecture ETL ELT — Skill Exam

Topics

Idempotency And Backfills

Data Contracts Concepts

Standardized Ingestion Frameworks

Source System Integration Patterns

CDC And Incremental Loads

Batch And Streaming Integration Basics

Handling Schema Evolution

Error Handling And Retry Patterns

Have questions about Integration Architecture ETL ELT?

AI Assistant