Integration Architecture (ETL/ELT) for Data Architects
As a Data Architect, you choose how data moves from source systems to analytics platforms. ETL (transform before load) and ELT (load then transform) are core patterns you will combine with batch and streaming, CDC, data contracts, schema evolution, and standardized ingestion frameworks. This skill helps you design reliable, scalable, and observable integrations.
Why this matters in the Data Architect role
- Unlocks trustworthy analytics by defining ingestion SLAs, lineage, and quality gates.
- Controls cost and performance by choosing ETL vs ELT, batch vs streaming, and partitioning.
- Reduces risk with idempotent design, CDC, backfills, and schema evolution strategies.
- Accelerates delivery with reusable ingestion frameworks and data contracts.
Who this is for
- Aspiring and practicing Data Architects aligning data flows to business needs.
- Senior data engineers preparing to own platform-wide integration patterns.
- Analytics engineers needing robust ingestion and transformation designs.
Prerequisites
- Comfort with SQL (joins, window functions, MERGE/UPSERT).
- Basic Python or another scripting language for orchestration.
- Familiarity with data warehouses/lakes and message/batch processing.
- Understanding of data modeling (dimensions, facts, slowly changing dimensions).
Learning path
Milestone 1 — Foundation: ETL vs ELT, when to choose
- Compare compute locations, cost/performance, governance, and team skills.
- Define staging zones for both patterns.
- Outcome: articulate a decision matrix for ETL vs ELT.
Milestone 2 — Source integration patterns
- Learn file, API, database, and event-stream ingestion patterns.
- Define authentication, pagination, partitioning, and scheduling.
- Outcome: a checklist for adding a new source safely.
Milestone 3 — Incremental loading and CDC
- Implement high-water marks and change data capture.
- Design idempotent upserts and late-arriving data handling.
- Outcome: repeatable incremental loads with correctness guarantees.
Milestone 4 — Batch vs streaming
- Choose SLAs and latency budgets; implement micro-batches or streaming.
- Plan ordering, watermarking, and replay strategies.
- Outcome: blueprint for both daily batch and near-real-time feeds.
Milestone 5 — Reliability and evolution
- Design retries, backoffs, dead-letter queues, and alerting.
- Handle schema evolution with compatibility rules and versioning.
- Outcome: resilient pipelines that self-heal or fail fast with context.
Milestone 6 — Data contracts and standardized frameworks
- Define producer/consumer expectations: fields, types, SLAs, semantics, quality checks.
- Create templates and metadata-driven ingestion.
- Outcome: scalable ingestion with consistent observability.
Worked examples
Example 1 — ETL with staging and SCD Type 2
Goal: transform customer records before loading to the warehouse; track history.
-- Staging table (landed daily from source ETL job)
CREATE TABLE stage_customers (
customer_id STRING,
name STRING,
email STRING,
updated_at TIMESTAMP
);
-- Dimension table with SCD2
CREATE TABLE dim_customer (
customer_id STRING,
name STRING,
email STRING,
effective_from TIMESTAMP,
effective_to TIMESTAMP,
is_current BOOLEAN
);
-- Close out changed records and insert new versions
MERGE INTO dim_customer d
USING (
SELECT s.* FROM stage_customers s
) n
ON d.customer_id = n.customer_id AND d.is_current = TRUE
WHEN MATCHED AND (
d.name != n.name OR d.email != n.email
) THEN UPDATE SET d.effective_to = n.updated_at, d.is_current = FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT (
customer_id, name, email, effective_from, effective_to, is_current
) VALUES (
n.customer_id, n.name, n.email, n.updated_at, TIMESTAMP '9999-12-31 00:00:00', TRUE
);
Example 2 — ELT for clickstream
Goal: load raw JSON to a landing table quickly; transform inside the warehouse.
-- Raw landing
CREATE TABLE raw_clicks (
event_time TIMESTAMP,
user_id STRING,
url STRING,
referrer STRING,
ua STRING
);
-- Transform in-warehouse
CREATE TABLE fct_sessions AS
SELECT
user_id,
DATE_TRUNC('hour', event_time) AS session_hour,
COUNT(*) AS hits,
APPROX_COUNT_DISTINCT(url) AS pages
FROM raw_clicks
GROUP BY user_id, DATE_TRUNC('hour', event_time);
Example 3 — Incremental load with high‑water mark
# Pseudocode
last_ts = state_store.get('orders_max_ts', default='1970-01-01T00:00:00Z')
new_rows = source_db.read("SELECT * FROM orders WHERE updated_at > ? ORDER BY updated_at", [last_ts])
for batch in chunk(new_rows, size=5000):
upsert(batch) # must be idempotent
last_ts = batch[-1]['updated_at']
state_store.set('orders_max_ts', last_ts)
Key: ordering by the watermark column and persisting state after successful upsert.
Example 4 — Idempotent UPSERT (handles retries)
MERGE INTO fct_orders t
USING stage_orders s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
t.status = s.status,
t.amount = s.amount,
t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (
order_id, status, amount, updated_at
) VALUES (
s.order_id, s.status, s.amount, s.updated_at
);
If the same batch is reprocessed, the MERGE repeats safely without duplication.
Example 5 — Schema evolution: add a new nullable column
-- Add new column safely
ALTER TABLE fct_orders ADD COLUMN discount_pct DECIMAL(5,2) NULL;
-- Backfill with default for historical rows
UPDATE fct_orders SET discount_pct = 0.00 WHERE discount_pct IS NULL;
-- Downstream model references must handle NULLs until backfill completes
Prefer additive, backward-compatible changes (new nullable columns, new fields with defaults) and version the contract.
Batch vs streaming — choosing wisely
- Batch: simpler, cheaper, high throughput. Good for daily reports and large backfills.
- Streaming/micro-batch: lower latency, continuous updates. Good for near-real-time dashboards and event-driven features.
- Decision inputs: latency SLA, event volume/variability, ordering needs, cost constraints, team ops maturity.
Design tip
Start with batch to prove value. Move the few pipelines that truly need low latency to streaming. Keep shared contracts and transformations consistent across both.
Error handling, retries, and backpressure
- Retry with exponential backoff for transient failures; cap max attempts.
- Use dead-letter queues or quarantine tables for poison records with full context.
- Emit structured errors: source, payload snippet, exception, attempt count, correlation id.
- Apply backpressure: limit concurrency or temporarily pause ingestion to protect sinks.
# Pseudocode retry wrapper
for attempt in range(1, MAX_TRIES+1):
try:
process(batch)
break
except TransientError as e:
sleep(backoff(attempt))
except PermanentError as e:
to_dead_letter(batch, error=str(e))
break
Data contracts in integration
- Define schema (names, types, nullability) and semantics (business meaning).
- Operational SLAs: delivery frequency/latency, completeness, retention.
- Quality rules: required fields, ranges, referential checks, duplication policy.
- Change policy: versioning, deprecation windows, compatibility expectations.
Contract change example
Producer adds a nullable field with default; version moves from v1.2 to v1.3. Consumers have 30 days to adopt. Breaking changes require v2.0 with coexistence period.
Standardized ingestion frameworks
- Template connectors: file, API, relational DB, event stream.
- Metadata-driven configs: source, schedule, partitions, schema mapping, destination.
- Built-in quality checks, lineage tagging, and observability.
- Idempotent writes and automatic backfills with safety rails.
Minimal config example
source: "orders_db"
entity: "orders"
mode: "incremental"
watermark: "updated_at"
partition: "ingest_date"
destination: "lake.raw.orders"
quality:
- rule: "not_null" field: "order_id"
- rule: "range" field: "amount" min: 0 max: 100000
retries: {max: 5, backoff: "exponential"}
Drills / exercises
- Sketch an ETL vs ELT decision matrix for three sources: files, DB, and events.
- Design a CDC plan for a table with soft deletes (include delete capture).
- Write a MERGE statement to upsert by natural key and keep last-updated wins.
- Define a data contract change from v1 to v1.1 that is backward-compatible.
- Plan a 60-day historical backfill with safe batching and idempotent writes.
Common mistakes and debugging tips
- Relying on full reloads when increments are available. Tip: start with a high-water mark, then explore CDC for fidelity.
- Ignoring idempotency. Tip: enforce unique keys and use MERGE instead of INSERT.
- No quarantine path. Tip: implement dead-letter handling to keep the pipeline flowing.
- Unbounded schema drift. Tip: gate changes via contracts and versioning.
- Small-file explosion in lakes. Tip: compact files by partition and target size.
- Missing observability. Tip: track rows in/out, lateness, and error rates per step.
Mini project — Incremental orders integration
Build an incremental pipeline that ingests orders from a relational source into a lake/warehouse with ELT transforms.
- Scope: one table orders with updates and deletes, plus a daily dimension lookup.
- Requirements:
- Landing zone for raw data with partitioning by ingest_date.
- Incremental load using updated_at as watermark; capture deletes.
- Idempotent MERGE into analytical table.
- Contract v1.0 with two quality rules (not-null key, non-negative amount).
- Error handling: retries and a quarantine table for bad records.
- Deliverables:
- DDL for landing, stage, and final tables.
- Python or SQL scripts for load, merge, and backfill.
- Runbook: how to reprocess a day and how to backfill 30 days safely.
- Acceptance checks:
- Re-running the same batch produces no duplicates.
- Row counts before/after match expectations; deletes reflected.
- Contract validation passes; errors visible in quarantine.
Practical projects
- Build a standardized ingestion template that onboards a CSV source by configuration only.
- Create a streaming micro-batch job that materializes rolling 1-hour metrics with late data handling.
- Design a schema evolution policy and implement a migration that adds two new nullable fields and backfills them.
Subskills
- Source System Integration Patterns
- CDC And Incremental Loads
- Batch And Streaming Integration Basics
- Idempotency And Backfills
- Handling Schema Evolution
- Error Handling And Retry Patterns
- Data Contracts Concepts
- Standardized Ingestion Frameworks
Next steps
- Finish the subskills below; apply each to your mini project.
- Take the skill exam to confirm readiness. Everyone can take it for free; logged-in learners get saved progress.
- Extend the mini project with a dimension table and SCD2 handling.