How to learn Staging Layer Design for ETL Design Patterns in ETL Developer for free

Why this matters

The staging layer is where raw data first lands. Good staging design makes your pipelines reliable, debuggable, and fast to recover. In real ETL work, you will:

Receive files, CDC streams, or API payloads and standardize them into consistent structures.
Capture audit metadata (load IDs, source file names, offsets) for traceability and reprocessing.
Validate types and required fields, quarantine bad rows, and de-duplicate records safely.
Make loads idempotent so re-runs don’t create duplicates.
Handle schema drift without breaking downstream models.

Mini mental picture

Think of staging as an airlock: data comes in messy, you stabilize pressure (validate, type-cast, add metadata), and then you open the inner door only when it’s safe. If something goes wrong, you can close the airlock and try again without contaminating the rest of the system.

Concept explained simply

The staging layer is a controlled, mostly raw zone where you:

Ingest: land data exactly as received (or very lightly processed).
Enrich with metadata: add what came in, when, and from where.
Validate and standardize: types, encodings, timestamps.
Prepare for merge: dedupe and sort out late/out-of-order events.

Mental model: separate Landing (immutable dump), Staging (structured, traceable prep), and Core (business-ready) layers. Landing can be files/object storage; Staging is usually tables you can query, with audit columns and validation results.

Core design decisions (quick checklist)

Retention: ephemeral (hours–days) or persistent (weeks–months) staging?
Immutability: append-only vs. upserted staging tables?
Audit columns: load_id, source, row_number/offset, event_ts, ingested_at, checksum/hash, valid_flag/error_reason.
Schema strategy: strict typed columns + catch-all JSON, or schema-on-read?
Idempotency: how do you safely re-run a load?
Partitioning: by date/source/domain for speed and purge.
Security: PII tagging, masking, and encryption.
Recovery: how to replay from files, offsets, or bookmarks.

Key staging patterns

1) Landing vs. Staging

Landing: immutable files as received (e.g., CSV/JSON/parquet). No transformations.
Staging: queryable tables with metadata and light normalization.

When to keep both

Keep both when you need easy replays, audits, or regulatory traceability. If storage is cheap, persistent landing + staging provides strong safety nets.

2) Ephemeral vs. Persistent staging

Ephemeral: drop after successful load. Use for large, frequent loads to save cost when replays are still possible from landing.
Persistent: retain for weeks/months. Use for complex debugging, late-arriving data, or regulated domains.

3) Schema drift handling

Use typed columns for known fields plus a flexible column (JSON/VARIANT) for unknowns.
Track schema versions per load_id. Store a column list snapshot to help downstream.

4) Idempotent loads

Derive a deterministic record_hash (e.g., stable concatenation + hash) to detect duplicates.
Use natural keys + event_ts or source offsets to pick the latest version per key.

5) File layout and partitioning

Partition by load_date or event_date, and optionally by source_system or domain.
Name files with monotonic IDs/timestamps to preserve processing order.

6) Auditability & lineage

Always include: load_id, source_name, source_file_or_topic, row_number_or_offset, ingested_at, event_ts, record_hash, is_valid, error_reason.
Maintain a loads control table with status, counts, and durations.

7) Security & PII

Mask or tokenize sensitive fields in staging when possible; at minimum, restrict access and tag columns.
Encrypt at rest and in transit; restrict who can read landing raw files.

8) Retention & purge

Define purge policy by compliance needs, e.g., 30–90 days for staging, longer for landing if audit is required.
Implement automated purges tied to partition keys.

9) Late and duplicate records

Use ROW_NUMBER() or QUALIFY over a business key ordered by event_ts then ingested_at.
Keep late arrivals in staging and re-merge into core with CDC-friendly logic.

Worked examples

Example 1: Daily CSV orders to warehouse

Create a raw staging table with audit fields.
Load files from landing, compute a stable hash, and mark row validity.
Produce a deduped staging view for downstream merges.

-- staging raw (append-only)
CREATE TABLE stg_orders_raw (
  load_id           VARCHAR,
  source_file       VARCHAR,
  source_row_number INT,
  ingested_at       TIMESTAMP,
  event_ts          TIMESTAMP,
  record_hash       VARCHAR,
  is_valid          BOOLEAN,
  error_reason      VARCHAR,
  -- business fields
  order_id          VARCHAR,
  customer_id       VARCHAR,
  amount            DECIMAL(12,2)
);

-- example insert (pseudo)
INSERT INTO stg_orders_raw
SELECT
  :load_id,
  :source_file,
  src.row_number,
  CURRENT_TIMESTAMP,
  TRY_TO_TIMESTAMP(src.order_time) as event_ts,
  SHA256(CONCAT(src.order_id,'|',src.customer_id,'|',src.amount)) as record_hash,
  CASE WHEN TRY_TO_DECIMAL(src.amount) IS NULL THEN FALSE ELSE TRUE END as is_valid,
  CASE WHEN TRY_TO_DECIMAL(src.amount) IS NULL THEN 'bad_amount' ELSE NULL END as error_reason,
  src.order_id,
  src.customer_id,
  TRY_TO_DECIMAL(src.amount) as amount
FROM landing_csv src;

-- dedupe view (latest per order_id by event_ts then ingested_at)
CREATE OR REPLACE VIEW stg_orders_dedup AS
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY order_id
           ORDER BY event_ts DESC, ingested_at DESC
         ) rn
  FROM stg_orders_raw
  WHERE is_valid = TRUE
)
SELECT * FROM ranked WHERE rn = 1;

Why this design works

Append-only raw preserves evidence.
Hash enables idempotent re-runs.
Validation stays close to the source with clear error_reason.
View isolates dedupe logic for core merges.

Example 2: Kafka CDC to staging

Use an append-only CDC staging table keyed by topic, partition, offset.
Keep operation type (I/U/D), event time, and payload JSON for drift.
Derive a latest image per key when needed.

CREATE TABLE stg_cdc_raw (
  load_id     VARCHAR,
  topic       VARCHAR,
  partition   INT,
  offset      BIGINT,
  key         VARCHAR,
  op          VARCHAR, -- 'c','u','d'
  event_ts    TIMESTAMP,
  ingested_at TIMESTAMP,
  record_hash VARCHAR,
  payload     VARIANT
);

-- Uniqueness guarantee per message
ALTER TABLE stg_cdc_raw ADD PRIMARY KEY (topic, partition, offset);

-- Latest image per key (example for upserts)
CREATE OR REPLACE VIEW stg_cdc_latest AS
WITH ordered AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY topic, key
           ORDER BY event_ts DESC, ingested_at DESC, offset DESC
         ) rn
  FROM stg_cdc_raw
)
SELECT * FROM ordered WHERE rn = 1 AND op <> 'd';

Notes

Primary key prevents double-processing on retries.
Keep deletes (op='d') but exclude from latest view when assembling current state.

Example 3: REST API pagination with schema drift

Maintain a bookmarks table for cursors (updated_at or next_page_token).
Store typed columns + raw JSON payload.
Log per-request metrics into a loads/control table.

CREATE TABLE stg_api_users_raw (
  load_id       VARCHAR,
  page_token    VARCHAR,
  ingested_at   TIMESTAMP,
  event_ts      TIMESTAMP,
  is_valid      BOOLEAN,
  error_reason  VARCHAR,
  user_id       VARCHAR,
  email         VARCHAR,
  full_name     VARCHAR,
  payload       VARIANT
);

-- After each page fetch, insert rows and update bookmark separately
-- Keeping payload allows you to survive new/unknown fields without breaking.

Resilience tactics

On failure, restart from last bookmark (stored in a control table).
Use TRY_ functions and is_valid flags to quarantine bad rows.

Exercises

Do these to solidify your staging design skills. After completing, check your work against the solution and the checklist.

Exercise 1: Design a CSV staging table with validation and dedupe

Scenario: You receive daily orders.csv with columns: order_id, customer_id, amount, order_time. Build a staging design that:

Creates a raw staging table with audit fields.
Loads rows, type-casts amount and order_time, flags invalid rows.
Computes a deterministic record_hash.
Exposes a deduped view selecting the latest record per order_id by event_ts then ingested_at.

Expected: CREATE TABLE, INSERT-SELECT, and a CREATE VIEW for dedupe.

Self-check checklist

Did you include load_id, source_file, source_row_number, ingested_at, event_ts?
Do invalid rows retain error_reason?
Is dedupe order clear and deterministic?
Can you safely re-run the same file without duplicates?

Exercise 2: CDC offsets and idempotency

Scenario: You ingest CDC from a Kafka topic with (partition, offset). Design a staging table and a view that returns the latest record per business key, ensuring re-runs don’t double-ingest messages.

Expected: Table with (topic, partition, offset) uniqueness; view using ROW_NUMBER over (topic, key) ordered by event_ts/offset.

Self-check checklist

Is (topic, partition, offset) unique?
Do you retain deletes but exclude them from the latest state?
Is ordering stable using event_ts, ingested_at, offset?

Common mistakes and how to self-check

No audit columns: You can’t trace or replay. Self-check: Can you answer “which file created this row?”
Non-idempotent loads: Re-runs double data. Self-check: Simulate two identical runs—do counts stay stable?
Dropping raw too early: Lose forensic ability. Self-check: Can you reconstruct if a downstream merge failed?
Over-validating in staging: Breaking loads. Self-check: Use flags and quarantine instead of rejecting entire batches.
Ignoring late events: Wrong “latest” row. Self-check: Sort by event_ts then ingested_at.
No purge policy: Storage bloat. Self-check: Is there a documented retention per table/partition?

Practical projects

Retail orders pipeline: Land CSV daily, stage with audit/validation, dedupe per order_id, and merge into a core orders table. Add a 60-day purge job for staging.
CDC product catalog: Ingest Kafka topic to staging with (topic, partition, offset) PK, maintain latest view, and implement soft-delete handling in core.
API user directory: Pull paginated users, store typed + payload JSON, manage a bookmarks table, and build a drift report showing new fields discovered over time.

Who this is for

ETL/ELT developers building reliable data pipelines.
Data engineers responsible for ingestion and data quality.
Analytics engineers who need stable upstream data.

Prerequisites

Comfort with SQL (DDL, DML, window functions).
Basic understanding of files/object storage and data types.
Familiarity with batch or streaming ingestion concepts.

Learning path

Start with landing patterns (file formats, partitioning, naming).
Design staging tables with audit fields and validation flags.
Implement idempotent load patterns (hashing, keys, offsets).
Handle schema drift with typed + JSON hybrid designs.
Build dedupe and late-event logic.
Set retention and purge policies.

Mini challenge

You ingest IoT sensor readings with fields: device_id, reading_ts, temperature, humidity. Some devices resend old readings. Draft a staging design (columns, keys, dedupe order, and how to mark invalid rows). Keep it idempotent and ready for late data.

Next steps

Implement the patterns in a sandbox with small sample files and a CDC simulator.
Add a loads control table tracking counts and failures.
Practice re-processing a failed load end-to-end.

Quick Test and progress

Take the quick test below to check your understanding. Everyone can take it; if you log in, your progress is saved.

Menu

Staging Layer Design

Table of Contents