Why this matters
The staging layer is where raw data first lands. Good staging design makes your pipelines reliable, debuggable, and fast to recover. In real ETL work, you will:
- Receive files, CDC streams, or API payloads and standardize them into consistent structures.
- Capture audit metadata (load IDs, source file names, offsets) for traceability and reprocessing.
- Validate types and required fields, quarantine bad rows, and de-duplicate records safely.
- Make loads idempotent so re-runs don’t create duplicates.
- Handle schema drift without breaking downstream models.
Mini mental picture
Think of staging as an airlock: data comes in messy, you stabilize pressure (validate, type-cast, add metadata), and then you open the inner door only when it’s safe. If something goes wrong, you can close the airlock and try again without contaminating the rest of the system.
Concept explained simply
The staging layer is a controlled, mostly raw zone where you:
- Ingest: land data exactly as received (or very lightly processed).
- Enrich with metadata: add what came in, when, and from where.
- Validate and standardize: types, encodings, timestamps.
- Prepare for merge: dedupe and sort out late/out-of-order events.
Mental model: separate Landing (immutable dump), Staging (structured, traceable prep), and Core (business-ready) layers. Landing can be files/object storage; Staging is usually tables you can query, with audit columns and validation results.
Core design decisions (quick checklist)
- Retention: ephemeral (hours–days) or persistent (weeks–months) staging?
- Immutability: append-only vs. upserted staging tables?
- Audit columns: load_id, source, row_number/offset, event_ts, ingested_at, checksum/hash, valid_flag/error_reason.
- Schema strategy: strict typed columns + catch-all JSON, or schema-on-read?
- Idempotency: how do you safely re-run a load?
- Partitioning: by date/source/domain for speed and purge.
- Security: PII tagging, masking, and encryption.
- Recovery: how to replay from files, offsets, or bookmarks.
Key staging patterns
1) Landing vs. Staging
- Landing: immutable files as received (e.g., CSV/JSON/parquet). No transformations.
- Staging: queryable tables with metadata and light normalization.
When to keep both
Keep both when you need easy replays, audits, or regulatory traceability. If storage is cheap, persistent landing + staging provides strong safety nets.
2) Ephemeral vs. Persistent staging
- Ephemeral: drop after successful load. Use for large, frequent loads to save cost when replays are still possible from landing.
- Persistent: retain for weeks/months. Use for complex debugging, late-arriving data, or regulated domains.
3) Schema drift handling
- Use typed columns for known fields plus a flexible column (JSON/VARIANT) for unknowns.
- Track schema versions per load_id. Store a column list snapshot to help downstream.
4) Idempotent loads
- Derive a deterministic
record_hash(e.g., stable concatenation + hash) to detect duplicates. - Use natural keys + event_ts or source offsets to pick the latest version per key.
5) File layout and partitioning
- Partition by
load_dateorevent_date, and optionally bysource_systemor domain. - Name files with monotonic IDs/timestamps to preserve processing order.
6) Auditability & lineage
- Always include:
load_id,source_name,source_file_or_topic,row_number_or_offset,ingested_at,event_ts,record_hash,is_valid,error_reason. - Maintain a loads control table with status, counts, and durations.
7) Security & PII
- Mask or tokenize sensitive fields in staging when possible; at minimum, restrict access and tag columns.
- Encrypt at rest and in transit; restrict who can read landing raw files.
8) Retention & purge
- Define purge policy by compliance needs, e.g., 30–90 days for staging, longer for landing if audit is required.
- Implement automated purges tied to partition keys.
9) Late and duplicate records
- Use
ROW_NUMBER()orQUALIFYover a business key ordered byevent_tstheningested_at. - Keep late arrivals in staging and re-merge into core with CDC-friendly logic.
Worked examples
Example 1: Daily CSV orders to warehouse
- Create a raw staging table with audit fields.
- Load files from landing, compute a stable hash, and mark row validity.
- Produce a deduped staging view for downstream merges.
-- staging raw (append-only)
CREATE TABLE stg_orders_raw (
load_id VARCHAR,
source_file VARCHAR,
source_row_number INT,
ingested_at TIMESTAMP,
event_ts TIMESTAMP,
record_hash VARCHAR,
is_valid BOOLEAN,
error_reason VARCHAR,
-- business fields
order_id VARCHAR,
customer_id VARCHAR,
amount DECIMAL(12,2)
);
-- example insert (pseudo)
INSERT INTO stg_orders_raw
SELECT
:load_id,
:source_file,
src.row_number,
CURRENT_TIMESTAMP,
TRY_TO_TIMESTAMP(src.order_time) as event_ts,
SHA256(CONCAT(src.order_id,'|',src.customer_id,'|',src.amount)) as record_hash,
CASE WHEN TRY_TO_DECIMAL(src.amount) IS NULL THEN FALSE ELSE TRUE END as is_valid,
CASE WHEN TRY_TO_DECIMAL(src.amount) IS NULL THEN 'bad_amount' ELSE NULL END as error_reason,
src.order_id,
src.customer_id,
TRY_TO_DECIMAL(src.amount) as amount
FROM landing_csv src;
-- dedupe view (latest per order_id by event_ts then ingested_at)
CREATE OR REPLACE VIEW stg_orders_dedup AS
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY event_ts DESC, ingested_at DESC
) rn
FROM stg_orders_raw
WHERE is_valid = TRUE
)
SELECT * FROM ranked WHERE rn = 1;
Why this design works
- Append-only raw preserves evidence.
- Hash enables idempotent re-runs.
- Validation stays close to the source with clear error_reason.
- View isolates dedupe logic for core merges.
Example 2: Kafka CDC to staging
- Use an append-only CDC staging table keyed by topic, partition, offset.
- Keep operation type (I/U/D), event time, and payload JSON for drift.
- Derive a latest image per key when needed.
CREATE TABLE stg_cdc_raw (
load_id VARCHAR,
topic VARCHAR,
partition INT,
offset BIGINT,
key VARCHAR,
op VARCHAR, -- 'c','u','d'
event_ts TIMESTAMP,
ingested_at TIMESTAMP,
record_hash VARCHAR,
payload VARIANT
);
-- Uniqueness guarantee per message
ALTER TABLE stg_cdc_raw ADD PRIMARY KEY (topic, partition, offset);
-- Latest image per key (example for upserts)
CREATE OR REPLACE VIEW stg_cdc_latest AS
WITH ordered AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY topic, key
ORDER BY event_ts DESC, ingested_at DESC, offset DESC
) rn
FROM stg_cdc_raw
)
SELECT * FROM ordered WHERE rn = 1 AND op <> 'd';
Notes
- Primary key prevents double-processing on retries.
- Keep deletes (op='d') but exclude from latest view when assembling current state.
Example 3: REST API pagination with schema drift
- Maintain a bookmarks table for cursors (updated_at or next_page_token).
- Store typed columns + raw JSON payload.
- Log per-request metrics into a loads/control table.
CREATE TABLE stg_api_users_raw (
load_id VARCHAR,
page_token VARCHAR,
ingested_at TIMESTAMP,
event_ts TIMESTAMP,
is_valid BOOLEAN,
error_reason VARCHAR,
user_id VARCHAR,
email VARCHAR,
full_name VARCHAR,
payload VARIANT
);
-- After each page fetch, insert rows and update bookmark separately
-- Keeping payload allows you to survive new/unknown fields without breaking.
Resilience tactics
- On failure, restart from last bookmark (stored in a control table).
- Use TRY_ functions and is_valid flags to quarantine bad rows.
Exercises
Do these to solidify your staging design skills. After completing, check your work against the solution and the checklist.
Exercise 1: Design a CSV staging table with validation and dedupe
Scenario: You receive daily orders.csv with columns: order_id, customer_id, amount, order_time. Build a staging design that:
- Creates a raw staging table with audit fields.
- Loads rows, type-casts amount and order_time, flags invalid rows.
- Computes a deterministic record_hash.
- Exposes a deduped view selecting the latest record per order_id by event_ts then ingested_at.
Expected: CREATE TABLE, INSERT-SELECT, and a CREATE VIEW for dedupe.
Self-check checklist
- Did you include load_id, source_file, source_row_number, ingested_at, event_ts?
- Do invalid rows retain error_reason?
- Is dedupe order clear and deterministic?
- Can you safely re-run the same file without duplicates?
Exercise 2: CDC offsets and idempotency
Scenario: You ingest CDC from a Kafka topic with (partition, offset). Design a staging table and a view that returns the latest record per business key, ensuring re-runs don’t double-ingest messages.
Expected: Table with (topic, partition, offset) uniqueness; view using ROW_NUMBER over (topic, key) ordered by event_ts/offset.
Self-check checklist
- Is (topic, partition, offset) unique?
- Do you retain deletes but exclude them from the latest state?
- Is ordering stable using event_ts, ingested_at, offset?
Common mistakes and how to self-check
- No audit columns: You can’t trace or replay. Self-check: Can you answer “which file created this row?”
- Non-idempotent loads: Re-runs double data. Self-check: Simulate two identical runs—do counts stay stable?
- Dropping raw too early: Lose forensic ability. Self-check: Can you reconstruct if a downstream merge failed?
- Over-validating in staging: Breaking loads. Self-check: Use flags and quarantine instead of rejecting entire batches.
- Ignoring late events: Wrong “latest” row. Self-check: Sort by event_ts then ingested_at.
- No purge policy: Storage bloat. Self-check: Is there a documented retention per table/partition?
Practical projects
- Retail orders pipeline: Land CSV daily, stage with audit/validation, dedupe per order_id, and merge into a core orders table. Add a 60-day purge job for staging.
- CDC product catalog: Ingest Kafka topic to staging with (topic, partition, offset) PK, maintain latest view, and implement soft-delete handling in core.
- API user directory: Pull paginated users, store typed + payload JSON, manage a bookmarks table, and build a drift report showing new fields discovered over time.
Who this is for
- ETL/ELT developers building reliable data pipelines.
- Data engineers responsible for ingestion and data quality.
- Analytics engineers who need stable upstream data.
Prerequisites
- Comfort with SQL (DDL, DML, window functions).
- Basic understanding of files/object storage and data types.
- Familiarity with batch or streaming ingestion concepts.
Learning path
- Start with landing patterns (file formats, partitioning, naming).
- Design staging tables with audit fields and validation flags.
- Implement idempotent load patterns (hashing, keys, offsets).
- Handle schema drift with typed + JSON hybrid designs.
- Build dedupe and late-event logic.
- Set retention and purge policies.
Mini challenge
You ingest IoT sensor readings with fields: device_id, reading_ts, temperature, humidity. Some devices resend old readings. Draft a staging design (columns, keys, dedupe order, and how to mark invalid rows). Keep it idempotent and ready for late data.
Next steps
- Implement the patterns in a sandbox with small sample files and a CDC simulator.
- Add a loads control table tracking counts and failures.
- Practice re-processing a failed load end-to-end.
Quick Test and progress
Take the quick test below to check your understanding. Everyone can take it; if you log in, your progress is saved.