Why this matters
The Raw → Staging → Curated layering pattern keeps your data pipeline reliable, auditable, and easy to change. It separates concerns: ingest everything safely (Raw), normalize and clean (Staging), then deliver business-ready models (Curated).
- Real tasks you will do: design layer boundaries, write transforms, set data quality checks, manage incremental loads, and document lineage.
- Common synonyms: Raw/Staging/Curated is often called Bronze/Silver/Gold.
- Works for both ETL and ELT. In modern analytics, ELT is typical: Extract and Load to Raw, then Transform inside the warehouse.
Concept explained simply
Think of a clean kitchen workflow:
- Raw (Bronze): groceries delivered exactly as they arrive. No changes.
- Staging (Silver): wash, chop, and portion. Clean but not a final dish.
- Curated (Gold): plated meals ready to serve to the business.
- Raw: append-only, schema-on-read; keep source fidelity.
- Staging: standardized types, deduplicated keys, light business-agnostic cleaning.
- Curated: business logic applied; dimensions and facts, or wide analysis-ready tables.
ELT vs ETL in this pattern
- ETL: transform before loading; still keep a Raw archive if feasible for audit.
- ELT: load first, transform later; Raw is your immutable source-of-truth snapshot.
- Both benefit from clear layer contracts and data quality checks.
What each layer owns
- Raw layer
- Content: source dumps, CDC logs, JSON blobs, CSVs, or parquet from ingestion tools.
- Rules: no business logic; preserve source column names, types, and nulls; immutable or only audited corrections.
- Typical checks: file arrival, row count, schema drift alerting.
- Staging layer
- Content: typed, cleaned tables mirroring source entities (customers, orders).
- Rules: cast types, parse JSON, normalize enums, deduplicate by primary key, apply soft deletes, standardize timestamps and time zones.
- Typical checks: primary key uniqueness, not-null for keys, basic referential checks if possible.
- Curated layer
- Content: business models, such as dimensions and facts (star schema) or data marts by domain.
- Rules: conform business definitions, surrogate keys, SCD handling, derived metrics, grain clearly defined.
- Typical checks: metric reasonableness, dimensional conformance, slowly changing dimension integrity.
Naming and storage conventions (practical)
- Schemas: raw_*, stg_*, cur_*/mart_*; e.g., raw_payments, stg_payments, mart_finance.
- Tables: use source and entity; e.g., stg_payments__transactions.
- Partition/cluster: by load_date or event_date for Raw; by business date keys for Curated.
- Retention: Raw long-term for audit if storage allows; Staging and Curated retain as required by downstream usage and compliance.
Worked examples
Example 1: E-commerce orders (ELT)
- Raw: raw_shopify.orders as JSON with nested line_items and customer info.
- Staging:
- stg_shopify__orders: one row per order; typed columns; parse nested totals.
- stg_shopify__order_lines: explode line_items; one row per line.
- Curated:
- dim_customer: latest customer attributes (SCD2 if required).
- fct_orders: grain = one row per order; revenue, discount, tax, date keys.
Sample staging SQL snippet
-- stg_shopify__orders
select
cast(o:id as string) as order_id,
cast(o:created_at as timestamp) as created_at_utc,
cast(o:customer:id as string) as customer_id,
cast(o:total_price as numeric(18,2)) as total_price,
cast(o:currency as string) as currency,
cast(o:cancelled_at as timestamp) as cancelled_at_utc,
current_date as load_date
from raw_shopify.orders as r,
lateral flatten(input => r.record) o;
Example 2: CRM CDC (incremental merge)
- Raw: raw_crm.accounts_cdc with op_type (I/U/D), op_ts, payload.
- Staging: stg_crm__accounts with latest row per primary key, apply soft delete flag.
- Curated: dim_account SCD2 to track changes in industry or segment.
Curated SCD2 merge sketch
-- Pseudocode for dim_account SCD2
merge into cur_model.dim_account t
using stg_crm__accounts s
on t.nk_account_id = s.account_id and t.is_current = true
when matched and hash(t.attributes) != hash(s.attributes) then update set
t.is_current = false, t.valid_to = s.updated_at
when not matched then insert (nk_account_id, surrogate_key, attributes..., is_current, valid_from)
values (s.account_id, generate_surrogate(), s.attributes..., true, s.updated_at);
Example 3: Event logs
- Raw: raw_events.web with semi-structured columns (event_name, ts, user_id, properties).
- Staging: stg_events__web normalized to one row per event; extract session_id and device_type.
- Curated: fct_events with grain event; dim_user_session for session analytics.
Designing your layers (step-by-step)
- List source entities and keys (business keys, timestamps, CDC markers).
- Define Raw contracts: landing format, partitioning, and schema drift handling.
- Define Staging rules: type casting, dedup, null handling, timezone standardization, soft deletes.
- Define Curated models: grains, dimensions/facts, metrics, SCD strategy, and naming.
- Plan incremental logic: watermarks or change tracking; choose MERGE vs INSERT-OVERWRITE safely.
- Add data quality checks per layer and document expectations.
Governance and privacy
- PII handling: mask or tokenize in Staging; only expose needed fields in Curated.
- Access control: Raw restricted; broader read access to Curated.
- Auditability: keep load_date, _ingested_at, and source filenames where useful.
Data quality checks by layer
- [ ] Raw: file/row counts within expected range
- [ ] Raw: schema drift alert if new/removed columns
- [ ] Staging: primary key uniqueness
- [ ] Staging: not-null constraints on keys and required fields
- [ ] Curated: referential integrity between facts and dims
- [ ] Curated: metric sanity (e.g., revenue >= 0, quantities positive)
Exercises
Complete these. Then open the Quick Test at the end. Everyone can take the test; only logged-in users will see saved progress.
Exercise 1 (ex1): Layer mapping for a payments API
You ingest payments_api transactions as JSON. Design Raw, Staging, and Curated models.
- Raw: raw_payments.transactions (json payload), partitioned by load_date.
- Staging: stg_payments__transactions; identify primary key and types; deduplicate.
- Curated: fct_payments; dim_payment_method; ensure grain is one row per payment.
Checklist
- [ ] Named schemas/tables follow conventions
- [ ] Primary key and dedup strategy defined
- [ ] Incremental watermark chosen
- [ ] Curated grain and dimensions clear
Exercise 2 (ex2): From raw orders to curated fact
Write staging and curated transformations for orders with nested line items.
- Staging: stg_orders__orders and stg_orders__order_lines
- Curated: fct_orders at order grain with revenue and discount
Checklist
- [ ] Correct grain and joins between orders and lines
- [ ] Currency and numeric types standardized
- [ ] Incremental MERGE or insert-overwrite is safe and idempotent
Common mistakes and how to self-check
- Mixing business logic into Staging
- Self-check: does a rule depend on a business definition? If yes, move it to Curated.
- Dropping columns in Raw
- Self-check: Raw must preserve source fidelity. Only add metadata; never lose data.
- Unclear grain in facts
- Self-check: can you state the fact table grain in one sentence? If not, clarify.
- No dedup strategy
- Self-check: show the unique key and logic for resolving duplicates.
- PII exposure in Curated
- Self-check: verify masking/tokenization and access controls.
Practical projects
- Project 1: Build a mini ecommerce mart
- Inputs: orders and customers in Raw as JSON
- Outputs: stg_orders__orders, stg_orders__order_lines, dim_customer (SCD2), fct_orders
- Acceptance: all keys unique, null handling documented, test queries match expected totals
- Project 2: CDC pipeline for CRM accounts
- Inputs: raw_crm.accounts_cdc with op_type
- Outputs: stg_crm__accounts, dim_account with history
- Acceptance: updates create new SCD2 rows; current flag accurate
- Project 3: Event analytics foundation
- Inputs: raw_events.web
- Outputs: stg_events__web, dim_user_session, fct_events
- Acceptance: sessionization consistent; event counts reconcile with Raw
Learning path
Who this is for
Analytics Engineers, BI Developers, and Data Engineers who need reliable, auditable pipelines.
Prerequisites
- Comfortable with SQL (joins, window functions)
- Basic data modeling concepts (dimensions, facts)
- Understanding of incremental loads and primary keys
Next steps
- Add automated tests for each layer (keys, not nulls, accepted values)
- Document model contracts: inputs, outputs, grain, freshness SLAs
- Introduce performance tuning: clustering, partitions, pruning
Mini challenge
Your sales team changes the definition of "active customer" from last 30 days to 90 days. Where should this change live and why? Update your Curated layer design to reflect the new definition while keeping Staging untouched.
Hint
Definitions that reflect business meaning belong in Curated. Staging remains business-agnostic.
Quick test
Everyone can take the test; only logged-in users will have their progress saved.