Source To Target Mapping

Learn Source To Target Mapping for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

Source-to-Target Mapping (STM) is the blueprint that explains how raw data from source systems becomes analytics-ready data in your warehouse, lakehouse, or marts. Teams rely on it to build pipelines consistently, review changes, audit lineage, and troubleshoot issues. Without an STM, you risk mismatched fields, broken transformations, and unreliable dashboards.

Real tasks you will do: define field mappings, document transformations, align business rules with data modeling, guide ETL/ELT code, and support QA/data governance.
Outcomes: fewer defects, faster onboarding for teammates, predictable releases, and clear lineage for audits.

Who this is for

Data engineers building or maintaining pipelines.
Analytics engineers and BI developers consuming curated data.
Data stewards and QA validating data quality and lineage.

Prerequisites

Basic SQL (SELECT, JOIN, CAST) and familiarity with data types.
Understanding of dimensional modeling (facts/dimensions) or data vault basics.
Comfort with profiling data (nulls, uniqueness, ranges).

Concept explained simply

An STM is a structured document that answers: From which source field does each target field come, and what transformation happens along the way? It also records data types, business rules, and data quality checks.

Mental model

Think of STM as a GPS route for data. The source is your starting point. The target is your destination. Transformations are the turns, speed limits, and detours. Quality checks are your safety checks. Versioning is the trip log.

Open a minimal STM row example

Target Table: dim_customer
Target Column: customer_key (BIGINT, PK)
Source: crm.customers.id
Transform: generate surrogate key via hash(id) or sequence
DQ: must be unique, not null
Notes: used to join facts

Core components of a Source-to-Target Map

Identifiers: source system, schema, table, column; target schema, table, column.
Data types and constraints: type, length/precision, nullability, primary/surrogate keys.
Transformations: expressions, standardization (trimming, case), business rules.
Joins and lookups: keys, reference tables, SCD handling (Type 1/2).
Defaults and fallbacks: constants, null-handling, error routing.
Data quality checks: uniqueness, referential integrity, domain lists, ranges.
Lineage and versions: who changed what, when, and why.
Load details: frequency, mode (full/incremental/CDC), late-arriving handling.

Reusable STM checklist

Every target column mapped or intentionally left unmapped (with reason)
Data types compatible and sized safely
Clear transformation logic with examples
Join keys and lookup logic documented
DQ rules with pass/fail actions
Load mode and schedule noted
Owner/reviewer and version/date recorded

Step-by-step: Build an STM

Collect context: business purpose, target model (star, vault, or 3NF), SLAs.
Profile sources: data types, null rates, distinct counts, min/max, example values.
Define targets: table/field list with data types and constraints.
Map fields: for each target column, pick a source and write transformations.
Add quality rules: what to validate and how to handle failures.
Plan loads: full vs incremental, CDC keys, watermark columns, time-zone handling.
Document test cases: row-level examples and expected outputs.
Review and version: peer review and change log.

Example transformation snippets

-- Standardize country
UPPER(TRIM(src.country)) AS country_name

-- Parse dates safely
TRY_CAST(src.signup_date AS DATE) AS signup_date

-- Currency conversion
ROUND(src.amount * fx.rate, 2) AS amount_usd

-- Surrogate key (hash example)
TO_HEX(SHA256(CONCAT(COALESCE(src.id,''), '|', COALESCE(src.system,'')))) AS customer_key

Worked examples

Example 1: CRM to dim_customer

Target: dim_customer(customer_key, natural_key, full_name, email, country, signup_date)
Sources: crm.customers(id, first_name, last_name, email, country, created_at)

Mapping rows

customer_key  <- HASH(id)
natural_key   <- id
full_name     <- TRIM(first_name) || ' ' || TRIM(last_name)
email         <- LOWER(TRIM(email))
country       <- UPPER(TRIM(country))  -- if null, default 'UNKNOWN'
signup_date   <- CAST(created_at AS DATE)
DQ: email contains '@'; natural_key unique; signup_date not in future

Example 2: Orders fact with currency

Target: fact_order(order_id, customer_key, order_ts_utc, amount_usd)
Sources: oms.orders(id, customer_id, order_ts_local, amount, currency), ref.fx_rates(currency, rate_to_usd, valid_date)

Mapping rows

order_id        <- id
customer_key    <- HASH(customer_id)
order_ts_utc    <- CONVERT_TZ(order_ts_local, src_tz, 'UTC')
amount_usd      <- ROUND(amount * fx.rate_to_usd, 2)  -- join on currency & date window
DQ: amount >= 0; currency in list; customer_key exists in dim_customer

Example 3: JSON events to parquet table

Target: events_flat(event_id, user_id, event_type, event_ts, device_os)
Source: raw.events(payload VARIANT/JSON)

Mapping rows

event_id   <- payload:id
user_id    <- payload:user.id
event_type <- UPPER(payload:type)
event_ts   <- TO_TIMESTAMP(payload:ts)
device_os  <- COALESCE(payload:device.os, 'UNKNOWN')
DQ: event_id unique; event_ts not null

Common mistakes and self-check

Mistake: Unmapped target columns. Fix: mark as derived/constant or document N/A explicitly.
Mistake: Implicit type casts. Fix: write explicit CAST/TRY_CAST and note expected failures.
Mistake: Missing time-zone normalization. Fix: standardize to UTC and record the rule.
Mistake: Weak DQ rules. Fix: add at least uniqueness, null checks, domain checks, and RI checks.
Mistake: No examples. Fix: include at least one before/after example per complex rule.

Self-check mini audit

Do two readers produce the same SQL from your STM?
Can QA test each rule without guessing?
Can you trace any dashboard field back to a source field?

Exercises

Do these to solidify your understanding. The Quick Test at the end checks core concepts.

Exercise ex1 — Customer Dimension STM (click to open)

Design an STM for a customer dimension with fields: customer_key (PK), natural_key, email_normalized, lifecycle_segment, signup_date. Sources: crm.customers(id, email, created_at, status). Rules: email_normalized is lower-cased and trimmed; lifecycle_segment is 'ACTIVE' if status in ('ACTIVE','PAID'), else 'INACTIVE'; signup_date is date(created_at). Add DQ rules and load mode.

I wrote one mapping row per target column.
I included explicit CAST/TRY_CAST.
I added at least three DQ checks.

Practical projects

Project 1 — Sales mart STM

Model dim_customer, dim_product, fact_sales.
Profile CRM and OMS sources; note quirks.
Create detailed STM with transformations and DQ.
Have a peer review it; iterate once.

Project 2 — CDC pipeline STM

Define target for incremental loads (watermark last_updated).
Document CDC keys, deleted-flag logic, and late-arriving handling.
Specify idempotency and deduplication rules.

Project 3 — Events flattening STM

List JSON paths to extract.
Standardize timestamps and device fields.
Document schema evolution policy and nullable handling.

Learning path

Start: Build an STM for one dimension (customers).
Next: Add one fact table with a lookup/join and DQ rules.
Advance: Incorporate CDC, SCD Type 2, and currency conversions.
Polish: Add versioning notes and example test cases.

Next steps

Learn dimensional modeling patterns and SCD strategies.
Practice writing SQL transforms that mirror your STM.
Adopt a consistent STM template for your team.
Automate checks where possible (pre/post-load validations).

Mini challenge

Pick any target field that uses multiple sources (e.g., choosing the freshest email from CRM or Support). Write a short STM rule to resolve conflicts: source priority, freshness window, and a DQ rule to flag mismatches.

Quick Test

Open to everyone. Your progress is saved if you are logged in.

Practice Exercises

1 exercises to complete

Instructions

Create an STM for target table dim_customer with columns:
customer_key (PK), natural_key, email_normalized, lifecycle_segment, signup_date.

Sources: crm.customers(id, email, created_at, status)

email_normalized = LOWER(TRIM(email))
lifecycle_segment = 'ACTIVE' if status in ('ACTIVE','PAID') else 'INACTIVE'
signup_date = CAST(created_at AS DATE)

Include data types, nullability, DQ checks (at least 3), and load mode (full vs incremental, with key).

Expected Output

A clear mapping row for each target column, explicit transformations, three data quality rules, and a stated load mode with keys/watermarks.