Why this skill matters for a Data Architect
Data quality and observability make your data platform trustworthy. As a Data Architect, you define how quality is measured, monitored, and remediated across ingestion, transformation, and serving layers. Strong observability unlocks reliable analytics, AI features, regulatory reporting, and faster incident recovery.
- Prevent costly bad-data incidents and regulatory risk.
- Shorten time-to-detect and time-to-resolve data issues.
- Enable teams to ship changes confidently with measurable SLAs/SLOs.
What you will be able to do
- Design a data quality framework with clear dimensions (freshness, completeness, accuracy, validity, consistency, uniqueness).
- Implement checks, anomaly detection, and cross-system reconciliation.
- Set up monitoring, alerting, SLAs/SLOs, and executive-quality reporting.
- Create incident playbooks and a repeatable root cause analysis process.
Who this is for
- Data Architects defining platform standards and governance.
- Senior Data/Analytics Engineers building reliable pipelines.
- Platform Engineers responsible for observability and SLOs.
Prerequisites
- Comfortable with SQL and at least one data processing engine (e.g., Spark, Flink, or a warehouse).
- Basic understanding of data modeling, pipelines/orchestration, and CI/CD.
- Familiarity with metrics/monitoring concepts (dashboards, alerts, severity).
Learning path (practical roadmap)
1) Define the quality bar
- Pick quality dimensions and draft SLAs/SLOs per critical dataset.
- Choose thresholds for freshness, completeness, and accuracy; document owners and escalation policy.
2) Instrument baseline checks
- Add freshness, completeness, and domain/accuracy tests close to the data (SQL/dbt/Great Expectations-style checks).
- Store results and failures as events for trend analysis.
3) Establish reconciliation patterns
- Compare record counts, sums, and hashes between source and warehouse by partition/date.
- Automate daily drift detection jobs.
4) Add anomaly detection
- Start with simple statistical baselines (rolling mean, z-score). Tune to reduce noise.
- Graduate to seasonality-aware models if needed.
5) Monitoring and alerting
- Publish metrics to your monitoring stack. Define severity levels and on-call rotation.
- Route alerts with context (dataset, owner, runbook link, last healthy time).
6) Incident management and RCA
- Create a playbook template. Practice drills. Track MTTR/MTTD.
- Use lineage to localize the fault and prevent recurrences with action items.
7) Quality metrics and reporting
- Build a dashboard: test coverage, failure rate, freshness SLO attainment, incident metrics.
- Share summaries with stakeholders; adjust thresholds based on impact.
Milestone checklist
- SLAs/SLOs defined and approved for top 10 critical tables.
- Baseline checks deployed and logging results.
- Daily reconciliation job running with alerts.
- Anomaly rules tuned to low false-positive rate.
- Runbooks published; on-call rota set.
- Quality dashboard live with targets and trends.
Worked examples
1) Freshness, completeness, and accuracy checks (SQL)
-- Freshness (minutes since last update)
SELECT TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(updated_at), MINUTE) AS freshness_minutes
FROM mart.customer_daily;
-- Completeness (null rate for critical column)
SELECT 100.0 * SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS email_null_pct
FROM mart.customer_daily
WHERE load_date = CURRENT_DATE();
-- Accuracy / validity (domain and range checks)
SELECT COUNTIF(status NOT IN ('active','inactive','pending')) AS invalid_status_count,
COUNTIF(amount < 0) AS negative_amount_count
FROM mart.orders
WHERE load_date = CURRENT_DATE();
How to act on results
- Define thresholds (e.g., freshness <= 60 min, email_null_pct <= 1%).
- On breach, include dataset name, owner, threshold, and last healthy time in the alert.
2) Reconciliation across systems
-- Partition-level count check between source and warehouse
WITH src AS (
SELECT order_date, COUNT(*) AS c FROM staging_src.orders GROUP BY order_date
), wh AS (
SELECT order_date, COUNT(*) AS c FROM wh.orders GROUP BY order_date
)
SELECT COALESCE(src.order_date, wh.order_date) AS order_date,
src.c AS src_count, wh.c AS wh_count,
(wh.c - src.c) AS diff
FROM src FULL OUTER JOIN wh USING (order_date)
WHERE (src.c IS NULL OR wh.c IS NULL OR src.c != wh.c);
-- Row-level hash comparison (stable business keys)
SELECT s.order_id
FROM (
SELECT order_id, MD5(CONCAT(CAST(amount AS STRING),'|',status,'|',DATE(order_ts))) AS h
FROM staging_src.orders WHERE order_date = CURRENT_DATE()
) s
LEFT JOIN (
SELECT order_id, MD5(CONCAT(CAST(amount AS STRING),'|',status,'|',DATE(order_ts))) AS h
FROM wh.orders WHERE order_date = CURRENT_DATE()
) w USING(order_id)
WHERE s.h != w.h OR w.order_id IS NULL;
Tip
Use deterministic hashes on business keys and normalized field formats to avoid false mismatches due to type/format differences.
3) Simple anomaly detection with rolling baseline (SQL)
-- Detect anomalous daily record count using z-score
WITH daily AS (
SELECT DATE(load_ts) AS d, COUNT(*) AS c FROM wh.events GROUP BY 1
), stats AS (
SELECT d, c,
AVG(c) OVER (ORDER BY d ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING) AS mean_c,
STDDEV_SAMP(c) OVER (ORDER BY d ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING) AS sd_c
FROM daily
)
SELECT d, c, (c - mean_c)/NULLIF(sd_c,0) AS z
FROM stats
WHERE sd_c IS NOT NULL AND ABS((c - mean_c)/sd_c) >= 3;
Tuning guidance
- Use a lookback window that covers typical cycles (e.g., 14 days).
- Start with |z| ≥ 3; adjust per business impact and seasonality.
4) Monitoring and alerting: config sketch
# Pseudo YAML for a dataset quality monitor
checks:
- name: customer_freshness
type: freshness
query: SELECT TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(updated_at), MINUTE) FROM mart.customer_daily
threshold: 60
severity: high
on_breach:
notify: ["data-oncall"]
include: ["dataset","owner","threshold","last_success_ts","runbook"]
- name: orders_completeness_email
type: null_rate
query: |
SELECT 100.0 * SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) / COUNT(*) FROM mart.orders
threshold: 1
severity: medium
on_breach:
notify: ["analytics-engineering"]
ticket: true
Alert hygiene
- Deduplicate repeated alerts; add cooldowns to avoid noise storms.
- Always link to a runbook and recent context (last pass, owner, change history).
5) Incident playbook snippet
Incident: Data Freshness Breach (High Severity)
1. Triage (max 15 min): confirm breach, identify impacted tables and consumers.
2. Contain: pause downstream jobs if serving stale/incorrect data to critical outputs.
3. Diagnose: check last successful run, recent code/config changes, upstream source health.
4. Fix: rollback or hotfix; re-run idempotent steps from the minimal checkpoint.
5. Recover: validate with targeted checks; re-enable downstream.
6. Learn: RCA within 48h; create action items (tests, alerts, guards) with owners and due dates.
Message template:
"High-sev freshness breach in mart.customer_daily. Last healthy: 09:10 UTC, threshold: 60 min.
Potential impact: Marketing dashboards. Runbook: <internal path>. Owner: Data Platform On-call."
6) Root cause analysis cues
-- Example lineage probe (conceptual)
SELECT child_table, parent_table, last_run_ts, status
FROM lineage_runs
WHERE child_table = 'mart.customer_daily'
ORDER BY last_run_ts DESC
LIMIT 20;
RCA workflow
- Reproduce locally or in staging; confirm breach signal.
- Check change log (code, schema, data contracts, source outages) in the last 24–48 hours.
- Trace lineage one hop upstream; compare inputs vs expected contracts.
- Test hypotheses with targeted queries and backfills.
Drills / exercises
- Create a freshness and completeness check for a table you own; choose thresholds and define an owner.
- Implement a daily reconciliation between an ingestion table and a warehouse table using counts and a hash on a business key.
- Add a rolling z-score anomaly detector for a metric of your choice; tune threshold to reduce false positives.
- Draft an incident playbook for a high-severity data breach scenario relevant to your org.
- Build a small dashboard showing test pass rate, freshness SLO attainment, and MTTR over the last 30 days.
Common mistakes and how to avoid them
- Only checking downstream marts: instrument quality as early as possible near ingestion.
- Thresholds too strict or too loose: iterate using historical data and stakeholder input.
- Noisy alerts: add severity levels, cooldowns, and context; prefer actionable alerts.
- Non-idempotent re-runs: design pipelines with dedupe keys and safe retries.
- Skipping reconciliation: counts and hashes catch silent drift and schema quirks.
- Unowned incidents: assign owners and escalation paths; include runbooks in every alert.
Debugging tips
- Check whether the signal is real: validate with a direct query and a control table.
- Validate clocks and time zones when diagnosing freshness breaches.
- Compare a small partition/date; sample row-level diffs before full backfills.
- Roll back to last known good artifacts if hotfix risk is high.
Mini project: Quality guardrails for a critical dataset
Goal: Add reliable guardrails to a high-impact table (e.g., orders_daily) and publish quality KPIs.
- Design: define SLAs/SLOs, owners, and severity levels.
- Build: implement freshness, completeness, and domain checks; add a rolling anomaly detector for daily volume.
- Reconcile: schedule a daily source-vs-warehouse reconciliation job.
- Operate: configure alerts with runbook references and a simple on-call rotation.
- Report: create a dashboard with test coverage, pass rate, freshness attainment, incident count, and MTTR.
Stretch goals
- Add seasonality-aware anomalies (weekday profiles).
- Publish a weekly stakeholder summary with trends and action items.
Subskills
- Data Quality Framework Design — Establish dimensions, SLAs/SLOs, ownership, and governance.
- Freshness Completeness Accuracy Checks — Implement baseline tests near data.
- Reconciliation Patterns — Detect drift across systems with counts and hashes.
- Anomaly Detection Strategy — Catch outliers with rolling baselines and tuned thresholds.
- Monitoring And Alerting Design — Route actionable alerts with severity and runbooks.
- Incident Management Playbooks — Standardize triage, containment, fix, and learn steps.
- Root Cause Analysis Process — Use lineage, contracts, and experiments to find causes.
- Quality Metrics And Reporting — Track coverage, failure rates, freshness, MTTR.
Next steps
- Pick one production dataset and implement at least three checks plus a reconciliation job.
- Define alerting and publish a small dashboard with your first quality KPIs.
- Schedule a monthly review to refine thresholds and update runbooks.