Skill Not Found

Why this skill matters for a Data Architect

Data quality and observability make your data platform trustworthy. As a Data Architect, you define how quality is measured, monitored, and remediated across ingestion, transformation, and serving layers. Strong observability unlocks reliable analytics, AI features, regulatory reporting, and faster incident recovery.

Prevent costly bad-data incidents and regulatory risk.
Shorten time-to-detect and time-to-resolve data issues.
Enable teams to ship changes confidently with measurable SLAs/SLOs.

What you will be able to do

Design a data quality framework with clear dimensions (freshness, completeness, accuracy, validity, consistency, uniqueness).
Implement checks, anomaly detection, and cross-system reconciliation.
Set up monitoring, alerting, SLAs/SLOs, and executive-quality reporting.
Create incident playbooks and a repeatable root cause analysis process.

Who this is for

Data Architects defining platform standards and governance.
Senior Data/Analytics Engineers building reliable pipelines.
Platform Engineers responsible for observability and SLOs.

Prerequisites

Comfortable with SQL and at least one data processing engine (e.g., Spark, Flink, or a warehouse).
Basic understanding of data modeling, pipelines/orchestration, and CI/CD.
Familiarity with metrics/monitoring concepts (dashboards, alerts, severity).

Learning path (practical roadmap)

1) Define the quality bar

Pick quality dimensions and draft SLAs/SLOs per critical dataset.
Choose thresholds for freshness, completeness, and accuracy; document owners and escalation policy.

2) Instrument baseline checks

Add freshness, completeness, and domain/accuracy tests close to the data (SQL/dbt/Great Expectations-style checks).
Store results and failures as events for trend analysis.

3) Establish reconciliation patterns

Compare record counts, sums, and hashes between source and warehouse by partition/date.
Automate daily drift detection jobs.

4) Add anomaly detection

Start with simple statistical baselines (rolling mean, z-score). Tune to reduce noise.
Graduate to seasonality-aware models if needed.

5) Monitoring and alerting

Publish metrics to your monitoring stack. Define severity levels and on-call rotation.
Route alerts with context (dataset, owner, runbook link, last healthy time).

6) Incident management and RCA

Create a playbook template. Practice drills. Track MTTR/MTTD.
Use lineage to localize the fault and prevent recurrences with action items.

7) Quality metrics and reporting

Build a dashboard: test coverage, failure rate, freshness SLO attainment, incident metrics.
Share summaries with stakeholders; adjust thresholds based on impact.

Milestone checklist

SLAs/SLOs defined and approved for top 10 critical tables.
Baseline checks deployed and logging results.
Daily reconciliation job running with alerts.
Anomaly rules tuned to low false-positive rate.
Runbooks published; on-call rota set.
Quality dashboard live with targets and trends.

Worked examples

1) Freshness, completeness, and accuracy checks (SQL)

-- Freshness (minutes since last update)
SELECT TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(updated_at), MINUTE) AS freshness_minutes
FROM mart.customer_daily;

-- Completeness (null rate for critical column)
SELECT 100.0 * SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS email_null_pct
FROM mart.customer_daily
WHERE load_date = CURRENT_DATE();

-- Accuracy / validity (domain and range checks)
SELECT COUNTIF(status NOT IN ('active','inactive','pending')) AS invalid_status_count,
       COUNTIF(amount < 0) AS negative_amount_count
FROM mart.orders
WHERE load_date = CURRENT_DATE();

How to act on results

Define thresholds (e.g., freshness <= 60 min, email_null_pct <= 1%).
On breach, include dataset name, owner, threshold, and last healthy time in the alert.

2) Reconciliation across systems

-- Partition-level count check between source and warehouse
WITH src AS (
  SELECT order_date, COUNT(*) AS c FROM staging_src.orders GROUP BY order_date
), wh AS (
  SELECT order_date, COUNT(*) AS c FROM wh.orders GROUP BY order_date
)
SELECT COALESCE(src.order_date, wh.order_date) AS order_date,
       src.c AS src_count, wh.c AS wh_count,
       (wh.c - src.c) AS diff
FROM src FULL OUTER JOIN wh USING (order_date)
WHERE (src.c IS NULL OR wh.c IS NULL OR src.c != wh.c);

-- Row-level hash comparison (stable business keys)
SELECT s.order_id
FROM (
  SELECT order_id, MD5(CONCAT(CAST(amount AS STRING),'|',status,'|',DATE(order_ts))) AS h
  FROM staging_src.orders WHERE order_date = CURRENT_DATE()
) s
LEFT JOIN (
  SELECT order_id, MD5(CONCAT(CAST(amount AS STRING),'|',status,'|',DATE(order_ts))) AS h
  FROM wh.orders WHERE order_date = CURRENT_DATE()
) w USING(order_id)
WHERE s.h != w.h OR w.order_id IS NULL;

Tip

Use deterministic hashes on business keys and normalized field formats to avoid false mismatches due to type/format differences.

3) Simple anomaly detection with rolling baseline (SQL)

-- Detect anomalous daily record count using z-score
WITH daily AS (
  SELECT DATE(load_ts) AS d, COUNT(*) AS c FROM wh.events GROUP BY 1
), stats AS (
  SELECT d, c,
         AVG(c) OVER (ORDER BY d ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING) AS mean_c,
         STDDEV_SAMP(c) OVER (ORDER BY d ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING) AS sd_c
  FROM daily
)
SELECT d, c, (c - mean_c)/NULLIF(sd_c,0) AS z
FROM stats
WHERE sd_c IS NOT NULL AND ABS((c - mean_c)/sd_c) >= 3;

Tuning guidance

Use a lookback window that covers typical cycles (e.g., 14 days).
Start with |z| ≥ 3; adjust per business impact and seasonality.

4) Monitoring and alerting: config sketch

# Pseudo YAML for a dataset quality monitor
checks:
  - name: customer_freshness
    type: freshness
    query: SELECT TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(updated_at), MINUTE) FROM mart.customer_daily
    threshold: 60
    severity: high
    on_breach:
      notify: ["data-oncall"]
      include: ["dataset","owner","threshold","last_success_ts","runbook"]

  - name: orders_completeness_email
    type: null_rate
    query: |
      SELECT 100.0 * SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) / COUNT(*) FROM mart.orders
    threshold: 1
    severity: medium
    on_breach:
      notify: ["analytics-engineering"]
      ticket: true

Alert hygiene

Deduplicate repeated alerts; add cooldowns to avoid noise storms.
Always link to a runbook and recent context (last pass, owner, change history).

5) Incident playbook snippet

Incident: Data Freshness Breach (High Severity)
1. Triage (max 15 min): confirm breach, identify impacted tables and consumers.
2. Contain: pause downstream jobs if serving stale/incorrect data to critical outputs.
3. Diagnose: check last successful run, recent code/config changes, upstream source health.
4. Fix: rollback or hotfix; re-run idempotent steps from the minimal checkpoint.
5. Recover: validate with targeted checks; re-enable downstream.
6. Learn: RCA within 48h; create action items (tests, alerts, guards) with owners and due dates.

Message template:
"High-sev freshness breach in mart.customer_daily. Last healthy: 09:10 UTC, threshold: 60 min.
Potential impact: Marketing dashboards. Runbook: <internal path>. Owner: Data Platform On-call."

6) Root cause analysis cues

-- Example lineage probe (conceptual)
SELECT child_table, parent_table, last_run_ts, status
FROM lineage_runs
WHERE child_table = 'mart.customer_daily'
ORDER BY last_run_ts DESC
LIMIT 20;

RCA workflow

Reproduce locally or in staging; confirm breach signal.
Check change log (code, schema, data contracts, source outages) in the last 24–48 hours.
Trace lineage one hop upstream; compare inputs vs expected contracts.
Test hypotheses with targeted queries and backfills.

Drills / exercises

Create a freshness and completeness check for a table you own; choose thresholds and define an owner.
Implement a daily reconciliation between an ingestion table and a warehouse table using counts and a hash on a business key.
Add a rolling z-score anomaly detector for a metric of your choice; tune threshold to reduce false positives.
Draft an incident playbook for a high-severity data breach scenario relevant to your org.
Build a small dashboard showing test pass rate, freshness SLO attainment, and MTTR over the last 30 days.

Common mistakes and how to avoid them

Only checking downstream marts: instrument quality as early as possible near ingestion.
Thresholds too strict or too loose: iterate using historical data and stakeholder input.
Noisy alerts: add severity levels, cooldowns, and context; prefer actionable alerts.
Non-idempotent re-runs: design pipelines with dedupe keys and safe retries.
Skipping reconciliation: counts and hashes catch silent drift and schema quirks.
Unowned incidents: assign owners and escalation paths; include runbooks in every alert.

Debugging tips

Check whether the signal is real: validate with a direct query and a control table.
Validate clocks and time zones when diagnosing freshness breaches.
Compare a small partition/date; sample row-level diffs before full backfills.
Roll back to last known good artifacts if hotfix risk is high.

Mini project: Quality guardrails for a critical dataset

Goal: Add reliable guardrails to a high-impact table (e.g., orders_daily) and publish quality KPIs.

Design: define SLAs/SLOs, owners, and severity levels.
Build: implement freshness, completeness, and domain checks; add a rolling anomaly detector for daily volume.
Reconcile: schedule a daily source-vs-warehouse reconciliation job.
Operate: configure alerts with runbook references and a simple on-call rotation.
Report: create a dashboard with test coverage, pass rate, freshness attainment, incident count, and MTTR.

Stretch goals

Add seasonality-aware anomalies (weekday profiles).
Publish a weekly stakeholder summary with trends and action items.

Subskills

Data Quality Framework Design — Establish dimensions, SLAs/SLOs, ownership, and governance.
Freshness Completeness Accuracy Checks — Implement baseline tests near data.
Reconciliation Patterns — Detect drift across systems with counts and hashes.
Anomaly Detection Strategy — Catch outliers with rolling baselines and tuned thresholds.
Monitoring And Alerting Design — Route actionable alerts with severity and runbooks.
Incident Management Playbooks — Standardize triage, containment, fix, and learn steps.
Root Cause Analysis Process — Use lineage, contracts, and experiments to find causes.
Quality Metrics And Reporting — Track coverage, failure rates, freshness, MTTR.

Next steps

Pick one production dataset and implement at least three checks plus a reconciliation job.
Define alerting and publish a small dashboard with your first quality KPIs.
Schedule a monthly review to refine thresholds and update runbooks.

Menu

Data Quality And Observability

Table of Contents

Why this skill matters for a Data Architect

What you will be able to do

Who this is for

Prerequisites

Learning path (practical roadmap)

1) Define the quality bar

2) Instrument baseline checks

3) Establish reconciliation patterns

4) Add anomaly detection

5) Monitoring and alerting

6) Incident management and RCA

7) Quality metrics and reporting

Worked examples

1) Freshness, completeness, and accuracy checks (SQL)

2) Reconciliation across systems

3) Simple anomaly detection with rolling baseline (SQL)

4) Monitoring and alerting: config sketch

5) Incident playbook snippet

6) Root cause analysis cues

Drills / exercises

Common mistakes and how to avoid them

Mini project: Quality guardrails for a critical dataset

Subskills

Next steps

Topics

Data Quality Framework Design

Freshness Completeness Accuracy Checks

Reconciliation Patterns

Anomaly Detection Strategy

Monitoring And Alerting Design

Root Cause Analysis Process

Quality Metrics And Reporting

Incident Management Playbooks

Have questions about Data Quality And Observability?

AI Assistant