Menu

Topic 6 of 8

Data Validation And Reconciliation

Learn Data Validation And Reconciliation for free with explanations, exercises, and a quick test (for API Engineer).

Published: January 21, 2026 | Updated: January 21, 2026

Why this matters

As an API Engineer, you connect systems that were not built together. Data validation ensures incoming and outgoing data follows contracts, so downstream services do not break. Reconciliation verifies that data in different systems remains consistent over time (e.g., your orders service vs. billing database). Real tasks include building schema validation at API boundaries, deduplicating events, quarantining bad records safely, and writing automated jobs that compare row counts and totals between systems.

  • Stop bad payloads at the edge before they corrupt databases.
  • Detect and resolve missing, duplicate, or out-of-date data across services.
  • Give stakeholders confidence in metrics and reports.

Concept explained simply

Data validation is checking each record against rules before it is accepted. Reconciliation is comparing two data sets that should match, finding and fixing differences.

Mental model

Imagine a secure warehouse:

  • Validation = the guard at the door checking ID and package contents.
  • Reconciliation = inventory audit comparing today’s shelf counts with the master manifest; mismatches are investigated and corrected.
Deeper dive: What to validate and reconcile
  • Schema rules: required fields, types, formats (email, UUID), enums.
  • Business rules: non-negative amounts, dates not in the future for historical events, referential checks (user must exist).
  • Idempotency and deduplication: ignore or merge duplicates using event IDs.
  • Reconciliation signals: row counts, control totals (sum(amount)), watermarks (max updated_at), hash checksums.
  • Quarantine strategy: isolate invalid records; reprocess after fixes.

Worked examples

Example 1: Validate an incoming Order payload at the API edge

Goal: Reject malformed payloads and quarantine them safely.

{
  "order_id": "UUID (required)",
  "user_id": "UUID (required)",
  "currency": "ISO 4217, enum [USD, EUR, GBP] (required)",
  "items": [
    { "sku": "string", "qty": ">=1", "price": ">=0" }
  ],
  "total": ">=0 (must equal sum(items.qty * items.price))",
  "created_at": "RFC3339 timestamp not in future"
}
  1. Validate schema and business rules.
  2. If any rule fails: put payload in a quarantine store with reason codes.
  3. Expose metrics: validation_failures_total by reason.
Tip: Efficient validation
  • Short-circuit on critical failures (missing IDs), then accumulate non-critical warnings to return a single useful error response.
  • Use stable error codes (e.g., VAL_MISSING_FIELD, VAL_ENUM_OUT_OF_RANGE).

Example 2: Reconcile Orders service vs. Analytics warehouse

Scenario: Nightly job compares yesterday’s orders in two systems.

-- Counts by day
SELECT date(created_at) d, COUNT(*) c, SUM(total) s FROM orders WHERE created_at >= CURRENT_DATE - INTERVAL '1 day' GROUP BY 1;
SELECT date(created_at) d, COUNT(*) c, SUM(total) s FROM dw_orders WHERE created_at >= CURRENT_DATE - INTERVAL '1 day' GROUP BY 1;

-- Missing IDs
SELECT o.order_id FROM orders o
LEFT JOIN dw_orders d ON d.order_id = o.order_id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '1 day' AND d.order_id IS NULL;

-- Control totals drift tolerance: abs(svc_sum - dw_sum) <= 0.01 * svc_sum
  1. Compare counts, control totals, and max(updated_at) watermarks.
  2. Alert if thresholds are exceeded; generate a fix list to backfill.
  3. Re-run CDC or re-export missing IDs.

Example 3: Idempotent upsert with deduplication

Requirement: Process events exactly once when retries happen.

// Pseudocode
BEGIN;
IF EXISTS (SELECT 1 FROM processed_events WHERE event_id = :id) THEN
  COMMIT; -- already applied
ELSE
  INSERT INTO target_table (...) VALUES (...)
    ON CONFLICT (business_key) DO UPDATE SET ...;
  INSERT INTO processed_events(event_id, processed_at) VALUES (:id, now());
  COMMIT;
END IF;

Store a unique event_id seen table. Upsert by a business key ensures latest state is applied even if order changes.

Example 4: File-level checksums and control totals

When ingesting batch files, verify integrity before loading.

# Manifest
file_name, row_count, sum_total, md5
orders_2026-01-20.csv, 120034, 554321.45, 5d41402abc4b2a76b9719d911017c592

# Steps
1) Compute local md5 and compare with manifest.
2) Validate row_count and sum(total) before accepting.
3) If mismatch, quarantine file and notify.

Exercises

Complete these and compare with the solutions. You can track progress on the site; the quick test is available to everyone, and only logged-in learners get saved progress.

Exercise 1: Design validation rules for a Payment API and a reconciliation query

Task: Define clear validation rules for an incoming Payment payload and write one SQL query that finds missing payments in the warehouse compared to the payments service for the last 24 hours.

  • Input fields: payment_id (UUID), user_id (UUID), method (enum: CARD, BANK, WALLET), amount (>=0), currency (ISO 4217), status (enum: PENDING, CAPTURED, FAILED), created_at (timestamp), capture_id (nullable, required only when status=CAPTURED).
Hint
  • Include conditional validation for capture_id.
  • Use LEFT JOIN to detect missing records.
  • Add a tolerance idea for control totals.

Exercise 2: Plan idempotency, dedup, and watermarks for an ingestion job

Task: For a daily customer export, propose how you will ensure idempotent loads, deduplicate retries, and resume from the last successful watermark.

Hint
  • Think about a run ledger table that stores batch_id, date_range, and status.
  • Combine a deterministic file name with a checksum.
  • Use max(updated_at) as a resume pointer.

Practice checklist

Common mistakes and self-check

  • Mistake: Only validating schema, not business rules. Fix: Add domain checks (e.g., totals match line items).
  • Mistake: Rejecting everything on minor issues. Fix: Separate errors (reject) vs. warnings (accept with flags).
  • Mistake: No quarantine or replay path. Fix: Store invalid records and make reprocessing easy.
  • Mistake: Reconciliation using only counts. Fix: Add control totals and missing-ID lists.
  • Mistake: No idempotency. Fix: Track processed events/batches and upsert by business keys.
Self-check prompts
  • Can I list the exact reasons a record is rejected?
  • Can I prove two systems match with numbers and IDs?
  • If a job fails mid-run, can I resume without duplicates?
  • What’s the maximum staleness tolerated before alerting?

Practical projects

  • Build a validation gateway: Implement a small service that validates JSON payloads with schema and business rules, returns structured error codes, and writes invalid payloads to a quarantine store for replay.
  • Reconciliation dashboard: Nightly job writes counts, control totals, watermarks, and missing IDs to a status table. Build a simple page or CLI that summarizes pass/fail and top issues.
  • Idempotent CDC pipeline: Consume a change event stream, deduplicate using event IDs, and upsert into a target table. Include a processed_events table and a retry-safe design.

Who this is for

  • API Engineers ensuring stable integrations between services.
  • Backend developers responsible for data flowing into analytics or billing.
  • SREs and data engineers supporting reliable pipelines.

Prerequisites

  • Comfort with REST or event-driven APIs.
  • Basic SQL (JOINs, aggregates).
  • Familiarity with JSON structures and common schema validation ideas.

Learning path

  1. Start: Validation rules at API boundaries (schema + business).
  2. Then: Idempotency and deduplication patterns.
  3. Next: Reconciliation via counts, control totals, and watermarks.
  4. Finally: Alerting, quarantine, and replay workflows.

Mini challenge

Your product team adds a new enum value to payment.method: CRYPTO. Describe what breaks, where validation needs to change, and how you will reconcile data before and after the change. Keep it under 6 bullet points.

Next steps

  • Automate validation at the edge and in pipelines with clear reason codes.
  • Add nightly reconciliation with thresholds and actionable outputs.
  • Introduce idempotent upserts and processed event tracking.
  • Document replay procedures and on-call runbooks.

Quick Test

Take the quick test to check your understanding. Everyone can take it; only logged-in users have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Create a concise validation spec for the Payment payload and write one SQL query that finds payments present in the service but missing in the warehouse for the last 24 hours. Include at least one control total you would compare.

  • Fields: payment_id (UUID), user_id (UUID), method (CARD, BANK, WALLET), amount (>=0), currency (ISO 4217), status (PENDING, CAPTURED, FAILED), created_at (ts), capture_id (nullable; required when status=CAPTURED).
Expected Output
A list of validation rules (required/conditional, ranges, enums), a LEFT JOIN based SQL selecting missing IDs for the last 24 hours, and a note on control totals (e.g., SUM(amount)).

Data Validation And Reconciliation — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Data Validation And Reconciliation?

AI Assistant

Ask questions about this tool