luvv to helpDiscover the Best Free Online Tools
Topic 1 of 8

Source System Discovery

Learn Source System Discovery for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

Source System Discovery is the first step of any ingestion. It prevents surprises like incompatible formats, missing permissions, unstable fields, and unrealistic SLAs. In real data engineering work, you will:

  • Inventory all upstream systems and owners.
  • Confirm access, formats, SLAs, and data contracts.
  • Choose the right ingestion approach (batch, CDC, stream, API).
  • Estimate cost, volumes, and operational risk before building.

Concept explained simply

Source System Discovery is a structured way to answer: What data exists, where is it, how does it change, how do we access it safely, and what guarantees are expected?

Mental model: RAPIDS

  • R — Reachability: How do we connect? Network path, auth, roles, IP allowlists.
  • A — Anatomy: Schema, semantics, data types, PII/PHI, keys.
  • P — Patterns: Batch vs stream vs CDC, update cadence, late/retro loads.
  • I — Importance: Business criticality, SLAs, freshness, recovery objectives.
  • D — Data quality: Nulls, duplicates, inconsistencies, validation rules.
  • S — Scale: Volume, velocity, retention, partitioning, rate limits.

Discovery checklist (use as you interview and probe)

Check off each item as you gather facts:

  • Owner/contact (team and on-call) and support hours
  • Purpose and key business entities (what the system models)
  • Access path: network, accounts, auth type (OAuth, key, IAM, Kerberos), secrets rotation
  • Data location and interface: DB (type/version), API (endpoints), file drops (path, bucket), stream topics
  • Schema: entities, primary keys, foreign keys, data types, nullable fields
  • Change pattern: inserts/updates/deletes, CDC availability, event time vs processing time
  • Cadence and freshness: how often data changes and when it becomes available
  • Volumes: rows/day, GB/day, peak rates; retention policies; historical backfill source
  • Constraints: rate limits, quotas, connection limits, maintenance windows
  • Quality profile: known anomalies, late arriving events, duplicates, timezone handling
  • Security and compliance: PII/PHI presence, required masking/field-level encryption
  • Contract/SLAs: stability expectations, deprecation notices, schema change process
  • Error handling expectations: retries, dead-letter, replays, backfills
  • Cost considerations: API billing, egress fees, storage/compute estimates
  • Success criteria: what stakeholders will accept as "working" (latency, completeness)

Worked examples

Example 1: Orders in OLTP PostgreSQL with CDC

Context: Ecommerce app with PostgreSQL (v13). Tables: orders, order_items, payments. Business needs near-real-time reporting.

  • Reachability: VPC peering; read-only user; password rotated monthly.
  • Anatomy: orders.id (PK), created_at, updated_at; payments has nullable settled_at.
  • Patterns: Updates on orders and payments; deletes rare; CDC possible via logical decoding.
  • Importance: High; dashboard SLA 5 minutes.
  • Data quality: Some orders created without payments yet; late settlement updates common.
  • Scale: ~2M rows/day; spikes during sales.

Ingestion choice: CDC (e.g., Debezium) to Kafka topic, then land in lakehouse. Keep update semantics (upsert) and late-arrival handling.

Notes: Confirm WAL settings, choose key columns, define dedup and idempotency strategy in consumers.

Example 2: Marketing SaaS REST API (rate limited)

Context: Need campaign metrics from a SaaS API. Endpoints support created_since and updated_since cursors. Rate limit: 240 requests/hour.

  • Reachability: Public API over HTTPS; OAuth2 client credentials.
  • Anatomy: campaigns, ad_groups, metrics (nested JSON arrays).
  • Patterns: Mostly append-only metrics; occasional late corrections (updated_at changes).
  • Importance: Medium; daily reporting acceptable.
  • Data quality: Sampling on small campaigns; metric revisions up to 7 days.
  • Scale: 10K campaigns; ~1 GB/day raw JSON.

Ingestion choice: Batch pull daily using updated_since with checkpointing. Implement exponential backoff and respect rate limits. Normalize nested JSON into tables.

Notes: Add a 7-day sliding re-sync to capture late corrections.

Example 3: Vendor CSVs in object storage

Context: Logistics vendor drops daily CSV files into a bucket path like /vendorX/dt=YYYY-MM-DD/shipments.csv

  • Reachability: Cross-account bucket access via IAM role; event notifications available.
  • Anatomy: CSV with header; shipment_id, customer_id, status, updated_ts.
  • Patterns: Late files possible up to 48 hours; occasional schema drift (new columns).
  • Importance: Medium-high for operations; freshness 2 hours acceptable.
  • Data quality: Duplicates on retries; timezone mix-ups historically.
  • Scale: ~5 GB/day; weekend low volume.

Ingestion choice: Event-driven copy on object create, with partition discovery. Use schema-on-read or evolve schema with column add. Deduplicate on (shipment_id, updated_ts). Normalize timestamps to UTC.

Step-by-step: run a discovery interview

1) Prepare
Skim docs, list unknowns, draft your checklist.
2) Interview
Meet the system owner; confirm business goals and constraints first.
3) Verify
Request sample dumps or test credentials; run small probes.
4) Model
Sketch entities, keys, and flows; choose ingestion method.
5) Decide
Document trade-offs, SLAs, and success criteria; get sign-off.
Interview question bank
  • What events or records matter most for the business outcome?
  • When is data considered "ready" and how late can it arrive?
  • How are schema changes communicated? Is there a deprecation period?
  • What credentials and network paths are required? Any rotation policies?
  • What are the peak loads and maintenance windows?
  • If ingestion fails, what is the acceptable recovery time and data loss?

Exercises (hands-on)

Do these to build muscle memory. Then compare with the solutions below each exercise.

Exercise 1: Map discovery fields for an HR API

Scenario: WorkLyfe HR provides employee profiles via REST. Endpoints: /employees, /departments. Supports updated_since. OAuth2. Rate limit 120 req/hour. PII: emails, phone. Weekly maintenance Sundays 02:00–04:00 UTC. Occasional backdated corrections within 14 days.

  • List the top 8 discovery items you would document (use the checklist above).
  • Propose an ingestion method and freshness plan.
  • Note two data quality risks and mitigations.

Exercise 2: Identify risks in a sensor stream

Scenario: Factory sensors publish JSON to an MQTT broker. Topics per line: factory/lineX/sensor#. Events at 10–50/sec. Some sensors disconnect intermittently. Time is device local time; some clocks drift by minutes.

  • Name at least five risks or constraints you must discover.
  • For each, suggest a mitigation or validation check.

Checklist for both exercises:

  • Included owners and access?
  • Captured change pattern and cadence?
  • Quantified scale and rate limits?
  • Addressed PII/compliance or security constraints?
  • Defined success criteria and backfill/retry approach?

Common mistakes and self-check

  • Skipping access proof. Self-check: Did you obtain and test credentials in a sandbox?
  • Underestimating late data. Self-check: Do you have a reprocessing window and idempotent loads?
  • Ignoring rate limits. Self-check: Are concurrency and backoff configured and tested?
  • Assuming stable schema. Self-check: Do you track schema changes and handle column adds/removes?
  • Missing owner alignment. Self-check: Are SLAs and change notifications agreed and documented?

Practical projects

  • Create a one-page Source Discovery document (RAPIDS sections) for a public API (e.g., a mock) and include a proposed ingestion diagram.
  • Build a tiny probe: call an API with backoff, collect response size stats, and produce a discovery summary JSON (counts, fields, null %).
  • Simulate schema drift: process daily CSVs where columns evolve; write rules to auto-adapt and log changes to a registry file.

Mini challenge

You discover a payments system exposing a Kafka topic with at-least-once delivery, occasional reorders, and GDPR deletions via tombstone messages. In 5 bullet points, write the minimum ingestion requirements and checks you must implement.

Who this is for

  • Data Engineers starting ingestion projects.
  • Analytics Engineers validating upstream feasibility.
  • Platform Engineers defining data contracts with producers.

Prerequisites

  • Basic understanding of databases, files, and APIs.
  • Familiarity with batch vs stream processing concepts.
  • Comfort reading JSON/CSV and ER diagrams.

Learning path

  • 1) Learn data formats and contracts (JSON, CSV, Avro, schema evolution).
  • 2) Study access and security (authN/Z, secrets handling, networking basics).
  • 3) Explore ingestion patterns (batch, CDC, streaming) and when to use each.
  • 4) Practice discovery interviews and probes; build a repeatable template.
  • 5) Implement observability for ingestion (freshness, completeness, quality checks).

Next steps

  • Complete the exercises below and write your own discovery template.
  • Take the Quick Test to confirm understanding. Anyone can take it; only logged-in users will have their progress saved.
  • Pick one Practical Project and execute it this week.

Quick Test

Short quiz to check your grasp of Source System Discovery. Everyone can take it for free; login is only needed if you want your progress saved.

Practice Exercises

2 exercises to complete

Instructions

Using the scenario in the lesson (WorkLyfe HR), produce a concise discovery summary that covers:

  • 8 key discovery items (owners, access, interface, schema highlights, change pattern, cadence, scale, quality, SLAs).
  • Your ingestion method and freshness plan.
  • Two data quality risks and mitigations.
Expected Output
A structured bullet list or short document capturing owners, OAuth2 access, endpoints, updated_since usage, daily batch with 14-day re-sync, rate limit handling, PII handling, success criteria, and quality mitigations.

Source System Discovery — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Source System Discovery?

AI Assistant

Ask questions about this tool