How to learn Source System Discovery for Data Ingestion in Data Engineer for free

Why this matters

Source System Discovery is the first step of any ingestion. It prevents surprises like incompatible formats, missing permissions, unstable fields, and unrealistic SLAs. In real data engineering work, you will:

Inventory all upstream systems and owners.
Confirm access, formats, SLAs, and data contracts.
Choose the right ingestion approach (batch, CDC, stream, API).
Estimate cost, volumes, and operational risk before building.

Concept explained simply

Source System Discovery is a structured way to answer: What data exists, where is it, how does it change, how do we access it safely, and what guarantees are expected?

Mental model: RAPIDS

R — Reachability: How do we connect? Network path, auth, roles, IP allowlists.
A — Anatomy: Schema, semantics, data types, PII/PHI, keys.
P — Patterns: Batch vs stream vs CDC, update cadence, late/retro loads.
I — Importance: Business criticality, SLAs, freshness, recovery objectives.
D — Data quality: Nulls, duplicates, inconsistencies, validation rules.
S — Scale: Volume, velocity, retention, partitioning, rate limits.

Discovery checklist (use as you interview and probe)

Check off each item as you gather facts:

Owner/contact (team and on-call) and support hours
Purpose and key business entities (what the system models)
Access path: network, accounts, auth type (OAuth, key, IAM, Kerberos), secrets rotation
Data location and interface: DB (type/version), API (endpoints), file drops (path, bucket), stream topics
Schema: entities, primary keys, foreign keys, data types, nullable fields
Change pattern: inserts/updates/deletes, CDC availability, event time vs processing time
Cadence and freshness: how often data changes and when it becomes available
Volumes: rows/day, GB/day, peak rates; retention policies; historical backfill source
Constraints: rate limits, quotas, connection limits, maintenance windows
Quality profile: known anomalies, late arriving events, duplicates, timezone handling
Security and compliance: PII/PHI presence, required masking/field-level encryption
Contract/SLAs: stability expectations, deprecation notices, schema change process
Error handling expectations: retries, dead-letter, replays, backfills
Cost considerations: API billing, egress fees, storage/compute estimates
Success criteria: what stakeholders will accept as "working" (latency, completeness)

Worked examples

Example 1: Orders in OLTP PostgreSQL with CDC

Context: Ecommerce app with PostgreSQL (v13). Tables: orders, order_items, payments. Business needs near-real-time reporting.

Reachability: VPC peering; read-only user; password rotated monthly.
Anatomy: orders.id (PK), created_at, updated_at; payments has nullable settled_at.
Patterns: Updates on orders and payments; deletes rare; CDC possible via logical decoding.
Importance: High; dashboard SLA 5 minutes.
Data quality: Some orders created without payments yet; late settlement updates common.
Scale: ~2M rows/day; spikes during sales.

Ingestion choice: CDC (e.g., Debezium) to Kafka topic, then land in lakehouse. Keep update semantics (upsert) and late-arrival handling.

Notes: Confirm WAL settings, choose key columns, define dedup and idempotency strategy in consumers.

Example 2: Marketing SaaS REST API (rate limited)

Context: Need campaign metrics from a SaaS API. Endpoints support created_since and updated_since cursors. Rate limit: 240 requests/hour.

Reachability: Public API over HTTPS; OAuth2 client credentials.
Anatomy: campaigns, ad_groups, metrics (nested JSON arrays).
Patterns: Mostly append-only metrics; occasional late corrections (updated_at changes).
Importance: Medium; daily reporting acceptable.
Data quality: Sampling on small campaigns; metric revisions up to 7 days.
Scale: 10K campaigns; ~1 GB/day raw JSON.

Ingestion choice: Batch pull daily using updated_since with checkpointing. Implement exponential backoff and respect rate limits. Normalize nested JSON into tables.

Notes: Add a 7-day sliding re-sync to capture late corrections.

Example 3: Vendor CSVs in object storage

Context: Logistics vendor drops daily CSV files into a bucket path like /vendorX/dt=YYYY-MM-DD/shipments.csv

Reachability: Cross-account bucket access via IAM role; event notifications available.
Anatomy: CSV with header; shipment_id, customer_id, status, updated_ts.
Patterns: Late files possible up to 48 hours; occasional schema drift (new columns).
Importance: Medium-high for operations; freshness 2 hours acceptable.
Data quality: Duplicates on retries; timezone mix-ups historically.
Scale: ~5 GB/day; weekend low volume.

Ingestion choice: Event-driven copy on object create, with partition discovery. Use schema-on-read or evolve schema with column add. Deduplicate on (shipment_id, updated_ts). Normalize timestamps to UTC.

Step-by-step: run a discovery interview

1) Prepare
Skim docs, list unknowns, draft your checklist.

2) Interview
Meet the system owner; confirm business goals and constraints first.

3) Verify
Request sample dumps or test credentials; run small probes.

4) Model
Sketch entities, keys, and flows; choose ingestion method.

5) Decide
Document trade-offs, SLAs, and success criteria; get sign-off.

Interview question bank

What events or records matter most for the business outcome?
When is data considered "ready" and how late can it arrive?
How are schema changes communicated? Is there a deprecation period?
What credentials and network paths are required? Any rotation policies?
What are the peak loads and maintenance windows?
If ingestion fails, what is the acceptable recovery time and data loss?

Exercises (hands-on)

Do these to build muscle memory. Then compare with the solutions below each exercise.

Exercise 1: Map discovery fields for an HR API

Scenario: WorkLyfe HR provides employee profiles via REST. Endpoints: /employees, /departments. Supports updated_since. OAuth2. Rate limit 120 req/hour. PII: emails, phone. Weekly maintenance Sundays 02:00–04:00 UTC. Occasional backdated corrections within 14 days.

List the top 8 discovery items you would document (use the checklist above).
Propose an ingestion method and freshness plan.
Note two data quality risks and mitigations.

Exercise 2: Identify risks in a sensor stream

Scenario: Factory sensors publish JSON to an MQTT broker. Topics per line: factory/lineX/sensor#. Events at 10–50/sec. Some sensors disconnect intermittently. Time is device local time; some clocks drift by minutes.

Name at least five risks or constraints you must discover.
For each, suggest a mitigation or validation check.

Checklist for both exercises:

Included owners and access?
Captured change pattern and cadence?
Quantified scale and rate limits?
Addressed PII/compliance or security constraints?
Defined success criteria and backfill/retry approach?

Common mistakes and self-check

Skipping access proof. Self-check: Did you obtain and test credentials in a sandbox?
Underestimating late data. Self-check: Do you have a reprocessing window and idempotent loads?
Ignoring rate limits. Self-check: Are concurrency and backoff configured and tested?
Assuming stable schema. Self-check: Do you track schema changes and handle column adds/removes?
Missing owner alignment. Self-check: Are SLAs and change notifications agreed and documented?

Practical projects

Create a one-page Source Discovery document (RAPIDS sections) for a public API (e.g., a mock) and include a proposed ingestion diagram.
Build a tiny probe: call an API with backoff, collect response size stats, and produce a discovery summary JSON (counts, fields, null %).
Simulate schema drift: process daily CSVs where columns evolve; write rules to auto-adapt and log changes to a registry file.

Mini challenge

You discover a payments system exposing a Kafka topic with at-least-once delivery, occasional reorders, and GDPR deletions via tombstone messages. In 5 bullet points, write the minimum ingestion requirements and checks you must implement.

Who this is for

Data Engineers starting ingestion projects.
Analytics Engineers validating upstream feasibility.
Platform Engineers defining data contracts with producers.

Prerequisites

Basic understanding of databases, files, and APIs.
Familiarity with batch vs stream processing concepts.
Comfort reading JSON/CSV and ER diagrams.

Learning path

1) Learn data formats and contracts (JSON, CSV, Avro, schema evolution).
2) Study access and security (authN/Z, secrets handling, networking basics).
3) Explore ingestion patterns (batch, CDC, streaming) and when to use each.
4) Practice discovery interviews and probes; build a repeatable template.
5) Implement observability for ingestion (freshness, completeness, quality checks).

Next steps

Complete the exercises below and write your own discovery template.
Take the Quick Test to confirm understanding. Anyone can take it; only logged-in users will have their progress saved.
Pick one Practical Project and execute it this week.

Quick Test

Short quiz to check your grasp of Source System Discovery. Everyone can take it for free; login is only needed if you want your progress saved.

Menu

Source System Discovery

Table of Contents