Why this matters
As a Data Architect, you decide how data leaves source systems and enters your platform. The right integration pattern reduces latency, cost, and risks like data loss or inconsistent metrics. Real tasks you will face:
- Choosing between batch, micro-batch, and streaming for a new product’s data feed.
- Designing change data capture (CDC) to avoid overloading a production database.
- Defining contracts for files delivered by a vendor via SFTP.
- Handling late-arriving data, replays, and backfills safely and idempotently.
- Ensuring reliable error handling and monitoring with clear SLAs.
Concept explained simply
Source system integration patterns are repeatable ways to extract data from operational systems into analytics or downstream services. You pick a pattern based on latency needs, volume, data volatility, coupling tolerance, and reliability.
Mental model
Think of integration as water flowing through pipes:
- Valve = trigger (schedule, event, or change log)
- Pipe size = throughput
- Filters = validation/transformation
- Meters = monitoring/observability
- Reservoirs = staging areas or queues
Your job: choose valves, pipe sizes, and filters that deliver clean, timely water without flooding the city.
Common patterns at a glance
- File-based batch (SFTP/CSV/Parquet): simple, cost-effective for daily/hourly loads.
- Database extracts (JDBC/ODBC): direct read; watch for source load and security.
- Change Data Capture (CDC): incremental via logs (e.g., WAL/binlog); near-real-time with low source impact.
- Event-driven (queues/streams/webhooks): low latency, decoupled, ideal for operational integrations.
- API pull/push: flexible but rate-limited; best for SaaS and smaller volumes.
- Replication/mirroring: near-real-time copy of tables; good for read-heavy analytics.
- Virtualization/federation: query in place; fast to start, not ideal for heavy analytics.
How to choose a pattern
- Latency: real-time (events/CDC), near-real-time (micro-batch/CDC), batch (files/extracts).
- Volume: huge volumes prefer files, streams, or CDC; APIs may throttle.
- Coupling: events and files decouple; direct DB reads couple to schema and load.
- Change frequency: frequent updates favor CDC/events; static data suits batch.
- Data model stability: unstable schemas benefit from schema registry and contracts.
- Operational maturity: choose patterns your team can monitor and support.
- Cost and security: minimize egress costs; ensure encryption and least privilege.
Worked examples
Example 1: Nightly ERP to Data Lake (Batch Files)
Context: ERP exports 50M rows/day. Latency requirement: next morning.
Pattern: Vendor drops partitioned Parquet via SFTP. Ingest to landing, validate schema, move to bronze, then transform to silver/gold.
Why: Large volume, low latency need. Files minimize source load and are cost-effective.
Key controls: file naming convention, manifest files, checksum validation, idempotent loads using file IDs and watermarks.
Example 2: CRM incremental updates (CDC)
Context: Sales records update all day. Business needs near-real-time dashboards.
Pattern: Log-based CDC from CRM database to a stream, then to warehouse.
Why: Frequent updates; CDC avoids full table scans and reduces source load.
Key controls: ordering by LSN/SCN, dedupe by primary key + op type, SCD2 handling for history, replay from offsets.
Example 3: E-commerce events (Event-driven)
Context: Checkout, add-to-cart events required within seconds for recommendations.
Pattern: Producers publish to a message broker; consumers stream to analytics and services.
Why: Low latency, decoupling, scalable fan-out.
Key controls: schema registry, partition keys, exactly-once or idempotent sink writes, dead-letter queues.
Example 4: SaaS marketing tool (API Pull)
Context: Moderate volume, hourly refresh acceptable.
Pattern: Paginated API pulls with incremental parameters (updated_after).
Why: SaaS exposes API only; micro-batch keeps within rate limits.
Key controls: backoff/retry, stateful checkpointing per entity, 429 handling, quota-aware scheduler.
Core design elements
- Contracts: define schema, encoding, partitioning, nullability, and delivery guarantees.
- Idempotency: ensure re-running loads does not create duplicates (use natural/business keys, dedup windows).
- Watermarking: track high-water marks (timestamps, incremental IDs) to load only new data.
- Schema evolution: adopt additive-first changes, use schema registry and compatibility checks.
- Backfills and reprocess: plan for historical loads and replay with isolation (separate staging).
- Error handling: quarantine bad records, alert, and retry with exponential backoff.
- Observability: metrics for latency, throughput, success rate, lag, and consumer offsets.
- Security: encryption in transit/at rest, key rotation, least privilege, PII masking.
Exercises
Do these now. Answers are available, but try first. These mirror the graded exercises below.
Exercise 1 — Choose a pattern and controls
Scenario: A payments system emits 5–10k transactions/minute. Stakeholders need fraud features updated within 1–2 minutes. The payments DB is sensitive to load; the vendor can publish events or allow read replicas. Choose an integration pattern and list 4 concrete design controls.
Hints
- Think decoupling and latency.
- Consider idempotency and ordering.
Show solution
Pattern: Event-driven streaming from vendor to a broker, with consumers pushing to the analytics store and feature store.
- Controls: partition by account or transaction ID for ordering; schema registry with backward compatibility; idempotent writes keyed by transaction_id; dead-letter topic with alerting; consumer offset monitoring and lag alerts.
Exercise 2 — Design a CDC pipeline
Scenario: A customer table receives frequent updates and occasional deletes. Daily batch is too stale; near-real-time is needed. Outline a CDC approach, including handling deletes and late updates. Include how you will reprocess a 2-day backlog if the pipeline pauses.
Hints
- Think log-based CDC and SCD handling.
- Plan replay from a durable offset.
Show solution
Approach: Log-based CDC reading from DB replication logs. Emit insert/update/delete events with a monotonically increasing LSN. Use a merge pattern into the warehouse table:
- Inserts/updates: upsert by customer_id with updated_at watermark, maintain SCD2 (close old row, open new version).
- Deletes: soft-delete flag or close SCD2 record; optionally move keys to a tombstone table.
- Late updates: compare updated_at and ignore older events.
- Reprocess: resume from stored offset/LSN two days back; reapply events idempotently; dedupe by (customer_id, LSN).
Self-checklist
- I can justify pattern choice using latency, volume, and coupling.
- I defined idempotency keys and dedup strategy.
- I know how to capture and store watermarks or offsets.
- I have a plan for schema evolution and compatibility tests.
- I documented backfill, replay, and failure recovery steps.
- I set SLAs and corresponding monitors/alerts.
Common mistakes and how to self-check
- No idempotency: duplicates appear after retries. Self-check: simulate retries and verify stable row counts.
- Overloading source DB with full scans. Self-check: measure query impact or use read replicas/CDC.
- Ignoring deletes: downstream analytics show ghost records. Self-check: confirm tombstone handling.
- Weak contracts: vendors change schema without notice. Self-check: enforce schema checks and contract tests before ingestion.
- Unbounded DLQs: errors pile up. Self-check: set DLQ SLOs and operational playbooks.
- No backfill plan: gaps remain after outages. Self-check: run a time-window reprocessing drill monthly.
Who this is for
- Data Architects and Senior Data Engineers designing ingestion strategies.
- Analytics Engineers who must understand upstream data freshness and reliability.
- Platform Engineers operating data pipelines.
Prerequisites
- Basic ETL/ELT concepts (staging/bronze/silver/gold).
- SQL joins, merges (upsert), and window functions.
- Familiarity with streaming concepts (topics, partitions, offsets).
Learning path
- Review core patterns: batch, CDC, events, APIs, replication.
- Practice choosing patterns using scenarios (see exercises).
- Implement idempotent merges and watermarking in a sandbox.
- Add monitoring: lag, throughput, failures, and DLQs.
- Run a controlled backfill and replay drill.
Practical projects
- Build a mini CDC pipeline: simulate a table with inserts/updates/deletes, emit change events, and upsert into an analytics table with SCD2.
- File ingestion with contracts: generate partitioned Parquet files, validate schema on read, and implement idempotent loading by file manifest.
- Event-driven demo: publish JSON events to a local broker, consume and write to a warehouse table with exactly-once semantics via dedup keys.
Next steps
- Document a pattern decision record (PDR) for one of your systems: include latency, volume, coupling, and operational plan.
- Set up alerts for lag, error rate, and schema drift in a current pipeline.
- Schedule a quarterly replay/backfill exercise.
Mini challenge
Pick one real source in your environment. In 30 minutes, choose a pattern, list 5 risks, and write the idempotency/dedup plan. Share with a peer for review.
Progress saving note
The quick test is available to everyone. If you log in, your progress and results will be saved.