How to learn Connectors And CDC Basics for Streaming Platform Basics in Data Platform Engineer for free

Why this matters

Data Platform Engineers must move data from diverse sources into streaming pipelines reliably. Connectors and Change Data Capture (CDC) are how you: keep analytics dashboards fresh, power event-driven microservices, build audit trails, and migrate systems with near-zero downtime. If you choose the wrong connector type or CDC strategy, you risk missing events, duplicates, or heavy load on production databases.

Real tasks you will do: capture database changes into a streaming bus, mirror SaaS data into topics, load topics into storage/warehouses, and operate connectors at scale.
Typical outcomes: correct ordering per key, idempotent sinks, backpressure handling, and recoverable pipelines using offsets and dead-letter queues.

Concept explained simply

Connectors are adapters that read from or write to systems (databases, file/object stores, APIs, message queues). CDC is a technique to stream row-level changes from databases as they happen.

Common connector categories:

Source connectors: pull or receive data from systems into your streaming platform.
Sink connectors: push data from the streaming platform to destinations (warehouse, object store, OLTP, search, etc.).

CDC strategies:

Log-based CDC: reads the database transaction log (e.g., binlog/WAL/redo). Lowest impact, best fidelity (inserts/updates/deletes with ordering).
Query-based CDC: periodically queries for new/changed rows (e.g., updated_at > last_offset). Easier to set up, can miss rapid updates and can cause load.
Trigger-based CDC: DB triggers write to a change table. Precise but intrusive; increases DB complexity.

Mental model

Think of a conveyor belt:

The source connector grabs items from a source and places them on the belt (the streaming platform).
CDC ensures we put every item on the belt in the right order without skipping or duplicating.
The sink connector takes items off the belt and stores them safely at the destination.

Offsets are bookmarks that remember where we are on the belt. If the system restarts, we continue from the last bookmark. Snapshots are the initial bulk load to populate the destination before switching to the live belt of changes.

Core building blocks: connectors and CDC

Initial snapshot: one-time (or resumable) copy of source state to bootstrap the destination.
Incremental changes: continuous stream of inserts/updates/deletes after snapshot completes.
Offsets and checkpoints: stored positions to resume reads after failures or restarts.
Delivery semantics: at-least-once (most common), exactly-once (harder; often per-sink idempotency is used), at-most-once (rarely acceptable).
Ordering: per-key ordering matters; partition by a stable key to keep related changes ordered.
Schema evolution: track column additions, type changes, and nullability; manage backward/forward compatibility.
Backpressure and retries: pace reading/writing; apply exponential backoff; use dead-letter queues for poison messages.
Idempotency and deduplication: use primary keys + versioning or change LSN/transaction IDs to deduplicate at the sink.

Worked examples

Example 1: MySQL orders to the streaming platform via log-based CDC

Run an initial snapshot of tables: customers, orders, order_items.
Switch to binlog streaming (log-based CDC) to capture all changes.
Partition topics by the business key (e.g., order_id) so events for the same order stay ordered.
Include metadata: source table, operation (create/update/delete), source timestamp, transaction id.
At the sink (warehouse), merge using order_id, apply upserts, and handle deletes via soft delete flag or tombstones.

Why this works

Log-based CDC is low-impact on MySQL and preserves change order within a transaction, reducing missed updates and enabling accurate downstream merges.

Example 2: PostgreSQL product catalog to object storage

Snapshot products to partitioned files by date (e.g., dt=YYYY-MM-DD).
Stream WAL changes as micro-batches. Write compacted, time-partitioned files (e.g., hourly), with a stable file naming scheme.
Use a dedupe key such as (product_id, change_version or WAL LSN). Keep the last write per key per window.
Downstream analytics read partitioned data; late updates land in the next partition and are merged by version.

Why this works

You get near-real-time updates with storage-efficient files, while versioning enables idempotent processing and consistent analytics.

Example 3: SaaS CRM API to streaming platform (polling)

Start with a full export endpoint as snapshot.
Incremental polling: call the changes endpoint using updated_since watermark.
Respect rate limits; implement exponential backoff and jitter.
Deduplicate using external_id + last_modified; ignore out-of-order duplicates when last_modified is older.

Why this works

APIs rarely provide logs; polling with a robust watermark model approximates CDC while controlling load and avoiding duplicates.

Design checklist for reliable connectors and CDC

Define the truth: Which tables/entities and which operations (insert/update/delete) are required?
Choose CDC type: log-based if available; otherwise polling with a strong watermark.
Plan snapshot: consistent snapshot point; exclude giant historical partitions if not needed.
Offsets: where stored, retention, and recovery plan validated in staging.
Partitioning: choose key for ordering; avoid hot partitions.
Schema evolution: allow additive changes; prepare backfill strategy for breaking changes.
Throughput and backpressure: throttle reads/writes; size batches; set retries and DLQ rules.
Idempotent sinks: upsert/merge by primary key; dedupe by version or change id.
Observability: metrics for lag, error rate, throughput, and end-to-end latency; alerts on lag growth.
Security: least-privilege credentials; encrypt in transit and at rest; avoid copying sensitive columns if not needed.

Hands-on exercises

Do these to solidify the ideas. Then check the solutions.

Exercise 1: Pick a CDC strategy per table

You have an OLTP database with tables:

accounts (business-critical, high write rate)
invoices (moderate writes, occasional backfills)
audit_log (append-only)

Constraints: You may enable DB logs; you must minimize load; you need near-real-time for accounts, hourly for invoices, and daily for audit_log.

Deliverable: Choose CDC type for each table, snapshot plan, and partition key for the topic. Write a short justification.

Show a tip

High-write, low-latency tables favor log-based CDC. Append-only tables are simpler; polling with a high watermark can suffice if latency is relaxed.

Exercise 2: Draft a connector config skeleton

Create a minimal, tool-agnostic source connector config skeleton for PostgreSQL orders streaming with log-based CDC. Include snapshot strategy, offset storage, and partitioning key.

{
  "name": "pg-orders-source",
  "type": "source",
  "source": {
    "system": "postgresql",
    "host": "",
    "database": "sales",
    "tables": ["public.orders", "public.order_items"],
    "cdc": {
      "mode": "log",
      "snapshot": "initial_then_stream",
      "consistent_snapshot": true
    }
  },
  "stream": {
    "topic_prefix": "sales_",
    "partition_key": "order_id",
    "include_metadata": ["op", "ts_ms", "txid", "lsn"]
  },
  "offsets": {
    "storage": "internal_store",
    "retention_days": 14
  },
  "retries": {"max_attempts": 10, "backoff_ms": 2000},
  "security": {"ssl": true}
}

Need a hint?

Include fields that explain how you snapshot, how you resume (offsets), and how you maintain ordering (partition key).

Common mistakes and self-check

Mistake: Polling with an imprecise watermark (e.g., now-5m) leading to gaps or duplicates. Self-check: Is your watermark a persisted value tied to last processed record?
Mistake: Ignoring deletes. Self-check: Do you emit tombstones or soft-delete flags and handle them downstream?
Mistake: No plan for schema evolution. Self-check: Can your sink accept added columns without breaking?
Mistake: Hot partitions. Self-check: Is your partition key skewed (e.g., country=US for 80% of events)? Consider hashing or composite keys.
Mistake: No backpressure control. Self-check: What happens when the sink slows down? Are retries and rate limits configured?
Mistake: Missing observability. Self-check: Do you track end-to-end lag and have alerts?

Mini challenge

Design a pipeline to stream changes from a legacy ERP (PostgreSQL) to a warehouse. Requirements: snapshot 2 TB of historical data once, then near-real-time updates; strict ordering per customer_id; handle deletes; warehouse supports MERGE. Outline CDC choice, partitioning, snapshot plan, idempotency at sink, and how you will monitor lag.

Considerations

Prefer log-based CDC, snapshot per table in chunks by primary key range.
Partition by customer_id; use tombstones for deletes.
Use MERGE on (customer_id, updated_at) or version/LSN to dedupe.
Expose metrics: snapshot progress, change lag, DLQ count.

Who this is for

Data Platform Engineers and Data Engineers new to streaming ingestion.
Backend engineers integrating operational databases with event streams.

Prerequisites

Basic SQL and understanding of OLTP databases.
Familiarity with a streaming platform concept (topics/partitions/offsets).
Comfort with JSON or structured message formats.

Learning path

Connectors and CDC basics (this lesson).
Serialization and schema evolution in streams.
Delivery semantics and exactly-once patterns.
Sink design: upserts, compaction, partitioned storage.
Operations: monitoring, scaling, and incident response.

Practical projects

Build a CDC pipeline from a sample PostgreSQL DB to topics, then to a local warehouse or files with upsert logic.
Implement a polling connector for a mock REST API with a robust watermark and dedupe logic.
Create monitoring dashboards for connector lag, error rates, and throughput; simulate backpressure.

Next steps

Complete the exercises above and verify with the provided solutions.
Take the Quick Test to validate your understanding. Note: The test is available to everyone; only logged-in users get saved progress.
Apply patterns to a small internal dataset before touching production systems.

Menu

Connectors And CDC Basics

Table of Contents

Why this matters

Concept explained simply

Core building blocks: connectors and CDC

Worked examples

Example 1: MySQL orders to the streaming platform via log-based CDC

Example 2: PostgreSQL product catalog to object storage

Example 3: SaaS CRM API to streaming platform (polling)

Design checklist for reliable connectors and CDC

Hands-on exercises

Exercise 1: Pick a CDC strategy per table

Exercise 2: Draft a connector config skeleton

Common mistakes and self-check

Mini challenge

Who this is for

Prerequisites

Learning path

Practical projects

Next steps

Practice Exercises

Pick a CDC strategy per table

Instructions

Expected Output

Draft a connector config skeleton

Connectors And CDC Basics — Quick Test

Have questions about Connectors And CDC Basics?

AI Assistant