How to learn Streaming Platform Basics for Data Platform Engineer for free

Why this skill matters for a Data Platform Engineer

Modern data platforms need real-time pipelines for product analytics, fraud detection, ML features, microservice integration, audit trails, and data mesh domains. Streaming Platform Basics give you the foundation to design reliable topics and partitions, enforce schemas, move data with connectors and CDC, process streams in near real time, and operate safely under load.

Unlocks low-latency data movement across services and warehouses.
Reduces data breakage via schema governance and compatibility.
Supports scale with partitions, consumer groups, and backpressure strategies.
Improves reliability with delivery semantics, monitoring, and governance.

What you'll be able to do

Design Kafka topics and partitions for scale and ordering.
Register and evolve schemas safely with a Schema Registry.
Build ingestion pipelines using Kafka Connect and CDC.
Implement basic stream processing (filters, joins, windows).
Choose and verify delivery semantics (at-least-once, exactly-once).
Monitor lag and throughput, and handle backpressure gracefully.
Apply multi-tenant governance for safe, secure shared clusters.

Prerequisites

Comfort with Linux shell and Docker basics.
Familiarity with at least one programming language (Python or Java recommended).
Basic understanding of data modeling and messaging concepts.

Learning path

Core Kafka concepts
- Topics, partitions, offsets, consumer groups, retention, compaction.
- Hands-on: create a topic, produce/consume, change retention.
Schemas and compatibility
- Avro/JSON/Protobuf, subject naming, backward/forward/full compatibility.
- Hands-on: register v1, evolve to v2 with a default field.
Connectors and CDC
- Source vs sink connectors, tasks, error handling, dead letter queues.
- Hands-on: simulate a CDC pipeline from a database to a topic.
Stream processing
- Stateless/stateful ops, windows, joins, aggregations.
- Hands-on: rolling counts per key with tumbling windows.
Delivery semantics
- At-most-once, at-least-once, exactly-once; idempotence and transactions.
- Hands-on: configure a producer for idempotence and test retries.
Operations
- Monitor lag, throughput, consumer liveness; backpressure strategies.
- Hands-on: observe lag growth and fix with scaling or batching.
Governance
- Multi-tenant isolation, quotas, ACLs, naming standards, retention policies.
- Hands-on: design a naming and ACL plan for two teams.

Worked examples

1) Create a topic for ordering and scale

Goal: route all events of the same order_id to the same partition for per-order ordering, and scale reads/writes.

# Create a topic with 12 partitions, compaction off, 7-day retention
kafka-topics --create \
  --topic orders.events \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=$((7*24*60*60*1000)) \
  --config cleanup.policy=delete

# Produce keyed messages (pseudo)
key = order_id  # ensures partition stickiness for ordering
value = {"order_id":123, "status":"PLACED", "ts":1699999999}
# Send using a client producer with key=order_id

Notes: Increase partitions to scale throughput and consumer parallelism. Use the message key to guarantee per-key ordering.

2) Register and evolve an Avro schema

Goal: add a new optional field without breaking old consumers.

// v1
record OrderEventV1 {
  string order_id;
  string status;
  long ts;
}

// v2 adds customer_tier with a default to keep backward compatibility
record OrderEventV2 {
  string order_id;
  string status;
  long ts;
  string customer_tier = "standard";
}

# Set subject compatibility to BACKWARD or FULL as required

With a default value, older consumers can still read new messages (backward compatibility).

3) Configure a CDC source connector (conceptual)

Goal: stream database changes from orders_db to Kafka.

{
  "name": "orders-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "2",
    "database.hostname": "db",
    "database.port": "5432",
    "database.user": "replicator",
    "database.password": "secret",
    "database.dbname": "orders_db",
    "table.include.list": "public.orders,public.order_items",
    "tombstones.on.delete": "false",
    "snapshot.mode": "initial",
    "topic.prefix": "cdc",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "_dlq.cdc.orders"
  }
}

Key points: size tasks for throughput, include a DLQ for bad records, and align topic naming with your governance.

4) Windowed aggregation (conceptual Kafka Streams/ksqlDB)

Goal: count events per status in 1-minute windows, per order_id.

// Pseudocode API
stream = source("orders.events").keyBy("order_id")
counts = stream
  .groupByKey()
  .window(tumbling=1m)
  .aggregate(count)
counts.to("orders.status.counts")

Use per-key windows to keep state bounded; monitor state store size and commit intervals.

5) Monitor lag and throughput

# Check consumer group lag (conceptual CLI)
kafka-consumer-groups --describe --group orders-app
# Observe CURRENT-OFFSET, LOG-END-OFFSET, LAG per partition

# Throughput calculation example
messages_per_sec = 20000
avg_msg_bytes = 800
approx_mb_per_sec = (messages_per_sec * avg_msg_bytes) / (1024*1024)

Track: consumer lag, request rate, bytes in/out, failed fetches, rebalance count, GC pauses, and end-to-end latency.

6) Handle backpressure safely

Symptoms: rising lag, timeouts, OOM in stream apps. Strategies:

Increase consumer parallelism (more instances) if partitions allow.
Batch and compress messages; tune linger/batch sizes.
Reduce source rate or apply sampling/dropping for non-critical traffic.
Tune max.poll.interval.ms, max.poll.records, and processing time per batch.
Scale state stores/disks for streaming frameworks.

Delivery semantics refresher

At-most-once: fastest, can lose messages (commit before processing).
At-least-once: duplicates possible, no loss with retries (process before commit).
Exactly-once: transactional writes + idempotent producer; higher overhead.

Drills and exercises

Create a topic with 8 partitions and 3-day retention. Produce 1000 keyed messages and consume to verify ordering per key.
Register a schema, then evolve it by adding an optional field. Validate that old consumers can still read.
Simulate a CDC connector config for one table and include a DLQ. Explain each key property in comments.
Write a small script to measure simple producer throughput with two batch sizes and compare.
Induce consumer lag (slow processing) and then reduce it by scaling instances or tuning max.poll.records.
Draft a tenant naming standard for topics and subjects: include team, domain, and privacy level.

Common mistakes and debugging tips

No message key for ordered workflows: leads to cross-partition reordering. Fix: choose a stable key (e.g., order_id).
Unlimited retention: increases storage costs and compaction pressure. Set retention by data value and compliance needs.
Ignoring schema compatibility: breaking changes cause consumer failures. Use defaults or a migration strategy.
Too few partitions: limits throughput and parallelism. Project peak rates and size partitions accordingly.
Commit before processing with at-least-once: risks data loss. Process then commit, or use transactions.
Missing DLQ: bad records block pipelines. Route malformed messages to a DLQ with context.

Quick debugging checklist

High lag: check consumer health, rebalance frequency, and partition skew.
Time-outs: inspect broker/network I/O, request sizes, and compression.
Skewed partitions: examine key distribution; consider a custom partitioner or composite keys.
State store OOM: reduce window size, increase commit frequency, or widen partitions.
Schema errors: validate subjects, compatibility mode, and serializer configs on both producer and consumer.

Mini project: Real-time Orders Pipeline

Build a mini platform to ingest, validate, and aggregate order events in near real time.

Design topics: orders.events (12 partitions), orders.agg (6 partitions). Choose keys and retention for each.
Define schemas: v1 with minimal fields; evolve to v2 by adding an optional field with a default.
CDC simulation: produce create/update events representing DB changes to orders.events with keys = order_id.
Stream app: aggregate counts of orders per status per minute, output to orders.agg.
Reliability: configure producer idempotence and consumer commit order for at-least-once or exactly-once if supported.
Operations: add DLQ for parse errors, measure lag, and tune batch sizes to stabilize throughput.

Success criteria

No consumer lag growth under your test load for at least 5 minutes.
Schema evolution does not break the stream app.
Per-order ordering holds across multiple events.
DLQ contains only invalid messages with clear error context.

Who this is for

Data Platform Engineers establishing real-time data foundations.
Data Engineers and Developers integrating microservices via events.
Analytics engineers preparing for streaming-first architectures.

Next steps

Finish each subskill below and complete the exam to check understanding.
Apply the mini project pattern to a domain in your company (e.g., payments, shipments, tracking).
Document your topic standards, schema rules, and DLQ policy as part of platform governance.

Menu

Streaming Platform Basics

Table of Contents

Why this skill matters for a Data Platform Engineer

What you'll be able to do

Prerequisites

Learning path

Worked examples

Drills and exercises

Common mistakes and debugging tips

Mini project: Real-time Orders Pipeline

Who this is for

Next steps

Streaming Platform Basics — Skill Exam

Topics

Kafka Concepts Topics Partitions

Schema Registry Basics

Connectors And CDC Basics

Stream Processing Concepts

Exactly Once At Least Once Semantics

Monitoring Lag And Throughput

Handling Backpressure

Multi Tenant Streaming Governance

Have questions about Streaming Platform Basics?

AI Assistant