Why this matters
Streaming ingestion is how modern systems move events in real time: clicks, payments, sensor signals, logs, and app telemetry. As a Data Engineer you will:
- Collect events with low latency into a durable stream (e.g., a message broker).
- Process events continuously for alerts, fraud checks, and near real-time dashboards.
- Land data reliably to storage and warehouses without duplicates or gaps.
- Handle spikes, out-of-order events, and schema changes without breaking pipelines.
Concept explained simply
Streaming ingestion means data flows in continuously and you handle it as it arrives. Instead of waiting for a nightly file, you read tiny chunks of data right now, push them through a pipeline, and store results quickly.
Mental model
Picture a conveyor belt:
- Producers put events on the belt (source apps, services).
- A broker keeps the belt moving and organizes events in partitions.
- Consumers pick items off the belt, transform them, and send them to sinks (databases, object storage, warehouses).
- Offsets mark where each consumer is on the belt. If a consumer restarts, it continues from its last offset.
Core concepts and terms
Events, topics, partitions
- Event: A single record (e.g., {event_id, user_id, ts, payload}).
- Topic/Stream: Named channel for events.
- Partition: A shard of a topic to scale throughput and parallelism. Ordering is guaranteed only within a partition.
Delivery semantics
- At-most-once: No duplicates, but you can lose events.
- At-least-once: No loss, but duplicates can occur (most common in practice).
- Exactly-once: Effects occur once; usually requires careful design (idempotent writes, transactions, or framework support).
Time in streaming
- Event time: When the event actually happened (from the producer).
- Processing time: When your system sees the event.
- Watermark: Your best guess of how far you are in event time, allowing for late events.
Back-pressure and lag
- Consumer lag: How far behind the consumer is from the latest events.
- Back-pressure: Signals that downstream is slower than upstream; systems slow down or buffer to stabilize.
Schemas and compatibility
- Define fields and types (e.g., Avro/Protobuf/JSON with schema) to avoid breaking changes.
- Plan for evolution: new optional fields first; avoid removing or renaming without compatibility strategy.
Worked examples
Example 1: Clickstream to object storage with micro-batches
- Source: Web app emits events to a broker topic "clicks".
- Ingestion: A connector or consumer reads events every 60 seconds.
- Processing: Minimal transformation (add server_timestamp, validate schema).
- Sink: Write Parquet files to object storage partitioned by event_date/hour.
- Reliability: At-least-once. Use file naming with a unique run_id to avoid overwriting; run a dedup step downstream.
Example 2: IoT sensors to alerts and time-series DB
- Source: Sensors publish temperature every second.
- Processing: Compute 1-minute event-time windows with average temperature.
- Alerts: If average exceeds threshold, emit alert events to an "alerts" topic.
- Sink: Write raw events to a time-series DB and aggregated windows to a warehouse.
- Late data: Allow 2 minutes lateness with watermarks; update aggregates on late arrivals.
Example 3: Payments with idempotent upserts
- Source: Payments service emits events with unique event_id.
- Processing: Validate schema and deduplicate by event_id within a 10-minute window.
- Sink: Upsert into a transactions table using event_id as the primary key.
- Outcome: Exactly-once effect in the sink, even if the consumer retries.
Design principles you will use
- Prefer event-time semantics for analytics; set watermarks and allowed lateness.
- Design for at-least-once delivery; make sinks idempotent to achieve exactly-once effects.
- Choose a partition key that balances load and keeps related events together (e.g., user_id).
- Plan capacity with headroom (throughput spikes happen).
- Observe everything: consumer lag, end-to-end latency, error rates, dead-letter queue volume.
Hands-on exercises
Note: The Quick Test is available to everyone; only logged-in users get saved progress.
Exercise 1 — Capacity and partitioning plan (ex1)
You ingest 120,000 events per minute. Each event is 2 KB. Consumer max read rate per partition is 2 MB/s. Target p95 end-to-end latency: under 30 seconds. Propose:
- Number of partitions.
- Estimated steady-state throughput.
- Headroom rationale for spikes.
- Simple retention and DLQ strategy.
Write your plan in 3–5 bullet points.
Exercise 2 — Idempotent sink with dedup (ex2)
Design logic that ensures exactly-once effect in the sink for an at-least-once stream. Requirements:
- Unique key: event_id; include event_time.
- Handle late events up to 5 minutes.
- Store only the latest event per event_id if duplicates arrive.
Provide SQL or pseudocode that performs dedup and upsert.
Self-check checklist for exercises
- [ ] You computed throughput in MB/s for Exercise 1.
- [ ] Your partition count leaves 25–100% headroom.
- [ ] Your plan explains how lag is drained under spikes.
- [ ] Your dedup logic uses a stable key (event_id) and a deterministic rule.
- [ ] Your upsert does not create duplicates on retries.
Common mistakes and self-check
- Mistake: Using processing time for analytics windows. Fix: Use event time with watermarks and allowed lateness.
- Mistake: No schema enforcement. Fix: Validate records and route bad ones to a dead-letter queue with reason.
- Mistake: Too few partitions creating hot partitions. Fix: Size partitions for peak throughput and balanced keys.
- Mistake: Assuming exactly-once without idempotent writes. Fix: Upserts by key; deduplicate before landing.
- Mistake: Ignoring consumer lag. Fix: Alert on lag thresholds; scale consumers or increase partitions.
Quick self-audit before you ship
- [ ] Does each pipeline step define delivery semantics?
- [ ] Are keys chosen to balance load and preserve needed locality?
- [ ] Are retries safe due to idempotency or transactions?
- [ ] Are metrics and alerts defined for lag, errors, and latency?
- [ ] Are late/out-of-order events handled intentionally?
Practical projects
- Real-time clickstream to warehouse: Ingest to a broker, micro-batch to object storage in Parquet, build a deduped view in your warehouse.
- IoT monitoring: Stream sensor events, compute rolling averages and anomaly flags, alert when thresholds breach, and store both raw and aggregates.
- Payments ledger: Consume payment events, validate schema, deduplicate by event_id, upsert into a transactions table, and expose a near real-time balance view.
Mini challenge
You see rising consumer lag after a marketing campaign starts. Propose a quick, safe plan to reduce lag within 15 minutes without losing data.
One possible approach
- Scale consumers horizontally (more instances in the same consumer group).
- Temporarily increase partitions if your platform supports safe repartitioning, or rekey future events to better-distributed keys.
- Reduce sink batch sizes to speed commits; ensure idempotent upserts.
- Pause noncritical enrichments; route raw to storage, backfill enrichments later.
- Check for a hot key; if found, shard the key (e.g., user_id plus a small hash suffix).
Who this is for
- Aspiring and junior Data Engineers needing real-time ingestion fundamentals.
- Analysts/Scientists integrating streaming data into models and dashboards.
- Backend engineers interfacing services with data platforms.
Prerequisites
- Basic understanding of batch data ingestion and file formats (CSV/JSON/Parquet).
- Comfort with SQL and at least one programming language.
- Familiarity with cloud storage and basic networking concepts.
Learning path
- Start here: streaming vs batch, delivery semantics, partitions, offsets.
- Next: schema management and validation strategies.
- Then: event-time processing, windows, watermarks, late data.
- Reliability: idempotent writes, dedup, DLQ design.
- Operations: monitoring, lag reduction, back-pressure handling, scaling.
- Apply: build one end-to-end project from source to warehouse.
Key metrics to watch
- Producer throughput (events/s, MB/s) and error rate.
- Consumer lag, processing latency, end-to-end latency.
- DLQ volume and top failure reasons.
- Partition skew (hot partitions) and rebalance frequency.
Next steps
- Implement one practical project end-to-end with clear delivery semantics.
- Add observability: track lag, latency, and error budgets with alerts.
- Iterate on schema evolution with a compatibility policy and validation at ingest.
Quick Test
Take the test to check your understanding. Available to everyone; only logged-in users get saved progress.