luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Streaming Ingestion Basics

Learn Streaming Ingestion Basics for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

Streaming ingestion is how modern systems move events in real time: clicks, payments, sensor signals, logs, and app telemetry. As a Data Engineer you will:

  • Collect events with low latency into a durable stream (e.g., a message broker).
  • Process events continuously for alerts, fraud checks, and near real-time dashboards.
  • Land data reliably to storage and warehouses without duplicates or gaps.
  • Handle spikes, out-of-order events, and schema changes without breaking pipelines.

Concept explained simply

Streaming ingestion means data flows in continuously and you handle it as it arrives. Instead of waiting for a nightly file, you read tiny chunks of data right now, push them through a pipeline, and store results quickly.

Mental model

Picture a conveyor belt:

  • Producers put events on the belt (source apps, services).
  • A broker keeps the belt moving and organizes events in partitions.
  • Consumers pick items off the belt, transform them, and send them to sinks (databases, object storage, warehouses).
  • Offsets mark where each consumer is on the belt. If a consumer restarts, it continues from its last offset.

Core concepts and terms

Events, topics, partitions
  • Event: A single record (e.g., {event_id, user_id, ts, payload}).
  • Topic/Stream: Named channel for events.
  • Partition: A shard of a topic to scale throughput and parallelism. Ordering is guaranteed only within a partition.
Delivery semantics
  • At-most-once: No duplicates, but you can lose events.
  • At-least-once: No loss, but duplicates can occur (most common in practice).
  • Exactly-once: Effects occur once; usually requires careful design (idempotent writes, transactions, or framework support).
Time in streaming
  • Event time: When the event actually happened (from the producer).
  • Processing time: When your system sees the event.
  • Watermark: Your best guess of how far you are in event time, allowing for late events.
Back-pressure and lag
  • Consumer lag: How far behind the consumer is from the latest events.
  • Back-pressure: Signals that downstream is slower than upstream; systems slow down or buffer to stabilize.
Schemas and compatibility
  • Define fields and types (e.g., Avro/Protobuf/JSON with schema) to avoid breaking changes.
  • Plan for evolution: new optional fields first; avoid removing or renaming without compatibility strategy.

Worked examples

Example 1: Clickstream to object storage with micro-batches
  • Source: Web app emits events to a broker topic "clicks".
  • Ingestion: A connector or consumer reads events every 60 seconds.
  • Processing: Minimal transformation (add server_timestamp, validate schema).
  • Sink: Write Parquet files to object storage partitioned by event_date/hour.
  • Reliability: At-least-once. Use file naming with a unique run_id to avoid overwriting; run a dedup step downstream.
Example 2: IoT sensors to alerts and time-series DB
  • Source: Sensors publish temperature every second.
  • Processing: Compute 1-minute event-time windows with average temperature.
  • Alerts: If average exceeds threshold, emit alert events to an "alerts" topic.
  • Sink: Write raw events to a time-series DB and aggregated windows to a warehouse.
  • Late data: Allow 2 minutes lateness with watermarks; update aggregates on late arrivals.
Example 3: Payments with idempotent upserts
  • Source: Payments service emits events with unique event_id.
  • Processing: Validate schema and deduplicate by event_id within a 10-minute window.
  • Sink: Upsert into a transactions table using event_id as the primary key.
  • Outcome: Exactly-once effect in the sink, even if the consumer retries.

Design principles you will use

  • Prefer event-time semantics for analytics; set watermarks and allowed lateness.
  • Design for at-least-once delivery; make sinks idempotent to achieve exactly-once effects.
  • Choose a partition key that balances load and keeps related events together (e.g., user_id).
  • Plan capacity with headroom (throughput spikes happen).
  • Observe everything: consumer lag, end-to-end latency, error rates, dead-letter queue volume.

Hands-on exercises

Note: The Quick Test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Capacity and partitioning plan (ex1)

You ingest 120,000 events per minute. Each event is 2 KB. Consumer max read rate per partition is 2 MB/s. Target p95 end-to-end latency: under 30 seconds. Propose:

  • Number of partitions.
  • Estimated steady-state throughput.
  • Headroom rationale for spikes.
  • Simple retention and DLQ strategy.

Write your plan in 3–5 bullet points.

Exercise 2 — Idempotent sink with dedup (ex2)

Design logic that ensures exactly-once effect in the sink for an at-least-once stream. Requirements:

  • Unique key: event_id; include event_time.
  • Handle late events up to 5 minutes.
  • Store only the latest event per event_id if duplicates arrive.

Provide SQL or pseudocode that performs dedup and upsert.

Self-check checklist for exercises
  • [ ] You computed throughput in MB/s for Exercise 1.
  • [ ] Your partition count leaves 25–100% headroom.
  • [ ] Your plan explains how lag is drained under spikes.
  • [ ] Your dedup logic uses a stable key (event_id) and a deterministic rule.
  • [ ] Your upsert does not create duplicates on retries.

Common mistakes and self-check

  • Mistake: Using processing time for analytics windows. Fix: Use event time with watermarks and allowed lateness.
  • Mistake: No schema enforcement. Fix: Validate records and route bad ones to a dead-letter queue with reason.
  • Mistake: Too few partitions creating hot partitions. Fix: Size partitions for peak throughput and balanced keys.
  • Mistake: Assuming exactly-once without idempotent writes. Fix: Upserts by key; deduplicate before landing.
  • Mistake: Ignoring consumer lag. Fix: Alert on lag thresholds; scale consumers or increase partitions.
Quick self-audit before you ship
  • [ ] Does each pipeline step define delivery semantics?
  • [ ] Are keys chosen to balance load and preserve needed locality?
  • [ ] Are retries safe due to idempotency or transactions?
  • [ ] Are metrics and alerts defined for lag, errors, and latency?
  • [ ] Are late/out-of-order events handled intentionally?

Practical projects

  • Real-time clickstream to warehouse: Ingest to a broker, micro-batch to object storage in Parquet, build a deduped view in your warehouse.
  • IoT monitoring: Stream sensor events, compute rolling averages and anomaly flags, alert when thresholds breach, and store both raw and aggregates.
  • Payments ledger: Consume payment events, validate schema, deduplicate by event_id, upsert into a transactions table, and expose a near real-time balance view.

Mini challenge

You see rising consumer lag after a marketing campaign starts. Propose a quick, safe plan to reduce lag within 15 minutes without losing data.

One possible approach
  • Scale consumers horizontally (more instances in the same consumer group).
  • Temporarily increase partitions if your platform supports safe repartitioning, or rekey future events to better-distributed keys.
  • Reduce sink batch sizes to speed commits; ensure idempotent upserts.
  • Pause noncritical enrichments; route raw to storage, backfill enrichments later.
  • Check for a hot key; if found, shard the key (e.g., user_id plus a small hash suffix).

Who this is for

  • Aspiring and junior Data Engineers needing real-time ingestion fundamentals.
  • Analysts/Scientists integrating streaming data into models and dashboards.
  • Backend engineers interfacing services with data platforms.

Prerequisites

  • Basic understanding of batch data ingestion and file formats (CSV/JSON/Parquet).
  • Comfort with SQL and at least one programming language.
  • Familiarity with cloud storage and basic networking concepts.

Learning path

  • Start here: streaming vs batch, delivery semantics, partitions, offsets.
  • Next: schema management and validation strategies.
  • Then: event-time processing, windows, watermarks, late data.
  • Reliability: idempotent writes, dedup, DLQ design.
  • Operations: monitoring, lag reduction, back-pressure handling, scaling.
  • Apply: build one end-to-end project from source to warehouse.

Key metrics to watch

  • Producer throughput (events/s, MB/s) and error rate.
  • Consumer lag, processing latency, end-to-end latency.
  • DLQ volume and top failure reasons.
  • Partition skew (hot partitions) and rebalance frequency.

Next steps

  • Implement one practical project end-to-end with clear delivery semantics.
  • Add observability: track lag, latency, and error budgets with alerts.
  • Iterate on schema evolution with a compatibility policy and validation at ingest.

Quick Test

Take the test to check your understanding. Available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

You ingest 120,000 events per minute. Each event is 2 KB. Consumer max read rate per partition is 2 MB/s. Target p95 end-to-end latency: under 30 seconds. Propose:

  • Number of partitions.
  • Estimated steady-state throughput.
  • Headroom rationale for spikes.
  • Simple retention and DLQ strategy.

Write your plan in 3–5 bullet points.

Expected Output
A short plan that states throughput in MB/s, a partition count with headroom, how lag will be drained under spikes, and a retention + DLQ approach.

Streaming Ingestion Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Streaming Ingestion Basics?

AI Assistant

Ask questions about this tool