How to learn Exactly Once At Least Once Concepts for Streaming Systems Basics in Data Engineer for free

Why this matters

As a Data Engineer, you will build real-time pipelines where retries, crashes, and duplicates are normal. Choosing the right processing semantics prevents downstream errors like double-charging customers, wrong KPIs, or inconsistent machine learning features.

Prevent duplicate writes when jobs restart.
Design consumer commit order to avoid data loss.
Use idempotent sinks and deduplication to get correct results.
Communicate guarantees to analysts and application teams.

Concept explained simply

Events can be delivered and processed multiple times due to retries or failures. Semantics are the contract your system offers about how many times each event affects the result.

Mental model

Imagine sorting mail. Delivery semantics describe how often each letter arrives at your table. Processing semantics describe how often you stamp it as "handled" in your log. Exactly-once means every letter leads to exactly one stamp, even if the postman tries twice. At-least-once means you may stamp a letter more than once unless your stamping process is idempotent.

Delivery vs processing semantics

At-most-once: You may miss events, but never process an event more than once.
At-least-once: You will not miss events (barring catastrophic loss), but you may process duplicates.
Exactly-once processing: Each event affects results once, even if delivered multiple times. Usually achieved via idempotence + transactions + state checkpoints.

Key ingredients

Idempotent writes: Writing the same event twice yields the same final state.
Deduplication: Use event IDs, windows, or keys to filter duplicates.
Transactional or two-phase commit: Commit read and write atomically.
Offsets/positions: Commit only after outputs are safe.
Checkpoints: Recover state consistently after failures.

Worked examples

Example 1 — Counting clicks per minute (analytics)

Goal: Exactly-once results for per-minute counts.

Consume events from a log (may deliver duplicates).
Aggregate in a windowed state store keyed by user/minute.
On checkpoint, snapshot state and downstream commits together.
Sink is idempotent: use upserts keyed by (minute, user) so retries overwrite the same row.

Outcome: Even if events replay, the final count per (minute, user) is correct.

Example 2 — Payments (critical side effects)

Goal: Prevent double-charging.

Each payment has a unique transaction_id.
Processor validates and writes to a ledger with a unique constraint on transaction_id.
Use a transactional sink or idempotent API with transaction_id as idempotency key.
Commit input position only after the sink confirms success.

Outcome: Retries may re-send, but the sink deduplicates on transaction_id.

Example 3 — CDC upsert to warehouse

Goal: Maintain a dimension table.

Read CDC stream with possible duplicates.
Write to warehouse using MERGE/UPSERT keyed by primary key.
Checkpoint source position after successful MERGE.

Outcome: Multiple deliveries of the same change result in the same final row.

Design patterns to achieve guarantees

1) Idempotent sink

Choose a unique key per output record (e.g., event_id, business key, or composite key).
Write via upsert/merge. Ensure duplicates overwrite the same key.
Make side-effect APIs idempotent with an idempotency key.

2) Deduplication

Carry a stable event_id from the producer.
Keep a short-term dedup store (e.g., key with TTL matching late-arrival expectations).
For aggregates, use commutative operations (sum/count) when possible.

3) Transactions / two-phase commit

Process batch of events.
Write outputs and offsets atomically (or in a saga-like pattern).
On failure, either everything is visible or nothing is.

4) Offset commit ordering

Process event.
Write to sink and verify success.
Commit offset/acknowledge after sink success.

5) Watermarks and windows

Define lateness tolerance for reordering.
Emit results on watermark advance.
Use upserts if late updates arrive.

Common mistakes and self-check

Committing offsets before sink write completes. Self-check: Can a crash after commit but before sink write lose data? Yes ➜ Fix commit order.
No idempotency key. Self-check: If the job retries, will the sink create a second row or a second charge? If yes ➜ Add a key and upsert.
Assuming the source is exactly-once. Self-check: Does the source guarantee no duplicates after failover? Usually no ➜ Deduplicate in processing or sink.
Window results as inserts only. Self-check: Can late events update old windows? If yes ➜ Use upserts keyed by window.
Infinite dedup store. Self-check: How long do duplicates realistically appear? Use TTL aligned with lateness.

Exercises (do them here, then open solutions)

Exercise 1 — Commit ordering

You read from a queue, write to a database, and commit offsets. Define the exact ordering to achieve at-least-once delivery without data loss, and explain how to make the final result exactly-once.

List the steps in order.
Name the idempotency mechanism at the sink.
Describe what happens on a crash after the sink write but before the commit.

See a guided solution

Open the solution in the Exercises section below.

Exercise 2 — Designing dedup

A clickstream has event_id and timestamp. You compute per-user daily active users (DAU). Duplicates and late events (up to 2 hours) are expected. Propose a dedup and sink strategy.

State the dedup store key and TTL.
Explain how late events update counts.
Define the sink write mode.

See a guided solution

Open the solution in the Exercises section below.

Checklist before you check solutions:
- Did you specify an idempotency key?
- Did you order commit after successful sink write?
- Did you set a realistic TTL for dedup?
- Did you handle late data via upserts?

Mini challenge

Design semantics for a rides stream that updates driver earnings per hour. Duplicates and reordering occur; late events up to 30 minutes. Write a short plan:

Keys for idempotency (record and aggregate).
Windowing and watermark policy.
Sink mode and how replay affects results.
Commit/ack order.

Hint

Use upserts keyed by (driver_id, hour) and keep a dedup store keyed by event_id with ~30–60 minutes TTL.

Who this is for

Data Engineers building real-time pipelines.
Analytics Engineers maintaining streaming ETL.
Platform Engineers exposing streaming infrastructure.

Prerequisites

Basic understanding of message queues and consumers.
Familiarity with batch vs streaming.
Comfort with SQL upsert/merge or key-based writes.

Learning path

Event delivery vs processing semantics.
Idempotency and deduplication patterns.
Transactions, checkpoints, and commit ordering.
Windowing, lateness, and watermarks.
End-to-end testing with failure injection.

Practical projects

Build a small stream app that counts events per 5 minutes with upserted results and a dedup cache (TTL 1 hour). Kill/restart mid-run and validate results are stable.
Implement a payment-like demo: accept events with transaction_id, write to a store with unique constraint, and verify no duplicates after retries.
Create a CDC-to-warehouse pipeline that uses MERGE and checkpoints. Reprocess a partition and confirm outputs don’t multiply.

Next steps

Take the Quick Test below to check your understanding. Available to everyone; progress is saved if you sign in.
Apply these patterns in a small sandbox project with controlled failures (stop the app, drop connections) to build intuition.
Move on to the next subskill in Streaming Systems Basics after you pass the test.

Menu

Exactly Once At Least Once Concepts

Table of Contents