Why this matters
As a Data Platform Engineer, you will choose and implement delivery semantics for streaming pipelines. This impacts money movement, counters, search indexes, inventory, recommendations, and alerts. You will:
- Decide where at-least-once is fine and where exactly-once is mandatory.
- Configure Kafka producers/consumers and streaming jobs to avoid duplicates and data loss.
- Design idempotent writes and deduplication at sinks like Kafka, object storage, databases, and Elasticsearch.
- Operate pipelines safely during restarts, redeploys, and failures.
Quick Test is available to everyone; sign in to save your progress.
Concept explained simply
Delivery semantics describe how many times an event's effect happens at the destination:
- At-most-once: deliver 0 or 1 time. Fast, but can lose messages.
- At-least-once: deliver 1 or more times. No loss, duplicates possible.
- Exactly-once (effectively-once): each event's effect happens once. No loss, no duplicates at the sink.
Important: "Exactly-once" in streaming is about the effect, not that the event was only delivered once over the network. Retries may happen; idempotency and transactions ensure the final state is as if each event applied once.
Mental model
Imagine dropping numbered coins into a box:
- At-least-once: you may drop coin #42 twice; the box holds two coins unless the box rejects duplicates.
- Exactly-once: the box accepts coin #42 only once, even if you try twice.
- The "box" is your sink (Kafka topic, DB table, object store). The key to exactly-once is making the sink idempotent or transactional.
Implementation patterns
KAFKA exactly-once overview (producer and processors)
- Idempotent producer: enable.idempotence=true, acks=all, retries>0, max.in.flight.requests.per.connection<=5 reduces reordering.
- Transactions: producer sets transactional.id; beginTransaction() → send() → sendOffsetsToTransaction() → commitTransaction(). This ties produced records and consumed offsets atomically → no lost or duplicated effects between Kafka source and Kafka sink.
- Processors (e.g., Flink) use checkpointing + transactional sinks to align state and committed outputs.
Flink exactly-once with Kafka
- Enable checkpointing with exactly-once mode.
- Use Kafka source with committed offsets on checkpoints.
- Use Kafka sink with two-phase commit; Flink commits the transaction only after checkpoint success.
- End-to-end exactly-once holds for Kafka→Flink→Kafka. For external sinks, the sink must provide idempotency or two-phase commit.
Idempotent sinks and dedup
- Databases: upserts, primary key constraints, INSERT ... ON CONFLICT DO NOTHING/UPDATE semantics.
- Object storage: write files with deterministic names (event_id/date) and commit manifests once.
- Search: upserts by document id instead of increments.
- Stateful dedup: keep a set of seen event_ids with TTL; beware memory growth and late arrivals.
Choosing semantics by use case
- Billing, inventory, payouts: exactly-once or idempotent writes required.
- Search index, cache rebuilds, downstream read models: at-least-once with idempotent upsert is typically fine.
- Metrics counters and aggregates: at-least-once plus dedup, or design for idempotent aggregation (e.g., recompute per window, not running increments).
- Alerts/notifications: at-least-once with suppression (dedup by alert key and time window).
Worked examples
Example 1: Payments to ledger DB
Goal: each payment event should affect the ledger exactly once.
- Source: Kafka topic payments.
- Process: Flink job with checkpointing.
- Sink: Ledger DB table with primary key payment_id.
- Approach: Use at-least-once transport + idempotent sink (UPSERT by payment_id). If using Kafka sink intermediate, use Kafka transactions; the final DB write remains idempotent.
- Result: even if a payment is retried, the DB row for payment_id is updated once.
Example 2: Page view counts to Redis
Naive increments (INCR) are not idempotent. To get effectively-once:
- Generate stable event_id per page view.
- Dedup before increment: keep a short TTL set of seen event_ids per user/session or per page. If not seen → INCR; else skip.
- Alternative: aggregate counts in Flink per window and write final window totals idempotently.
Example 3: CDC to Elasticsearch
Replicate DB changes to search:
- Process CDC events with Kafka→Flink→Elasticsearch.
- Use document id = primary key. Upsert documents (replace) rather than doing arithmetic updates.
- Duplicates overwrite the same doc → effectively-once effect in the index.
Configuration cheat-sheet
- Kafka producer (idempotent): enable.idempotence=true, acks=all, retries>0, max.in.flight.requests.per.connection<=5
- Kafka producer (transactions): transactional.id set; beginTransaction → send → sendOffsetsToTransaction → commitTransaction
- Flink: enable checkpointing (exactly-once), use KafkaSource with committed offsets on checkpoints, KafkaSink with two-phase commit
- DB sink: use primary keys, UPSERT/merge, or INSERT ON CONFLICT DO NOTHING/UPDATE
Observability tips
- Track duplicate rate: ratio of events with already-seen event_id at sink.
- Missing rate proxy: compare counts between source and sink per time window.
- Monitor transactional aborts, commit latency, checkpoint duration, and consumer group lag.
Common mistakes and self-check
- Assuming exactly-once without an idempotent or transactional sink. Self-check: can your sink accept the same event twice without double effect?
- Forgetting sendOffsetsToTransaction in Kafka processors. Self-check: after crash, could messages be reprocessed without corresponding output?
- Using non-deterministic operations (e.g., random values) inside stateful processing that will replay differently. Self-check: are results deterministic on replay?
- No stable event_id. Self-check: does each event carry an immutable key that uniquely identifies it?
- Unbounded dedup state. Self-check: do you enforce TTLs/windows, and handle late events explicitly?
Exercises
Do these before the quick test. They mirror the tasks below. You can check your answers in the collapsible solutions.
Exercise 1: Pick semantics per pipeline
For three pipelines, choose at-least-once or exactly-once and justify:
- Orders topic → Flink → Payments topic → Ledger DB
- Clickstream topic → Flink → Redis counters for dashboard
- Products CDC topic → Flink → Search index
Show guidance
- Think about monetary impact, idempotency of the sink, and user-facing tolerance to duplicates.
Exercise 2: Design dedup strategy
You process events with fields: event_id, user_id, ts, action. Design dedup so each event updates analytics once when writing to a DB sink.
- Describe the key, state/TTL, and DB write pattern.
Show guidance
- Prefer a stable event_id. Consider windowed state vs long TTL. DB upsert vs insert-ignore.
- I identified which pipelines need exactly-once effects
- I specified an event_id or composite key
- I chose idempotent write semantics at the sink
- I considered state TTL and late events
Mini challenge
Pick one of your existing or hypothetical pipelines. In 5 sentences, define: the required semantics, the idempotent key, the sink write mode, how retries are safe, and how you will measure duplicate rate.
Who this is for
- Aspiring or current Data Platform Engineers building Kafka/Flink/Spark Streaming pipelines.
- Data Engineers who need reliable real-time outputs.
Prerequisites
- Basic understanding of Kafka topics, partitions, and consumer groups.
- Familiarity with at-least-once delivery and checkpoints in a stream processor.
- Ability to design primary keys and upserts in a database.
Learning path
- Grasp semantics and idempotency (this lesson).
- Practice Kafka idempotent producer and transactions in a sandbox.
- Build a small Flink job with exactly-once Kafka sink and verify on restart.
- Add an idempotent DB sink and measure duplicates and missing rates.
Practical projects
- Kafka→Flink→Kafka pipeline with EOS: simulate failures, prove no duplicates at the sink topic.
- Clickstream dedup: implement event_id dedup with TTL in Flink and write upserts to a DB.
- CDC to search: upsert documents by primary key and validate effect-only-once with duplicate replays.
Next steps
- Run the Quick Test below to check your understanding. Everyone can take it; sign in to save your progress.
- Apply one project idea this week; document your decisions and metrics.