Why this matters
As a Data Platform Engineer, you often design topics, choose partition counts, and set replication and retention so teams can reliably stream data at scale. Getting topics and partitions right impacts throughput, ordering guarantees, cost, and the ability for multiple consumers to process data in parallel.
- Real task: Size a topic for clickstream traffic with peak bursts.
- Real task: Preserve per-customer ordering while scaling consumers.
- Real task: Tune replication and ISR to withstand broker failures without data loss.
Who this is for
- Aspiring and current Data Platform Engineers
- Data Engineers building real-time pipelines
- Backend Engineers integrating services with Kafka
Prerequisites
- Basic understanding of distributed systems (brokers, replication)
- Familiarity with producers/consumers concepts
- Comfort with throughput and storage units (KB/MB/GB)
Concept explained simply
Topic: a named data stream (like a file category) where producers write events and consumers read them.
Partition: a topic is split into partitions. Each partition is an append-only log with strictly increasing offsets. Ordering is guaranteed within a partition, not across partitions.
Key: used by the partitioner to route messages. With the default partitioner, all messages with the same key go to the same partition (important for per-key ordering).
Replication factor (RF): number of copies of each partition across brokers. One replica is the leader; others are followers. Consumers and producers interact with the leader.
Consumer group: multiple consumers sharing a group read a topic's partitions in parallel. Each partition is consumed by at most one consumer in the same group at a time.
Retention: how long Kafka keeps data. Kafka stores data for a configured time/size, independent of whether consumers read it.
Compaction (optional): keeps the latest record per key and discards older ones, useful for changelog-style topics.
Mental model
Picture a topic as a highway made of multiple lanes (partitions). Each lane has a strict order of cars (messages). Cars with the same plate (key) always use the same lane, so their order is preserved. More lanes mean more throughput and more cars driving in parallel, but a single lane preserves the ordering. Replication are parallel highways in different locations: one is open (leader), others mirror it and take over if needed.
Quick self-check: Do I need more partitions?
- Need more parallel consumers? Add partitions.
- Need higher write/read throughput? Add partitions (until broker/network limits).
- Need strict global order across all messages? Keep one partition (but throughput limited).
Worked examples
Example 1: Clickstream topic
Requirement: Peak 300k events/sec, ~1 KB each. Want parallel processing with 10–20 consumer instances and room to grow. Target per-partition throughput around 5 MB/s.
- Peak write: 300k KB/s ≈ 300 MB/s
- Partitions for throughput: 300 / 5 = 60 → round up to 64 partitions
- RF = 3 for resilience; min.insync.replicas (ISR) = 2
- Key = user_id to preserve per-user order
Outcome: 64 partitions allow parallel consumers and handle bursts while preserving per-user order.
Example 2: IoT devices
Requirement: 50k devices, ~1 msg/sec each, 500 B per message. Need per-device ordering. Team runs 25 consumers.
- Throughput: 50k * 0.5 KB/s = 25 MB/s
- Parallelism: at least 25 partitions for 25 consumers
- Throughput target 3 MB/s per partition → 25 / 3 ≈ 9 (parallelism is higher driver)
- Choose 36 partitions to cover 25 consumers plus growth
- Key = device_id to keep ordering per device
Outcome: 36 partitions balance parallelism, ordering, and headroom.
Example 3: Payments with ordering constraints
Requirement: Strict per-account ordering (credits/debits). Traffic: 15 MB/s peak. Only 4 consumers allowed by downstream service limits.
- At least 4 partitions (one per consumer)
- Throughput per partition: 15 / 4 ≈ 3.75 MB/s (acceptable)
- Key by account_id to preserve ordering
- Avoid increasing partitions later unless you can handle potential key-to-partition remapping impacts
Outcome: 4 partitions with account_id key ensures per-account ordering under consumer constraints.
Sizing partitions: quick method
- Compute peak write rate in MB/s.
- Pick a safe per-partition throughput target (commonly 2–10 MB/s; depends on hardware).
- Partitions for throughput = peak MB/s / per-partition target.
- Partitions for parallelism = max concurrent consumers in a group.
- Choose the larger of the two, then round up (often to a convenient number like 24, 36, 48, 64).
- Validate with a small load test; adjust.
Tip: Replication and capacity
With RF=3, each partition’s data is stored three times. Size storage and network for write amplification. Set min.insync.replicas to at least 2 for durability under one-broker failure.
Ordering and keys
- Ordering is guaranteed only within a partition.
- Using a key routes same-key messages to the same partition (default partitioner), preserving per-key order.
- Changing the number of partitions can change key-to-partition mapping for future messages. Be cautious when you rely on stable per-key ordering across time.
How to choose a key
- Need per-customer ordering? Key = customer_id.
- Need even load and no ordering constraints? Consider a randomized or round-robin key (or null key) but expect no cross-message order.
- Hot keys (very popular IDs) can overload a single partition; consider composite keys or sharding schemes.
Common mistakes and self-check
- Mistake: Too few partitions → consumers bottlenecked, long catch-up times.
Self-check: Are consumers regularly at 100% CPU or lag growing at peaks? - Mistake: Too many partitions → high metadata overhead, slow rebalances.
Self-check: Are broker/controller metrics and rebalance times spiking after increases? - Mistake: No key when per-entity order is required.
Self-check: Do you observe out-of-order processing for the same entity? - Mistake: RF=1 in production.
Self-check: Can a single broker failure lose data? If yes, raise RF and set min.insync.replicas accordingly. - Mistake: Expanding partitions without considering key remapping.
Self-check: Do consumers rely on historical per-key order across partition changes?
Practical projects
- Design a topic plan for three workloads: clickstream, IoT sensors, and payments. Document keys, partition counts, RF, retention, and expected consumer groups.
- Create a partition sizing spreadsheet that takes peak EPS and message size to propose partition counts.
- Develop a small producer + consumer demo that verifies per-key ordering and shows how increasing partitions changes key-to-partition mapping.
Exercises
These exercises are also available below as interactive blocks. Complete them here and then take the Quick Test. The test is available to everyone; log in to save your progress.
Exercise 1: Clickstream sizing
Traffic: average 120k events/sec, peak 300k events/sec for 10 minutes. Each event is ~1 KB. Target per-partition throughput 5 MB/s. You expect up to 12 consumers in one group and need room to grow. Propose:
- Partition count
- Replication factor and min.insync.replicas
- Key choice and reason
Expected output: A short configuration proposal with concrete numbers.
Show solution
Peak write ≈ 300 MB/s. Partitions for throughput = 300 / 5 = 60 → round up to 64 for headroom. RF=3; min.insync.replicas=2. Key by user_id to preserve per-user ordering. 64 partitions support 12 consumers comfortably and allow growth.
Exercise 2: IoT ordering and parallelism
You manage 50k devices sending ~1 msg/sec at 500 B. You need per-device ordering. Team runs 25 consumers now; plan for growth. Decide:
- Key
- Partition count
- Any notes about future expansion and ordering
Expected output: A brief plan with rationale.
Show solution
Key=device_id to preserve order. Throughput ≈ 25 MB/s; parallelism requires ≥25 partitions. Choose ~36 partitions to allow growth and smoother assignment. Note: Increasing partitions later can change key-to-partition mapping for future messages; communicate to downstreams if they rely on historical per-key ordering.
Exercise completion checklist
- I calculated peak MB/s and mapped it to partitions.
- I justified replication factor and ISR for durability.
- I chose keys based on ordering needs and hot-key risk.
- I considered consumer parallelism and future growth.
Mini challenge
Design a topic for order events: 40 MB/s peak, strict per-order-id ordering, 20 consumers today, potential to double next quarter. List your partition count, key, RF, and a note on what happens if you expand partitions later.
Hint
Balance parallelism (≥ consumers) with per-partition throughput. Key by order_id. Consider impact of partition expansion on key mapping.
Learning path
- Start: Kafka basics (brokers, producers/consumers)
- This subskill: Topics, partitions, keys, ordering, and sizing
- Next: Delivery semantics, consumer groups tuning, and rebalancing
- Later: Retention, compaction, and schema evolution
Next steps
- Complete the Quick Test below to check your understanding. The test is available to everyone; log in to save your progress.
- Apply the sizing method to one of your real workloads.
- Document topic conventions (naming, defaults for RF/ISR, retention) for your team.