Why this matters
Backend systems often juggle thousands of requests, slow I/O calls, and CPU-heavy tasks. Knowing when to use concurrency (making progress on multiple tasks) and when to use parallelism (running tasks at the same time on multiple CPU cores) helps you build faster, more reliable services.
- Speed up API endpoints by overlapping I/O (database, cache, HTTP calls).
- Protect shared state (caches, counters, queues) from race conditions.
- Make batch jobs finish sooner by using available CPU cores safely.
- Stay within rate limits and resource budgets using worker pools and backpressure.
Concept explained simply
- Concurrency: Structuring a program so multiple tasks can make progress without blocking each other. Often about waiting less (great for I/O-bound work).
- Parallelism: Running multiple tasks at the same time on different cores. Often about doing more compute per unit time (great for CPU-bound work).
Mental model
Imagine a restaurant:
- Concurrency: One chef manages several dishes by switching between them while some are baking or boiling (no time wasted waiting).
- Parallelism: Several chefs cook different dishes at the same time on separate stations.
Quick checks to choose an approach
- If the bottleneck is waiting (network, disk, DB): use concurrency (async/await, event loops, non-blocking I/O).
- If the bottleneck is CPU: use parallelism (multiple processes/threads, task parallel libraries).
- Mix both if you have I/O and CPU parts: e.g., fetch data concurrently, then process in parallel.
Worked examples
Example 1: Concurrent I/O fan-out
Goal: Call three internal services and respond as soon as all return or any fails.
// Pseudocode (language-agnostic)
function handleRequest(userId):
t1 = async fetchProfile(userId)
t2 = async fetchOrders(userId)
t3 = async fetchRecommendations(userId)
results = await all(t1, t2, t3) with timeout 300ms
if any failed: return partial or fallback
return aggregate(results)
Why it works: I/O overlaps, reducing total latency to roughly the slowest call (plus overhead) instead of sum of latencies.
Example 2: Parallel CPU map-reduce
Goal: Recompute search indexes.
docs = loadDocuments()
chunks = split(docs, by=number_of_cores())
parallel_map(chunks, buildPartialIndex)
index = reduce(mergeIndex, partialIndexes)
Why it works: Each core builds a partial index; merging is smaller than doing everything serially.
Example 3: Avoiding races on shared counters
Goal: Count processed items across workers safely.
counter = 0
workers = startN(8)
// BAD: counter++ has a race
for each item in items:
enqueue(work, item)
waitAll(workers)
print(counter) // may be wrong
// GOOD: use an atomic counter or channel to aggregate
atomicCounter = AtomicInt(0)
for each item in items:
enqueue(work, () => { process(item); atomicCounter.increment() })
waitAll(workers)
print(atomicCounter.get())
Example 4: Bounded concurrency (semaphore)
Goal: Call an external API at most 20 in-flight requests to stay under rate limits.
sem = Semaphore(20)
for id in ids:
acquire(sem)
async {
try { callExternalAPI(id) }
finally { release(sem) }
}
await all tasks
Why it works: You cap pressure on the external system and your resources, improving reliability.
Common primitives (tool-agnostic)
- Futures/Promises/Tasks: Represent work that completes later.
- Async/Await: Structured way to pause and resume tasks without blocking threads.
- Thread/Process Pools: Reuse workers to avoid creation overhead.
- Semaphores/Rate limiters: Control concurrency level and throughput.
- Channels/Queues: Safe communication between producers and consumers.
- Locks/Mutexes/Atomics: Protect shared state; prefer minimizing shared mutable state.
Choosing the right primitive
- I/O-heavy: async I/O + bounded concurrency.
- CPU-heavy: process pool or thread pool + chunking.
- Mixed load: pipeline (stage 1: fetch concurrently, stage 2: parse/process in parallel).
Exercises you can do now
Pick your preferred language. Pseudocode is provided; translate and run locally. Aim to keep it simple and observable (timings, counts).
Exercise 1 — Build a bounded worker pool
Implement a worker pool that processes 100 mock jobs with at most 10 concurrent workers. Each job sleeps 50–80 ms and returns its job id.
- Create a queue of 100 jobs
- Start a fixed pool of 10 workers
- Collect results without losing any job
- Print total duration and verify concurrency actually helped
Starter pseudocode
jobs = range(1..100)
results = []
poolSize = 10
start poolSize workers reading from jobsQueue, writing to resultsQueue
for j in jobs: enqueue(jobsQueue, j)
close(jobsQueue)
for i in 1..100: results.append(dequeue(resultsQueue))
print("Processed", len(results))
Exercise 2 — Fix the race condition
You have 8 workers incrementing a shared counter for each processed item. Replace the unsafe counter with a safe approach: either an atomic counter or summation via a results channel/queue.
- Reproduce the race (unsafe increment)
- Replace with atomic increment OR per-worker counts then sum
- Verify the final count equals the number of items
Starter pseudocode
counter = 0
parallel_for item in items with 8 workers:
process(item)
counter = counter + 1 // unsafe
print(counter)
Common mistakes and self-checks
- Confusing concurrency with parallelism: Overusing threads for I/O when async I/O would be simpler. Self-check: Is CPU usage low but latency high? Prefer async I/O.
- Unbounded fan-out: Spawning thousands of tasks and exhausting memory or hitting rate limits. Self-check: Do you cap concurrency with a semaphore or pool?
- Shared mutable state: Data races and heisenbugs. Self-check: Can two tasks modify the same state? If yes, guard it or eliminate sharing.
- Blocking in async contexts: Using blocking calls inside event loops. Self-check: Any blocking disk/network calls without await?
- Ignoring timeouts and cancellation: Leaking work when callers give up. Self-check: Does every external call have a timeout and propagate cancellation?
Minimal reliability checklist
- Set a sensible concurrency limit
- Add timeouts to external calls
- Use safe synchronization (atomic/locks/channels)
- Measure: log duration, queue depth, errors
Learning path
- Master I/O concurrency: async/await, futures, bounded concurrency.
- Learn CPU parallelism: pools, chunking, map-reduce patterns.
- Synchronize safely: locks, atomics, channels, immutability.
- Add robustness: timeouts, cancellations, retries, rate limiting.
- Measure and tune: profiling, identifying bottlenecks, load testing.
Who this is for
- Backend engineers building APIs, workers, and data pipelines.
- Newcomers moving from single-threaded scripts to production services.
- Engineers preparing for performance and reliability responsibilities.
Prerequisites
- Comfort with one programming language (functions, collections, error handling).
- Basic understanding of HTTP, databases, and logging.
- Ability to run code locally and read stack traces.
Practical projects
- Concurrent aggregator: Hit 3–5 mock services concurrently, aggregate responses, add timeouts and fallbacks.
- Bounded crawler: Crawl a small site with at most N concurrent requests, collect stats, and respect delays.
- Parallel image processor: Resize and compress images using a process/thread pool; compare serial vs parallel timings.
- Queue worker: Read tasks from a queue, process with a fixed worker pool, implement retries and dead-letter handling.
Mini challenge
You have a batch of 10,000 items. Each item requires an external API call (~100 ms) and then a CPU step (~2 ms). Design a two-stage pipeline that:
- Stage 1: Concurrent I/O with a limit of 50 in-flight calls.
- Stage 2: Parallel CPU processing using number_of_cores() workers.
- Includes timeouts (200 ms), retries (max 1), and cancellation if overall job exceeds 5 minutes.
Hint
Use a semaphore for Stage 1, a bounded queue between stages, and a pool for Stage 2. Propagate cancellation tokens to both stages.
Quick Test
Everyone can take the test. Only logged-in users will have their progress saved. When ready, start the quiz below.