Concurrency And Parallelism Basics

Learn Concurrency And Parallelism Basics for free with explanations, exercises, and a quick test (for Backend Engineer).

Published: January 20, 2026 | Updated: January 20, 2026

Why this matters

Backend systems often juggle thousands of requests, slow I/O calls, and CPU-heavy tasks. Knowing when to use concurrency (making progress on multiple tasks) and when to use parallelism (running tasks at the same time on multiple CPU cores) helps you build faster, more reliable services.

Speed up API endpoints by overlapping I/O (database, cache, HTTP calls).
Protect shared state (caches, counters, queues) from race conditions.
Make batch jobs finish sooner by using available CPU cores safely.
Stay within rate limits and resource budgets using worker pools and backpressure.

Concept explained simply

- Concurrency: Structuring a program so multiple tasks can make progress without blocking each other. Often about waiting less (great for I/O-bound work).
- Parallelism: Running multiple tasks at the same time on different cores. Often about doing more compute per unit time (great for CPU-bound work).

Mental model

Imagine a restaurant:

Concurrency: One chef manages several dishes by switching between them while some are baking or boiling (no time wasted waiting).
Parallelism: Several chefs cook different dishes at the same time on separate stations.

Quick checks to choose an approach

If the bottleneck is waiting (network, disk, DB): use concurrency (async/await, event loops, non-blocking I/O).
If the bottleneck is CPU: use parallelism (multiple processes/threads, task parallel libraries).
Mix both if you have I/O and CPU parts: e.g., fetch data concurrently, then process in parallel.

Worked examples

Example 1: Concurrent I/O fan-out

Goal: Call three internal services and respond as soon as all return or any fails.

// Pseudocode (language-agnostic)
function handleRequest(userId):
  t1 = async fetchProfile(userId)
  t2 = async fetchOrders(userId)
  t3 = async fetchRecommendations(userId)
  results = await all(t1, t2, t3) with timeout 300ms
  if any failed: return partial or fallback
  return aggregate(results)

Why it works: I/O overlaps, reducing total latency to roughly the slowest call (plus overhead) instead of sum of latencies.

Example 2: Parallel CPU map-reduce

Goal: Recompute search indexes.

docs = loadDocuments()
chunks = split(docs, by=number_of_cores())
parallel_map(chunks, buildPartialIndex)
index = reduce(mergeIndex, partialIndexes)

Why it works: Each core builds a partial index; merging is smaller than doing everything serially.

Example 3: Avoiding races on shared counters

Goal: Count processed items across workers safely.

counter = 0
workers = startN(8)
// BAD: counter++ has a race
for each item in items:
  enqueue(work, item)
waitAll(workers)
print(counter)  // may be wrong

// GOOD: use an atomic counter or channel to aggregate
atomicCounter = AtomicInt(0)
for each item in items:
  enqueue(work, () => { process(item); atomicCounter.increment() })
waitAll(workers)
print(atomicCounter.get())

Example 4: Bounded concurrency (semaphore)

Goal: Call an external API at most 20 in-flight requests to stay under rate limits.

sem = Semaphore(20)
for id in ids:
  acquire(sem)
  async {
    try { callExternalAPI(id) }
    finally { release(sem) }
  }
await all tasks

Why it works: You cap pressure on the external system and your resources, improving reliability.

Common primitives (tool-agnostic)

Futures/Promises/Tasks: Represent work that completes later.
Async/Await: Structured way to pause and resume tasks without blocking threads.
Thread/Process Pools: Reuse workers to avoid creation overhead.
Semaphores/Rate limiters: Control concurrency level and throughput.
Channels/Queues: Safe communication between producers and consumers.
Locks/Mutexes/Atomics: Protect shared state; prefer minimizing shared mutable state.

Choosing the right primitive

I/O-heavy: async I/O + bounded concurrency.
CPU-heavy: process pool or thread pool + chunking.
Mixed load: pipeline (stage 1: fetch concurrently, stage 2: parse/process in parallel).

Exercises you can do now

Pick your preferred language. Pseudocode is provided; translate and run locally. Aim to keep it simple and observable (timings, counts).

Exercise 1 — Build a bounded worker pool

Implement a worker pool that processes 100 mock jobs with at most 10 concurrent workers. Each job sleeps 50–80 ms and returns its job id.

Create a queue of 100 jobs
Start a fixed pool of 10 workers
Collect results without losing any job
Print total duration and verify concurrency actually helped

Starter pseudocode

jobs = range(1..100)
results = []
poolSize = 10
start poolSize workers reading from jobsQueue, writing to resultsQueue
for j in jobs: enqueue(jobsQueue, j)
close(jobsQueue)
for i in 1..100: results.append(dequeue(resultsQueue))
print("Processed", len(results))

Exercise 2 — Fix the race condition

You have 8 workers incrementing a shared counter for each processed item. Replace the unsafe counter with a safe approach: either an atomic counter or summation via a results channel/queue.

Reproduce the race (unsafe increment)
Replace with atomic increment OR per-worker counts then sum
Verify the final count equals the number of items

Starter pseudocode

counter = 0
parallel_for item in items with 8 workers:
  process(item)
  counter = counter + 1  // unsafe
print(counter)

Common mistakes and self-checks

Confusing concurrency with parallelism: Overusing threads for I/O when async I/O would be simpler. Self-check: Is CPU usage low but latency high? Prefer async I/O.
Unbounded fan-out: Spawning thousands of tasks and exhausting memory or hitting rate limits. Self-check: Do you cap concurrency with a semaphore or pool?
Shared mutable state: Data races and heisenbugs. Self-check: Can two tasks modify the same state? If yes, guard it or eliminate sharing.
Blocking in async contexts: Using blocking calls inside event loops. Self-check: Any blocking disk/network calls without await?
Ignoring timeouts and cancellation: Leaking work when callers give up. Self-check: Does every external call have a timeout and propagate cancellation?

Minimal reliability checklist

Set a sensible concurrency limit
Add timeouts to external calls
Use safe synchronization (atomic/locks/channels)
Measure: log duration, queue depth, errors

Learning path

Master I/O concurrency: async/await, futures, bounded concurrency.
Learn CPU parallelism: pools, chunking, map-reduce patterns.
Synchronize safely: locks, atomics, channels, immutability.
Add robustness: timeouts, cancellations, retries, rate limiting.
Measure and tune: profiling, identifying bottlenecks, load testing.

Who this is for

Backend engineers building APIs, workers, and data pipelines.
Newcomers moving from single-threaded scripts to production services.
Engineers preparing for performance and reliability responsibilities.

Prerequisites

Comfort with one programming language (functions, collections, error handling).
Basic understanding of HTTP, databases, and logging.
Ability to run code locally and read stack traces.

Practical projects

Concurrent aggregator: Hit 3–5 mock services concurrently, aggregate responses, add timeouts and fallbacks.
Bounded crawler: Crawl a small site with at most N concurrent requests, collect stats, and respect delays.
Parallel image processor: Resize and compress images using a process/thread pool; compare serial vs parallel timings.
Queue worker: Read tasks from a queue, process with a fixed worker pool, implement retries and dead-letter handling.

Mini challenge

You have a batch of 10,000 items. Each item requires an external API call (~100 ms) and then a CPU step (~2 ms). Design a two-stage pipeline that:

Stage 1: Concurrent I/O with a limit of 50 in-flight calls.
Stage 2: Parallel CPU processing using number_of_cores() workers.
Includes timeouts (200 ms), retries (max 1), and cancellation if overall job exceeds 5 minutes.

Hint

Use a semaphore for Stage 1, a bounded queue between stages, and a pool for Stage 2. Propagate cancellation tokens to both stages.

Quick Test

Everyone can take the test. Only logged-in users will have their progress saved. When ready, start the quiz below.

Practice Exercises

2 exercises to complete

Instructions

Implement a worker pool that processes 100 mock jobs with at most 10 concurrent workers. Each job simulates I/O by sleeping 50–80 ms and returns its job id. Measure total duration and verify bounded concurrency is respected.

Create a queue of 100 jobs (ids 1..100)
Start exactly 10 workers reading from the queue
Collect all results; ensure none are lost
Print total time and confirm it is significantly less than running serially

// Pseudocode
poolSize = 10
jobs = range(1..100)
startWorkers(poolSize)
for j in jobs: enqueue(j)
closeQueue()
for i in 1..100: collectResult()
printStats()

Expected Output

Processed 100 jobs with max 10 concurrent workers. All job ids present in results. Total duration notably less than serial execution (roughly ~8x–10x faster than 100 * average_sleep).