Why this matters
As a Platform Engineer, you route traffic reliably and safely. DNS maps names to services, and load balancers spread traffic, keep apps available during failures, and enable zero-downtime deploys. You will use these skills to:
- Expose internal and external services with stable hostnames.
- Design failover and disaster recovery using DNS and health checks.
- Roll out blue/green or canary releases without breaking users.
- Scale horizontally and keep response times predictable.
Who this is for
- Platform and DevOps engineers starting with networking fundamentals.
- Backend engineers who need to ship services behind stable endpoints.
- SREs improving availability and release safety.
Prerequisites
- Comfort with the command line and editing config files.
- Basic TCP/IP understanding (IP, ports, HTTP).
- Ability to run simple tests using curl, dig, or nslookup.
Concept explained simply
DNS (Domain Name System)
DNS is the phonebook of the internet. It translates human-readable names (like api.example.com) into IP addresses that computers use. It’s distributed and cached so lookups are fast and scalable.
- Common records: A (IPv4), AAAA (IPv6), CNAME (alias to another name), TXT (metadata), MX (mail), NS (nameservers), SRV (service location).
- TTL: Time-to-Live controls caching. Lower TTLs let you change answers faster but increase DNS query load.
- Resolvers: Your device asks a recursive resolver, which finds authoritative nameservers and returns the final answer, caching along the way.
- Gotcha: You cannot put a CNAME at the zone apex (example.com) because the apex must also have SOA and NS records. Use ALIAS/ANAME features if your DNS provider supports them.
Load Balancing
Load balancers distribute traffic across multiple backends to improve reliability and performance.
- L4 vs L7: Layer 4 (TCP/UDP) routes by IP/port only. Layer 7 (HTTP/HTTPS) can route by path, host, headers, cookies, etc.
- Algorithms: round-robin, weighted round-robin, least connections, IP hash (simple stickiness).
- Health checks: Automatically remove unhealthy backends (e.g., HTTP 200 on /healthz).
- Session affinity: Keep a user on the same backend if needed (cookies, IP hash).
- Global vs local: DNS-based distribution across regions (global), load balancer-based distribution within a region (local).
Mental model
Imagine DNS as road signs that point users toward a city (your region or load balancer). Inside the city, a traffic officer (the load balancer) directs cars to different parking lots (your instances) based on rules and current congestion. The signs change slowly (TTL), while the officer reacts quickly (health checks, algorithms).
Core concepts to know
- DNS caching and TTL trade-offs.
- Authoritative vs recursive resolvers; SOA/NS records.
- Record selection: A/AAAA vs CNAME; ALIAS/ANAME at apex.
- Anycast and DNS-based global routing limitations due to caching.
- L4 vs L7 load balancing; when to choose each.
- Health checks, draining, surge protection, timeouts.
- Strategies: blue/green, canary, weighted routing, failover.
Worked examples
Example 1: Map a service and plan TTLs
- Goal: Expose api.example.com to an L7 load balancer at 203.0.113.12.
- Records:
; zone: example.com api IN A 203.0.113.12 api IN AAAA 2001:db8::12 ; 5 minutes for faster change during rollout api IN TTL 300 - Trade-off: During a migration, use TTL 300 to switch targets quickly. After stability, raise to 3600 to reduce DNS traffic.
Example 2: Weighted DNS between two regions
- Goal: Send 80% traffic to us-east, 20% to eu-west. Providers implement weights differently; conceptually it looks like:
api.example.com. 300 IN A 198.51.100.10 ; us-east weight=80 api.example.com. 300 IN A 203.0.113.20 ; eu-west weight=20 - Limitations: DNS caches mean users might not immediately see new weights. Use for coarse global distribution, not real-time traffic shifting.
Example 3: L7 load balancer (NGINX) with health checks
- Upstreams and routes:
http { upstream app_backend { least_conn; server 10.0.1.10:8080 max_fails=2 fail_timeout=10s; server 10.0.1.11:8080 max_fails=2 fail_timeout=10s; } server { listen 80; location /healthz { return 200 'ok'; } location / { proxy_set_header Host $host; proxy_pass http://app_backend; } } } - Effect: NGINX prefers backends with fewer connections and quickly avoids repeatedly failing ones.
Example 4: Blue/Green switch with weights
- Goal: Move from v1 to v2 gradually via L7 weights.
upstream app_backend { # weighted round-robin via server weight server 10.0.1.10:8080 weight=9; # v1 server 10.0.1.20:8080 weight=1; # v2 } - Increase v2 weight step-by-step, monitor errors and latency, then drain v1.
Hands-on practice exercises
Complete the exercises below. When ready, take the Quick Test at the bottom. Note: Tests are available to everyone; only logged-in users get saved progress.
Exercise 1: Plan DNS for a service and simulate a change
See details in the Exercises section below (Exercise 1).
Exercise 2: Configure NGINX load balancing with health checks
See details in the Exercises section below (Exercise 2).
Checklist: I can
- Explain recursive vs authoritative DNS in one sentence.
- Choose between A/AAAA vs CNAME vs ALIAS/ANAME at apex.
- Set TTLs appropriately for migrations and steady state.
- Pick L4 vs L7 based on requirements (protocol vs content routing).
- Enable health checks and connection draining on a load balancer.
- Run a blue/green or canary rollout with weights or routing rules.
- Diagnose DNS issues with dig/nslookup and LB issues with curl/logs.
Common mistakes and how to self-check
- Using CNAME at zone apex: Not allowed. Self-check: Verify SOA/NS exist at apex; use ALIAS/ANAME if needed.
- TTL too high during migrations: Changes propagate slowly. Self-check: Lower TTL at least one TTL period before changes.
- No health checks: Users hit dead backends. Self-check: Intentionally stop one backend and see if traffic shifts automatically.
- Sticky sessions without reason: Reduces balancing quality. Self-check: Remove stickiness unless the app truly needs it.
- Ignoring IPv6: Some users prefer AAAA. Self-check: Add AAAA and test with dig -6 and curl -6.
- Per-request "balancing" via DNS only: DNS is coarse. Self-check: Use L7/L4 LB for fast failover and granular control.
Practical projects
- High-availability web app: Put two app instances behind NGINX with health checks and demonstrate failover.
- Global read distribution: Use DNS weights to send a portion of traffic to a read-only replica in another region; measure cache effects.
- Blue/green release: Automate weight shifts and connection draining; record error rates and rollback steps.
Learning path
- DNS basics: understand records, TTL, caching.
- Hands-on: dig/nslookup to see resolution paths.
- Load balancer fundamentals: algorithms, L4 vs L7.
- Deploy a lab NGINX/HAProxy and configure health checks.
- Practice blue/green and canary with controlled weights.
- Add observability: logs and simple health endpoints.
Next steps
- Finish the Exercises below and verify the Checklist items.
- Take the Quick Test to confirm understanding. Tests are available to everyone; only logged-in users get saved progress.
- Apply the Practical projects at work or in a lab.
Exercises
Exercise 1 — Plan DNS and simulate a change
Goal: Design records for api.example.com and simulate a migration with safe TTLs.
Show task
- Create a zone file snippet for example.com that includes:
- SOA/NS placeholders, and records for api.example.com with A and AAAA.
- TTL 300 for api during migration.
- Simulate a change: switch api.example.com from 198.51.100.10 to 203.0.113.12 and note propagation expectations.
- Use dig to show what you expect to see before and after the TTL window.
Exercise 2 — NGINX load balancer with health checks
Goal: Balance across two backends with least connections, add a health endpoint, and test failover.
Show task
- Write an NGINX config with an upstream of two servers, least_conn, and basic fail parameters.
- Add a /healthz that returns 200.
- Use curl in a loop to observe distribution, then stop one backend and verify automatic failover.
Mini challenge
You must move 30% of traffic to a new region for a read-only feature test within 1 hour, minimizing risk. What’s your plan using DNS and the load balancer?
Sample approach
- Lower DNS TTL to 300 at least 5–10 minutes in advance.
- Add a weighted DNS record for the new region at 0.3 effective weight.
- Keep regional L7 health checks strict; monitor errors and latency.
- If errors rise, drop weight to 0 or remove record; if stable, consider small increments.