Why this matters
As a Platform Engineer, you own uptime and reliability. When services cannot talk to each other, users feel it immediately. You will often be the first responder for issues like failing health checks, timeouts to databases, or TLS errors at load balancers. Solid troubleshooting shortens outages, reduces guesswork, and builds team confidence.
- Real tasks you will do: isolate if an outage is DNS, routing, firewall, or app; confirm port reachability; trace latency spikes; validate TLS certs; verify load balancer target health.
- Impact: faster incident resolution, clearer on-call runbooks, and fewer recurring issues.
Concept explained simply
Networking troubleshooting is answering a few core questions in order:
- Is the host reachable? (Layer 3)
- Is the service reachable on the port/protocol? (Layer 4)
- Is the name resolving correctly? (DNS, Layer 7)
- If HTTPS, is TLS valid and negotiated correctly? (Layer 6/7)
- Along the path, where does it break or slow down? (Routing/Policies)
Mental model
Think “outside-in” by layers:
- Network path: interface up → route exists → security policy allows → remote listens.
- Names: clients resolve the same IPs you expect.
- Transport: TCP handshake completes; no resets/timeouts.
- Application: the right service responds with the right protocol.
Move one layer at a time, confirm or rule out each layer, and write down findings.
Quick toolbox
- Host reachability: ping, traceroute/tracert, mtr
- DNS: dig/nslookup (A/AAAA/CNAME/SRV), /etc/resolv.conf, hosts overrides
- Ports: nc (netcat) -vz, telnet (legacy), curl -v for HTTP(S)
- TLS: openssl s_client -connect host:443 -servername host, curl --resolve, certificate dates and CN/SAN
- Linux network state: ip addr, ip route, ip rule, ss -ltnp, nf tables/iptables/ufw rules
- Captures: tcpdump (requires privileges), analyze with Wireshark
- Cloud controls: security groups/network ACLs, load balancer target health (conceptually verify policies)
Tip: Prefer commands that show both success and error details (e.g., curl -v, openssl s_client) to reduce guesswork.
Worked examples
Example 1: App can ping DB host but cannot connect to port 5432
Symptoms: Application logs show timeouts to 10.2.3.4:5432. ping works.
- Check port reachability:
nc -vz 10.2.3.4 5432. If it times out, likely firewall/security group or DB not listening. - Check on DB host (if you can):
ss -ltnp | grep 5432to confirm listener is bound to the correct interface (0.0.0.0 or specific IP). - Validate path policies: ensure security groups/NACLs/firewalls allow TCP 5432 from source to destination.
- If policies are open, run
traceroute 10.2.3.4to confirm the path and look for drops.
Likely fix: allow inbound 5432 on DB host/security group and bind the DB to the correct interface.
Example 2: DNS resolves to unexpected IP
Symptoms: curl -v https://api.internal hits a public IP instead of a private IP.
- Check DNS:
dig api.internal. Compare the answer with the intended IP block. - Check search domains and resolvers: look at
/etc/resolv.conf. - If containerized, check if the pod/sidecar has a different resolver config.
- Temporarily override:
curl --resolve api.internal:443:10.0.1.7 https://api.internalto validate the service while DNS is being fixed.
Likely fix: correct the DNS record or the resolver configuration used by the environment.
Example 3: TLS error after certificate rotation
Symptoms: curl -v https://payments.company.com returns SSL certificate problem: unable to get local issuer certificate.
- Inspect chain:
openssl s_client -connect payments.company.com:443 -servername payments.company.com -showcerts. - Look for missing intermediate CA. The server must present full chain.
- Check CN/SAN: ensure
payments.company.commatches a SAN entry. - Verify expiry: notBefore/notAfter fields.
Likely fix: install the correct certificate chain (leaf + intermediates) on the load balancer/ingress.
Exercises (practice)
Do these now. They mirror the exercises below and build muscle memory.
- Exercise 1: Diagnose a blocked DB port using connection tests and listeners.
- Exercise 2: Diagnose a TLS handshake failure using curl -v and openssl s_client.
Checklist before you start
- Can you test ICMP (ping) vs TCP (nc/curl) separately?
- Do you know which IP and port are expected?
- Do you know which resolvers your environment uses?
- Do you know where firewall/security group rules are defined conceptually?
Common mistakes
- Assuming ping success means the service port is open. It only proves network reachability, not application reachability.
- Testing with HTTP to a non-HTTP port. Use the right tool for the protocol (nc/ss for TCP, curl for HTTP).
- Ignoring DNS search domains, causing names to resolve differently across hosts or pods.
- Forgetting SNI in TLS: many servers present different certs based on the Server Name. Use
-servernamewith openssl. - Overlooking outbound egress rules; not just inbound rules can block connections.
- Not checking the listener bind address; service may listen only on localhost (127.0.0.1) instead of 0.0.0.0.
How to self-check your work
- After each change, re-run the exact failing command (curl/nc) and save the output to compare.
- Confirm at least two independent signals agree (e.g., nc success and load balancer target marked healthy).
- For DNS, compare
diganswers from two different resolvers to rule out caching. - For TLS, verify both with
openssl s_clientand a browser/curl to ensure chain and hostname validation are correct.
Practical projects
- Build a small two-service app (web → db) and intentionally block the DB port. Write a runbook to detect and fix it fast.
- Set up a local reverse proxy (like nginx) with HTTPS. Rotate its certificate with an incomplete chain and practice diagnosing the error.
- Create a lab with two subnets and a host route missing. Practice finding where the path breaks with traceroute and policy checks.
Learning path
- Start here: layer-by-layer troubleshooting and the core commands.
- Next: deepen with packet captures (tcpdump) and reading SYN/SYN-ACK/ACK flows.
- Then: practice cloud load balancers, health checks, and security policies in a sandbox environment.
- Finally: automate checks into runbooks (copy-paste ready commands with expected outputs).
Who this is for
- Platform and DevOps engineers who support services in production.
- Backend engineers who want to debug connectivity without waiting on infra teams.
- New on-call responders who need a reliable first-steps playbook.
Prerequisites
- Basic Linux shell usage.
- Familiarity with IPs, ports, and DNS.
- Access to test environments where you can run network commands. Some commands may require elevated privileges.
Mini challenge
Your app at https://store.internal returns intermittent 502s. You notice DNS sometimes returns two IPs. Outline the exact commands you will run to confirm which backend is failing and how you will temporarily route around it. Hint: use dig, curl -v repeatedly, and --resolve to pin the IP.
Next steps
- Take the Quick Test below to check understanding. Available to everyone; log in to save your progress.
- Repeat the exercises with your own services and write a short runbook.
- Move on to deeper topics: packet captures, load balancer behaviors, and zero-downtime cert rotations.