Menu

Topic 8 of 8

Networking Troubleshooting Basics

Learn Networking Troubleshooting Basics for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Why this matters

As a Platform Engineer, you own uptime and reliability. When services cannot talk to each other, users feel it immediately. You will often be the first responder for issues like failing health checks, timeouts to databases, or TLS errors at load balancers. Solid troubleshooting shortens outages, reduces guesswork, and builds team confidence.

  • Real tasks you will do: isolate if an outage is DNS, routing, firewall, or app; confirm port reachability; trace latency spikes; validate TLS certs; verify load balancer target health.
  • Impact: faster incident resolution, clearer on-call runbooks, and fewer recurring issues.

Concept explained simply

Networking troubleshooting is answering a few core questions in order:

  • Is the host reachable? (Layer 3)
  • Is the service reachable on the port/protocol? (Layer 4)
  • Is the name resolving correctly? (DNS, Layer 7)
  • If HTTPS, is TLS valid and negotiated correctly? (Layer 6/7)
  • Along the path, where does it break or slow down? (Routing/Policies)

Mental model

Think “outside-in” by layers:

  • Network path: interface up → route exists → security policy allows → remote listens.
  • Names: clients resolve the same IPs you expect.
  • Transport: TCP handshake completes; no resets/timeouts.
  • Application: the right service responds with the right protocol.

Move one layer at a time, confirm or rule out each layer, and write down findings.

Quick toolbox

  • Host reachability: ping, traceroute/tracert, mtr
  • DNS: dig/nslookup (A/AAAA/CNAME/SRV), /etc/resolv.conf, hosts overrides
  • Ports: nc (netcat) -vz, telnet (legacy), curl -v for HTTP(S)
  • TLS: openssl s_client -connect host:443 -servername host, curl --resolve, certificate dates and CN/SAN
  • Linux network state: ip addr, ip route, ip rule, ss -ltnp, nf tables/iptables/ufw rules
  • Captures: tcpdump (requires privileges), analyze with Wireshark
  • Cloud controls: security groups/network ACLs, load balancer target health (conceptually verify policies)

Tip: Prefer commands that show both success and error details (e.g., curl -v, openssl s_client) to reduce guesswork.

Worked examples

Example 1: App can ping DB host but cannot connect to port 5432

Symptoms: Application logs show timeouts to 10.2.3.4:5432. ping works.

  • Check port reachability: nc -vz 10.2.3.4 5432. If it times out, likely firewall/security group or DB not listening.
  • Check on DB host (if you can): ss -ltnp | grep 5432 to confirm listener is bound to the correct interface (0.0.0.0 or specific IP).
  • Validate path policies: ensure security groups/NACLs/firewalls allow TCP 5432 from source to destination.
  • If policies are open, run traceroute 10.2.3.4 to confirm the path and look for drops.

Likely fix: allow inbound 5432 on DB host/security group and bind the DB to the correct interface.

Example 2: DNS resolves to unexpected IP

Symptoms: curl -v https://api.internal hits a public IP instead of a private IP.

  • Check DNS: dig api.internal. Compare the answer with the intended IP block.
  • Check search domains and resolvers: look at /etc/resolv.conf.
  • If containerized, check if the pod/sidecar has a different resolver config.
  • Temporarily override: curl --resolve api.internal:443:10.0.1.7 https://api.internal to validate the service while DNS is being fixed.

Likely fix: correct the DNS record or the resolver configuration used by the environment.

Example 3: TLS error after certificate rotation

Symptoms: curl -v https://payments.company.com returns SSL certificate problem: unable to get local issuer certificate.

  • Inspect chain: openssl s_client -connect payments.company.com:443 -servername payments.company.com -showcerts.
  • Look for missing intermediate CA. The server must present full chain.
  • Check CN/SAN: ensure payments.company.com matches a SAN entry.
  • Verify expiry: notBefore/notAfter fields.

Likely fix: install the correct certificate chain (leaf + intermediates) on the load balancer/ingress.

Exercises (practice)

Do these now. They mirror the exercises below and build muscle memory.

  • Exercise 1: Diagnose a blocked DB port using connection tests and listeners.
  • Exercise 2: Diagnose a TLS handshake failure using curl -v and openssl s_client.
Checklist before you start
  • Can you test ICMP (ping) vs TCP (nc/curl) separately?
  • Do you know which IP and port are expected?
  • Do you know which resolvers your environment uses?
  • Do you know where firewall/security group rules are defined conceptually?

Common mistakes

  • Assuming ping success means the service port is open. It only proves network reachability, not application reachability.
  • Testing with HTTP to a non-HTTP port. Use the right tool for the protocol (nc/ss for TCP, curl for HTTP).
  • Ignoring DNS search domains, causing names to resolve differently across hosts or pods.
  • Forgetting SNI in TLS: many servers present different certs based on the Server Name. Use -servername with openssl.
  • Overlooking outbound egress rules; not just inbound rules can block connections.
  • Not checking the listener bind address; service may listen only on localhost (127.0.0.1) instead of 0.0.0.0.

How to self-check your work

  • After each change, re-run the exact failing command (curl/nc) and save the output to compare.
  • Confirm at least two independent signals agree (e.g., nc success and load balancer target marked healthy).
  • For DNS, compare dig answers from two different resolvers to rule out caching.
  • For TLS, verify both with openssl s_client and a browser/curl to ensure chain and hostname validation are correct.

Practical projects

  • Build a small two-service app (web → db) and intentionally block the DB port. Write a runbook to detect and fix it fast.
  • Set up a local reverse proxy (like nginx) with HTTPS. Rotate its certificate with an incomplete chain and practice diagnosing the error.
  • Create a lab with two subnets and a host route missing. Practice finding where the path breaks with traceroute and policy checks.

Learning path

  • Start here: layer-by-layer troubleshooting and the core commands.
  • Next: deepen with packet captures (tcpdump) and reading SYN/SYN-ACK/ACK flows.
  • Then: practice cloud load balancers, health checks, and security policies in a sandbox environment.
  • Finally: automate checks into runbooks (copy-paste ready commands with expected outputs).

Who this is for

  • Platform and DevOps engineers who support services in production.
  • Backend engineers who want to debug connectivity without waiting on infra teams.
  • New on-call responders who need a reliable first-steps playbook.

Prerequisites

  • Basic Linux shell usage.
  • Familiarity with IPs, ports, and DNS.
  • Access to test environments where you can run network commands. Some commands may require elevated privileges.

Mini challenge

Your app at https://store.internal returns intermittent 502s. You notice DNS sometimes returns two IPs. Outline the exact commands you will run to confirm which backend is failing and how you will temporarily route around it. Hint: use dig, curl -v repeatedly, and --resolve to pin the IP.

Next steps

  • Take the Quick Test below to check understanding. Available to everyone; log in to save your progress.
  • Repeat the exercises with your own services and write a short runbook.
  • Move on to deeper topics: packet captures, load balancer behaviors, and zero-downtime cert rotations.

Practice Exercises

2 exercises to complete

Instructions

You can ping 10.2.3.4 from your app VM, but the app times out connecting to PostgreSQL on 10.2.3.4:5432.

  1. Run a TCP connect test with nc -vz 10.2.3.4 5432. Note the result.
  2. On the database host (if you have access), verify listener with ss -ltnp | grep 5432.
  3. Check the DB config for bind address (0.0.0.0 vs 127.0.0.1).
  4. Conceptually verify that inbound rules allow TCP 5432 from the app VM's subnet or security group.

What is the most likely cause, and what change will make the connection succeed?

Expected Output
nc times out; ss shows postgres listening only on 127.0.0.1:5432 OR inbound policy blocks 5432. Fix: bind to 0.0.0.0 (or correct interface) and allow inbound 5432 from the app network.

Networking Troubleshooting Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Networking Troubleshooting Basics?

AI Assistant

Ask questions about this tool