How to learn Networking Troubleshooting Basics for Cloud And Networking Basics in Platform Engineer for free

Why this matters

As a Platform Engineer, you own uptime and reliability. When services cannot talk to each other, users feel it immediately. You will often be the first responder for issues like failing health checks, timeouts to databases, or TLS errors at load balancers. Solid troubleshooting shortens outages, reduces guesswork, and builds team confidence.

Real tasks you will do: isolate if an outage is DNS, routing, firewall, or app; confirm port reachability; trace latency spikes; validate TLS certs; verify load balancer target health.
Impact: faster incident resolution, clearer on-call runbooks, and fewer recurring issues.

Concept explained simply

Networking troubleshooting is answering a few core questions in order:

Is the host reachable? (Layer 3)
Is the service reachable on the port/protocol? (Layer 4)
Is the name resolving correctly? (DNS, Layer 7)
If HTTPS, is TLS valid and negotiated correctly? (Layer 6/7)
Along the path, where does it break or slow down? (Routing/Policies)

Mental model

Think “outside-in” by layers:

Network path: interface up → route exists → security policy allows → remote listens.
Names: clients resolve the same IPs you expect.
Transport: TCP handshake completes; no resets/timeouts.
Application: the right service responds with the right protocol.

Move one layer at a time, confirm or rule out each layer, and write down findings.

Quick toolbox

Host reachability: ping, traceroute/tracert, mtr
DNS: dig/nslookup (A/AAAA/CNAME/SRV), /etc/resolv.conf, hosts overrides
Ports: nc (netcat) -vz, telnet (legacy), curl -v for HTTP(S)
TLS: openssl s_client -connect host:443 -servername host, curl --resolve, certificate dates and CN/SAN
Linux network state: ip addr, ip route, ip rule, ss -ltnp, nf tables/iptables/ufw rules
Captures: tcpdump (requires privileges), analyze with Wireshark
Cloud controls: security groups/network ACLs, load balancer target health (conceptually verify policies)

Tip: Prefer commands that show both success and error details (e.g., curl -v, openssl s_client) to reduce guesswork.

Worked examples

Example 1: App can ping DB host but cannot connect to port 5432

Symptoms: Application logs show timeouts to 10.2.3.4:5432. ping works.

Check port reachability: nc -vz 10.2.3.4 5432. If it times out, likely firewall/security group or DB not listening.
Check on DB host (if you can): ss -ltnp | grep 5432 to confirm listener is bound to the correct interface (0.0.0.0 or specific IP).
Validate path policies: ensure security groups/NACLs/firewalls allow TCP 5432 from source to destination.
If policies are open, run traceroute 10.2.3.4 to confirm the path and look for drops.

Likely fix: allow inbound 5432 on DB host/security group and bind the DB to the correct interface.

Example 2: DNS resolves to unexpected IP

Symptoms: curl -v https://api.internal hits a public IP instead of a private IP.

Check DNS: dig api.internal. Compare the answer with the intended IP block.
Check search domains and resolvers: look at /etc/resolv.conf.
If containerized, check if the pod/sidecar has a different resolver config.
Temporarily override: curl --resolve api.internal:443:10.0.1.7 https://api.internal to validate the service while DNS is being fixed.

Likely fix: correct the DNS record or the resolver configuration used by the environment.

Example 3: TLS error after certificate rotation

Symptoms: curl -v https://payments.company.com returns SSL certificate problem: unable to get local issuer certificate.

Inspect chain: openssl s_client -connect payments.company.com:443 -servername payments.company.com -showcerts.
Look for missing intermediate CA. The server must present full chain.
Check CN/SAN: ensure payments.company.com matches a SAN entry.
Verify expiry: notBefore/notAfter fields.

Likely fix: install the correct certificate chain (leaf + intermediates) on the load balancer/ingress.

Exercises (practice)

Do these now. They mirror the exercises below and build muscle memory.

Exercise 1: Diagnose a blocked DB port using connection tests and listeners.
Exercise 2: Diagnose a TLS handshake failure using curl -v and openssl s_client.

Checklist before you start

Can you test ICMP (ping) vs TCP (nc/curl) separately?
Do you know which IP and port are expected?
Do you know which resolvers your environment uses?
Do you know where firewall/security group rules are defined conceptually?

Common mistakes

Assuming ping success means the service port is open. It only proves network reachability, not application reachability.
Testing with HTTP to a non-HTTP port. Use the right tool for the protocol (nc/ss for TCP, curl for HTTP).
Ignoring DNS search domains, causing names to resolve differently across hosts or pods.
Forgetting SNI in TLS: many servers present different certs based on the Server Name. Use -servername with openssl.
Overlooking outbound egress rules; not just inbound rules can block connections.
Not checking the listener bind address; service may listen only on localhost (127.0.0.1) instead of 0.0.0.0.

How to self-check your work

After each change, re-run the exact failing command (curl/nc) and save the output to compare.
Confirm at least two independent signals agree (e.g., nc success and load balancer target marked healthy).
For DNS, compare dig answers from two different resolvers to rule out caching.
For TLS, verify both with openssl s_client and a browser/curl to ensure chain and hostname validation are correct.

Practical projects

Build a small two-service app (web → db) and intentionally block the DB port. Write a runbook to detect and fix it fast.
Set up a local reverse proxy (like nginx) with HTTPS. Rotate its certificate with an incomplete chain and practice diagnosing the error.
Create a lab with two subnets and a host route missing. Practice finding where the path breaks with traceroute and policy checks.

Learning path

Start here: layer-by-layer troubleshooting and the core commands.
Next: deepen with packet captures (tcpdump) and reading SYN/SYN-ACK/ACK flows.
Then: practice cloud load balancers, health checks, and security policies in a sandbox environment.
Finally: automate checks into runbooks (copy-paste ready commands with expected outputs).

Who this is for

Platform and DevOps engineers who support services in production.
Backend engineers who want to debug connectivity without waiting on infra teams.
New on-call responders who need a reliable first-steps playbook.

Prerequisites

Basic Linux shell usage.
Familiarity with IPs, ports, and DNS.
Access to test environments where you can run network commands. Some commands may require elevated privileges.

Mini challenge

Your app at https://store.internal returns intermittent 502s. You notice DNS sometimes returns two IPs. Outline the exact commands you will run to confirm which backend is failing and how you will temporarily route around it. Hint: use dig, curl -v repeatedly, and --resolve to pin the IP.

Next steps

Take the Quick Test below to check understanding. Available to everyone; log in to save your progress.
Repeat the exercises with your own services and write a short runbook.
Move on to deeper topics: packet captures, load balancer behaviors, and zero-downtime cert rotations.

Menu

Networking Troubleshooting Basics

Table of Contents

Why this matters

Concept explained simply

Mental model

Quick toolbox

Worked examples

Exercises (practice)

Common mistakes

How to self-check your work

Practical projects

Learning path

Who this is for

Prerequisites

Mini challenge

Next steps

Practice Exercises

Port 5432 unreachable though host pings

Instructions

Expected Output

TLS handshake fails to api.company.com

Networking Troubleshooting Basics — Quick Test

Have questions about Networking Troubleshooting Basics?

AI Assistant