Menu

Observability And Monitoring

Learn Observability And Monitoring for API Engineer for free: roadmap, examples, subskills, and a skill exam.

Published: January 21, 2026 | Updated: January 21, 2026

Why Observability matters for API Engineers

APIs succeed when they are fast, reliable, and easy to troubleshoot. Observability and monitoring give you the signals to spot problems early, understand root causes, and improve user experience without guesswork. As an API Engineer, this skill lets you:

  • Trace a request across services to pinpoint latency or errors.
  • Detect downstream dependency failures before they impact customers.
  • Track SLOs and alert on meaningful breaches instead of noisy symptoms.
  • Create dashboards your team can use during incidents.
  • Produce audit logs for security and compliance.
Typical tasks you’ll unlock
  • Add correlation IDs and structured logs across services.
  • Expose and alert on RPS, error rate, and latency (p50/p95/p99).
  • Instrument distributed tracing and propagate context.
  • Build endpoint health dashboards for on-call triage.
  • Define SLOs and implement multi-window burn-rate alerts.
  • Monitor external APIs, databases, and queues as dependencies.

Who this is for

  • API Engineers and Backend Developers building or operating HTTP/gRPC services.
  • Platform/DevOps engineers enabling service teams with shared observability tooling.
  • Tech leads responsible for SLOs and incident response.

Prerequisites

  • Comfortable building APIs (e.g., Node.js, Python, Go, or Java).
  • Basic understanding of HTTP, status codes, and service timeouts/retries.
  • Familiar with environments (local, staging, production) and CI/CD.

Learning path

Follow these milestones. Each step includes a small deliverable you can finish in a day or less.

Milestone 1 — Structured logging and correlation IDs
  • Switch to JSON logs with level, timestamp, service, and correlation_id.
  • Generate or accept X-Request-Id at the edge and propagate it.
  • Deliverable: one endpoint with consistent structured logs in local and staging.
Milestone 2 — Golden signals metrics
  • Expose RPS, error rate, and latency histograms per endpoint.
  • Add labels sparingly: method, route template, status_class.
  • Deliverable: metrics endpoint and a simple dashboard panel for each signal.
Milestone 3 — Distributed tracing basics
  • Emit server spans and child spans for DB / external calls.
  • Propagate trace context over HTTP and messaging.
  • Deliverable: single request shows a full trace across at least two services.
Milestone 4 — Downstream dependency monitoring
  • Collect latency, error, and timeout metrics for each dependency.
  • Add circuit-breaker state metric (if applicable).
  • Deliverable: dashboard section and alerts for a flaky dependency.
Milestone 5 — SLOs and alerting
  • Define SLOs (availability and latency). Pick budgets and windows.
  • Implement multi-window burn-rate alerts to reduce noise.
  • Deliverable: actionable alerts with clear runbooks.
Milestone 6 — Audit logs and incident triage
  • Design audit log schema (who, what, when, where).
  • Create a quick triage checklist and add dashboard links.
  • Deliverable: incident simulation validated end-to-end.

Worked examples

1) Structured JSON logs with correlation IDs (Node.js/Express)

{`// middleware/correlation.js
const { randomUUID } = require('crypto');

module.exports = function correlation(req, res, next) {
  const inbound = req.headers['x-request-id'];
  const id = inbound && String(inbound).trim() !== '' ? inbound : randomUUID();
  req.correlationId = id;
  res.setHeader('X-Request-Id', id);
  next();
};

// logger.js
function log(level, msg, fields = {}) {
  const record = {
    ts: new Date().toISOString(),
    level,
    service: 'orders-api',
    correlation_id: fields.correlation_id,
    msg,
    ...fields,
  };
  console.log(JSON.stringify(record));
}

// usage in a route
app.get('/orders/:id', correlation, async (req, res) => {
  log('info', 'fetching order', { correlation_id: req.correlationId, route: '/orders/:id' });
  // ...
});`}

Tip: Keep high-cardinality details (e.g., user_id) in logs, not in metric labels.

2) Golden signals metrics (Python / Prometheus client)

{`from prometheus_client import Counter, Histogram

REQUESTS = Counter('http_requests_total', 'Total requests', ['route', 'method', 'status_class'])
LATENCY = Histogram('http_request_duration_seconds', 'Request latency', ['route', 'method'],
    buckets=[0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2,5])

# wrapper
import time

def record_metrics(route, method, status):
    REQUESTS.labels(route, method, f"{status//100}xx").inc()

def time_request(route, method):
    start = time.time()
    def observe(status):
        LATENCY.labels(route, method).observe(time.time() - start)
        record_metrics(route, method, status)
    return observe

# usage in handler
obs = time_request('/orders/:id', 'GET')
status = 200
# do work...
obs(status)`}

Expose metrics on a dedicated endpoint (e.g., /metrics) and scrape it from your monitoring system.

3) Distributed tracing with OpenTelemetry (Node.js)

{`// pseudo-setup
const { context, trace } = require('@opentelemetry/api');

app.get('/orders/:id', async (req, res) => {
  const tracer = trace.getTracer('orders-api');
  const span = tracer.startSpan('GET /orders/:id', {
    attributes: {
      'http.method': 'GET',
      'http.route': '/orders/:id',
    }
  });

  try {
    await context.with(trace.setSpan(context.active(), span), async () => {
      const dbSpan = tracer.startSpan('db.lookup');
      // query DB...
      dbSpan.setAttribute('db.system', 'postgres');
      dbSpan.end();

      const extSpan = tracer.startSpan('call.inventory');
      // http call to inventory service...
      extSpan.setAttribute('peer.service', 'inventory-api');
      extSpan.end();
    });
    span.setAttribute('http.status_code', 200);
    res.status(200).json({ ok: true });
  } catch (e) {
    span.recordException(e);
    span.setStatus({ code: 2, message: 'ERROR' });
    res.status(500).json({ error: 'internal' });
  } finally {
    span.end();
  }
});`}

Propagate trace context over outbound HTTP by forwarding standard trace headers.

4) Monitoring downstream dependencies

{`// wrap outbound HTTP (Node.js)
const fetch = require('node-fetch');
const { Histogram, Counter } = require('prom-client');

const OUT_LAT = new Histogram({
  name: 'outbound_request_duration_seconds',
  help: 'Latency to downstream services',
  labelNames: ['target', 'route', 'method', 'status_class']
});

const OUT_ERR = new Counter({
  name: 'outbound_errors_total',
  help: 'Errors to downstream services',
  labelNames: ['target', 'route', 'method', 'reason']
});

async function call(target, route, method, url) {
  const end = OUT_LAT.labels(target, route, method).startTimer();
  try {
    const r = await fetch(url, { method });
    end();
    const cls = `${r.status/100|0}xx`;
    OUT_LAT.labels(target, route, method, cls);
    if (r.status >= 500) OUT_ERR.labels(target, route, method, '5xx').inc();
    return r;
  } catch (e) {
    end();
    OUT_ERR.labels(target, route, method, 'timeout_or_network').inc();
    throw e;
  }
}`}

Dashboard these per target: latency percentiles, error reasons, and request volume.

5) SLO-based alerting (PromQL examples)

{`# Example: 99.9% availability over 30 days (error budget 0.1%)
# Error ratio (5xx over all)
error_ratio = sum(rate(http_requests_total{status_class="5xx"}[5m]))
              /
              sum(rate(http_requests_total[5m]))

# Multi-window burn-rate alert (fast + slow)
# Fast (short window): catches big, acute issues
error_ratio_fast = sum(rate(http_requests_total{status_class="5xx"}[5m])) / sum(rate(http_requests_total[5m]))
# Slow (longer window): avoids flapping
error_ratio_slow = sum(rate(http_requests_total{status_class="5xx"}[1h])) / sum(rate(http_requests_total[1h]))

# Alert: if both exceed burn thresholds
# Pseudocode rule:
ALERT SLOAvailabilityBurn
IF error_ratio_fast > 14.4 * 0.001 AND error_ratio_slow > 6 * 0.001
FOR 10m
ANNOTATIONS: runbook="Check recent deploy, top endpoints, dependency errors"`}

Use burn-rate multipliers tuned to your error budget and paging sensitivity.

6) Endpoint health dashboard blueprint

  • RPS: sum of requests per endpoint template and method.
  • Error ratio: 5xx/total and 4xx/total.
  • Latency: p50/p95/p99 per endpoint template and method.
  • Dependency health: latency and error reasons per dependency.
  • Saturation: CPU, memory, queue length (if available).
{`# PromQL sketches (adjust labels to your metrics)
sum(rate(http_requests_total[1m])) by (route, method)

sum(rate(http_requests_total{status_class="5xx"}[5m]))
  / sum(rate(http_requests_total[5m]))

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route, method))`}

Drills and exercises

  • [ ] Convert one service to JSON structured logs with correlation_id on all entries.
  • [ ] Expose /metrics with RPS, error rate, and latency histograms per endpoint.
  • [ ] Add distributed tracing on ingress and one outbound call; verify spans.
  • [ ] Create a dashboard showing RPS, error ratio, p95 latency for top 3 endpoints.
  • [ ] Define one availability SLO and implement a two-window burn-rate alert.
  • [ ] Log audit events for login, role change, and data export; verify fields.
  • [ ] Simulate a dependency slowdown and confirm alerts fire with useful context.

Common mistakes and debugging tips

  • High-cardinality metric labels: Avoid putting IDs (user_id, request_id) into metric labels. Keep them in logs and traces.
  • Unpropagated correlation: Generate request IDs at the edge and propagate through all services and async jobs.
  • Missing route templates: Use normalized route names (/orders/:id) instead of raw paths to prevent label explosion.
  • Alert fatigue: Prefer SLO-based and multi-window alerts; avoid paging on single 500 counts.
  • Sampling pitfalls: For tracing, use tail-based or conditional sampling to keep slow/error traces, not only a fixed low rate.
  • Dashboard sprawl: Start with the golden signals; add panels only when they answer an on-call question.
  • Opaque audit logs: Make audit events human-readable JSON with who, what, when, where, and why.
Quick triage flow
  1. Check alerts and compare fast vs slow burn rates.
  2. Open the endpoint health dashboard: RPS spike? p95 jump? Which endpoints?
  3. Drill into dependency panels: timeouts? 5xx from a specific target?
  4. Open a recent trace for a slow/error request; inspect child spans.
  5. Correlate with deploys, config changes, or feature flags.
  6. Capture key findings in the incident notes and add a follow-up action item.

Mini project: Observability-in-a-Box

Build a small REST API with logs, metrics, traces, and SLO alerts end-to-end.

  1. Logging: Implement JSON structured logs with correlation_id and consistent fields.
  2. Metrics: Add RPS, error ratio, and latency histograms per endpoint.
  3. Tracing: Emit server spans; add child spans for DB and one external HTTP call.
  4. Dependency: Wrap outbound HTTP and record latency and error reasons.
  5. Dashboards: Create panels for RPS, error ratio, p50/p95/p99 latency, and dependency health.
  6. SLO + Alerts: Define a 99.9% availability SLO and set multi-window burn-rate alerts.
  7. Audit: Log login and role-change events; verify fields are complete.
  8. Test: Introduce a synthetic 500 burst and a dependency slowdown; verify alerts and trace visibility.
Deliverables checklist
  • [ ] Screenshots of dashboards showing normal and degraded states.
  • [ ] Alert rules text and a short runbook for responders.
  • [ ] Sample JSON logs with correlation_id flowing across services.
  • [ ] One sample trace screenshot with spans for DB and outbound HTTP.

Practical projects

  • API Health Starter: Wrap an existing service with logging middleware, metrics endpoint, and a 3-panel dashboard.
  • Dependency Sentinel: Add per-target latency/error metrics and a simple circuit-breaker metric; simulate a flaky downstream.
  • SLO Pilot: Define availability and latency SLOs for one endpoint and implement burn-rate alerts with a short runbook.

Subskills

  • Structured Logs With Correlation Ids — Produce consistent JSON logs and propagate X-Request-Id end-to-end.
  • Metrics RPS Errors Latency — Expose and interpret the golden signals for APIs.
  • Distributed Tracing Basics — Instrument spans and propagate context across services.
  • Monitoring Downstream Dependencies — Track latency, errors, timeouts, and circuit-breaker states per dependency.
  • Alerting On SLO Breaches — Use error budgets and multi-window burn-rate alerts for actionable pages.
  • Dashboards For Endpoint Health — Build focused panels for RPS, error rate, and latency percentiles.
  • Audit Logs — Capture who/what/when/where/why for sensitive actions.
  • Incident Triage — Apply a repeatable flow to isolate, diagnose, and mitigate issues.

Next steps

  • Complete the drills and the mini project to solidify fundamentals.
  • Review the Subskills section below and study any gaps.
  • Take the Skill Exam at the end of this page. The exam is available to everyone; only logged-in users get saved progress.

Observability And Monitoring — Skill Exam

This exam checks practical understanding of API observability: logs, metrics, tracing, dependencies, dashboards, SLOs, audit logs, and incident triage. You can take it for free. Only logged-in users get saved progress and can resume later. Your score and review are shown at the end.Tips: choose the best answer(s) based on reliability-first practices. Some questions are multiple-select.

12 questions70% to pass

Have questions about Observability And Monitoring?

AI Assistant

Ask questions about this tool