Skill Not Found

Why Observability matters for API Engineers

APIs succeed when they are fast, reliable, and easy to troubleshoot. Observability and monitoring give you the signals to spot problems early, understand root causes, and improve user experience without guesswork. As an API Engineer, this skill lets you:

Trace a request across services to pinpoint latency or errors.
Detect downstream dependency failures before they impact customers.
Track SLOs and alert on meaningful breaches instead of noisy symptoms.
Create dashboards your team can use during incidents.
Produce audit logs for security and compliance.

Typical tasks you’ll unlock

Add correlation IDs and structured logs across services.
Expose and alert on RPS, error rate, and latency (p50/p95/p99).
Instrument distributed tracing and propagate context.
Build endpoint health dashboards for on-call triage.
Define SLOs and implement multi-window burn-rate alerts.
Monitor external APIs, databases, and queues as dependencies.

Who this is for

API Engineers and Backend Developers building or operating HTTP/gRPC services.
Platform/DevOps engineers enabling service teams with shared observability tooling.
Tech leads responsible for SLOs and incident response.

Prerequisites

Comfortable building APIs (e.g., Node.js, Python, Go, or Java).
Basic understanding of HTTP, status codes, and service timeouts/retries.
Familiar with environments (local, staging, production) and CI/CD.

Learning path

Follow these milestones. Each step includes a small deliverable you can finish in a day or less.

Milestone 1 — Structured logging and correlation IDs

Switch to JSON logs with level, timestamp, service, and correlation_id.
Generate or accept X-Request-Id at the edge and propagate it.
Deliverable: one endpoint with consistent structured logs in local and staging.

Milestone 2 — Golden signals metrics

Expose RPS, error rate, and latency histograms per endpoint.
Add labels sparingly: method, route template, status_class.
Deliverable: metrics endpoint and a simple dashboard panel for each signal.

Milestone 3 — Distributed tracing basics

Emit server spans and child spans for DB / external calls.
Propagate trace context over HTTP and messaging.
Deliverable: single request shows a full trace across at least two services.

Milestone 4 — Downstream dependency monitoring

Collect latency, error, and timeout metrics for each dependency.
Add circuit-breaker state metric (if applicable).
Deliverable: dashboard section and alerts for a flaky dependency.

Milestone 5 — SLOs and alerting

Define SLOs (availability and latency). Pick budgets and windows.
Implement multi-window burn-rate alerts to reduce noise.
Deliverable: actionable alerts with clear runbooks.

Milestone 6 — Audit logs and incident triage

Design audit log schema (who, what, when, where).
Create a quick triage checklist and add dashboard links.
Deliverable: incident simulation validated end-to-end.

Worked examples

1) Structured JSON logs with correlation IDs (Node.js/Express)

{`// middleware/correlation.js
const { randomUUID } = require('crypto');

module.exports = function correlation(req, res, next) {
  const inbound = req.headers['x-request-id'];
  const id = inbound && String(inbound).trim() !== '' ? inbound : randomUUID();
  req.correlationId = id;
  res.setHeader('X-Request-Id', id);
  next();
};

// logger.js
function log(level, msg, fields = {}) {
  const record = {
    ts: new Date().toISOString(),
    level,
    service: 'orders-api',
    correlation_id: fields.correlation_id,
    msg,
    ...fields,
  };
  console.log(JSON.stringify(record));
}

// usage in a route
app.get('/orders/:id', correlation, async (req, res) => {
  log('info', 'fetching order', { correlation_id: req.correlationId, route: '/orders/:id' });
  // ...
});`}

Tip: Keep high-cardinality details (e.g., user_id) in logs, not in metric labels.

2) Golden signals metrics (Python / Prometheus client)

{`from prometheus_client import Counter, Histogram

REQUESTS = Counter('http_requests_total', 'Total requests', ['route', 'method', 'status_class'])
LATENCY = Histogram('http_request_duration_seconds', 'Request latency', ['route', 'method'],
    buckets=[0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2,5])

# wrapper
import time

def record_metrics(route, method, status):
    REQUESTS.labels(route, method, f"{status//100}xx").inc()

def time_request(route, method):
    start = time.time()
    def observe(status):
        LATENCY.labels(route, method).observe(time.time() - start)
        record_metrics(route, method, status)
    return observe

# usage in handler
obs = time_request('/orders/:id', 'GET')
status = 200
# do work...
obs(status)`}

Expose metrics on a dedicated endpoint (e.g., /metrics) and scrape it from your monitoring system.

3) Distributed tracing with OpenTelemetry (Node.js)

{`// pseudo-setup
const { context, trace } = require('@opentelemetry/api');

app.get('/orders/:id', async (req, res) => {
  const tracer = trace.getTracer('orders-api');
  const span = tracer.startSpan('GET /orders/:id', {
    attributes: {
      'http.method': 'GET',
      'http.route': '/orders/:id',
    }
  });

  try {
    await context.with(trace.setSpan(context.active(), span), async () => {
      const dbSpan = tracer.startSpan('db.lookup');
      // query DB...
      dbSpan.setAttribute('db.system', 'postgres');
      dbSpan.end();

      const extSpan = tracer.startSpan('call.inventory');
      // http call to inventory service...
      extSpan.setAttribute('peer.service', 'inventory-api');
      extSpan.end();
    });
    span.setAttribute('http.status_code', 200);
    res.status(200).json({ ok: true });
  } catch (e) {
    span.recordException(e);
    span.setStatus({ code: 2, message: 'ERROR' });
    res.status(500).json({ error: 'internal' });
  } finally {
    span.end();
  }
});`}

Propagate trace context over outbound HTTP by forwarding standard trace headers.

4) Monitoring downstream dependencies

{`// wrap outbound HTTP (Node.js)
const fetch = require('node-fetch');
const { Histogram, Counter } = require('prom-client');

const OUT_LAT = new Histogram({
  name: 'outbound_request_duration_seconds',
  help: 'Latency to downstream services',
  labelNames: ['target', 'route', 'method', 'status_class']
});

const OUT_ERR = new Counter({
  name: 'outbound_errors_total',
  help: 'Errors to downstream services',
  labelNames: ['target', 'route', 'method', 'reason']
});

async function call(target, route, method, url) {
  const end = OUT_LAT.labels(target, route, method).startTimer();
  try {
    const r = await fetch(url, { method });
    end();
    const cls = `${r.status/100|0}xx`;
    OUT_LAT.labels(target, route, method, cls);
    if (r.status >= 500) OUT_ERR.labels(target, route, method, '5xx').inc();
    return r;
  } catch (e) {
    end();
    OUT_ERR.labels(target, route, method, 'timeout_or_network').inc();
    throw e;
  }
}`}

Dashboard these per target: latency percentiles, error reasons, and request volume.

5) SLO-based alerting (PromQL examples)

{`# Example: 99.9% availability over 30 days (error budget 0.1%)
# Error ratio (5xx over all)
error_ratio = sum(rate(http_requests_total{status_class="5xx"}[5m]))
              /
              sum(rate(http_requests_total[5m]))

# Multi-window burn-rate alert (fast + slow)
# Fast (short window): catches big, acute issues
error_ratio_fast = sum(rate(http_requests_total{status_class="5xx"}[5m])) / sum(rate(http_requests_total[5m]))
# Slow (longer window): avoids flapping
error_ratio_slow = sum(rate(http_requests_total{status_class="5xx"}[1h])) / sum(rate(http_requests_total[1h]))

# Alert: if both exceed burn thresholds
# Pseudocode rule:
ALERT SLOAvailabilityBurn
IF error_ratio_fast > 14.4 * 0.001 AND error_ratio_slow > 6 * 0.001
FOR 10m
ANNOTATIONS: runbook="Check recent deploy, top endpoints, dependency errors"`}

Use burn-rate multipliers tuned to your error budget and paging sensitivity.

6) Endpoint health dashboard blueprint

RPS: sum of requests per endpoint template and method.
Error ratio: 5xx/total and 4xx/total.
Latency: p50/p95/p99 per endpoint template and method.
Dependency health: latency and error reasons per dependency.
Saturation: CPU, memory, queue length (if available).

{`# PromQL sketches (adjust labels to your metrics)
sum(rate(http_requests_total[1m])) by (route, method)

sum(rate(http_requests_total{status_class="5xx"}[5m]))
  / sum(rate(http_requests_total[5m]))

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route, method))`}

Drills and exercises

[ ] Convert one service to JSON structured logs with correlation_id on all entries.
[ ] Expose /metrics with RPS, error rate, and latency histograms per endpoint.
[ ] Add distributed tracing on ingress and one outbound call; verify spans.
[ ] Create a dashboard showing RPS, error ratio, p95 latency for top 3 endpoints.
[ ] Define one availability SLO and implement a two-window burn-rate alert.
[ ] Log audit events for login, role change, and data export; verify fields.
[ ] Simulate a dependency slowdown and confirm alerts fire with useful context.

Common mistakes and debugging tips

High-cardinality metric labels: Avoid putting IDs (user_id, request_id) into metric labels. Keep them in logs and traces.
Unpropagated correlation: Generate request IDs at the edge and propagate through all services and async jobs.
Missing route templates: Use normalized route names (/orders/:id) instead of raw paths to prevent label explosion.
Alert fatigue: Prefer SLO-based and multi-window alerts; avoid paging on single 500 counts.
Sampling pitfalls: For tracing, use tail-based or conditional sampling to keep slow/error traces, not only a fixed low rate.
Dashboard sprawl: Start with the golden signals; add panels only when they answer an on-call question.
Opaque audit logs: Make audit events human-readable JSON with who, what, when, where, and why.

Quick triage flow

Check alerts and compare fast vs slow burn rates.
Open the endpoint health dashboard: RPS spike? p95 jump? Which endpoints?
Drill into dependency panels: timeouts? 5xx from a specific target?
Open a recent trace for a slow/error request; inspect child spans.
Correlate with deploys, config changes, or feature flags.
Capture key findings in the incident notes and add a follow-up action item.

Mini project: Observability-in-a-Box

Build a small REST API with logs, metrics, traces, and SLO alerts end-to-end.

Logging: Implement JSON structured logs with correlation_id and consistent fields.
Metrics: Add RPS, error ratio, and latency histograms per endpoint.
Tracing: Emit server spans; add child spans for DB and one external HTTP call.
Dependency: Wrap outbound HTTP and record latency and error reasons.
Dashboards: Create panels for RPS, error ratio, p50/p95/p99 latency, and dependency health.
SLO + Alerts: Define a 99.9% availability SLO and set multi-window burn-rate alerts.
Audit: Log login and role-change events; verify fields are complete.
Test: Introduce a synthetic 500 burst and a dependency slowdown; verify alerts and trace visibility.

Deliverables checklist

[ ] Screenshots of dashboards showing normal and degraded states.
[ ] Alert rules text and a short runbook for responders.
[ ] Sample JSON logs with correlation_id flowing across services.
[ ] One sample trace screenshot with spans for DB and outbound HTTP.

Practical projects

API Health Starter: Wrap an existing service with logging middleware, metrics endpoint, and a 3-panel dashboard.
Dependency Sentinel: Add per-target latency/error metrics and a simple circuit-breaker metric; simulate a flaky downstream.
SLO Pilot: Define availability and latency SLOs for one endpoint and implement burn-rate alerts with a short runbook.

Subskills

Structured Logs With Correlation Ids — Produce consistent JSON logs and propagate X-Request-Id end-to-end.
Metrics RPS Errors Latency — Expose and interpret the golden signals for APIs.
Distributed Tracing Basics — Instrument spans and propagate context across services.
Monitoring Downstream Dependencies — Track latency, errors, timeouts, and circuit-breaker states per dependency.
Alerting On SLO Breaches — Use error budgets and multi-window burn-rate alerts for actionable pages.
Dashboards For Endpoint Health — Build focused panels for RPS, error rate, and latency percentiles.
Audit Logs — Capture who/what/when/where/why for sensitive actions.
Incident Triage — Apply a repeatable flow to isolate, diagnose, and mitigate issues.

Next steps

Complete the drills and the mini project to solidify fundamentals.
Review the Subskills section below and study any gaps.
Take the Skill Exam at the end of this page. The exam is available to everyone; only logged-in users get saved progress.

Menu

Observability And Monitoring

Table of Contents

Why Observability matters for API Engineers

Who this is for

Prerequisites

Learning path

Worked examples

1) Structured JSON logs with correlation IDs (Node.js/Express)

2) Golden signals metrics (Python / Prometheus client)

3) Distributed tracing with OpenTelemetry (Node.js)

4) Monitoring downstream dependencies

5) SLO-based alerting (PromQL examples)

6) Endpoint health dashboard blueprint

Drills and exercises

Common mistakes and debugging tips

Mini project: Observability-in-a-Box

Practical projects

Subskills

Next steps

Observability And Monitoring — Skill Exam

Topics

Structured Logs With Correlation Ids

Alerting On SLO Breaches

Metrics RPS Errors Latency

Distributed Tracing Basics

Monitoring Downstream Dependencies

Dashboards For Endpoint Health

Audit Logs

Incident Triage

Have questions about Observability And Monitoring?

AI Assistant