Menu

Topic 7 of 8

Centralized Log Management

Learn Centralized Log Management for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Who this is for

  • Platform Engineers and SREs setting up organization-wide logging.
  • Backend engineers who need reliable, searchable logs for debugging.
  • Team leads who want consistent, compliant logging across services.

Prerequisites

  • Basic Linux and container experience (e.g., Docker or Kubernetes).
  • Familiarity with JSON and key-value data.
  • High-level understanding of your stack (services, environments, CI/CD).

Why this matters

Real tasks you’ll face as a Platform Engineer:

  • Collect logs from dozens of services and make them searchable within minutes.
  • Standardize fields (service, environment, trace_id) so teams can correlate logs with metrics and traces.
  • Control cost with retention tiers (hot vs. cold storage) and reduce noisy or high-cardinality data.
  • Ensure compliance by redacting PII and enforcing access controls and retention policies.
  • Keep ingestion stable during traffic spikes without losing logs.

Concept explained simply

Centralized log management gathers logs from many places into a single, queryable system. You ship logs from apps and infrastructure, parse and enrich them, store them in hot/cold tiers, and let engineers search and alert on them.

Mental model

  • Think "postal system":
    • Ingest: collectors pick up letters (logs).
    • Parse & Enrich: read the address, add postal codes (metadata).
    • Route: send to the right sorting centers (hot index, archive).
    • Store & Index: shelves for fast lookup vs. warehouses for archives.
    • Search & Alert: clerks who find letters fast and ring the bell on spikes.

Core components you’ll design

  • Ingestion: agents/collectors (e.g., node or sidecar) ship logs reliably with backpressure and retries.
  • Parsing: handle JSON, plain text, and multiline (stack traces). Extract timestamp, level, message.
  • Enrichment: add consistent fields: service, env, cluster, pod, region, trace_id, request_id.
  • Routing: send logs to different destinations or indexes based on team, env, or severity.
  • Storage & Indexing: balance hot (fast, costly) vs. cold (cheap, slower). Plan retention.
  • Governance & Security: PII redaction, access control, audit, immutability, and legal holds.
  • Query & Alerting: enable fielded search, saved queries, and burst detection.
Key design choices (quick checklist)
  • Structured logging (JSON) with standard fields.
  • Index naming: env-service-YYYY.MM (or team-based) to aid lifecycle policies.
  • Retention tiers: e.g., 7 days hot, 30 days warm, 180 days cold (object storage).
  • PII handling: redact emails, phones, tokens before storage.
  • Sampling or aggregation for very high-volume debug logs.
  • Backpressure: bounded queues, retries with exponential backoff, disk buffer.
  • Multiline handling for stack traces.

Worked examples

Example 1: Routing by environment and severity

Goal: Keep prod errors searchable for 30 days, archive everything else cheaply.

  • Ingest all logs from prod and staging.
  • If env=prod AND level in [error, fatal], route to hot index prod-errors with 30d retention.
  • All other logs: 7d hot then auto-move to cold storage for 180d.
  • Benefit: fast P1 triage while controlling cost.
What to check
  • Query time for prod-errors is low even during spikes.
  • Archived logs are still retrievable for audits.

Example 2: Fixing timestamps and timezones

Problem: Some services log local time without timezone; queries by time window miss events.

  • Parse timestamp from message; if no tz, assume node timezone and convert to UTC.
  • Add original_timestamp for traceability.
  • Result: Consistent time-based searches across regions.
What to check
  • Spot-check 5 logs per service: parsed timestamp equals actual event time.
  • Dashboards align with metrics timestamps.

Example 3: Reducing high-cardinality fields

Problem: user_id in index keys causes explosion of unique terms and high cost.

  • Store user_id as a non-indexed field or hashed form (for lookup, not search).
  • Keep only service, env, endpoint, status_code as indexed terms.
  • Result: 30–60% lower index storage and faster queries.
What to check
  • Cardinality reports show major reduction.
  • Investigations still possible by exact user_id filter (non-indexed or keyword field) where needed.

Example 4: Multiline stack traces

Problem: Java stack traces split into multiple log lines, breaking search.

  • Enable multiline rule: start new event when line matches timestamp pattern; otherwise append.
  • Store full stack trace in message field; keep level, logger, thread extracted.
  • Result: Search and alerts reference whole exception correctly.
What to check
  • Count of exceptions equals alert count (no fragmentation).
  • Single event contains all stack frames.

Exercises

Do this hands-on task. When done, compare with the solution in the Exercises section below.

  1. Exercise 1: Build a pipeline that:
    • Parses JSON logs from containers.
    • Adds env, service, and cluster labels.
    • Redacts emails in messages.
    • Routes errors to a hot index and the rest to cold storage.
  • Success checklist:
    • All logs have timestamp, level, service, env, cluster.
    • No plain-text emails remain in stored messages.
    • Error logs are searchable in a fast index.
    • Non-error logs are visible in archive queries.

Common mistakes and self-checks

  • Unstructured logs: Free-text only. Self-check: Can you filter by service AND status_code? If not, add structured logging.
  • Missing correlation IDs: No request_id/trace_id. Self-check: Can you follow one request across services? If not, add correlation fields at the source.
  • Timezone drift: Logs use local time without tz. Self-check: Query across regions for the same incident; do times align? If not, normalize to UTC.
  • High-cardinality explosion: Indexing user_id/session_id as analyzed fields. Self-check: Index size grows faster than log volume; reduce or hash these fields.
  • PII leakage: Emails or tokens stored. Self-check: Run a PII scan query (email regex); if hits > 0, add/redouble redaction.
  • Dropped logs under load: No backpressure/disk buffering. Self-check: Simulate spike and confirm no gaps in timeline.
  • Multiline not configured: Stack traces split. Self-check: Exception search returns partial lines; enable multiline rules.

Practical projects

  • Project A: Create a two-tier log architecture: 14d hot searchable index + 90d object storage. Prove you can restore any day’s logs into a temporary index.
  • Project B: Organization-wide logging schema: define required fields (timestamp, service, env, level, trace_id, request_id) and lint CI to block PRs that break it.
  • Project C: Error budget alert: detect 5-minute spikes in error logs per service and page the owning team with top 3 error signatures.

Learning path

  1. Standardize structured logging in services (JSON + required fields).
  2. Deploy collectors with buffering, retries, and multiline handling.
  3. Implement parsing and enrichment pipelines (service, env, trace_id).
  4. Add routing and lifecycle policies (hot/warm/cold + retention).
  5. Set up PII redaction and RBAC for log access.
  6. Create shared queries, dashboards, and alerts for critical services.
  7. Load test ingestion; tune queues, compression, and sampling.

Next steps

  • Finish the exercise below and run the Quick Test.
  • Extend your pipeline to include metric extraction (count of errors per endpoint) for alerting.
  • Document your logging schema and share it with service owners.

Progress saving

You can take the quick test without logging in. Only logged-in users will have their progress and results saved.

Mini challenge

Design a plan for a sudden 5x production traffic spike during a launch:

  • What backpressure settings will you change?
  • Which logs will you sample or drop first, and why?
  • How will you verify no data loss after the spike?
Hints
  • Increase disk buffer size and enable compression.
  • Lower debug-level sampling; keep error/fatal at 100%.
  • Compare produced vs. stored counts per time bucket.

Practice Exercises

1 exercises to complete

Instructions

Goal: Configure a vendor-neutral collector pipeline that:

  • Accepts JSON logs from containers.
  • Parses/normalizes timestamp and level.
  • Enriches with env, service, and cluster.
  • Redacts emails in message.
  • Routes level >= error to a hot searchable index; others to cold object storage.

Use a collector or agent you know (e.g., OpenTelemetry Collector, Fluent Bit, or Vector). Provide a minimal working config and a short note explaining your choices.

Sample input log lines
{"timestamp":"2025-05-01T12:34:56","level":"info","msg":"User john@example.com viewed /pricing","service":"webapi"}
{"timestamp":"2025-05-01T12:35:01Z","level":"error","msg":"DB timeout for user jane.doe@corp.io","service":"orders"}

Requirements:

  • All stored logs must include: timestamp (UTC), level, message, service, env, cluster, trace_id (optional if provided), and a sanitized message field.
  • Hot index naming example: prod-errors-YYYY.MM.DD
  • Cold storage: object store path with date-based partitions.
Expected Output
{ "timestamp":"2025-05-01T12:35:01Z", "level":"error", "service":"orders", "env":"prod", "cluster":"prod-cluster-a", "message":"DB timeout for user [REDACTED_EMAIL]", "trace_id":"abc123..." }

Centralized Log Management — Quick Test

Test your knowledge with 9 questions. Pass with 70% or higher.

9 questions70% to pass

Have questions about Centralized Log Management?

AI Assistant

Ask questions about this tool