How to learn Centralized Log Management for Observability Platform in Platform Engineer for free

Who this is for

Platform Engineers and SREs setting up organization-wide logging.
Backend engineers who need reliable, searchable logs for debugging.
Team leads who want consistent, compliant logging across services.

Prerequisites

Basic Linux and container experience (e.g., Docker or Kubernetes).
Familiarity with JSON and key-value data.
High-level understanding of your stack (services, environments, CI/CD).

Why this matters

Real tasks you’ll face as a Platform Engineer:

Collect logs from dozens of services and make them searchable within minutes.
Standardize fields (service, environment, trace_id) so teams can correlate logs with metrics and traces.
Control cost with retention tiers (hot vs. cold storage) and reduce noisy or high-cardinality data.
Ensure compliance by redacting PII and enforcing access controls and retention policies.
Keep ingestion stable during traffic spikes without losing logs.

Concept explained simply

Centralized log management gathers logs from many places into a single, queryable system. You ship logs from apps and infrastructure, parse and enrich them, store them in hot/cold tiers, and let engineers search and alert on them.

Mental model

Think "postal system":
- Ingest: collectors pick up letters (logs).
- Parse & Enrich: read the address, add postal codes (metadata).
- Route: send to the right sorting centers (hot index, archive).
- Store & Index: shelves for fast lookup vs. warehouses for archives.
- Search & Alert: clerks who find letters fast and ring the bell on spikes.

Core components you’ll design

Ingestion: agents/collectors (e.g., node or sidecar) ship logs reliably with backpressure and retries.
Parsing: handle JSON, plain text, and multiline (stack traces). Extract timestamp, level, message.
Enrichment: add consistent fields: service, env, cluster, pod, region, trace_id, request_id.
Routing: send logs to different destinations or indexes based on team, env, or severity.
Storage & Indexing: balance hot (fast, costly) vs. cold (cheap, slower). Plan retention.
Governance & Security: PII redaction, access control, audit, immutability, and legal holds.
Query & Alerting: enable fielded search, saved queries, and burst detection.

Key design choices (quick checklist)

Structured logging (JSON) with standard fields.
Index naming: env-service-YYYY.MM (or team-based) to aid lifecycle policies.
Retention tiers: e.g., 7 days hot, 30 days warm, 180 days cold (object storage).
PII handling: redact emails, phones, tokens before storage.
Sampling or aggregation for very high-volume debug logs.
Backpressure: bounded queues, retries with exponential backoff, disk buffer.
Multiline handling for stack traces.

Worked examples

Example 1: Routing by environment and severity

Goal: Keep prod errors searchable for 30 days, archive everything else cheaply.

Ingest all logs from prod and staging.
If env=prod AND level in [error, fatal], route to hot index prod-errors with 30d retention.
All other logs: 7d hot then auto-move to cold storage for 180d.
Benefit: fast P1 triage while controlling cost.

What to check

Query time for prod-errors is low even during spikes.
Archived logs are still retrievable for audits.

Example 2: Fixing timestamps and timezones

Problem: Some services log local time without timezone; queries by time window miss events.

Parse timestamp from message; if no tz, assume node timezone and convert to UTC.
Add original_timestamp for traceability.
Result: Consistent time-based searches across regions.

What to check

Spot-check 5 logs per service: parsed timestamp equals actual event time.
Dashboards align with metrics timestamps.

Example 3: Reducing high-cardinality fields

Problem: user_id in index keys causes explosion of unique terms and high cost.

Store user_id as a non-indexed field or hashed form (for lookup, not search).
Keep only service, env, endpoint, status_code as indexed terms.
Result: 30–60% lower index storage and faster queries.

What to check

Cardinality reports show major reduction.
Investigations still possible by exact user_id filter (non-indexed or keyword field) where needed.

Example 4: Multiline stack traces

Problem: Java stack traces split into multiple log lines, breaking search.

Enable multiline rule: start new event when line matches timestamp pattern; otherwise append.
Store full stack trace in message field; keep level, logger, thread extracted.
Result: Search and alerts reference whole exception correctly.

What to check

Count of exceptions equals alert count (no fragmentation).
Single event contains all stack frames.

Exercises

Do this hands-on task. When done, compare with the solution in the Exercises section below.

Exercise 1: Build a pipeline that:
- Parses JSON logs from containers.
- Adds env, service, and cluster labels.
- Redacts emails in messages.
- Routes errors to a hot index and the rest to cold storage.

Success checklist:
- All logs have timestamp, level, service, env, cluster.
- No plain-text emails remain in stored messages.
- Error logs are searchable in a fast index.
- Non-error logs are visible in archive queries.

Common mistakes and self-checks

Unstructured logs: Free-text only. Self-check: Can you filter by service AND status_code? If not, add structured logging.
Missing correlation IDs: No request_id/trace_id. Self-check: Can you follow one request across services? If not, add correlation fields at the source.
Timezone drift: Logs use local time without tz. Self-check: Query across regions for the same incident; do times align? If not, normalize to UTC.
High-cardinality explosion: Indexing user_id/session_id as analyzed fields. Self-check: Index size grows faster than log volume; reduce or hash these fields.
PII leakage: Emails or tokens stored. Self-check: Run a PII scan query (email regex); if hits > 0, add/redouble redaction.
Dropped logs under load: No backpressure/disk buffering. Self-check: Simulate spike and confirm no gaps in timeline.
Multiline not configured: Stack traces split. Self-check: Exception search returns partial lines; enable multiline rules.

Practical projects

Project A: Create a two-tier log architecture: 14d hot searchable index + 90d object storage. Prove you can restore any day’s logs into a temporary index.
Project B: Organization-wide logging schema: define required fields (timestamp, service, env, level, trace_id, request_id) and lint CI to block PRs that break it.
Project C: Error budget alert: detect 5-minute spikes in error logs per service and page the owning team with top 3 error signatures.

Learning path

Standardize structured logging in services (JSON + required fields).
Deploy collectors with buffering, retries, and multiline handling.
Implement parsing and enrichment pipelines (service, env, trace_id).
Add routing and lifecycle policies (hot/warm/cold + retention).
Set up PII redaction and RBAC for log access.
Create shared queries, dashboards, and alerts for critical services.
Load test ingestion; tune queues, compression, and sampling.

Next steps

Finish the exercise below and run the Quick Test.
Extend your pipeline to include metric extraction (count of errors per endpoint) for alerting.
Document your logging schema and share it with service owners.

Progress saving

You can take the quick test without logging in. Only logged-in users will have their progress and results saved.

Mini challenge

Design a plan for a sudden 5x production traffic spike during a launch:

What backpressure settings will you change?
Which logs will you sample or drop first, and why?
How will you verify no data loss after the spike?

Hints

Increase disk buffer size and enable compression.
Lower debug-level sampling; keep error/fatal at 100%.
Compare produced vs. stored counts per time bucket.

Instructions

Goal: Configure a vendor-neutral collector pipeline that:

Accepts JSON logs from containers.
Parses/normalizes timestamp and level.
Enriches with env, service, and cluster.
Redacts emails in message.
Routes level >= error to a hot searchable index; others to cold object storage.

Use a collector or agent you know (e.g., OpenTelemetry Collector, Fluent Bit, or Vector). Provide a minimal working config and a short note explaining your choices.

Sample input log lines

{"timestamp":"2025-05-01T12:34:56","level":"info","msg":"User john@example.com viewed /pricing","service":"webapi"}
{"timestamp":"2025-05-01T12:35:01Z","level":"error","msg":"DB timeout for user jane.doe@corp.io","service":"orders"}

Requirements:

All stored logs must include: timestamp (UTC), level, message, service, env, cluster, trace_id (optional if provided), and a sanitized message field.
Hot index naming example: prod-errors-YYYY.MM.DD
Cold storage: object store path with date-based partitions.

Menu

Centralized Log Management

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Core components you’ll design

Worked examples

Example 1: Routing by environment and severity

Example 2: Fixing timestamps and timezones

Example 3: Reducing high-cardinality fields

Example 4: Multiline stack traces

Exercises

Common mistakes and self-checks

Practical projects

Learning path

Next steps

Progress saving

Mini challenge

Practice Exercises

Build a robust log pipeline with enrichment, redaction, and routing

Instructions

Expected Output

Centralized Log Management — Quick Test

Have questions about Centralized Log Management?

AI Assistant