Who this is for
- Platform Engineers and SREs setting up organization-wide logging.
- Backend engineers who need reliable, searchable logs for debugging.
- Team leads who want consistent, compliant logging across services.
Prerequisites
- Basic Linux and container experience (e.g., Docker or Kubernetes).
- Familiarity with JSON and key-value data.
- High-level understanding of your stack (services, environments, CI/CD).
Why this matters
Real tasks you’ll face as a Platform Engineer:
- Collect logs from dozens of services and make them searchable within minutes.
- Standardize fields (service, environment, trace_id) so teams can correlate logs with metrics and traces.
- Control cost with retention tiers (hot vs. cold storage) and reduce noisy or high-cardinality data.
- Ensure compliance by redacting PII and enforcing access controls and retention policies.
- Keep ingestion stable during traffic spikes without losing logs.
Concept explained simply
Centralized log management gathers logs from many places into a single, queryable system. You ship logs from apps and infrastructure, parse and enrich them, store them in hot/cold tiers, and let engineers search and alert on them.
Mental model
- Think "postal system":
- Ingest: collectors pick up letters (logs).
- Parse & Enrich: read the address, add postal codes (metadata).
- Route: send to the right sorting centers (hot index, archive).
- Store & Index: shelves for fast lookup vs. warehouses for archives.
- Search & Alert: clerks who find letters fast and ring the bell on spikes.
Core components you’ll design
- Ingestion: agents/collectors (e.g., node or sidecar) ship logs reliably with backpressure and retries.
- Parsing: handle JSON, plain text, and multiline (stack traces). Extract timestamp, level, message.
- Enrichment: add consistent fields: service, env, cluster, pod, region, trace_id, request_id.
- Routing: send logs to different destinations or indexes based on team, env, or severity.
- Storage & Indexing: balance hot (fast, costly) vs. cold (cheap, slower). Plan retention.
- Governance & Security: PII redaction, access control, audit, immutability, and legal holds.
- Query & Alerting: enable fielded search, saved queries, and burst detection.
Key design choices (quick checklist)
- Structured logging (JSON) with standard fields.
- Index naming: env-service-YYYY.MM (or team-based) to aid lifecycle policies.
- Retention tiers: e.g., 7 days hot, 30 days warm, 180 days cold (object storage).
- PII handling: redact emails, phones, tokens before storage.
- Sampling or aggregation for very high-volume debug logs.
- Backpressure: bounded queues, retries with exponential backoff, disk buffer.
- Multiline handling for stack traces.
Worked examples
Example 1: Routing by environment and severity
Goal: Keep prod errors searchable for 30 days, archive everything else cheaply.
- Ingest all logs from prod and staging.
- If env=prod AND level in [error, fatal], route to hot index prod-errors with 30d retention.
- All other logs: 7d hot then auto-move to cold storage for 180d.
- Benefit: fast P1 triage while controlling cost.
What to check
- Query time for prod-errors is low even during spikes.
- Archived logs are still retrievable for audits.
Example 2: Fixing timestamps and timezones
Problem: Some services log local time without timezone; queries by time window miss events.
- Parse timestamp from message; if no tz, assume node timezone and convert to UTC.
- Add original_timestamp for traceability.
- Result: Consistent time-based searches across regions.
What to check
- Spot-check 5 logs per service: parsed timestamp equals actual event time.
- Dashboards align with metrics timestamps.
Example 3: Reducing high-cardinality fields
Problem: user_id in index keys causes explosion of unique terms and high cost.
- Store user_id as a non-indexed field or hashed form (for lookup, not search).
- Keep only service, env, endpoint, status_code as indexed terms.
- Result: 30–60% lower index storage and faster queries.
What to check
- Cardinality reports show major reduction.
- Investigations still possible by exact user_id filter (non-indexed or keyword field) where needed.
Example 4: Multiline stack traces
Problem: Java stack traces split into multiple log lines, breaking search.
- Enable multiline rule: start new event when line matches timestamp pattern; otherwise append.
- Store full stack trace in message field; keep level, logger, thread extracted.
- Result: Search and alerts reference whole exception correctly.
What to check
- Count of exceptions equals alert count (no fragmentation).
- Single event contains all stack frames.
Exercises
Do this hands-on task. When done, compare with the solution in the Exercises section below.
- Exercise 1: Build a pipeline that:
- Parses JSON logs from containers.
- Adds env, service, and cluster labels.
- Redacts emails in messages.
- Routes errors to a hot index and the rest to cold storage.
- Success checklist:
- All logs have timestamp, level, service, env, cluster.
- No plain-text emails remain in stored messages.
- Error logs are searchable in a fast index.
- Non-error logs are visible in archive queries.
Common mistakes and self-checks
- Unstructured logs: Free-text only. Self-check: Can you filter by service AND status_code? If not, add structured logging.
- Missing correlation IDs: No request_id/trace_id. Self-check: Can you follow one request across services? If not, add correlation fields at the source.
- Timezone drift: Logs use local time without tz. Self-check: Query across regions for the same incident; do times align? If not, normalize to UTC.
- High-cardinality explosion: Indexing user_id/session_id as analyzed fields. Self-check: Index size grows faster than log volume; reduce or hash these fields.
- PII leakage: Emails or tokens stored. Self-check: Run a PII scan query (email regex); if hits > 0, add/redouble redaction.
- Dropped logs under load: No backpressure/disk buffering. Self-check: Simulate spike and confirm no gaps in timeline.
- Multiline not configured: Stack traces split. Self-check: Exception search returns partial lines; enable multiline rules.
Practical projects
- Project A: Create a two-tier log architecture: 14d hot searchable index + 90d object storage. Prove you can restore any day’s logs into a temporary index.
- Project B: Organization-wide logging schema: define required fields (timestamp, service, env, level, trace_id, request_id) and lint CI to block PRs that break it.
- Project C: Error budget alert: detect 5-minute spikes in error logs per service and page the owning team with top 3 error signatures.
Learning path
- Standardize structured logging in services (JSON + required fields).
- Deploy collectors with buffering, retries, and multiline handling.
- Implement parsing and enrichment pipelines (service, env, trace_id).
- Add routing and lifecycle policies (hot/warm/cold + retention).
- Set up PII redaction and RBAC for log access.
- Create shared queries, dashboards, and alerts for critical services.
- Load test ingestion; tune queues, compression, and sampling.
Next steps
- Finish the exercise below and run the Quick Test.
- Extend your pipeline to include metric extraction (count of errors per endpoint) for alerting.
- Document your logging schema and share it with service owners.
Progress saving
You can take the quick test without logging in. Only logged-in users will have their progress and results saved.
Mini challenge
Design a plan for a sudden 5x production traffic spike during a launch:
- What backpressure settings will you change?
- Which logs will you sample or drop first, and why?
- How will you verify no data loss after the spike?
Hints
- Increase disk buffer size and enable compression.
- Lower debug-level sampling; keep error/fatal at 100%.
- Compare produced vs. stored counts per time bucket.