Everyone can take the exercises and test for free. If you log in, your progress is saved automatically.
Who this is for
This lesson is for Platform Engineers and Security-minded Backend Engineers who need reliable audit trails and a repeatable access review process across services, clouds, and internal tools.
Prerequisites
- Basic understanding of identity and access management (users, roles, groups, service accounts)
- Familiarity with application logging and centralized log collection
- Comfort with JSON and structured logs
Why this matters
Real platform tasks where this is critical:
- Investigating production incidents: knowing exactly who changed what and when
- Proving compliance: SOC 2/ISO 27001 often require audit trails and periodic access recertification
- Containing breaches: rapid detection of privilege escalations, break-glass use, and mass data exports
- Least-privilege at scale: removing unused access safely and consistently
Concept explained simply
Audit logging: a trustworthy CCTV for your systems. It records who did what, to which resource, when, where, and whether it succeeded.
Access reviews: regular spring-cleaning of permissions so only the right people and services keep the access they truly need.
Mental model
- Events are receipts: each immutable, timestamped, and signed record is a receipt for a sensitive action.
- Reviews are recurring checkups: schedule them, notify owners, remove unused access, capture evidence.
Core design: audit events and reviews
Event taxonomy (recommended fields)
- timestamp (UTC, ISO 8601)
- actor: {type: user|service, id, display, org_id, auth_method, mfa: true|false}
- action: verb in past tense (granted_role, rotated_secret, exported_data)
- resource: {type, id, name, tenant_id}
- outcome: success|failure and error_code if any
- request: {id, ip, user_agent, location, trace_id/correlation_id}
- context: reason, ticket_id/change_id, previous_value/new_value for config changes
- integrity: {sequence, hash, prev_hash} for tamper-evident chaining
- retention_hint: normal|extended
Keep payloads minimal. For sensitive values, store references or one-way hashes instead of raw data.
Coverage checklist
- Identity lifecycle: user/service create, update, disable, delete
- Privilege changes: role/grant/revoke, group membership, policy edits
- Auth events: login success/failure, MFA status, token mint/refresh
- Secrets: create/rotate/revoke/read (at least log access intent)
- Production changes: deploys, config changes, data export/import
- High-risk flows: break-glass, just-in-time elevation, impersonation
Storage and integrity
- Append-only: ship to centralized store; consider write-once (object lock) for compliance
- Tamper-evident: hash chain per stream or per tenant; verify regularly
- Retention policy: e.g., 400–730 days online, then archive; document exceptions
- Access to logs: separation of duties; reader vs admin distinct; monitor access to the logs themselves
Access reviews (recertification)
- Scope: apps, roles, groups, privileged systems, production data access
- Cadence: monthly for high-risk, quarterly for others; auto-schedule
- Ownership: each resource has a reviewer (system owner); fallback to security/platform
- Evidence: decisions with reason (keep/remove), date, reviewer identity, linked ticket/change
- Signals: last-used data to suggest revocations; flag SoD (segregation of duties) violations
- Revocation path: fast and reversible (grace window), with alerts
Worked examples
Example 1 — Designing an event for a role grant
Scenario: User alice@corp grants admin role to bob@corp in service "app-cms".
{
"timestamp": "2025-11-03T10:12:34Z",
"actor": {"type": "user", "id": "alice@corp", "display": "Alice", "auth_method": "sso", "mfa": true},
"action": "granted_role",
"resource": {"type": "role", "id": "admin", "name": "Administrator", "tenant_id": "marketing"},
"subject": {"type": "user", "id": "bob@corp"},
"outcome": "success",
"request": {"id": "req-9d2f", "ip": "203.0.113.5", "trace_id": "tr-1ab2"},
"context": {"reason": "oncall coverage", "ticket_id": "CHG-1456"},
"integrity": {"sequence": 1042, "hash": "h-xyz", "prev_hash": "h-xyw"}
}
Note the subject field to distinguish who received the role from the actor who granted it.
Example 2 — Tamper-evident chain
Create a per-tenant sequence and compute hash = HMAC(prev_hash + event_body). Store sequence, hash, and prev_hash in each event. A verifier replays the chain and alerts on gaps or hash mismatch.
Example 3 — Access review with usage signals
For group "prod-db-readers":
- Last used: query audit logs for SELECT events by each member in past 90 days
- Reviewer sees suggested removals for members with zero usage
- Reviewer approves removals; system executes revocations and logs decisions with evidence
Step-by-step implementation
- Define event taxonomy and risk levels; agree on naming and required fields.
- Instrument producers: build a small library to emit structured events with correlation and integrity fields.
- Centralize: ship to a log platform or SIEM; index key fields (actor.id, action, resource.id, tenant_id).
- Secure storage: enable append-only or object lock; restrict write and admin paths; log access to logs.
- Dashboards & alerts: create panels for high-risk actions and authentication anomalies.
- Access review workflow: define owners, schedule, decision UI (or CSV), automated revocations, and evidence archive.
- Runbook & drills: simulate a privilege escalation and verify you can reconstruct the timeline from logs.
Exercises
Complete these hands-on tasks. They are also listed in the Exercises section below. If you are logged in, your progress will be saved.
Exercise 1 — Design an audit event schema (matches ex1)
Create a minimal JSON schema and one example for a privileged config change. Include timestamp, actor, action, resource, outcome, request.id, and integrity fields.
Exercise 2 — Plan a quarterly access review (matches ex2)
Define scope, reviewers, usage signals, decisions, revocation steps, evidence storage, and metrics. Produce a 7-step checklist.
Self-check checklist
- Event includes correlation ID and UTC timestamp
- Clear separation between actor and subject
- Tamper-evident integrity fields present
- Access review has owners, cadence, usage signals, and evidence plan
- Defined fast revocation with audit trails
Common mistakes and how to self-check
- Too much or too little logging: log sensitive actions with context; avoid dumping entire payloads with PII.
- No correlation IDs: add request.id or trace_id to connect multi-service workflows.
- Mutable logs: without append-only or hash chains, evidence can be challenged. Make tampering detectable.
- Inconsistent timestamps: always UTC ISO 8601; ensure time sync.
- No tenant or org scoping: include tenant_id to separate customers/environments.
- Skipping service accounts: review machine identities as rigorously as humans.
- Reviews without revocation: ensure a clear, fast, reversible removal path.
Practical projects
- Build an audit pipeline: producer library + centralized index + dashboard for high-risk actions
- Implement a tamper-evident verifier that checks hash chains nightly and logs results
- Access review run: export membership of one critical group, join with 90-day usage, run review, and document outcomes
- Break-glass flow: create a time-bound elevation with auto-expiry and mandatory reason, fully logged
Learning path
- Start: Event taxonomy and logging standards
- Next: Centralized collection and integrity
- Then: Dashboards and alerts for high-risk actions
- Finally: Access review workflow and automation
Next steps
- Finish the exercises below
- Take the quick test to confirm understanding
- Pick one practical project and implement it this week
Mini challenge
Within 48 hours, instrument one high-risk action in any service with the full event schema, ship it to your central logs, and create a simple saved search that alerts on failures.