Why this matters
Audit logs are your systems official record of who did what, when, and how. As an API Engineer, youll use them to:
- Investigate incidents (security breaches, data changes).
- Meet compliance requirements (e.g., immutability, retention, exportability).
- Resolve customer disputes (what action was taken and by whom).
- Detect risky patterns (e.g., mass deletions, repeated failed logins).
- Enable safe operations across microservices with clear, correlated events.
Concept explained simply
Audit logs are append-only, structured events describing significant actions taken by users, systems, or services. They differ from normal app logs: they favor clarity, consistency, and immutability over verbosity.
Audit vs application logs (open to compare)
- Application logs: for debugging system behavior; can be verbose; often mutable in practice.
- Audit logs: for accountability; minimal, structured, append-only; carefully redacted.
Mental model
Think of an audit log as a bank ledger: a chronological, append-only list of entries that together tell a trustworthy story.
time-ordered timeline:
[00:01] user:U1 action:LOGIN result:SUCCESS
[00:03] user:U1 action:KEY.CREATE resource:key:K9 result:SUCCESS
[00:05] svc:rotator action:TOKEN.ROTATE resource:svc:S1 result:SUCCESS
Each event: Who (actor) + What (action) + Which (target) + When (timestamp) + How (context) + ResultWhat to log: core fields
Design a small, stable schema and use it everywhere.
{
"event_id": "uuid",
"ts": "RFC3339 timestamp in UTC",
"actor": { "type": "user|service|system", "id": "string", "name": "optional", "ip": "optional", "auth": "optional" },
"action": "VERB.OBJECT e.g., USER.LOGIN, POLICY.UPDATE",
"target": { "type": "resource type", "id": "resource id" },
"request_id": "correlation id",
"tenant_id": "if multi-tenant",
"result": { "status": "SUCCESS|FAILURE", "reason": "optional" },
"metadata": { "version": 1, "fields": { /* safe, redacted details */ } },
"integrity": { "hash": "optional chain hash", "prev": "optional prev hash" }
}- Keep fields consistent; avoid nesting that varies per event.
- Redact sensitive values (secrets, full tokens, passwords, full card numbers).
- Use an explicit event naming convention (VERB.OBJECT) with a version.
- Make events idempotent to write (event_id as UUID).
- Clock: store UTC; sync servers with NTP; include monotonic ordering key (ts + event_id).
Redaction rules examples
// never log:
password, raw tokens, full card numbers, private keys
// allow masked forms:
"token_last4": "ABCD"
"email_hash": "sha256:..."
"ip": "203.0.113.10" (OK if needed for security)
// for before/after changes: log field names, not full values
"fields_changed": ["role", "status"]Worked examples
Example 1: Update API key permissions
{
"event_id": "1e3d...",
"ts": "2026-01-21T10:15:00Z",
"actor": {"type": "user", "id": "u_123", "ip": "198.51.100.5"},
"action": "KEY.UPDATE_PERMISSIONS",
"target": {"type": "api_key", "id": "k_9"},
"request_id": "r_abc",
"tenant_id": "t_42",
"result": {"status": "SUCCESS"},
"metadata": {"version": 1, "fields": {"added": ["read:metrics"], "removed": ["write:billing"]}}
}Example 2: Authentication flow
// failed login
{ "event_id": "a1", "ts": "2026-01-21T11:00:00Z", "actor": {"type":"user","id":"u_44","ip":"203.0.113.20"}, "action":"USER.LOGIN", "target": {"type":"user","id":"u_44"}, "request_id":"r1", "result": {"status":"FAILURE","reason":"INVALID_CREDENTIALS"}, "metadata": {"version":1} }
// successful login
{ "event_id": "a2", "ts": "2026-01-21T11:01:10Z", "actor": {"type":"user","id":"u_44","ip":"203.0.113.20"}, "action":"USER.LOGIN", "target": {"type":"user","id":"u_44"}, "request_id":"r2", "result": {"status":"SUCCESS"}, "metadata": {"version":1} }
// token refresh
{ "event_id": "a3", "ts": "2026-01-21T11:31:10Z", "actor": {"type":"user","id":"u_44"}, "action":"TOKEN.REFRESH", "target": {"type":"token","id":"tok_*masked"}, "request_id":"r3", "result": {"status":"SUCCESS"}, "metadata": {"version":1, "fields": {"expires_at":"2026-01-22T11:31:10Z"}} }Example 3: Export endpoint design
GET /audit-events?from=2026-01-21T00:00:00Z&to=2026-01-22T00:00:00Z&actor_id=u_123&action=USER.LOGIN&result=FAILURE&tenant_id=t_42&limit=200&cursor=eyJ0cyI6IjIwMjYt..."
Response 200
{
"items": [ { ...event... } ],
"next_cursor": "eyJ0cyI6IjIwMjYt..."
}
Rules:
- Read-only; time-ordered descending or cursor-ordered
- Filters: time range (required), actor_id, action, target.type/id, result.status, tenant_id, request_id
- Pagination: cursor by (ts, event_id)
- Rate limits; export size caps; redaction enforced server-side
- Authorization: only org admins / proper scopesImplementation patterns
- Capture at the edge using middleware so every request gets a request_id and actor context.
- Write audit events to a durable, append-only sink (e.g., a queue or write-ahead log) before applying changes, or in the same transaction if strongly consistent.
- Make writes idempotent using event_id; dedupe on conflict.
- Guarantee ordering per tenant or resource using (ts + event_id) keys.
- Use a schema version in each event; evolve via additive fields.
- Clock hygiene: UTC, NTP synced, include server_id in metadata for traceability.
- Tamper-evidence: optional hash chain linking prev event for the same stream (tenant or resource).
Minimal hash chaining example
integrity.hash = SHA256(prev_hash + canonical_json(event_without_integrity))
// Store integrity.prev per stream to verify sequence.Monitoring and alerts
- Spike alerts: sudden increase in DELETE, POLICY.UPDATE, ROLE.GRANT.
- Brute force: multiple USER.LOGIN failures from one IP or against one account.
- Service account anomalies: unusual actions outside maintenance window.
- Silent periods: no audit events from critical services.
Example alert rule
if count(action="ROLE.GRANT" AND tenant_id="t_42") > 5 within 10m then alert("Potential privilege escalation")Privacy, compliance, retention
- Immutability: append-only writes; no updates; use soft redaction if required by law.
- Retention: define per-tenant policies (e.g., 10 years) and legal holds.
- Exportability: admins can export by time range with server-side filtering.
- Data minimization: store only whats necessary; prefer hashes and masks.
- Access control: strict scopes to read audit logs; logs themselves can be sensitive.
Common mistakes and self-check
- Missing read events entirely. Self-check: sample READs for sensitive resources.
- Logging secrets. Self-check: scan for patterns like "Bearer ", "BEGIN PRIVATE KEY".
- Inconsistent event names. Self-check: enforce VERB.OBJECT with linter in CI.
- No tenant_id in multi-tenant systems. Self-check: query events missing tenant_id.
- Offset pagination for exports. Self-check: switch to cursor by (ts, event_id).
- No correlation ids. Self-check: ensure every event has request_id.
- No time sync. Self-check: verify NTP status and drift dashboards.
Exercises
Tackle these hands-on tasks. When done, compare with the solutions embedded in each exercise card below.
- Exercise 1: Define a minimal audit schema and produce 3 events (login fail, policy update success, system token rotation).
- Exercise 2: Design an export API with secure filters, cursor pagination, and sample query/SQL.
- [ ] Your schema has actor, action, target, ts, result, request_id.
- [ ] No secrets or full tokens appear in metadata.
- [ ] Cursor pagination returns a stable next_cursor.
- [ ] Filters include time range and tenant_id.
Practical projects
- Build a middleware-based audit hook that stamps request_id, actor, and writes an event for each mutating endpoint.
- Create an export admin page that streams newline-delimited JSON for a time range with a progress indicator.
- Add tamper-evidence by implementing a per-tenant hash chain and a verification tool that flags gaps or mismatches.
Who this is for
API Engineers, backend developers, and platform engineers who need reliable, compliant audit trails for security and operations.
Prerequisites
- Comfort with HTTP APIs and JSON.
- Basic data modeling (relational or document stores).
- Understanding of authentication and authorization concepts.
Learning path
- Design a stable event schema and naming convention.
- Implement capture points (middleware and domain services).
- Choose storage and indexing; enable cursor exports.
- Add monitoring and alerts for sensitive actions.
- Introduce retention, redaction, and tamper-evidence as needed.
Next steps
- Instrument all mutating endpoints first; then add sampled READs for sensitive data.
- Roll out export and verification tooling to admins.
- Run game-day drills: simulate an incident and investigate using only audit logs.
Mini challenge
Within one hour, add audit logging for ROLE.GRANT and ROLE.REVOKE. Include actor, target user, previous roles, new roles, request_id, and tenant_id. Prove it works by exporting events for the last hour filtered by action.
Hint
Start from your existing schema. Add a small helper that diffs role sets and records fields_changed without dumping full permission lists.
Quick Test
Take the quick test below to check your understanding. Available to everyone; only logged-in users get saved progress.