Who this is for
Data Engineers, Analytics Engineers, Platform Engineers, and anyone responsible for data access, pipelines, and compliance in data platforms.
Prerequisites
- Basic SQL (SELECT, WHERE, GROUP BY)
- Familiarity with data platforms (data warehouse, data lake, or orchestration)
- Basic understanding of IAM/permissions concepts
Learning path
- Start: This lesson (concepts + examples)
- Practice: Do the exercises and mini challenge
- Validate: Take the Quick Test at the bottom of this page
- Apply: Build a small audit log pipeline in your sandbox environment
Why this matters
Audit logging is your system’s security camera. It helps you:
- Investigate incidents (e.g., who queried a sensitive table and when)
- Prove compliance (retain access records for required periods)
- Detect misuse (suspicious queries, mass downloads, off-hours access)
- Debug pipelines (trace job runs end-to-end with correlation IDs)
Concept explained simply
An audit log is an append-only record of important events in your data platform. Each event should capture the 5W1H:
- Who: user or service account
- What: action taken (SELECT, INSERT, GRANT_ROLE, DELETE_OBJECT, RUN_JOB)
- When: timestamp in UTC
- Where: system and location (warehouse name, database, dataset, object path, source IP)
- Why: optional reason/ticket/change request ID
- How: auth method, tool/client, request/correlation ID
Minimum recommended fields:
- event_time_utc (ISO-8601)
- actor (user/service)
- action (verb)
- resource (type + identifier)
- result (SUCCESS/FAILURE + error code)
- request_id (to correlate steps)
- source_ip and user_agent (where applicable)
- impact metrics (row_count, bytes_read)
- environment tags (prod/stage) and project/team
Mental model
Think of audit logging as a flight recorder (black box) for your data platform. It continuously records key signals, is hard to tamper with, and is easy to replay for investigation.
Core ingredients of a good audit logging setup
- Policy: what to log, retention periods, who can read logs, and how to respond to alerts.
- Producers: your warehouse, lake, orchestration tool, and IAM system emitting events.
- Transport: reliable delivery (e.g., streaming or batch) with retries.
- Storage: append-only, versioned, ideally immutable or write-once.
- Index/Query: a way to search events quickly (SQL table or log index).
- Integrity: time sync (NTP), hashing/signing, and access controls.
- Privacy: avoid logging secrets/PII; mask or tokenize where needed.
- Observability: dashboards and alerts for risky patterns.
Worked examples
Example 1: Data warehouse query logs
- Enable query and access logging.
- Land logs into a central schema (e.g.,
sec.audit_events). - Normalize key fields; ensure UTC timestamps.
Sample event (warehouse query)
{
"event_time_utc": "2026-01-08T11:22:33Z",
"actor": {"type": "user", "id": "alice"},
"action": "SELECT",
"resource": {"type": "table", "id": "prod.analytics.pii_customers"},
"result": {"status": "SUCCESS"},
"request_id": "3f7b-98c1",
"client": {"tool": "sql_workbench", "source_ip": "203.0.113.42"},
"impact": {"rows": 120, "bytes_read": 5242880},
"env": {"stage": "prod"}
}
Query: Who accessed PII in the last 7 days?
SELECT actor_id, COUNT(*) AS hits
FROM sec.audit_events
WHERE action = 'SELECT'
AND resource_type = 'table'
AND resource_id LIKE '%pii%'
AND event_time_utc >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY actor_id
ORDER BY hits DESC;
Example 2: Data lake object access logs
Log reads/writes/deletes to lake objects. Track object path, method, status, bytes, and client.
Sample event (lake access)
{
"event_time_utc": "2026-01-08T10:05:00Z",
"actor": {"type": "service", "id": "etl_loader"},
"action": "PUT_OBJECT",
"resource": {"type": "object", "id": "prod/raw/sales/2026/01/08/part-0001.parquet"},
"result": {"status": "SUCCESS", "http": 200},
"request_id": "7c2d-44aa",
"client": {"source_ip": "198.51.100.10"},
"impact": {"bytes_written": 7340032},
"env": {"stage": "prod"}
}
Query: Detect unexpected deletes
SELECT event_time_utc, actor_id, resource_id
FROM sec.audit_events
WHERE action = 'DELETE_OBJECT'
AND env_stage = 'prod'
AND event_time_utc >= CURRENT_DATE - INTERVAL '1' DAY
ORDER BY event_time_utc DESC;
Example 3: Orchestrated pipeline audit trail
Emit one event per task with the same request_id to trace the run end-to-end.
Sample events (pipeline)
{
"event_time_utc": "2026-01-08T01:00:05Z",
"actor": {"type": "service", "id": "scheduler"},
"action": "RUN_TASK",
"resource": {"type": "task", "id": "daily_sales.extract"},
"result": {"status": "SUCCESS", "duration_ms": 53000},
"request_id": "run-2026-01-08-01:00",
"env": {"stage": "prod"}
}
{
"event_time_utc": "2026-01-08T01:02:10Z",
"actor": {"type": "service", "id": "scheduler"},
"action": "RUN_TASK",
"resource": {"type": "task", "id": "daily_sales.load"},
"result": {"status": "FAILURE", "error_code": "LOAD-403"},
"request_id": "run-2026-01-08-01:00",
"env": {"stage": "prod"}
}
Query: Find failing tasks by request_id
SELECT request_id, resource_id AS task, result_status, result_error_code
FROM sec.audit_events
WHERE action = 'RUN_TASK' AND result_status = 'FAILURE'
ORDER BY event_time_utc DESC;
Design checklist
- Use UTC timestamps and synchronized clocks.
- Make logs append-only; restrict delete permissions.
- Enable versioning/immutability or write-once storage if available.
- Define a standard schema for actors, actions, resources, and results.
- Include request/correlation IDs across systems.
- Mask secrets; avoid logging credentials or raw PII.
- Set retention (e.g., 90 days hot, multi-year archive if required).
- Index key fields for fast queries (time, actor, action, resource).
- Document who can access logs and for what purpose.
- Automate alerts for risk patterns (mass export, repeated failures, role grants).
Exercises
Do these in a scratch environment or on paper if tools are unavailable.
-
ex1 — Design a minimal audit event schema:
- Create the field list for a warehouse SELECT log with required types and an example JSON event.
-
ex2 — Write queries to detect risky access patterns:
- Given a generic
sec.audit_eventstable, find failed logins bursts and out-of-hours PII access.
- Given a generic
Common mistakes and self-check
- Mistake: Logging everything without a schema. Fix: Define a stable schema and required fields.
- Mistake: No correlation IDs. Fix: Add request_id to connect multi-step processes.
- Mistake: Logs are deletable by admins. Fix: Use write-once or versioned storage; separate admin roles.
- Mistake: Storing secrets in logs. Fix: Mask tokens/passwords; scrub query text if needed.
- Mistake: No retention policy. Fix: Set hot and archive tiers and review regularly.
- Mistake: No monitoring. Fix: Add dashboards and alerts for key patterns.
Self-check: Can you answer within minutes: who accessed a sensitive table last week, which tasks failed in a pipeline run, and who changed permissions yesterday?
Practical projects
- Build a small audit log table and load sample events from warehouse, lake, and pipeline.
- Create 3 saved queries: PII access, role changes, and mass export detection.
- Add an alert that triggers when more than 5 failed logins occur from the same IP in 10 minutes.
- Implement masking rules that redact secrets in log payloads.
Practice & test
Use the Quick Test at the bottom of this page to check your understanding. The test is available to everyone; only logged-in users will have their progress saved.
Next steps
- Extend your schema to include data lineage event types.
- Add integrity checks (hash/sign) for log files in object storage.
- Document your incident response playbook that references audit queries.
Mini challenge
Make your audit logs harder to tamper with: propose a design using versioned storage, object locking (WORM if available), and a separate security account that owns the log bucket/table. Write 3-4 sentences describing how rotate-only lifecycle rules and restricted policies prevent deletion while allowing lawful retention expiry.