Menu

Topic 4 of 8

Audit Logging Basics

Learn Audit Logging Basics for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Who this is for

Data Engineers, Analytics Engineers, Platform Engineers, and anyone responsible for data access, pipelines, and compliance in data platforms.

Prerequisites

  • Basic SQL (SELECT, WHERE, GROUP BY)
  • Familiarity with data platforms (data warehouse, data lake, or orchestration)
  • Basic understanding of IAM/permissions concepts

Learning path

  • Start: This lesson (concepts + examples)
  • Practice: Do the exercises and mini challenge
  • Validate: Take the Quick Test at the bottom of this page
  • Apply: Build a small audit log pipeline in your sandbox environment

Why this matters

Audit logging is your system’s security camera. It helps you:

  • Investigate incidents (e.g., who queried a sensitive table and when)
  • Prove compliance (retain access records for required periods)
  • Detect misuse (suspicious queries, mass downloads, off-hours access)
  • Debug pipelines (trace job runs end-to-end with correlation IDs)

Concept explained simply

An audit log is an append-only record of important events in your data platform. Each event should capture the 5W1H:

  • Who: user or service account
  • What: action taken (SELECT, INSERT, GRANT_ROLE, DELETE_OBJECT, RUN_JOB)
  • When: timestamp in UTC
  • Where: system and location (warehouse name, database, dataset, object path, source IP)
  • Why: optional reason/ticket/change request ID
  • How: auth method, tool/client, request/correlation ID

Minimum recommended fields:

  • event_time_utc (ISO-8601)
  • actor (user/service)
  • action (verb)
  • resource (type + identifier)
  • result (SUCCESS/FAILURE + error code)
  • request_id (to correlate steps)
  • source_ip and user_agent (where applicable)
  • impact metrics (row_count, bytes_read)
  • environment tags (prod/stage) and project/team

Mental model

Think of audit logging as a flight recorder (black box) for your data platform. It continuously records key signals, is hard to tamper with, and is easy to replay for investigation.

Core ingredients of a good audit logging setup

  • Policy: what to log, retention periods, who can read logs, and how to respond to alerts.
  • Producers: your warehouse, lake, orchestration tool, and IAM system emitting events.
  • Transport: reliable delivery (e.g., streaming or batch) with retries.
  • Storage: append-only, versioned, ideally immutable or write-once.
  • Index/Query: a way to search events quickly (SQL table or log index).
  • Integrity: time sync (NTP), hashing/signing, and access controls.
  • Privacy: avoid logging secrets/PII; mask or tokenize where needed.
  • Observability: dashboards and alerts for risky patterns.

Worked examples

Example 1: Data warehouse query logs

  1. Enable query and access logging.
  2. Land logs into a central schema (e.g., sec.audit_events).
  3. Normalize key fields; ensure UTC timestamps.
Sample event (warehouse query)
{
  "event_time_utc": "2026-01-08T11:22:33Z",
  "actor": {"type": "user", "id": "alice"},
  "action": "SELECT",
  "resource": {"type": "table", "id": "prod.analytics.pii_customers"},
  "result": {"status": "SUCCESS"},
  "request_id": "3f7b-98c1",
  "client": {"tool": "sql_workbench", "source_ip": "203.0.113.42"},
  "impact": {"rows": 120, "bytes_read": 5242880},
  "env": {"stage": "prod"}
}
Query: Who accessed PII in the last 7 days?
SELECT actor_id, COUNT(*) AS hits
FROM sec.audit_events
WHERE action = 'SELECT'
  AND resource_type = 'table'
  AND resource_id LIKE '%pii%'
  AND event_time_utc >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY actor_id
ORDER BY hits DESC;

Example 2: Data lake object access logs

Log reads/writes/deletes to lake objects. Track object path, method, status, bytes, and client.

Sample event (lake access)
{
  "event_time_utc": "2026-01-08T10:05:00Z",
  "actor": {"type": "service", "id": "etl_loader"},
  "action": "PUT_OBJECT",
  "resource": {"type": "object", "id": "prod/raw/sales/2026/01/08/part-0001.parquet"},
  "result": {"status": "SUCCESS", "http": 200},
  "request_id": "7c2d-44aa",
  "client": {"source_ip": "198.51.100.10"},
  "impact": {"bytes_written": 7340032},
  "env": {"stage": "prod"}
}
Query: Detect unexpected deletes
SELECT event_time_utc, actor_id, resource_id
FROM sec.audit_events
WHERE action = 'DELETE_OBJECT'
  AND env_stage = 'prod'
  AND event_time_utc >= CURRENT_DATE - INTERVAL '1' DAY
ORDER BY event_time_utc DESC;

Example 3: Orchestrated pipeline audit trail

Emit one event per task with the same request_id to trace the run end-to-end.

Sample events (pipeline)
{
  "event_time_utc": "2026-01-08T01:00:05Z",
  "actor": {"type": "service", "id": "scheduler"},
  "action": "RUN_TASK",
  "resource": {"type": "task", "id": "daily_sales.extract"},
  "result": {"status": "SUCCESS", "duration_ms": 53000},
  "request_id": "run-2026-01-08-01:00",
  "env": {"stage": "prod"}
}
{
  "event_time_utc": "2026-01-08T01:02:10Z",
  "actor": {"type": "service", "id": "scheduler"},
  "action": "RUN_TASK",
  "resource": {"type": "task", "id": "daily_sales.load"},
  "result": {"status": "FAILURE", "error_code": "LOAD-403"},
  "request_id": "run-2026-01-08-01:00",
  "env": {"stage": "prod"}
}
Query: Find failing tasks by request_id
SELECT request_id, resource_id AS task, result_status, result_error_code
FROM sec.audit_events
WHERE action = 'RUN_TASK' AND result_status = 'FAILURE'
ORDER BY event_time_utc DESC;

Design checklist

  • Use UTC timestamps and synchronized clocks.
  • Make logs append-only; restrict delete permissions.
  • Enable versioning/immutability or write-once storage if available.
  • Define a standard schema for actors, actions, resources, and results.
  • Include request/correlation IDs across systems.
  • Mask secrets; avoid logging credentials or raw PII.
  • Set retention (e.g., 90 days hot, multi-year archive if required).
  • Index key fields for fast queries (time, actor, action, resource).
  • Document who can access logs and for what purpose.
  • Automate alerts for risk patterns (mass export, repeated failures, role grants).

Exercises

Do these in a scratch environment or on paper if tools are unavailable.

  1. ex1 — Design a minimal audit event schema:
    • Create the field list for a warehouse SELECT log with required types and an example JSON event.
  2. ex2 — Write queries to detect risky access patterns:
    • Given a generic sec.audit_events table, find failed logins bursts and out-of-hours PII access.

Common mistakes and self-check

  • Mistake: Logging everything without a schema. Fix: Define a stable schema and required fields.
  • Mistake: No correlation IDs. Fix: Add request_id to connect multi-step processes.
  • Mistake: Logs are deletable by admins. Fix: Use write-once or versioned storage; separate admin roles.
  • Mistake: Storing secrets in logs. Fix: Mask tokens/passwords; scrub query text if needed.
  • Mistake: No retention policy. Fix: Set hot and archive tiers and review regularly.
  • Mistake: No monitoring. Fix: Add dashboards and alerts for key patterns.

Self-check: Can you answer within minutes: who accessed a sensitive table last week, which tasks failed in a pipeline run, and who changed permissions yesterday?

Practical projects

  • Build a small audit log table and load sample events from warehouse, lake, and pipeline.
  • Create 3 saved queries: PII access, role changes, and mass export detection.
  • Add an alert that triggers when more than 5 failed logins occur from the same IP in 10 minutes.
  • Implement masking rules that redact secrets in log payloads.

Practice & test

Use the Quick Test at the bottom of this page to check your understanding. The test is available to everyone; only logged-in users will have their progress saved.

Next steps

  • Extend your schema to include data lineage event types.
  • Add integrity checks (hash/sign) for log files in object storage.
  • Document your incident response playbook that references audit queries.

Mini challenge

Make your audit logs harder to tamper with: propose a design using versioned storage, object locking (WORM if available), and a separate security account that owns the log bucket/table. Write 3-4 sentences describing how rotate-only lifecycle rules and restricted policies prevent deletion while allowing lawful retention expiry.

Practice Exercises

2 exercises to complete

Instructions

You are enabling query audit logs for a production warehouse. Define a minimal, consistent JSON event schema for SELECT statements. Include field names, data types, example values, and one full example event. Ensure the schema supports correlation across systems.

Expected Output
A clear field list covering actor, action, resource, timestamp (UTC), result, request_id, client info, impact metrics, and environment tags, plus a valid example JSON event.

Audit Logging Basics — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Audit Logging Basics?

AI Assistant

Ask questions about this tool