How to learn Audit Logging Basics for Security And Governance in Data Engineer for free

Who this is for

Data Engineers, Analytics Engineers, Platform Engineers, and anyone responsible for data access, pipelines, and compliance in data platforms.

Prerequisites

Basic SQL (SELECT, WHERE, GROUP BY)
Familiarity with data platforms (data warehouse, data lake, or orchestration)
Basic understanding of IAM/permissions concepts

Learning path

Start: This lesson (concepts + examples)
Practice: Do the exercises and mini challenge
Validate: Take the Quick Test at the bottom of this page
Apply: Build a small audit log pipeline in your sandbox environment

Why this matters

Audit logging is your system’s security camera. It helps you:

Investigate incidents (e.g., who queried a sensitive table and when)
Prove compliance (retain access records for required periods)
Detect misuse (suspicious queries, mass downloads, off-hours access)
Debug pipelines (trace job runs end-to-end with correlation IDs)

Concept explained simply

An audit log is an append-only record of important events in your data platform. Each event should capture the 5W1H:

Who: user or service account
What: action taken (SELECT, INSERT, GRANT_ROLE, DELETE_OBJECT, RUN_JOB)
When: timestamp in UTC
Where: system and location (warehouse name, database, dataset, object path, source IP)
Why: optional reason/ticket/change request ID
How: auth method, tool/client, request/correlation ID

Minimum recommended fields:

event_time_utc (ISO-8601)
actor (user/service)
action (verb)
resource (type + identifier)
result (SUCCESS/FAILURE + error code)
request_id (to correlate steps)
source_ip and user_agent (where applicable)
impact metrics (row_count, bytes_read)
environment tags (prod/stage) and project/team

Mental model

Think of audit logging as a flight recorder (black box) for your data platform. It continuously records key signals, is hard to tamper with, and is easy to replay for investigation.

Core ingredients of a good audit logging setup

Policy: what to log, retention periods, who can read logs, and how to respond to alerts.
Producers: your warehouse, lake, orchestration tool, and IAM system emitting events.
Transport: reliable delivery (e.g., streaming or batch) with retries.
Storage: append-only, versioned, ideally immutable or write-once.
Index/Query: a way to search events quickly (SQL table or log index).
Integrity: time sync (NTP), hashing/signing, and access controls.
Privacy: avoid logging secrets/PII; mask or tokenize where needed.
Observability: dashboards and alerts for risky patterns.

Worked examples

Example 1: Data warehouse query logs

Enable query and access logging.
Land logs into a central schema (e.g., sec.audit_events).
Normalize key fields; ensure UTC timestamps.

Sample event (warehouse query)

{
  "event_time_utc": "2026-01-08T11:22:33Z",
  "actor": {"type": "user", "id": "alice"},
  "action": "SELECT",
  "resource": {"type": "table", "id": "prod.analytics.pii_customers"},
  "result": {"status": "SUCCESS"},
  "request_id": "3f7b-98c1",
  "client": {"tool": "sql_workbench", "source_ip": "203.0.113.42"},
  "impact": {"rows": 120, "bytes_read": 5242880},
  "env": {"stage": "prod"}
}

Query: Who accessed PII in the last 7 days?

SELECT actor_id, COUNT(*) AS hits
FROM sec.audit_events
WHERE action = 'SELECT'
  AND resource_type = 'table'
  AND resource_id LIKE '%pii%'
  AND event_time_utc >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY actor_id
ORDER BY hits DESC;

Example 2: Data lake object access logs

Log reads/writes/deletes to lake objects. Track object path, method, status, bytes, and client.

Sample event (lake access)

{
  "event_time_utc": "2026-01-08T10:05:00Z",
  "actor": {"type": "service", "id": "etl_loader"},
  "action": "PUT_OBJECT",
  "resource": {"type": "object", "id": "prod/raw/sales/2026/01/08/part-0001.parquet"},
  "result": {"status": "SUCCESS", "http": 200},
  "request_id": "7c2d-44aa",
  "client": {"source_ip": "198.51.100.10"},
  "impact": {"bytes_written": 7340032},
  "env": {"stage": "prod"}
}

Query: Detect unexpected deletes

SELECT event_time_utc, actor_id, resource_id
FROM sec.audit_events
WHERE action = 'DELETE_OBJECT'
  AND env_stage = 'prod'
  AND event_time_utc >= CURRENT_DATE - INTERVAL '1' DAY
ORDER BY event_time_utc DESC;

Example 3: Orchestrated pipeline audit trail

Emit one event per task with the same request_id to trace the run end-to-end.

Sample events (pipeline)

{
  "event_time_utc": "2026-01-08T01:00:05Z",
  "actor": {"type": "service", "id": "scheduler"},
  "action": "RUN_TASK",
  "resource": {"type": "task", "id": "daily_sales.extract"},
  "result": {"status": "SUCCESS", "duration_ms": 53000},
  "request_id": "run-2026-01-08-01:00",
  "env": {"stage": "prod"}
}
{
  "event_time_utc": "2026-01-08T01:02:10Z",
  "actor": {"type": "service", "id": "scheduler"},
  "action": "RUN_TASK",
  "resource": {"type": "task", "id": "daily_sales.load"},
  "result": {"status": "FAILURE", "error_code": "LOAD-403"},
  "request_id": "run-2026-01-08-01:00",
  "env": {"stage": "prod"}
}

Query: Find failing tasks by request_id

SELECT request_id, resource_id AS task, result_status, result_error_code
FROM sec.audit_events
WHERE action = 'RUN_TASK' AND result_status = 'FAILURE'
ORDER BY event_time_utc DESC;

Design checklist

Use UTC timestamps and synchronized clocks.
Make logs append-only; restrict delete permissions.
Enable versioning/immutability or write-once storage if available.
Define a standard schema for actors, actions, resources, and results.
Include request/correlation IDs across systems.
Mask secrets; avoid logging credentials or raw PII.
Set retention (e.g., 90 days hot, multi-year archive if required).
Index key fields for fast queries (time, actor, action, resource).
Document who can access logs and for what purpose.
Automate alerts for risk patterns (mass export, repeated failures, role grants).

Exercises

Do these in a scratch environment or on paper if tools are unavailable.

ex1 — Design a minimal audit event schema:
- Create the field list for a warehouse SELECT log with required types and an example JSON event.
ex2 — Write queries to detect risky access patterns:
- Given a generic sec.audit_events table, find failed logins bursts and out-of-hours PII access.

Common mistakes and self-check

Mistake: Logging everything without a schema. Fix: Define a stable schema and required fields.
Mistake: No correlation IDs. Fix: Add request_id to connect multi-step processes.
Mistake: Logs are deletable by admins. Fix: Use write-once or versioned storage; separate admin roles.
Mistake: Storing secrets in logs. Fix: Mask tokens/passwords; scrub query text if needed.
Mistake: No retention policy. Fix: Set hot and archive tiers and review regularly.
Mistake: No monitoring. Fix: Add dashboards and alerts for key patterns.

Self-check: Can you answer within minutes: who accessed a sensitive table last week, which tasks failed in a pipeline run, and who changed permissions yesterday?

Practical projects

Build a small audit log table and load sample events from warehouse, lake, and pipeline.
Create 3 saved queries: PII access, role changes, and mass export detection.
Add an alert that triggers when more than 5 failed logins occur from the same IP in 10 minutes.
Implement masking rules that redact secrets in log payloads.

Practice & test

Use the Quick Test at the bottom of this page to check your understanding. The test is available to everyone; only logged-in users will have their progress saved.

Next steps

Extend your schema to include data lineage event types.
Add integrity checks (hash/sign) for log files in object storage.
Document your incident response playbook that references audit queries.

Mini challenge

Make your audit logs harder to tamper with: propose a design using versioned storage, object locking (WORM if available), and a separate security account that owns the log bucket/table. Write 3-4 sentences describing how rotate-only lifecycle rules and restricted policies prevent deletion while allowing lawful retention expiry.

Menu

Audit Logging Basics

Table of Contents

Who this is for

Prerequisites

Learning path

Why this matters

Concept explained simply

Mental model

Core ingredients of a good audit logging setup

Worked examples

Example 1: Data warehouse query logs

Example 2: Data lake object access logs

Example 3: Orchestrated pipeline audit trail

Design checklist

Exercises

Common mistakes and self-check

Practical projects

Practice & test

Next steps

Mini challenge

Practice Exercises

Design a minimal audit event schema

Instructions

Expected Output

Write queries to detect risky access patterns

Audit Logging Basics — Quick Test

Have questions about Audit Logging Basics?

AI Assistant