How to learn Drift Detection Basics for Infrastructure As Code in Data Platform Engineer for free

Why this matters

In a data platform, small manual tweaks can quietly break pipelines, change costs, or weaken security. Drift detection catches differences between your Infrastructure as Code (IaC) and what actually runs in the cloud. As a Data Platform Engineer, you rely on this to prevent surprises like disabled S3 versioning, altered IAM policies, or resized warehouses that spike costs.

Reliability: Stop breaking changes before they hit production.
Security: Catch out-of-band policy edits or open network rules.
Cost control: Detect unplanned scaling or service upgrades.
Auditability: Keep the platform aligned with approved, reviewed code.

Concept explained simply

Drift = difference between your declared configuration (source of truth in code) and the real infrastructure state running in your cloud or platform.

Mental model

Think of IaC as a checklist for your platform. Each run compares the checklist to the real world:

No differences: you are aligned.
Differences found: this is drift. You must either fix the real world to match code or update the code intentionally.

Drift is not always bad—sometimes it is a necessary emergency change. But it must be surfaced, discussed, and codified or reverted.

Core workflow for drift detection

Scan: Run your tool's plan/drift check regularly (e.g., Terraform plan with refresh, CloudFormation drift detection, Pulumi preview).
Classify: For each change, decide: revert to code, or accept change and update code.
Validate: Test in a safe environment before production.
Remediate: Apply changes from code. Avoid direct console edits unless break-glass is required.
Prevent: Add guardrails (reviews, policies) and automate scheduled drift checks.

Signals to watch

Plan output shows additions, modifications, or deletions.
Special exit codes (e.g., Terraform -detailed-exitcode returns 2 when changes are detected).
CI job summaries that flag drift.

Worked examples

Example 1: S3 versioning drift

Scenario: Your Terraform code sets versioning = Enabled on a data lake bucket. Someone disabled it in the console.

# Plan excerpt (Terraform)
~ resource "aws_s3_bucket_versioning" "dl" {
      status = "Disabled" -> "Enabled"
}

Action: Classify as critical (data recoverability). Revert by applying code to re-enable versioning. Document why it happened and add a policy against manual changes.

Example 2: IAM policy loosened

Scenario: An inline policy was edited to allow s3:* on *. Your IaC specifies least-privilege read-only.

# Plan excerpt shows policy JSON diff (truncated)
~ effect    = "Allow"
~ actions   = [
    - "s3:GetObject",
    + "s3:*"
  ]

Action: Security-sensitive drift. Revert immediately via code. Investigate root cause and enforce approvals.

Example 3: Warehouse size bumped

Scenario: A Snowflake (or similar) warehouse is resized manually from M to L, increasing costs.

Action: Review whether this was a temporary need. If temporary, revert to M via code. If permanent, update code and justify cost.

Example 4 (bonus): Kafka topic retention changed

Scenario: Topic retention increased from 7d to 30d in the UI. IaC still declares 7d.

Action: Check storage costs and downstream usage. Decide to revert or codify 30d, then apply via code.

How to perform drift detection (step-by-step)

Prepare environment: ensure access to the same state and account as your deployments.
Run a read-only check: use your tool's plan/preview with refresh to compare code to reality.
Interpret results: identify adds, changes, deletes. Note risk (security, data loss, cost).
Decide: revert or codify. Involve stakeholders for ambiguous cases.
Apply safely: promote via dev → stage → prod with approvals.
Automate: schedule periodic checks and alert on drift.

Terraform specifics (common)

Use: terraform plan -refresh-only to reconcile state with real world.
Use: terraform plan -detailed-exitcode for CI signals (0 no changes, 2 changes, 1 error).
Never apply console fixes silently; update code first.

Exercises

Do these now. They mirror the exercises below so your answers can be checked.

Exercise 1: Read a plan and classify drift

Given this plan excerpt, list the drifted settings and decide whether to revert or codify:

~ aws_s3_bucket_versioning.dl
  status: "Disabled" -> "Enabled"
~ aws_iam_policy.reader
  actions: ["s3:GetObject"] -> ["s3:*"]
Plan: 0 to add, 2 to change, 0 to destroy.

What changed?
Risk level?
Revert now or codify?

Exercise 2: Exit-code logic for automation

Write a small shell snippet that runs a plan with detailed exit codes and prints one of: NO_DRIFT, DRIFT_DETECTED, or ERROR. Do not apply changes.

Need a hint?

Use a plan with -detailed-exitcode.
Exit code 2 indicates changes detected.
Ensure non-zero exit codes propagate in CI.

Checklist: Drift detection readiness

We can run a read-only plan against each environment.
We understand the plan output and exit codes.
There is a documented decision flow: revert vs codify.
CI runs scheduled drift checks and alerts on drift.
Sensitive resources (IAM, networking, encryption) get priority review.
Break-glass changes are tracked and back-ported to code within a set SLA.

Common mistakes and how to self-check

Mistake: Applying fixes in the console. Self-check: Can you point to a PR that explains the change? If not, it is risky.
Mistake: Ignoring exit codes. Self-check: Ensure 2 means alert and non-blocking deploys are discussed.
Mistake: Treating all drift as urgent. Self-check: Classify by security, data loss, and cost impact.
Mistake: Forgetting state refresh. Self-check: Run a refresh-only or equivalent to ensure state matches reality before deciding.
Mistake: Destroy changes unnoticed. Self-check: Scan plan output for destroy operations and require approvals.

Practical projects

Project 1: Scheduled drift checker. Set up a daily plan/preview that posts a summary and exit code. Include severity tags.
Project 2: Drift runbook. Write a one-page decision tree for revert vs codify, including examples and required reviewers.
Project 3: Guardrails. Add policy or codeowners rules to block changes to sensitive resources without approval.

Who this is for

Data Platform Engineers managing cloud resources via IaC.
Engineers owning data lakes, warehouses, streaming, or orchestration platforms.
SREs and Platform Engineers supporting data teams.

Prerequisites

Basic IaC knowledge (Terraform, CloudFormation, or Pulumi).
Access to a test environment and state backend.
Familiarity with your platform resources (S3/object storage, IAM, warehouses, clusters).

Learning path

Drift Detection Basics (this lesson)
Automating Drift Checks in CI
Policy-as-Code for Preventing Unapproved Changes
Incident Playbooks for High-Risk Drift
Cost-Aware Remediation Strategies

Mini challenge

Pick one production-critical resource (e.g., data lake bucket). Identify two drifts that would be high risk and describe exactly how your pipeline would detect them, who is paged, and how you would revert or codify within 24 hours.

Next steps

Automate a scheduled read-only plan in your CI.
Define severity labels and SLAs for drift types.
Share the runbook with your team and run a tabletop exercise.

Quick test

The quick test is available to everyone. Only logged-in users have their progress saved and can resume later. Use it to confirm you can interpret plan outputs and choose the right remediation.

Menu

Drift Detection Basics

Table of Contents