Menu

Topic 5 of 8

Drift Detection And Remediation

Learn Drift Detection And Remediation for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Why this matters

Mental model

Think of IaC as a contract. The code is the contract text, the cloud is the real building. Drift means the building changed without updating the contract. You either restore the building to match the contract or revise the contract to match the approved change.

Core workflow for drift handling

  1. Detect: Run read-only checks (plan/refresh/drift scan). Do not change anything yet.
  2. Triage: Classify drift as safe, risky, or unknown. Identify who made the change and why.
  3. Decide source of truth: Should code win, or should we adopt the live change? Document the decision.
  4. Remediate:
    • If code wins: apply the plan to revert live to code.
    • If live wins: update code and state (commit and re-apply or import).
  5. Prevent recurrence: Add guardrails, monitoring, and runbooks.
What "read-only" means in practice
  • Terraform: terraform plan -refresh-only or terraform plan with a remote/state refresh.
  • CloudFormation: Stack drift detection.
  • Kubernetes: Compare manifests to live using kubectl diff or GitOps status screens.

Worked examples

Example 1: Terraform security group rule added manually
  1. Detect: Run terraform plan -refresh-only. You see a change showing an unexpected ingress rule (e.g., port 22 from 0.0.0.0/0).
  2. Triage: This is risky (opens SSH globally). Investigate. Likely a manual console hotfix.
  3. Decide source of truth: Code should win. We will remove the manual rule.
  4. Remediate: Run terraform apply with the generated plan to revert the rule.
  5. Prevent: Add a policy-as-code rule or review to prevent wide-open SSH, and enable alerts for SG changes.
Example 2: CloudFormation stack shows drift on an S3 bucket
  1. Detect: Run a stack drift detection. It reports that BucketVersioning is disabled live but enabled in the template.
  2. Triage: Disabling versioning risks data recovery; classify as high-risk.
  3. Decide: Code wins; we want versioning on.
  4. Remediate: Update the stack (no template change needed) to re-apply desired settings. Confirm drift clears.
  5. Prevent: Enable a control to block bucket policy/versioning changes outside the pipeline.
Example 3: Kubernetes deployment replicas changed manually
  1. Detect: Someone scaled a Deployment from 3 to 6 replicas via kubectl scale. Run kubectl diff -f ./manifests or check your GitOps dashboard; it shows live=6 vs desired=3.
  2. Triage: Change might be an emergency scale-up. Check incident notes.
  3. Decide: If demand is still high, update the manifest to 6 (adopt live). If not, revert to 3 (code wins).
  4. Remediate: Either commit replicas: 6 and re-sync, or apply the original manifest to return to 3.
  5. Prevent: Use GitOps to auto-reconcile and document a temporary override process.

Who this is for

  • Platform Engineers maintaining cloud or Kubernetes platforms
  • DevOps/Infra Engineers migrating to IaC-managed environments
  • SREs responsible for reliability and change control

Prerequisites

  • Basic understanding of an IaC tool (e.g., Terraform, CloudFormation, or Pulumi)
  • Familiarity with code review and Git workflows
  • Access to a sandbox environment (cloud or local K8s) for safe practice

Learning path

  1. Identify drift signals: Learn where drift shows up in your tool (plan outputs, drift reports, GitOps status).
  2. Practice read-only detection: Run refresh/plan/diff safely and interpret diffs.
  3. Decide code vs live: Build a simple decision tree with risk criteria.
  4. Apply remediation: Revert to code or adopt changes into code and state.
  5. Prevent: Add guardrails, monitoring, and a short runbook.

Hands-on exercises

These are available to everyone. Progress is saved for logged-in users.

Exercise 1: Terraform drift — detect and remediate

Goal: Detect a manual tag change and revert it using code.

  1. Create a simple Terraform config that defines a storage bucket with tags (e.g., Environment=prod).
  2. Simulate drift by imagining someone changed the tag live to Environment=dev.
  3. Run a read-only check: terraform plan -refresh-only and review the diff.
  4. Decide: Code wins. Plan and apply to restore the original tag.
  5. Re-run plan to confirm no changes remain.
Expected output (sample)
# terraform plan -refresh-only
~ resource "storage_bucket" "app" {
      tags = {
-       Environment = "dev"
+       Environment = "prod"
      }
}

Plan: 0 to add, 1 to change, 0 to destroy.
Hints
  • Use a separate workspace or back-end to avoid touching real prod.
  • If a resource was created live but not in code, consider terraform import before deciding.
Show solution
  1. Run terraform init.
  2. Run terraform plan -refresh-only and confirm tag drift.
  3. Run terraform apply to restore tags to Environment=prod.
  4. Run terraform plan again: it should show No changes.
  5. Add a policy or pre-commit check to block non-approved tag changes.

Exercise 2: Kubernetes drift — replicas mismatch

Goal: Detect a manual scale and choose whether to adopt or revert.

  1. Create a Deployment manifest with replicas: 3.
  2. Simulate live drift by imagining someone ran kubectl scale deployment app --replicas=5.
  3. Run kubectl diff -f ./manifests (or compare GitOps desired vs live) to see the mismatch.
  4. Decide: If load is normal, revert to 3; otherwise, update the manifest to 5 and commit.
  5. Apply the chosen remediation and verify no diff remains.
Expected output (sample)
diff -u ...
-  replicas: 3
+  replicas: 5
Hints
  • Always update the manifest if you decide to keep the new replica count.
  • Document temporary overrides with an expiry time.
Show solution
  1. Run kubectl diff to confirm drift.
  2. Option A (revert): kubectl apply -f ./manifests to go back to 3.
  3. Option B (adopt): Change manifest to 5, commit, and apply/reconcile.
  4. Verify: kubectl get deploy app -o jsonpath='{.spec.replicas}' matches desired.

Self-check checklist

  • I can run a read-only drift check without changing resources.
  • I can explain why a particular drift is risky or safe.
  • I can choose when code should win vs when to adopt live changes.
  • I can remediate and verify that no drift remains.
  • I have a simple runbook for future drift events.

Common mistakes

  • Applying immediately after detecting drift: Always capture evidence first. Use read-only checks before any apply.
  • Adopting live changes without updating code: This guarantees the same drift returns later.
  • Ignoring state imports: Live-only resources should be imported or removed to keep state clean.
  • Over-automation of remediation: Auto-fixing everything can break urgent hotfixes. Start with alerts, then targeted auto-remediation.
  • No ownership: Not documenting decisions causes team confusion. Record who approved code vs live.
How to self-check
  • Can you reproduce the drift report after remediation? If yes, something is still off.
  • Is there a Git commit that reflects the decision? If not, adoption wasn’t completed.
  • Do policies or alerts exist to prevent the same drift? If not, add them.

Practical projects

  • Daily drift report: Schedule a read-only plan/drift check and send a summary to your team channel.
  • Targeted auto-remediation: Automatically revert high-risk drifts (e.g., public ingress) while only alerting on low-risk ones.
  • Drift runbook: One-page guide with triage categories, decision tree, and escalation contacts.

Next steps

  • Introduce policy-as-code to block risky manual changes.
  • Adopt GitOps for Kubernetes so drift auto-reconciles to desired state.
  • Review access controls to reduce console-based edits in sensitive environments.

Mini challenge

Scenario: A database instance size was changed live for a traffic spike. Your code still declares the smaller size.

  • Decide: Do you adopt the larger size or revert?
  • Write 3 bullet points for your decision (risk, cost, performance).
  • Describe the remediation steps and the commit message you would use.
Possible answer

Adopt temporarily: Update code to the larger instance now, schedule a cost review, and open a follow-up task to rightsize in a week. Commit message: "Adopt live DB size to handle spike; add review date."

Practice Exercises

2 exercises to complete

Instructions

Goal: Detect a manual tag change and revert it using code.

  1. Create a simple Terraform config that defines a storage bucket with tags (e.g., Environment=prod).
  2. Simulate drift by imagining someone changed the tag live to Environment=dev.
  3. Run a read-only check: terraform plan -refresh-only and review the diff.
  4. Decide: Code wins. Plan and apply to restore the original tag.
  5. Re-run plan to confirm no changes remain.
Expected Output
# terraform plan -refresh-only ~ resource "storage_bucket" "app" { tags = { - Environment = "dev" + Environment = "prod" } } Plan: 0 to add, 1 to change, 0 to destroy.

Drift Detection And Remediation — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Drift Detection And Remediation?

AI Assistant

Ask questions about this tool