How to learn Drift Detection And Remediation for Infrastructure As Code in Platform Engineer for free

Why this matters

Mental model

Think of IaC as a contract. The code is the contract text, the cloud is the real building. Drift means the building changed without updating the contract. You either restore the building to match the contract or revise the contract to match the approved change.

Core workflow for drift handling

Detect: Run read-only checks (plan/refresh/drift scan). Do not change anything yet.
Triage: Classify drift as safe, risky, or unknown. Identify who made the change and why.
Decide source of truth: Should code win, or should we adopt the live change? Document the decision.
Remediate:
- If code wins: apply the plan to revert live to code.
- If live wins: update code and state (commit and re-apply or import).
Prevent recurrence: Add guardrails, monitoring, and runbooks.

What "read-only" means in practice

Terraform: terraform plan -refresh-only or terraform plan with a remote/state refresh.
CloudFormation: Stack drift detection.
Kubernetes: Compare manifests to live using kubectl diff or GitOps status screens.

Worked examples

Example 1: Terraform security group rule added manually

Detect: Run terraform plan -refresh-only. You see a change showing an unexpected ingress rule (e.g., port 22 from 0.0.0.0/0).
Triage: This is risky (opens SSH globally). Investigate. Likely a manual console hotfix.
Decide source of truth: Code should win. We will remove the manual rule.
Remediate: Run terraform apply with the generated plan to revert the rule.
Prevent: Add a policy-as-code rule or review to prevent wide-open SSH, and enable alerts for SG changes.

Example 2: CloudFormation stack shows drift on an S3 bucket

Detect: Run a stack drift detection. It reports that BucketVersioning is disabled live but enabled in the template.
Triage: Disabling versioning risks data recovery; classify as high-risk.
Decide: Code wins; we want versioning on.
Remediate: Update the stack (no template change needed) to re-apply desired settings. Confirm drift clears.
Prevent: Enable a control to block bucket policy/versioning changes outside the pipeline.

Example 3: Kubernetes deployment replicas changed manually

Detect: Someone scaled a Deployment from 3 to 6 replicas via kubectl scale. Run kubectl diff -f ./manifests or check your GitOps dashboard; it shows live=6 vs desired=3.
Triage: Change might be an emergency scale-up. Check incident notes.
Decide: If demand is still high, update the manifest to 6 (adopt live). If not, revert to 3 (code wins).
Remediate: Either commit replicas: 6 and re-sync, or apply the original manifest to return to 3.
Prevent: Use GitOps to auto-reconcile and document a temporary override process.

Who this is for

Platform Engineers maintaining cloud or Kubernetes platforms
DevOps/Infra Engineers migrating to IaC-managed environments
SREs responsible for reliability and change control

Prerequisites

Basic understanding of an IaC tool (e.g., Terraform, CloudFormation, or Pulumi)
Familiarity with code review and Git workflows
Access to a sandbox environment (cloud or local K8s) for safe practice

Learning path

Identify drift signals: Learn where drift shows up in your tool (plan outputs, drift reports, GitOps status).
Practice read-only detection: Run refresh/plan/diff safely and interpret diffs.
Decide code vs live: Build a simple decision tree with risk criteria.
Apply remediation: Revert to code or adopt changes into code and state.
Prevent: Add guardrails, monitoring, and a short runbook.

Hands-on exercises

These are available to everyone. Progress is saved for logged-in users.

Exercise 1: Terraform drift — detect and remediate

Goal: Detect a manual tag change and revert it using code.

Create a simple Terraform config that defines a storage bucket with tags (e.g., Environment=prod).
Simulate drift by imagining someone changed the tag live to Environment=dev.
Run a read-only check: terraform plan -refresh-only and review the diff.
Decide: Code wins. Plan and apply to restore the original tag.
Re-run plan to confirm no changes remain.

Expected output (sample)

# terraform plan -refresh-only
~ resource "storage_bucket" "app" {
      tags = {
-       Environment = "dev"
+       Environment = "prod"
      }
}

Plan: 0 to add, 1 to change, 0 to destroy.

Hints

Use a separate workspace or back-end to avoid touching real prod.
If a resource was created live but not in code, consider terraform import before deciding.

Show solution

Run terraform init.
Run terraform plan -refresh-only and confirm tag drift.
Run terraform apply to restore tags to Environment=prod.
Run terraform plan again: it should show No changes.
Add a policy or pre-commit check to block non-approved tag changes.

Exercise 2: Kubernetes drift — replicas mismatch

Goal: Detect a manual scale and choose whether to adopt or revert.

Create a Deployment manifest with replicas: 3.
Simulate live drift by imagining someone ran kubectl scale deployment app --replicas=5.
Run kubectl diff -f ./manifests (or compare GitOps desired vs live) to see the mismatch.
Decide: If load is normal, revert to 3; otherwise, update the manifest to 5 and commit.
Apply the chosen remediation and verify no diff remains.

Expected output (sample)

diff -u ...
-  replicas: 3
+  replicas: 5

Hints

Always update the manifest if you decide to keep the new replica count.
Document temporary overrides with an expiry time.

Show solution

Run kubectl diff to confirm drift.
Option A (revert): kubectl apply -f ./manifests to go back to 3.
Option B (adopt): Change manifest to 5, commit, and apply/reconcile.
Verify: kubectl get deploy app -o jsonpath='{.spec.replicas}' matches desired.

Self-check checklist

I can run a read-only drift check without changing resources.
I can explain why a particular drift is risky or safe.
I can choose when code should win vs when to adopt live changes.
I can remediate and verify that no drift remains.
I have a simple runbook for future drift events.

Common mistakes

Applying immediately after detecting drift: Always capture evidence first. Use read-only checks before any apply.
Adopting live changes without updating code: This guarantees the same drift returns later.
Ignoring state imports: Live-only resources should be imported or removed to keep state clean.
Over-automation of remediation: Auto-fixing everything can break urgent hotfixes. Start with alerts, then targeted auto-remediation.
No ownership: Not documenting decisions causes team confusion. Record who approved code vs live.

How to self-check

Can you reproduce the drift report after remediation? If yes, something is still off.
Is there a Git commit that reflects the decision? If not, adoption wasn’t completed.
Do policies or alerts exist to prevent the same drift? If not, add them.

Practical projects

Daily drift report: Schedule a read-only plan/drift check and send a summary to your team channel.
Targeted auto-remediation: Automatically revert high-risk drifts (e.g., public ingress) while only alerting on low-risk ones.
Drift runbook: One-page guide with triage categories, decision tree, and escalation contacts.

Next steps

Introduce policy-as-code to block risky manual changes.
Adopt GitOps for Kubernetes so drift auto-reconciles to desired state.
Review access controls to reduce console-based edits in sensitive environments.

Mini challenge

Scenario: A database instance size was changed live for a traffic spike. Your code still declares the smaller size.

Decide: Do you adopt the larger size or revert?
Write 3 bullet points for your decision (risk, cost, performance).
Describe the remediation steps and the commit message you would use.

Possible answer

Adopt temporarily: Update code to the larger instance now, schedule a cost review, and open a follow-up task to rightsize in a week. Commit message: "Adopt live DB size to handle spike; add review date."

Menu

Drift Detection And Remediation

Table of Contents

Why this matters

Mental model

Core workflow for drift handling

Worked examples

Who this is for

Prerequisites

Learning path

Hands-on exercises

Exercise 1: Terraform drift — detect and remediate

Exercise 2: Kubernetes drift — replicas mismatch

Self-check checklist

Common mistakes

Practical projects

Next steps

Mini challenge

Practice Exercises

Terraform drift — detect and remediate

Instructions

Expected Output

Kubernetes drift — replicas mismatch

Drift Detection And Remediation — Quick Test

Have questions about Drift Detection And Remediation?

AI Assistant