Why this matters
Mental model
Think of IaC as a contract. The code is the contract text, the cloud is the real building. Drift means the building changed without updating the contract. You either restore the building to match the contract or revise the contract to match the approved change.
Core workflow for drift handling
- Detect: Run read-only checks (plan/refresh/drift scan). Do not change anything yet.
- Triage: Classify drift as safe, risky, or unknown. Identify who made the change and why.
- Decide source of truth: Should code win, or should we adopt the live change? Document the decision.
- Remediate:
- If code wins: apply the plan to revert live to code.
- If live wins: update code and state (commit and re-apply or import).
- Prevent recurrence: Add guardrails, monitoring, and runbooks.
What "read-only" means in practice
- Terraform:
terraform plan -refresh-onlyorterraform planwith a remote/state refresh. - CloudFormation: Stack drift detection.
- Kubernetes: Compare manifests to live using
kubectl diffor GitOps status screens.
Worked examples
Example 1: Terraform security group rule added manually
- Detect: Run
terraform plan -refresh-only. You see a change showing an unexpected ingress rule (e.g., port 22 from 0.0.0.0/0). - Triage: This is risky (opens SSH globally). Investigate. Likely a manual console hotfix.
- Decide source of truth: Code should win. We will remove the manual rule.
- Remediate: Run
terraform applywith the generated plan to revert the rule. - Prevent: Add a policy-as-code rule or review to prevent wide-open SSH, and enable alerts for SG changes.
Example 2: CloudFormation stack shows drift on an S3 bucket
- Detect: Run a stack drift detection. It reports that
BucketVersioningis disabled live but enabled in the template. - Triage: Disabling versioning risks data recovery; classify as high-risk.
- Decide: Code wins; we want versioning on.
- Remediate: Update the stack (no template change needed) to re-apply desired settings. Confirm drift clears.
- Prevent: Enable a control to block bucket policy/versioning changes outside the pipeline.
Example 3: Kubernetes deployment replicas changed manually
- Detect: Someone scaled a Deployment from 3 to 6 replicas via
kubectl scale. Runkubectl diff -f ./manifestsor check your GitOps dashboard; it shows live=6 vs desired=3. - Triage: Change might be an emergency scale-up. Check incident notes.
- Decide: If demand is still high, update the manifest to 6 (adopt live). If not, revert to 3 (code wins).
- Remediate: Either commit replicas: 6 and re-sync, or apply the original manifest to return to 3.
- Prevent: Use GitOps to auto-reconcile and document a temporary override process.
Who this is for
- Platform Engineers maintaining cloud or Kubernetes platforms
- DevOps/Infra Engineers migrating to IaC-managed environments
- SREs responsible for reliability and change control
Prerequisites
- Basic understanding of an IaC tool (e.g., Terraform, CloudFormation, or Pulumi)
- Familiarity with code review and Git workflows
- Access to a sandbox environment (cloud or local K8s) for safe practice
Learning path
- Identify drift signals: Learn where drift shows up in your tool (plan outputs, drift reports, GitOps status).
- Practice read-only detection: Run refresh/plan/diff safely and interpret diffs.
- Decide code vs live: Build a simple decision tree with risk criteria.
- Apply remediation: Revert to code or adopt changes into code and state.
- Prevent: Add guardrails, monitoring, and a short runbook.
Hands-on exercises
These are available to everyone. Progress is saved for logged-in users.
Exercise 1: Terraform drift — detect and remediate
Goal: Detect a manual tag change and revert it using code.
- Create a simple Terraform config that defines a storage bucket with tags (e.g.,
Environment=prod). - Simulate drift by imagining someone changed the tag live to
Environment=dev. - Run a read-only check:
terraform plan -refresh-onlyand review the diff. - Decide: Code wins. Plan and apply to restore the original tag.
- Re-run plan to confirm no changes remain.
Expected output (sample)
# terraform plan -refresh-only
~ resource "storage_bucket" "app" {
tags = {
- Environment = "dev"
+ Environment = "prod"
}
}
Plan: 0 to add, 1 to change, 0 to destroy.Hints
- Use a separate workspace or back-end to avoid touching real prod.
- If a resource was created live but not in code, consider
terraform importbefore deciding.
Show solution
- Run
terraform init. - Run
terraform plan -refresh-onlyand confirm tag drift. - Run
terraform applyto restore tags toEnvironment=prod. - Run
terraform planagain: it should showNo changes. - Add a policy or pre-commit check to block non-approved tag changes.
Exercise 2: Kubernetes drift — replicas mismatch
Goal: Detect a manual scale and choose whether to adopt or revert.
- Create a Deployment manifest with
replicas: 3. - Simulate live drift by imagining someone ran
kubectl scale deployment app --replicas=5. - Run
kubectl diff -f ./manifests(or compare GitOps desired vs live) to see the mismatch. - Decide: If load is normal, revert to 3; otherwise, update the manifest to 5 and commit.
- Apply the chosen remediation and verify no diff remains.
Expected output (sample)
diff -u ...
- replicas: 3
+ replicas: 5Hints
- Always update the manifest if you decide to keep the new replica count.
- Document temporary overrides with an expiry time.
Show solution
- Run
kubectl diffto confirm drift. - Option A (revert):
kubectl apply -f ./manifeststo go back to 3. - Option B (adopt): Change manifest to 5, commit, and apply/reconcile.
- Verify:
kubectl get deploy app -o jsonpath='{.spec.replicas}'matches desired.
Self-check checklist
- I can run a read-only drift check without changing resources.
- I can explain why a particular drift is risky or safe.
- I can choose when code should win vs when to adopt live changes.
- I can remediate and verify that no drift remains.
- I have a simple runbook for future drift events.
Common mistakes
- Applying immediately after detecting drift: Always capture evidence first. Use read-only checks before any apply.
- Adopting live changes without updating code: This guarantees the same drift returns later.
- Ignoring state imports: Live-only resources should be imported or removed to keep state clean.
- Over-automation of remediation: Auto-fixing everything can break urgent hotfixes. Start with alerts, then targeted auto-remediation.
- No ownership: Not documenting decisions causes team confusion. Record who approved code vs live.
How to self-check
- Can you reproduce the drift report after remediation? If yes, something is still off.
- Is there a Git commit that reflects the decision? If not, adoption wasn’t completed.
- Do policies or alerts exist to prevent the same drift? If not, add them.
Practical projects
- Daily drift report: Schedule a read-only plan/drift check and send a summary to your team channel.
- Targeted auto-remediation: Automatically revert high-risk drifts (e.g., public ingress) while only alerting on low-risk ones.
- Drift runbook: One-page guide with triage categories, decision tree, and escalation contacts.
Next steps
- Introduce policy-as-code to block risky manual changes.
- Adopt GitOps for Kubernetes so drift auto-reconciles to desired state.
- Review access controls to reduce console-based edits in sensitive environments.
Mini challenge
Scenario: A database instance size was changed live for a traffic spike. Your code still declares the smaller size.
- Decide: Do you adopt the larger size or revert?
- Write 3 bullet points for your decision (risk, cost, performance).
- Describe the remediation steps and the commit message you would use.
Possible answer
Adopt temporarily: Update code to the larger instance now, schedule a cost review, and open a follow-up task to rightsize in a week. Commit message: "Adopt live DB size to handle spike; add review date."