Why this matters
Infrastructure changes can improve reliability, security, and cost—but they can also cause outages if rushed. As a Platform Engineer using Infrastructure as Code, you need a repeatable, auditable process that keeps production safe while enabling teams to move fast.
- Real tasks you will face: rolling out network changes with zero downtime, upgrading Terraform modules safely, adjusting autoscaling without impacting traffic, and responding quickly when a change goes wrong.
- A solid change workflow reduces risk, speeds up reviews, and creates a clear paper trail for compliance.
What can go wrong without change management?
- Opening security groups too broadly, exposing services.
- Destroying or recreating stateful resources (databases, buckets) due to a tiny misconfiguration.
- Applying in the wrong environment or region.
- Cost spikes from unintended resource scaling.
Concept explained simply
Change management for infra is a lightweight set of steps to make changes safely. You propose a change (pull request), preview the impact (plan), get peer review and automation checks, apply in a controlled window, and verify results. If anything goes wrong, you roll back with a pre-agreed plan.
Mental model
Think of it as a three-gate flow: Propose → Prove → Promote.
- Propose: human-intent in PR with context, risk, and rollback.
- Prove: automation validates—format, validate, plan, security and cost checks, pre-prod test.
- Promote: controlled apply with monitoring and clear ownership.
GitOps vs Ticket-Driven
GitOps: PRs are the source of truth; merges trigger applies. Ticket-driven: A ticket references a PR, and apply happens via a pipeline or change window. Many teams combine both.
Core workflow (IaC change lifecycle)
- Open a PR (Propose)
- Describe the intent, risk, blast radius, plan summary, and rollback steps.
- Tag owners and environment.
- Pre-merge checks (Prove)
- IaC hygiene: fmt, validate, lint.
- Plan preview: terraform plan or change set.
- Policy-as-code: security and compliance rules.
- Cost estimate (rough) and drift detection.
- Test in sandbox or staging.
- Approval
- At least one peer approval for low-risk; two for high-risk or production.
- Optional CAB for critical changes.
- Apply (Promote)
- Use a controlled window if risk is high.
- Apply via CI or controlled workflow with audit logs.
- Announce start/end in team channel.
- Post-change verification
- Health checks, logs, dashboards, SLOs.
- Confirm cost expectations.
- Close with a brief change summary.
- Rollback (if needed)
- Reapply previous version or revert PR and re-run pipeline.
- For stateful resources, use snapshots/point-in-time restore.
Risk rating rubric (quick)
- Low: tagging, adding outputs, docs, no production impact.
- Medium: ASG size, instance types, moderate IAM changes.
- High: database parameter changes, VPC routing, resource replacement.
Worked examples
Example 1: Tighten a security group rule
Intent: Restrict inbound SSH from 0.0.0.0/0 to a VPN CIDR.
- Propose: PR explains current risk, target CIDR, affected ASGs.
- Prove: Plan shows update to ingress cidr_blocks only; no resource recreation.
- Promote: Apply during low-traffic window; verify SSH from VPN works; confirm no external access.
- Rollback: Revert PR to previous CIDR if VPN issues occur.
Plan snippet
~ aws_security_group.web_ssh ingress.cidr_blocks: ["0.0.0.0/0"] -> ["10.10.0.0/16"]
Example 2: Increase DB instance size
Intent: Upgrade RDS from db.m5.large to db.m5.xlarge.
- Propose: Reason (CPU saturation), maintenance window, expected downtime (multi-AZ), snapshot ID, rollback to previous class.
- Prove: Staging test with the same engine/version; performance test; plan shows only instance_class change.
- Promote: Apply in maintenance window; monitor replica lag, error rates.
- Rollback: Apply previous class or restore from snapshot if needed.
Plan snippet
~ aws_db_instance.app instance_class: "db.m5.large" -> "db.m5.xlarge" apply_immediately: true
Example 3: Upgrade a Terraform VPC module (breaking)
Intent: Move from module v2 to v3 with route table changes.
- Propose: Changelog summary, deprecations, expected replacements.
- Prove: Run in a sandbox; compare plans; split change into two PRs: a) add new resources alongside old (no traffic yet), b) switch traffic and remove old.
- Promote: Canary switch for one subnet, then full rollout.
- Rollback: Switch traffic back to old resources; keep old infra for 24 hours before destroy.
Safer sequence
- Introduce v3 under new names.
- Peer review plan for replacements.
- Cut traffic over with small scope first.
- Remove old only after checks pass.
Checklists you can use
Pre-change checklist
- Describe intent, risk, and blast radius in PR.
- Plan generated and reviewed; no unexpected destroys.
- Security and policy checks pass.
- Cost delta understood.
- Backups/snapshots ready for stateful resources.
- Monitoring and health checks identified.
- Rollback steps documented and tested in staging.
Apply checklist
- Confirm environment and region.
- Lock state (remote backend with locking).
- Apply with approved commit SHA.
- Watch logs, metrics, and alerts during and after apply.
- Record start/end time and outcome.
Post-change checklist
- Verify customer flows and SLOs.
- Update docs/runbooks if behavior changed.
- Tag resources correctly for ownership and cost.
- Leave a short change summary in PR.
Exercises
Do these now. The quick test is at the end. Everyone can take it; only logged-in users get saved progress.
- Exercise 1: Draft a change plan
Scenario: Increase an ASG's instance type from t3.medium to t3.large in production.
- Write a PR description including intent, risk, plan summary, checks, and rollback.
- List monitoring signals to watch during rollout.
- Exercise 2: Read a plan and spot risks
Review the plan below and propose the safest approach.
-/+ aws_launch_template.api (new resource required) name: "api-lt" -> "api-lt" (forces new resource) ~ image_id: "ami-111" -> "ami-222" ~ user_data: (sensitive value) ~ aws_autoscaling_group.api launch_template.version: "$Latest" -> "$Latest"- Is this safe to apply immediately? What pre-checks and rollout method would you use?
- Self-check: Did you include a rollback method that can be executed quickly?
- Self-check: Did you ensure no unintended destroys?
Common mistakes and how to self-check
- Ignoring the plan diff: Always scan for destroys or replacements on stateful resources. If found, pause and re-architect.
- Applying with $Latest: Pin immutable versions for launch templates and images. Self-check: Is the exact version pinned?
- No rollback: If you cannot describe rollback in two steps, you are not ready. Self-check: Can you revert by reverting a PR?
- Skipping pre-prod: High-risk changes must be tested in staging with realistic data patterns.
- Not monitoring: Define success metrics before applying. Self-check: Which dashboard will you watch?
Anti-patterns
- Manual console tweaks (causes drift).
- Applying from laptops with local state.
- Big-bang migrations without a canary.
Practical projects
- Set up a minimal GitOps pipeline: on PR → fmt/validate/plan; on merge → apply to staging; manual approval → apply to prod.
- Add a PR template that forces risk level, blast radius, and rollback to be filled.
- Create a sandbox to rehearse rollbacks for a stateful resource (e.g., restore an RDS snapshot).
- Implement tagging and cost estimation in CI for Terraform plans.
Mini challenge
Your team wants to replace an ALB with a new one using a newer module. Design a two-PR strategy that keeps traffic flowing. Include: traffic switch technique, verification steps, and rollback.
Learning path
- Master IaC basics: modules, state, backends.
- Add CI checks: fmt, validate, plan, policy, cost.
- Introduce environments: sandbox → staging → prod.
- Practice safe rollouts: canary, blue/green, phased changes.
- Document runbooks and PR templates.
Who this is for
- Platform and DevOps engineers owning infrastructure.
- Backend engineers contributing to infra modules.
- SREs responsible for reliability and compliance.
Prerequisites
- Comfort with Git and pull requests.
- Basic Terraform or CloudFormation experience.
- Familiarity with CI pipelines.
- Access to a non-production environment.
Next steps
- Create or refine your PR template for infra changes.
- Enable plan and policy checks in CI.
- Choose one service and document a rollback runbook.
Ready to test yourself?
Take the quick test below. Everyone can try it for free. Only logged-in users will have their progress saved.