Topic Not Found

Why this matters

Infrastructure changes can improve reliability, security, and cost—but they can also cause outages if rushed. As a Platform Engineer using Infrastructure as Code, you need a repeatable, auditable process that keeps production safe while enabling teams to move fast.

Real tasks you will face: rolling out network changes with zero downtime, upgrading Terraform modules safely, adjusting autoscaling without impacting traffic, and responding quickly when a change goes wrong.
A solid change workflow reduces risk, speeds up reviews, and creates a clear paper trail for compliance.

What can go wrong without change management?

Opening security groups too broadly, exposing services.
Destroying or recreating stateful resources (databases, buckets) due to a tiny misconfiguration.
Applying in the wrong environment or region.
Cost spikes from unintended resource scaling.

Concept explained simply

Change management for infra is a lightweight set of steps to make changes safely. You propose a change (pull request), preview the impact (plan), get peer review and automation checks, apply in a controlled window, and verify results. If anything goes wrong, you roll back with a pre-agreed plan.

Mental model

Think of it as a three-gate flow: Propose → Prove → Promote.

Propose: human-intent in PR with context, risk, and rollback.
Prove: automation validates—format, validate, plan, security and cost checks, pre-prod test.
Promote: controlled apply with monitoring and clear ownership.

GitOps vs Ticket-Driven

GitOps: PRs are the source of truth; merges trigger applies. Ticket-driven: A ticket references a PR, and apply happens via a pipeline or change window. Many teams combine both.

Core workflow (IaC change lifecycle)

Open a PR (Propose)
- Describe the intent, risk, blast radius, plan summary, and rollback steps.
- Tag owners and environment.
Pre-merge checks (Prove)
- IaC hygiene: fmt, validate, lint.
- Plan preview: terraform plan or change set.
- Policy-as-code: security and compliance rules.
- Cost estimate (rough) and drift detection.
- Test in sandbox or staging.
Approval
- At least one peer approval for low-risk; two for high-risk or production.
- Optional CAB for critical changes.
Apply (Promote)
- Use a controlled window if risk is high.
- Apply via CI or controlled workflow with audit logs.
- Announce start/end in team channel.
Post-change verification
- Health checks, logs, dashboards, SLOs.
- Confirm cost expectations.
- Close with a brief change summary.
Rollback (if needed)
- Reapply previous version or revert PR and re-run pipeline.
- For stateful resources, use snapshots/point-in-time restore.

Risk rating rubric (quick)

Low: tagging, adding outputs, docs, no production impact.
Medium: ASG size, instance types, moderate IAM changes.
High: database parameter changes, VPC routing, resource replacement.

Worked examples

Example 1: Tighten a security group rule

Intent: Restrict inbound SSH from 0.0.0.0/0 to a VPN CIDR.

Propose: PR explains current risk, target CIDR, affected ASGs.
Prove: Plan shows update to ingress cidr_blocks only; no resource recreation.
Promote: Apply during low-traffic window; verify SSH from VPN works; confirm no external access.
Rollback: Revert PR to previous CIDR if VPN issues occur.

Plan snippet

~ aws_security_group.web_ssh
  ingress.cidr_blocks: ["0.0.0.0/0"] -> ["10.10.0.0/16"]

Example 2: Increase DB instance size

Intent: Upgrade RDS from db.m5.large to db.m5.xlarge.

Propose: Reason (CPU saturation), maintenance window, expected downtime (multi-AZ), snapshot ID, rollback to previous class.
Prove: Staging test with the same engine/version; performance test; plan shows only instance_class change.
Promote: Apply in maintenance window; monitor replica lag, error rates.
Rollback: Apply previous class or restore from snapshot if needed.

Plan snippet

~ aws_db_instance.app
  instance_class: "db.m5.large" -> "db.m5.xlarge"
  apply_immediately: true

Example 3: Upgrade a Terraform VPC module (breaking)

Intent: Move from module v2 to v3 with route table changes.

Propose: Changelog summary, deprecations, expected replacements.
Prove: Run in a sandbox; compare plans; split change into two PRs: a) add new resources alongside old (no traffic yet), b) switch traffic and remove old.
Promote: Canary switch for one subnet, then full rollout.
Rollback: Switch traffic back to old resources; keep old infra for 24 hours before destroy.

Safer sequence

Introduce v3 under new names.
Peer review plan for replacements.
Cut traffic over with small scope first.
Remove old only after checks pass.

Checklists you can use

Pre-change checklist

Describe intent, risk, and blast radius in PR.
Plan generated and reviewed; no unexpected destroys.
Security and policy checks pass.
Cost delta understood.
Backups/snapshots ready for stateful resources.
Monitoring and health checks identified.
Rollback steps documented and tested in staging.

Apply checklist

Confirm environment and region.
Lock state (remote backend with locking).
Apply with approved commit SHA.
Watch logs, metrics, and alerts during and after apply.
Record start/end time and outcome.

Post-change checklist

Verify customer flows and SLOs.
Update docs/runbooks if behavior changed.
Tag resources correctly for ownership and cost.
Leave a short change summary in PR.

Exercises

Do these now. The quick test is at the end. Everyone can take it; only logged-in users get saved progress.

Exercise 1: Draft a change plan
Scenario: Increase an ASG's instance type from t3.medium to t3.large in production.
- Write a PR description including intent, risk, plan summary, checks, and rollback.
- List monitoring signals to watch during rollout.

Exercise 2: Read a plan and spot risks

Review the plan below and propose the safest approach.

-/+ aws_launch_template.api (new resource required)
      name:          "api-lt" -> "api-lt" (forces new resource)
    ~ image_id:      "ami-111" -> "ami-222"
    ~ user_data:     (sensitive value)

~ aws_autoscaling_group.api
    launch_template.version: "$Latest" -> "$Latest"

Is this safe to apply immediately? What pre-checks and rollout method would you use?

Self-check: Did you include a rollback method that can be executed quickly?
Self-check: Did you ensure no unintended destroys?

Common mistakes and how to self-check

Ignoring the plan diff: Always scan for destroys or replacements on stateful resources. If found, pause and re-architect.
Applying with $Latest: Pin immutable versions for launch templates and images. Self-check: Is the exact version pinned?
No rollback: If you cannot describe rollback in two steps, you are not ready. Self-check: Can you revert by reverting a PR?
Skipping pre-prod: High-risk changes must be tested in staging with realistic data patterns.
Not monitoring: Define success metrics before applying. Self-check: Which dashboard will you watch?

Anti-patterns

Manual console tweaks (causes drift).
Applying from laptops with local state.
Big-bang migrations without a canary.

Practical projects

Set up a minimal GitOps pipeline: on PR → fmt/validate/plan; on merge → apply to staging; manual approval → apply to prod.
Add a PR template that forces risk level, blast radius, and rollback to be filled.
Create a sandbox to rehearse rollbacks for a stateful resource (e.g., restore an RDS snapshot).
Implement tagging and cost estimation in CI for Terraform plans.

Mini challenge

Your team wants to replace an ALB with a new one using a newer module. Design a two-PR strategy that keeps traffic flowing. Include: traffic switch technique, verification steps, and rollback.

Learning path

Master IaC basics: modules, state, backends.
Add CI checks: fmt, validate, plan, policy, cost.
Introduce environments: sandbox → staging → prod.
Practice safe rollouts: canary, blue/green, phased changes.
Document runbooks and PR templates.

Who this is for

Platform and DevOps engineers owning infrastructure.
Backend engineers contributing to infra modules.
SREs responsible for reliability and compliance.

Prerequisites

Comfort with Git and pull requests.
Basic Terraform or CloudFormation experience.
Familiarity with CI pipelines.
Access to a non-production environment.

Next steps

Create or refine your PR template for infra changes.
Enable plan and policy checks in CI.
Choose one service and document a rollback runbook.

Ready to test yourself?

Take the quick test below. Everyone can try it for free. Only logged-in users will have their progress saved.

Menu

Change Management For Infra

Table of Contents

Why this matters

Concept explained simply

Mental model

Core workflow (IaC change lifecycle)

Worked examples

Example 1: Tighten a security group rule

Example 2: Increase DB instance size

Example 3: Upgrade a Terraform VPC module (breaking)

Checklists you can use

Exercises

Common mistakes and how to self-check

Practical projects

Mini challenge

Learning path

Who this is for

Prerequisites

Next steps

Ready to test yourself?

Practice Exercises

Draft a safe change plan for scaling up

Instructions

Expected Output

Plan reading: identify risks and propose a rollout

Change Management For Infra — Quick Test

Have questions about Change Management For Infra?

AI Assistant