Why this matters
Data platforms are living systems. You routinely add warehouses, rotate keys, change retention, and expand clusters. Each change can impact pipelines, SLAs, costs, and compliance. Good change management lets you ship confidently, recover quickly, and prove who changed what and why.
- Real tasks you will face: rotate credentials without downtime; increase Kafka retention safely; update warehouse sizes; add VPC rules for new data sources; upgrade IaC modules across environments.
- Outcomes: fewer incidents, predictable deploys, clear audit trail, easier collaboration.
Concept explained simply
Change management for infrastructure is a repeatable way to propose, review, test, deploy, and verify changes made with Infrastructure as Code. Think of it as guardrails around git-based changes so production stays stable.
Mental model
Use the runway model: each change is a plane that must pass gates before takeoff.
- Gate 1: Clarity — what will change and why (ticket + PR).
- Gate 2: Safety — automated checks (lint, validate, plan, policy).
- Gate 3: Review — human review with context and risk.
- Gate 4: Staging — prove it in a prod-like environment.
- Gate 5: Launch — controlled window with rollback ready.
- Gate 6: Verify — measure impact and close the loop.
See a minimal change record template
- Change ID:
- Summary:
- Scope (resources/modules):
- Risk (Low/Med/High) + why:
- Impact (users, SLAs, cost):
- Plan artifact (plan file output hash or summary):
- Test evidence (staging run, screenshots, logs):
- Rollback plan (exact steps):
- Owner + reviewers:
- Window + comms plan:
- Post-checks/metrics:
Core principles
- Everything through version control: no ad-hoc console edits.
- Plan before apply: require a human to read the plan.
- Small, reversible changes: prefer incremental over big-bang.
- Promote, don’t copy: apply the same change through dev → staging → prod.
- Policy as code: enforce guardrails automatically (naming, tags, regions, encryption).
- Observability: tag resources with change_id and verify after deploy.
- Roll-forward mindset: rollback is allowed, but roll-forward fixes are often safer. Have both ready.
A safe change workflow (end-to-end)
- Open a change ticket and branch from main. Fill the template (scope, risk, rollback).
- Make IaC changes. Keep diffs small and isolated.
- CI runs: fmt, validate, init, plan, policy checks, security scan. Publish the plan artifact.
- Peer review: verify the plan, risk, blast radius, and cost. Approve with Least Surprise.
- Apply to dev environment. Run smoke tests.
- Promote to staging. Run integration tests and load checks if relevant.
- Schedule a production window if needed. Announce. Apply using the exact, reviewed plan output. Capture logs.
- Post-apply verification: metrics, dashboards, data freshness, error rates. Tag the release; update the change record.
Deployment checklist (use before production)
- Plan shows no unintended resource replacement.
- Secrets handled via a secret manager or variables, not in code.
- Backups/snapshots available if data is in scope.
- Dependencies mapped (VPC rules, IAM, topic producers/consumers).
- Monitoring alerts in place for impacted components.
- Rollback plan tested in lower env or rehearsed.
Worked examples
Example 1: Increase warehouse size in a data warehouse
Goal: Temporarily increase compute to handle a backfill, then scale down.
- Change: Update terraform variable warehouse_size from M to L.
- Risk: Cost spike; potential query queue changes. Risk = Low/Med.
- Plan review: Ensure only warehouse size updates; no resource replacement.
- Staging test: Apply to staging, run a heavy query and watch costs/latency.
- Production: Apply during off-peak. Announce in team channel.
- Verification: Track query latency and credits used; set a reminder to scale down after backfill completes.
- Rollback: Apply previous commit to revert size.
Notes
- Prefer time-bound changes with scheduled follow-up PR.
- Tag change_id=CHG-123 on the resource for audit.
Example 2: Kafka topic retention change
Goal: Increase retention from 2 days to 7 days for an analytics topic.
- Impact analysis: Storage cost up; consumers may reprocess more data on restarts.
- IaC diff: Only the topic config should change; avoid re-creation.
- Tests: In dev, produce/consume a small load; verify retention config via CLI or metrics.
- Production window: Off-peak; communicate to downstream consumers.
- Verification: Check broker metrics and topic size growth after 24 hours.
- Rollback: Re-apply previous retention value. If storage alarm triggers, roll back immediately.
Risk mitigations
- Set alerts on partition size and disk usage.
- Consider incremental change (2 → 4 → 7 days) if capacity is tight.
Example 3: IAM policy tightening for Airflow
Goal: Restrict Airflow’s role to least privilege while keeping jobs running.
- Plan: Replace broad wildcard with resource-scoped permissions.
- Staged rollout: Dev first; run DAGs that touch each permission area.
- Shadow mode: Add monitoring to catch AccessDenied errors.
- Production: Apply during a low-run window. Keep break-glass role available.
- Verification: No DAG failures; error rate unchanged; audit logs clean.
- Rollback: Re-apply previous policy version if AccessDenied impacts SLAs.
Tips
- Use feature flags in DAGs to reduce blast radius (e.g., optional step toggles) while policy changes settle.
- Document which DAGs exercise each permission for future reviews.
Practical projects
- Build a change pipeline: CI that runs fmt, validate, plan, cost estimate, and policy checks. Acceptance: PR comment shows plan summary and policy status.
- Environment promotion: Implement dev → staging → prod with the same module version and variables per workspace. Acceptance: a single PR promotes through all envs by approvals.
- Drift detection job: Nightly job runs plan in read-only mode and posts drift summary. Acceptance: alert when unexpected changes appear.
Exercises you can do now
Exercise 1: Write a safe change plan
Scenario: Your team needs to enable server-side encryption on object storage buckets for staging and production managed by Terraform. Draft a one-page change plan.
- List affected resources and modules.
- Assess risk and impact (cost, performance, compliance).
- Show the exact commands you will run (plan/apply) and which workspace(s).
- Provide a rollback method.
- Define verification steps and metrics.
Exercise 2: Create a PR review template
Create a pull request template and reviewer checklist for infra changes.
- Sections: context, scope, plan summary, risk, rollback, test evidence, screenshots/logs, change_id.
- Reviewer checks: unintended resource replacement, secrets, policy violations, cost impact, module versions, environment promotion steps.
Self-check checklist
- Is the plan small, clear, and scoped?
- Does the rollback specify exact commands or commits to revert?
- Are verification metrics actionable and time-bound?
- Would a new teammate understand and run your steps?
Common mistakes and how to self-check
- Skipping plan review: Always compare planned vs expected changes; watch for replacement of stateful resources.
- Big-bang merges: Break large changes into smaller PRs with independent verification.
- Manual console edits: They cause drift and surprises; revert to IaC immediately.
- No rollback: Write explicit, tested rollback steps.
- Copying code across envs: Promote the same module version via variables/workspaces.
- Silent deploys: Announce changes that affect users, costs, or SLAs.
Quick self-audit before apply
- Plan shows only intended diffs.
- Backups or snapshots exist for stateful resources.
- Tags include change_id and owner.
- Monitoring/alerts are ready.
- Staging results attached to the PR.
Who this is for
- Data Platform Engineers and Infra Engineers owning clusters, storage, and orchestration.
- Analytics Engineers adding or modifying platform resources with IaC.
- SREs standardizing safe deploys for data systems.
Prerequisites
- Basic git and pull request workflow.
- Intro-level Infrastructure as Code (e.g., modules, state, plan/apply concepts).
- Familiarity with your platform components (compute, storage, network, IAM).
Learning path
- Standardize PR templates and CI checks for IaC.
- Add environment promotion and artifacted plan approvals.
- Introduce policy as code and cost checks.
- Automate drift detection and post-deploy verification.
- Run a game day: rehearse rollback for a safe, staged change.
Next steps
- Apply the exercises to a small, low-risk change this week.
- Add a change_id tag convention and start tracking deploy metrics.
- Schedule a review of high-risk changes in your backlog and slice them smaller.
Mini challenge
A module upgrade introduces a new variable that defaults to replacing a production resource. Draft a plan to deploy safely without replacement. Include: variable override, staged rollout steps, and a rollback path.
Quick test
Ready to check your understanding? Take the quick test below. The test is available to everyone; only logged-in users get saved progress.