Why this matters
As a Platform Engineer, you are the last line of defense when things go wrong. Backups and Disaster Recovery (DR) protect customer data, keep services running, and meet compliance obligations. You will design backup strategies, automate schedules, test restores, and plan for region-wide outages.
- Real task: Define RPO/RTO with stakeholders for critical services.
- Real task: Implement daily database backups with encryption and retention policies.
- Real task: Run restore drills to prove backups actually work.
- Real task: Plan failover to a secondary region with documented runbooks.
Concept explained simply
Backup is making safe copies of data so you can restore it later. Disaster Recovery is how you bring systems back when a big failure happens (like region outage, ransomware, or accidental deletion).
- RPO (Recovery Point Objective): How much data you can afford to lose (e.g., 15 minutes).
- RTO (Recovery Time Objective): How long you can be down (e.g., 60 minutes).
- Backup types: Full (everything), Incremental (changes since last backup), Differential (changes since last full).
- 3-2-1 rule: Keep 3 copies, on 2 different media, 1 offsite (preferably offline/immutable).
- Consistency: Crash-consistent vs application-consistent backups (quiesced I/O and flushes).
- Immutability: Write-once backups to resist ransomware and accidental deletion.
Mental model: Layers of safety
Imagine data safety as layers:
- Local snapshot: Fast, for small mistakes.
- Remote backup: Survives local failures.
- Immutable/offline copy: Survives ransomware and malicious deletes.
- Disaster Recovery: People, docs, and automation to rebuild systems fast.
Core building blocks
- Define service tiers: Which systems are Tier-0 (must restore first) vs lower tiers.
- Choose backup cadences: e.g., nightly full + 15-min incrementals for databases.
- Retention policy: e.g., 7 daily, 4 weekly, 12 monthly; align with compliance.
- Encryption: Encrypt in transit and at rest; safeguard keys separately.
- Catalog and verification: Index backups and verify with checksums and test restores.
- Runbooks: Step-by-step recovery documents with commands and validation steps.
- DR site strategy: Cold (cheap, slow), Warm (pre-provisioned partial), Hot (active/active, fast, expensive).
Worked examples
Example 1: Postgres backup plan with 15-min RPO
- RPO/RTO: RPO 15 min; RTO 60 min.
- Backups: Nightly full base backup; WAL archiving every 15 min to remote object storage.
- Retention: 7 daily, 4 weekly, 3 monthly, 1 yearly. Immutability: 14 days write-protected.
- Encryption: Encrypt backups; store encryption keys in a separate secure vault.
- Restore test: Quarterly restore to a staging instance; replay WAL to a target timestamp; run checksums and smoke tests.
Example 2: Web app DR with warm standby
- Tiers: Auth, DB, API are Tier-0; batch jobs are Tier-2.
- Warm standby: Pre-provision infra in another region; replicate images and configs.
- Data: Database PITR (point-in-time recovery); object storage cross-region replication with versioning.
- Failover steps: Update DNS (low TTL), scale up warm standby, restore DB, validate health checks, then increase traffic.
- Drill: Twice per year, measure RTO; record issues and fixes.
Example 3: Kubernetes etcd/cluster state
- Backup: Regular etcd snapshots; store off-cluster and offsite with immutability.
- App data: Back up persistent volumes via CSI snapshots or backup tooling; ensure application-consistent hooks.
- Restore: Recreate control plane from etcd snapshot; restore PVCs; reapply manifests from Git; verify service endpoints.
Process and checklists
Backup lifecycle
- Classify data and criticality tier.
- Set RPO/RTO with stakeholders.
- Design cadence, retention, and storage locations.
- Implement encryption, immutability, and cataloging.
- Automate monitoring and alerts for backup failures.
- Test restores and document runbooks.
- Review quarterly; update for new systems.
Quick self-check checklist
- I can state RPO/RTO for each Tier-0 service.
- At least one backup copy is offsite and immutable/offline.
- Backups are encrypted and keys are stored separately.
- I have performed a restore drill in the last 90 days.
- Runbooks include validation steps and rollback plans.
- DNS TTL and traffic switch methods are documented.
Who this is for
- Platform and SRE engineers who own reliability and data safety.
- Backend engineers contributing to on-call and operational readiness.
- Team leads defining availability targets and compliance posture.
Prerequisites
- Basic knowledge of Linux, containers, and networking.
- Familiarity with your database engine (e.g., Postgres, MySQL) concepts like WAL/binlogs.
- Comfort with infrastructure-as-code and CI/CD for reproducible environments.
Learning path
- Learn core terms: RPO, RTO, 3-2-1, immutability, PITR.
- Design a tiered backup plan with retention and encryption.
- Automate backups and add monitoring for failures.
- Write restore runbooks with clear validation steps.
- Conduct a DR drill: measure RTO/RPO, capture gaps, iterate.
Exercises
Tip: You can complete these for free. The quick test is available to everyone; sign in to save your progress.
Exercise 1: Define a realistic backup policy
Scenario: An e-commerce platform with a Postgres database and an object store for product images. Peak hours: 08:00–22:00. Write a concise policy.
- State RPO and RTO for the database and object store.
- Choose backup types and cadences (full/incremental; replication for objects).
- Define retention, immutability, and encryption.
- Outline a quarterly restore drill.
Exercise 2: Draft a DR runbook snippet
Create a one-page runbook for a regional outage affecting your API and database.
- Prereqs: Access, credentials, locations of backups, and contacts.
- Steps: Restore order, commands or actions, DNS/traffic switch, validation checks.
- Success criteria: What to verify before calling recovery complete.
Submission checklist
- RPO/RTO stated with numbers.
- Backup cadence and retention defined.
- Immutability and encryption addressed.
- Runbook includes validation and rollback.
Common mistakes and how to self-check
- Mistake: Assuming snapshots equal backups. Self-check: Do you have offsite, immutable copies and a tested restore?
- Mistake: Undefined or unrealistic RPO/RTO. Self-check: Confirm stakeholder-approved numbers and drill results within targets.
- Mistake: Backups unencrypted or keys stored with backups. Self-check: Verify encryption and separate key management.
- Mistake: Never testing restores. Self-check: Schedule quarterly restores and record time to recover and data loss.
- Mistake: Ignoring dependencies. Self-check: Recovery order lists DNS, secrets, configs, DB, caches, app.
- Mistake: Long DNS TTL. Self-check: Ensure low TTL for failover domains.
Practical projects
- Build a backup catalog: Inventory all services, RPO/RTO, cadence, retention, and last restore test date.
- Implement object storage versioning and lifecycle rules: Keep recent versions, transition old ones to colder storage, and enforce immutability for 14 days.
- Create a DR drill playbook: Roles, timeline, communication plan, and success metrics; run a partial failover test.
Next steps
- Automate alerts for failed backups and restore test reminders.
- Expand runbooks with screenshots or exact commands for your stack.
- Introduce game days to practice partial failures safely.
Mini challenge
In 10 minutes, write a one-paragraph "Executive DR Brief" for a Tier-0 service. Include RPO, RTO, backup cadence, immutability window, and date of last successful restore test. Keep it crisp and verifiable.