Who this is for
- Data Platform Engineers defining storage classes, TTLs, and archival rules.
- Data Engineers maintaining lakes/warehouses and streaming topics.
- Analysts/ML engineers who need predictable data availability windows.
Prerequisites
- Basic understanding of data storage (object store, block, file) and cost tiers.
- Familiarity with batch and streaming pipelines (e.g., schedulers, topics, checkpoints).
- Awareness of PII concepts and why compliance matters.
Why this matters
- Controls costs by moving data from hot to cold/archival tiers automatically.
- Reduces risk by enforcing legal and business retention (e.g., GDPR, finance audits).
- Improves performance by keeping hot datasets small and query-friendly.
- Prevents data sprawl and "mystery" storage bills.
Concept explained simply
Data lifecycle answers: what data we keep, where we keep it, how long we keep it, and what happens after. Retention policies are the specific timers and rules that enforce this plan automatically.
Mental model: Fridge → Freezer → Attic
- Fridge (hot): data you use daily; fast, expensive, short TTL.
- Freezer (warm/cool): data you need sometimes; slower, cheaper, longer TTL.
- Attic (archive): rarely accessed, cheapest, very long TTL or legal hold.
Lifecycle policies move items through these places and eventually discard or anonymize them.
Core levers you control
- Time-based TTL: expire after N days/months/years.
- Event-based TTL: retain until a case closes, a contract ends, or a user requests deletion.
- Tiering: move between hot → warm → cold → archive.
- Transformation: anonymize/pseudonymize/aggregate to keep value without risk.
- Deletion vs. Legal hold: remove data vs. lock it immutably for audits.
Where retention lives
- Object storage (lakes): lifecycle rules by prefix/tags for transition and expiration.
- Warehouses: partition pruning, table TTLs, time travel/version retention windows.
- Streaming: topic retention by time/size; compaction for latest state.
- Backups/snapshots: separate retention from production data.
Design steps (practical)
- Inventory datasets: purpose, sensitivity (PII/PHI/financial), owners, consumers.
- Capture requirements: legal minimums/maximums, audit needs, analytics SLAs.
- Classify access patterns: hot (daily), warm (weekly), cold (rare).
- Define retention windows per class: e.g., hot 30 days, warm 12 months, archive 7 years.
- Select storage tiers per phase: and required encryption/immutability.
- Set automation: lifecycle policies, topic retention, table TTLs, compaction.
- Plan transformations: anonymize, aggregate, or delete columns before long-term storage.
- Add monitoring and evidence: metrics, alerts, sample audits, deletion logs.
- Document exceptions and break-glass procedures.
Worked examples
Example 1: Product analytics events in a data lake
- Requirements: analysts query last 60 days frequently; business wants 24 months for seasonality; PII must not exceed 13 months (example policy).
- Policy:
- Hot: store partitioned events by event_date for 60 days in fast storage.
- Warm: transition objects at 60 days to cool storage; keep 13 months.
- PII handling: after 13 months, run a job to drop or hash direct identifiers; keep anonymized aggregates for 24 months.
- Archive: store daily aggregates (no PII) for 5 years in archival tier.
- Automation: lifecycle rules by prefix, scheduled anonymization job, metric that counts objects older than allowed with alert.
Example 2: Financial transactions with audit needs
- Requirements: immutable records for 7 years; quick access to last 90 days; monthly regulator audits.
- Policy:
- Hot: last 90 days in fast tier; index in warehouse for operations.
- Archive: enable object lock (WORM) for 7 years; no deletes; versioning on.
- Warehouse: time-travel retention 7 days to limit storage; canonical records are in the lake.
- Automation: lifecycle moves >90-day objects to archive tier; audit export job runs monthly.
Example 3: Kafka topics for clickstream
- Requirements: stream consumers process near real-time; replay up to 3 days for recovery; keep detailed history in the lake.
- Policy:
- Topic retention: 3 days by time or capped by size; cleanup.policy=delete.
- Offload: sink to object storage in hourly batches; lake retains warm for 12 months (PII rules apply).
- Aggregates: daily roll-ups kept for 3 years in archive.
- Automation: consumer lag alerts, failed offload retries, partitioned object keys by dt/hour.
Example 4: CDC (change data capture) stream
- Requirements: maintain latest state plus 30 days of change history; low storage overhead.
- Policy:
- Topic A (changes): cleanup.policy=compact,delete with retention.ms=30 days; ensures latest value per key plus recent history.
- Snapshot: nightly compacted table in lake; monthly snapshots for 12 months.
- Automation: compaction monitoring; snapshot job integrity checks.
Common mistakes and how to self-check
- Mistake: one-size-fits-all TTLs. Fix: classify datasets; apply different windows for hot/warm/archive.
- Mistake: forgetting backups and logs. Fix: include them in your retention inventory.
- Mistake: keeping PII longer than necessary. Fix: anonymize or drop sensitive columns before long-term storage.
- Mistake: relying on manual deletion. Fix: use automated lifecycle rules and job schedules.
- Mistake: not proving compliance. Fix: store deletion logs, lifecycle configs, and periodic audit evidence.
Quick self-check
- Can you show where data older than your hot window lives?
- Can you prove PII does not exceed its retention?
- If a user requests deletion, can you remove or anonymize across lake, warehouse, and backups?
Practical projects
- Project 1: Build lifecycle for a mock data lake
- Create raw/, refined/, and aggregate/ prefixes.
- Set rules: raw hot 30 days → cool 365 days → archive 5 years; refined: hot 90 days → cool 12 months.
- Add a job that drops email/phone after 13 months, keeping aggregates.
- Project 2: Stream retention and offload
- Configure a topic with 3-day retention and hourly offload to object storage.
- Implement idempotent offload with checkpoint files.
- Project 3: Warehouse TTL and partitioning
- Create partitioned tables by event_date; set table TTL of 180 days for detailed events and unlimited for aggregates.
Exercises
Do these now. They mirror the graded section below. You can take the test even without logging in; only logged-in users will see saved progress.
- Checklist before you start:
- Have you listed sensitive fields?
- Do you know hot/warm/archive windows?
- Do you know who needs the data and how often?
Exercise 1: Draft a lifecycle and retention plan for analytics events
See the Exercises panel below for instructions and a sample solution. Outline hot/warm/cold tiers, TTLs, anonymization, and monitoring.
Exercise 2: Configure streaming retention and lake offload
Set topic retention, compaction (if needed), and describe an offload schedule with object key patterns.
Learning path
- Before: Storage tiers and cost fundamentals; batch vs. streaming basics.
- Now: Data lifecycle and retention policies (this lesson).
- Next: Governance and data quality SLAs; access control and encryption at rest; disaster recovery and backups.
Next steps
- Write a one-page retention policy for one real dataset you own.
- Implement one automated lifecycle rule this week.
- Take the quick test to confirm understanding. Progress is saved for logged-in users.
Mini challenge
Scenario: You manage user behavioral events (contains user_id, country, device info) and purchase orders (financial). Analysts need 90 days of raw events and 24 months of trends. Finance needs 7-year immutable order records. Draft a 6-line policy:
- 1–2: hot/warm/archive windows for events; anonymization approach.
- 3–4: streaming retention and offload frequency.
- 5: order data immutability and tier.
- 6: monitoring/audit evidence you will capture.
Tip
Keep PII short-lived, keep aggregates long-lived, keep finance immutable.