How to learn Data Lifecycle And Retention Policies for Compute And Storage Foundations in Data Platform Engineer for free

Who this is for

Data Platform Engineers defining storage classes, TTLs, and archival rules.
Data Engineers maintaining lakes/warehouses and streaming topics.
Analysts/ML engineers who need predictable data availability windows.

Prerequisites

Basic understanding of data storage (object store, block, file) and cost tiers.
Familiarity with batch and streaming pipelines (e.g., schedulers, topics, checkpoints).
Awareness of PII concepts and why compliance matters.

Why this matters

Controls costs by moving data from hot to cold/archival tiers automatically.
Reduces risk by enforcing legal and business retention (e.g., GDPR, finance audits).
Improves performance by keeping hot datasets small and query-friendly.
Prevents data sprawl and "mystery" storage bills.

Concept explained simply

Data lifecycle answers: what data we keep, where we keep it, how long we keep it, and what happens after. Retention policies are the specific timers and rules that enforce this plan automatically.

Mental model: Fridge → Freezer → Attic

Fridge (hot): data you use daily; fast, expensive, short TTL.
Freezer (warm/cool): data you need sometimes; slower, cheaper, longer TTL.
Attic (archive): rarely accessed, cheapest, very long TTL or legal hold.

Lifecycle policies move items through these places and eventually discard or anonymize them.

Core levers you control

Time-based TTL: expire after N days/months/years.
Event-based TTL: retain until a case closes, a contract ends, or a user requests deletion.
Tiering: move between hot → warm → cold → archive.
Transformation: anonymize/pseudonymize/aggregate to keep value without risk.
Deletion vs. Legal hold: remove data vs. lock it immutably for audits.

Where retention lives

Object storage (lakes): lifecycle rules by prefix/tags for transition and expiration.
Warehouses: partition pruning, table TTLs, time travel/version retention windows.
Streaming: topic retention by time/size; compaction for latest state.
Backups/snapshots: separate retention from production data.

Design steps (practical)

Inventory datasets: purpose, sensitivity (PII/PHI/financial), owners, consumers.
Capture requirements: legal minimums/maximums, audit needs, analytics SLAs.
Classify access patterns: hot (daily), warm (weekly), cold (rare).
Define retention windows per class: e.g., hot 30 days, warm 12 months, archive 7 years.
Select storage tiers per phase: and required encryption/immutability.
Set automation: lifecycle policies, topic retention, table TTLs, compaction.
Plan transformations: anonymize, aggregate, or delete columns before long-term storage.
Add monitoring and evidence: metrics, alerts, sample audits, deletion logs.
Document exceptions and break-glass procedures.

Worked examples

Example 1: Product analytics events in a data lake

Requirements: analysts query last 60 days frequently; business wants 24 months for seasonality; PII must not exceed 13 months (example policy).
Policy:
- Hot: store partitioned events by event_date for 60 days in fast storage.
- Warm: transition objects at 60 days to cool storage; keep 13 months.
- PII handling: after 13 months, run a job to drop or hash direct identifiers; keep anonymized aggregates for 24 months.
- Archive: store daily aggregates (no PII) for 5 years in archival tier.
Automation: lifecycle rules by prefix, scheduled anonymization job, metric that counts objects older than allowed with alert.

Example 2: Financial transactions with audit needs

Requirements: immutable records for 7 years; quick access to last 90 days; monthly regulator audits.
Policy:
- Hot: last 90 days in fast tier; index in warehouse for operations.
- Archive: enable object lock (WORM) for 7 years; no deletes; versioning on.
- Warehouse: time-travel retention 7 days to limit storage; canonical records are in the lake.
Automation: lifecycle moves >90-day objects to archive tier; audit export job runs monthly.

Example 3: Kafka topics for clickstream

Requirements: stream consumers process near real-time; replay up to 3 days for recovery; keep detailed history in the lake.
Policy:
- Topic retention: 3 days by time or capped by size; cleanup.policy=delete.
- Offload: sink to object storage in hourly batches; lake retains warm for 12 months (PII rules apply).
- Aggregates: daily roll-ups kept for 3 years in archive.
Automation: consumer lag alerts, failed offload retries, partitioned object keys by dt/hour.

Example 4: CDC (change data capture) stream

Requirements: maintain latest state plus 30 days of change history; low storage overhead.
Policy:
- Topic A (changes): cleanup.policy=compact,delete with retention.ms=30 days; ensures latest value per key plus recent history.
- Snapshot: nightly compacted table in lake; monthly snapshots for 12 months.
Automation: compaction monitoring; snapshot job integrity checks.

Common mistakes and how to self-check

Mistake: one-size-fits-all TTLs. Fix: classify datasets; apply different windows for hot/warm/archive.
Mistake: forgetting backups and logs. Fix: include them in your retention inventory.
Mistake: keeping PII longer than necessary. Fix: anonymize or drop sensitive columns before long-term storage.
Mistake: relying on manual deletion. Fix: use automated lifecycle rules and job schedules.
Mistake: not proving compliance. Fix: store deletion logs, lifecycle configs, and periodic audit evidence.

Quick self-check

Can you show where data older than your hot window lives?
Can you prove PII does not exceed its retention?
If a user requests deletion, can you remove or anonymize across lake, warehouse, and backups?

Practical projects

Project 1: Build lifecycle for a mock data lake
- Create raw/, refined/, and aggregate/ prefixes.
- Set rules: raw hot 30 days → cool 365 days → archive 5 years; refined: hot 90 days → cool 12 months.
- Add a job that drops email/phone after 13 months, keeping aggregates.
Project 2: Stream retention and offload
- Configure a topic with 3-day retention and hourly offload to object storage.
- Implement idempotent offload with checkpoint files.
Project 3: Warehouse TTL and partitioning
- Create partitioned tables by event_date; set table TTL of 180 days for detailed events and unlimited for aggregates.

Exercises

Do these now. They mirror the graded section below. You can take the test even without logging in; only logged-in users will see saved progress.

Checklist before you start:
- Have you listed sensitive fields?
- Do you know hot/warm/archive windows?
- Do you know who needs the data and how often?

Exercise 1: Draft a lifecycle and retention plan for analytics events

See the Exercises panel below for instructions and a sample solution. Outline hot/warm/cold tiers, TTLs, anonymization, and monitoring.

Exercise 2: Configure streaming retention and lake offload

Set topic retention, compaction (if needed), and describe an offload schedule with object key patterns.

Learning path

Before: Storage tiers and cost fundamentals; batch vs. streaming basics.
Now: Data lifecycle and retention policies (this lesson).
Next: Governance and data quality SLAs; access control and encryption at rest; disaster recovery and backups.

Next steps

Write a one-page retention policy for one real dataset you own.
Implement one automated lifecycle rule this week.
Take the quick test to confirm understanding. Progress is saved for logged-in users.

Mini challenge

Scenario: You manage user behavioral events (contains user_id, country, device info) and purchase orders (financial). Analysts need 90 days of raw events and 24 months of trends. Finance needs 7-year immutable order records. Draft a 6-line policy:

1–2: hot/warm/archive windows for events; anonymization approach.
3–4: streaming retention and offload frequency.
5: order data immutability and tier.
6: monitoring/audit evidence you will capture.

Tip

Keep PII short-lived, keep aggregates long-lived, keep finance immutable.

Instructions

Context: You own a data lake with events/{event_date}/ files. Fields include user_id, session_id, url, ip, country, device, and optionally email for support flows. Requirements:

Analysts query last 60 days frequently, but need seasonality across 24 months.
Company policy: direct identifiers (email, IP) must not be retained beyond 13 months.
Cost guardrail: archive tier should be used for anything beyond 24 months.

Task: Propose a lifecycle and retention policy including:

Hot, warm, archive windows and storage tiers.
Anonymization strategy for PII and when it runs.
Streaming topic retention (if applicable) and offload schedule.
Warehouse table TTLs/partitioning.
Monitoring and audit evidence.

Deliverable: A concise plan (8–12 bullet points).

Menu

Data Lifecycle And Retention Policies

Table of Contents