Why this matters
MLOps Engineers keep models, datasets, and training outputs safe, reproducible, and cost-effective. Solid artifact storage and retention policies let teams:
- Reproduce any production model version on demand.
- Control storage costs with lifecycle rules instead of manual cleanup.
- Meet audit, security, and regulatory requirements.
- Prevent accidental deletions and data leaks.
- Speed up delivery by making artifacts easy to find and reuse.
Typical real tasks you will handle
- Design an S3/Blob storage layout and lifecycle rules for ML artifacts.
- Define how many model versions to keep per stage (Staging vs Production).
- Encrypt, version, and tag artifacts for lineage and auditing.
- Automate pruning of old experiments while retaining critical releases.
- Detect and remove orphaned artifacts left by failed or abandoned runs.
Concept explained simply
Artifacts are the files produced and used by ML workflows: trained models, datasets, feature stores snapshots/exports, metrics, plots, environment files (conda.yaml/requirements.txt), training code bundles, and inference packages.
Artifact storage is where these files live (usually object storage). Retention is the set of rules for how long to keep them and when to archive or delete them.
Mental model
Think of artifact storage as a library with:
- Sections (buckets/containers)
- Shelves (prefixes/folders)
- Labels (tags/metadata)
- Borrowing rules (access control)
- Weeding policy (retention/lifecycle)
Your job: make sure every book (artifact) can be found, trusted, and preserved for the right amount of time—no more, no less.
Core concepts
What to store (and how to package)
- Model binaries: .pt/.h5/.onnx/.pkl
- Training code snapshot: a commit SHA or bundle
- Environment/Dependencies: conda.yaml, requirements.txt, Dockerfile
- Data references: dataset version IDs, hashes, or manifest files (avoid duplicating large raw data unless required)
- Metrics & reports: JSON/CSV/HTML plots
Include a manifest file (e.g., artifact_manifest.json) capturing checksums, sizes, and pointers to related assets.
Storage backends and traits
- Object storage (commonly best): versioning, lifecycle rules, replication, low cost.
- Artifact repositories (ML-specific): integrate with MLflow/registry; still often backed by object storage.
- Container registries: for images, not general files.
Key features to look for: versioning, server-side encryption, lifecycle management, immutability/WORM, access logs, tags/metadata, replication.
Retention strategies
- Time-based: delete/archive after N days.
- Version-based: keep last K versions per model/stage.
- Stage-aware: Production kept longer than Staging/Dev.
- Event-based: retain artifacts attached to promoted releases; aggressively prune failed runs.
Compliance and safety
- Encryption in transit and at rest; customer-managed keys where required.
- Immutability/WORM for audit-critical releases.
- Least-privilege access: separate read/write roles for CI, serving, and analysts.
- Audit logs: who accessed what and when.
- PII handling: store only what is needed; tokenize or anonymize when possible.
Cost and performance
- Storage classes: hot (frequent access), cool/nearline, cold/archive.
- Lifecycle transitions: hot → cool after 30–60 days, then archive.
- Compression and deduplication: tar.gz, zstd; content-addressing with checksums.
- Caching: keep frequently used models in a small hot cache for CI and serving.
- Replication: multi-region for resilience when RTO/RPO require it.
Worked examples
Example 1: Stage-aware retention policy
Goal: Large experimentation but small budget.
- Dev/Experiment runs: keep 14 days; keep latest 5 versions per model; delete failed runs after 3 days.
- Staging: keep 60 days; keep latest 10 versions per model.
- Production: keep 365 days minimum; keep all promoted releases; archive after 365 days (no delete) to cold storage; enable immutability.
{
"dev": {"ttl_days": 14, "keep_last_versions": 5, "failed_ttl_days": 3},
"staging": {"ttl_days": 60, "keep_last_versions": 10},
"production": {"ttl_days": 365, "archive": true, "immutable": true}
}Example 2: Object storage lifecycle rules (pseudo)
{
"rules": [
{"filter": {"prefix": "ml/dev/"}, "transition": {"days": 14, "storage_class": "COLD"}, "expire": {"days": 30}},
{"filter": {"prefix": "ml/staging/"}, "transition": {"days": 60, "storage_class": "COLD"}, "expire": {"days": 120}},
{"filter": {"prefix": "ml/prod/"}, "transition": {"days": 365, "storage_class": "ARCHIVE"}, "lock": {"mode": "WORM", "retain_days": 365}}
]
}Note: Use tags like stage=dev|staging|prod to target rules more flexibly than folder names.
Example 3: Registry cleanup logic
For each model name:
- Keep all versions with stage=Production.
- For stage=Staging, keep last 10; delete older if also older than 60 days.
- For stage=None/Archived, keep last 5 if accessed in past 30 days; otherwise delete.
Always verify that underlying artifact files aren’t referenced by another model/version before deletion.
Example 4: Orphan detection
- List all artifact files under ml/.
- Build a set of referenced artifact IDs from registry metadata.
- Diff to find unreferenced (orphans).
- Quarantine orphans for 7 days (rename or move), then delete unless referenced.
Step-by-step: Set up storage and retention
- Define naming: org/project/model/stage/version/run_id/
- Enable object versioning and default encryption (KMS-managed keys).
- Decide tags: stage, model, owner, data_sensitivity, ttl_class.
- Create lifecycle rules for dev/staging/prod, including transitions and expiry.
- Set IAM roles: ci-writer (write dev/staging), prod-writer (limited), inference-reader (read prod), auditor (read + logs).
- Implement manifests with checksums (SHA256) and sizes for each artifact package.
- Schedule pruning jobs and orphan sweeps (e.g., daily).
- Enable access logs and periodic cost review.
Checklist to confirm setup
- Buckets/containers have versioning and encryption enabled
- Lifecycle rules exist for dev, staging, and prod with tested transitions
- IAM roles follow least privilege and are documented
- Artifacts include manifest with checksums
- Orphan detection routine is scheduled
- Immutability/WORM applied for production releases
Exercises you can do now
Do these exercises and then check your work below. The Quick Test at the end is available to everyone; only logged-in users will have their progress saved.
- Exercise 1: Design a stage-aware retention policy YAML for a team with heavy experiments, modest staging, and strict production retention. Include TTL, version limits, and archive rules.
- Exercise 2: Estimate storage savings when moving 2 TB of staging artifacts from hot to cold storage after 60 days (assume 60% cost reduction), and deleting 40% of old dev runs after 30 days.
Hints
- Use tags (stage, model, owner) for flexible lifecycle targeting.
- Production artifacts often need immutability and longer retention.
- Calculate savings per tier separately, then sum.
Common mistakes and how to self-check
- No versioning: Without versioning, rollbacks are fragile. Self-check: verify older object versions exist for a sample artifact.
- Over-retaining experiments: Costs creep up. Self-check: chart storage by prefix and age; ensure lifecycle rules hit most dev artifacts.
- Deleting referenced files: Breaks reproducibility. Self-check: compare registry references to files before any deletion.
- Weak tagging: Hard to audit or target rules. Self-check: pick 10 artifacts; confirm tags cover stage, model, owner.
- Skipping immutability for releases: Risks tampering. Self-check: attempt to modify a production artifact; it should fail when locked.
Practical projects
- Build a simulated ML project bucket with dev/staging/prod prefixes, tags, manifests, and lifecycle JSON; run a dry-run pruning script.
- Create an artifact packaging template: manifest + checksum + environment files; test validation on CI.
- Implement an orphan quarantine workflow: move, tag as quarantine=true, delete after 7 days if still unreferenced.
Who this is for and prerequisites
Who this is for
- MLOps Engineers designing storage and registry workflows.
- Data/ML Engineers maintaining pipelines and releases.
- Team leads needing governance and cost control.
Prerequisites
- Basic object storage concepts (buckets, prefixes, lifecycle).
- Familiarity with ML experimentation artifacts (models, metrics, datasets).
- Comfort with YAML/JSON configuration and IAM basics.
Learning path
- Before: Model versioning basics, experiment tracking.
- Now: Artifact storage and retention (this lesson).
- Next: Promotion workflows, model serving packaging, and automated cleanup jobs integrated with your registry.
Mini challenge
In 10 minutes, draft three lifecycle rules: one for dev (short TTL), one for staging (medium TTL + transition), and one for prod (immutability + archive). Add tags you would rely on. Keep it under 20 lines of JSON or YAML.
Next steps
- Convert your chosen examples into your environment’s configuration format.
- Pilot on a non-critical project; review costs after 2 weeks.
- Run the Quick Test below to validate your understanding.
Note: The test is available to everyone. Only logged-in users will have their progress saved.