Who this is for
Data Platform Engineers, Data Engineers, and Analytics Engineers who design or maintain data lakes on S3, ADLS, or GCS and want reliable, query-friendly paths and file names.
Prerequisites
- Basic familiarity with object storage (S3, ADLS Gen2, or GCS).
- Understanding of datasets, partitions, and batch vs. streaming ingestion.
- Basic knowledge of Parquet/CSV and columnar storage benefits.
Why this matters
Good layout and naming unlock fast queries, easier governance, cheaper storage, and simpler pipelines. In real projects you will:
- Define data lake zones (raw/bronze, clean/silver, curated/gold) and design paths for each.
- Choose partition keys that make queries cheaper and ingestion stable.
- Name files so reprocessing is idempotent and debuggable.
- Separate PII/security-sensitive data and apply lifecycle policies confidently.
Concept explained simply
Object storage is a giant key-value store. Your folder structure is virtual: slashes in object keys give you "folders". A clear, predictable path pattern turns that giant bucket into an organized data warehouse.
Mental model
- Think of the bucket as a building, top-level prefixes as floors (zones), dataset prefixes as rooms, and partitions as drawers inside each room.
- Each file name is a label with enough info to trace its origin and time.
Core principles
- Bucket/container naming: lowercase, DNS-compliant, short but descriptive; use hyphens; avoid underscores and uppercase. Example:
acme-prod-datalake. - Zone layout: separate raw/clean/curated (or bronze/silver/gold). Example:
raw/,clean/,curated/. - Dataset naming: use lowercase, hyphen or underscore consistently, prefer plural nouns (e.g.,
orders,customers). - Hive-style partitions:
year=YYYY/month=MM/day=DD(and optionallyhour=HH) for query engines to prune efficiently. - Event time vs. load time: partition by the time you query most often (usually event time). Keep
ingestion_datein metadata or add a separate prefix if needed. - Granularity: avoid over-partitioning (too many tiny files/partitions). Daily is a safe default; add hourly only if needed.
- File naming: include dataset, time window, and a unique suffix (e.g., UUID or task attempt). Avoid vague names like
final.csv. - Use columnar + compression for analytics (e.g.,
.parquetwith snappy). Name files accordingly. - Security: isolate PII in separate prefixes or buckets; keep ACLs simple (principle of least privilege).
- Consistency: publish a naming standard and lint new paths against it.
Design templates you can reuse
General data lake template
{bucket}/{zone}/{domain}/{dataset}/version=v{major}/format={parquet|csv}/
year=YYYY/month=MM/day=DD[/hour=HH]/
{dataset}__eventdate=YYYYMMDD[HH]__part-{nnnn}-{uuid}.{ext}
Notes:
- Include
version=v1when schema changes need side-by-side storage. - Optional
format=parquetlevel helps mixed-format transitions. - Use
domain(sales, marketing, finance) to group datasets.
Minimal, analytics-friendly template
{bucket}/{zone}/{dataset}/year=YYYY/month=MM/day=DD/
part-{nnnn}-{uuid}.parquet
Raw landing with load-time separation
{bucket}/raw/{source}/{dataset}/load_date=YYYY-MM-DD/
{original_filename}
Keep raw immutable; clean later into event-time partitions.
Worked examples
-
Clickstream events (hourly queries):
s3://acme-prod-datalake/clean/web/clickstream/year=2026/month=01/day=10/hour=14/ clickstream__eventdate=20260110T14__part-0001-6e0b.parquetWhy: Hourly partitioning makes time-window queries faster. File name includes dataset and window for quick debugging.
-
Finance daily batches (no hourly need):
ads://acmeprod/curated/finance/invoices/year=2026/month=01/day=09/ invoices__eventdate=20260109__part-0000-a9d1.parquetWhy: Daily is sufficient and avoids tiny partitions.
-
Schema-breaking change (versioned):
gs://acme-prod-datalake/curated/sales/orders/version=v2/year=2026/month=01/day=05/ orders__eventdate=20260105__part-0003-f2c7.parquetWhy: Keep v1 and v2 side-by-side while consumers migrate.
Step-by-step design process
- Clarify zones and environments (e.g., dev, stg, prod; raw, clean, curated).
- Choose dataset names and domains.
- Decide partitioning based on query patterns (start with daily).
- Define file name schema (dataset, date window, part counter, UUID).
- Plan PII separation and retention policies.
- Write and share the standard, then apply it consistently.
Exercises
Do these now. The quick test is below. Everyone can take it; if you are logged in, your progress will be saved automatically.
Exercise 1: Design a layout for an orders dataset
Goal: Create bucket, zones, dataset path, partitioning, and file name pattern for sales/orders in production. Queries are by order date (daily). Format: Parquet.
- Environment: prod
- Zones: raw, clean, curated
- Partition by: event (order) date, daily
Deliverables:
- Bucket name
- Path template for curated zone
- One example file path for 2026-01-10
Exercise 2: Refactor a messy path
Given the path:
s3://CompanyData/Raw/Orders/2024/1/5/file.csv
Refactor it to be DNS-compliant, zone-aware, hive-partitioned by event date (daily), and better file naming. Assume prod, dataset: orders, domain: sales, format: parquet.
Exercise checklist
- Bucket/container name is lowercase and DNS-compliant.
- Zones separated clearly (raw/clean/curated).
- Hive-style partitions: year=YYYY/month=MM/day=DD.
- File name includes dataset, event window, and unique suffix.
- No over-partitioning; daily is used unless hourly is justified.
- PII or sensitive data isolated if applicable.
Common mistakes and self-check
- Over-partitioning: hour-level for low-volume data creates many tiny files. Self-check: average files per partition should be at least a few large files, not dozens of tiny ones.
- Bucket names with uppercase/underscores: violates DNS rules. Self-check: all lowercase, digits, and hyphens only.
- Mixing event and load time: confuses consumers. Self-check: partitions reflect the time dimension used in queries.
- Vague file names: e.g.,
final.csv. Self-check: can you find the source and time window from the name? - PII mixed with non-PII: complicates access control. Self-check: sensitive prefixes/buckets are isolated.
- Lack of versioning for breaking changes: forces all consumers to update at once. Self-check: use
version=vNwhen needed.
Practical projects
- Redesign a small data lake: pick three datasets (orders, customers, products) and apply the template across raw/clean/curated.
- Partition tuning: run the same query on daily vs. hourly partitions (on sample data) and compare object counts and query times.
- Versioned migration: simulate a schema change (v1 to v2) and keep both versions accessible; document a consumer migration plan.
Mini challenge
Design a path template for IoT telemetry where consumers filter mostly by device_id and by day. What is your partitioning strategy? Explain your trade-offs and provide a final path template and example file name.
Learning path
- Start: Object Storage basics (buckets, permissions).
- Then: Layout and naming (this lesson).
- Next: File formats and small-file management.
- Later: Table formats (Iceberg/Delta/Hudi) and governance catalogs.
Next steps
- Apply the template to one real dataset this week.
- Publish a short team standard (one page) and get feedback.
- Automate checks in ingestion (reject paths that violate the standard).
Quick Test and progress
Take the quick test below to check your understanding. Everyone can take it for free. If you are logged in, your progress is saved automatically.