luvv to helpDiscover the Best Free Online Tools
Topic 1 of 8

Object Storage Layout And Naming

Learn Object Storage Layout And Naming for free with explanations, exercises, and a quick test (for Data Platform Engineer).

Published: January 11, 2026 | Updated: January 11, 2026

Who this is for

Data Platform Engineers, Data Engineers, and Analytics Engineers who design or maintain data lakes on S3, ADLS, or GCS and want reliable, query-friendly paths and file names.

Prerequisites

  • Basic familiarity with object storage (S3, ADLS Gen2, or GCS).
  • Understanding of datasets, partitions, and batch vs. streaming ingestion.
  • Basic knowledge of Parquet/CSV and columnar storage benefits.

Why this matters

Good layout and naming unlock fast queries, easier governance, cheaper storage, and simpler pipelines. In real projects you will:

  • Define data lake zones (raw/bronze, clean/silver, curated/gold) and design paths for each.
  • Choose partition keys that make queries cheaper and ingestion stable.
  • Name files so reprocessing is idempotent and debuggable.
  • Separate PII/security-sensitive data and apply lifecycle policies confidently.

Concept explained simply

Object storage is a giant key-value store. Your folder structure is virtual: slashes in object keys give you "folders". A clear, predictable path pattern turns that giant bucket into an organized data warehouse.

Mental model

  • Think of the bucket as a building, top-level prefixes as floors (zones), dataset prefixes as rooms, and partitions as drawers inside each room.
  • Each file name is a label with enough info to trace its origin and time.

Core principles

  • Bucket/container naming: lowercase, DNS-compliant, short but descriptive; use hyphens; avoid underscores and uppercase. Example: acme-prod-datalake.
  • Zone layout: separate raw/clean/curated (or bronze/silver/gold). Example: raw/, clean/, curated/.
  • Dataset naming: use lowercase, hyphen or underscore consistently, prefer plural nouns (e.g., orders, customers).
  • Hive-style partitions: year=YYYY/month=MM/day=DD (and optionally hour=HH) for query engines to prune efficiently.
  • Event time vs. load time: partition by the time you query most often (usually event time). Keep ingestion_date in metadata or add a separate prefix if needed.
  • Granularity: avoid over-partitioning (too many tiny files/partitions). Daily is a safe default; add hourly only if needed.
  • File naming: include dataset, time window, and a unique suffix (e.g., UUID or task attempt). Avoid vague names like final.csv.
  • Use columnar + compression for analytics (e.g., .parquet with snappy). Name files accordingly.
  • Security: isolate PII in separate prefixes or buckets; keep ACLs simple (principle of least privilege).
  • Consistency: publish a naming standard and lint new paths against it.

Design templates you can reuse

General data lake template
{bucket}/{zone}/{domain}/{dataset}/version=v{major}/format={parquet|csv}/
  year=YYYY/month=MM/day=DD[/hour=HH]/
    {dataset}__eventdate=YYYYMMDD[HH]__part-{nnnn}-{uuid}.{ext}

Notes:

  • Include version=v1 when schema changes need side-by-side storage.
  • Optional format=parquet level helps mixed-format transitions.
  • Use domain (sales, marketing, finance) to group datasets.
Minimal, analytics-friendly template
{bucket}/{zone}/{dataset}/year=YYYY/month=MM/day=DD/
  part-{nnnn}-{uuid}.parquet
Raw landing with load-time separation
{bucket}/raw/{source}/{dataset}/load_date=YYYY-MM-DD/
  {original_filename}

Keep raw immutable; clean later into event-time partitions.

Worked examples

  1. Clickstream events (hourly queries):

    s3://acme-prod-datalake/clean/web/clickstream/year=2026/month=01/day=10/hour=14/
      clickstream__eventdate=20260110T14__part-0001-6e0b.parquet
        

    Why: Hourly partitioning makes time-window queries faster. File name includes dataset and window for quick debugging.

  2. Finance daily batches (no hourly need):

    ads://acmeprod/curated/finance/invoices/year=2026/month=01/day=09/
      invoices__eventdate=20260109__part-0000-a9d1.parquet
        

    Why: Daily is sufficient and avoids tiny partitions.

  3. Schema-breaking change (versioned):

    gs://acme-prod-datalake/curated/sales/orders/version=v2/year=2026/month=01/day=05/
      orders__eventdate=20260105__part-0003-f2c7.parquet
        

    Why: Keep v1 and v2 side-by-side while consumers migrate.

Step-by-step design process

  1. Clarify zones and environments (e.g., dev, stg, prod; raw, clean, curated).
  2. Choose dataset names and domains.
  3. Decide partitioning based on query patterns (start with daily).
  4. Define file name schema (dataset, date window, part counter, UUID).
  5. Plan PII separation and retention policies.
  6. Write and share the standard, then apply it consistently.

Exercises

Do these now. The quick test is below. Everyone can take it; if you are logged in, your progress will be saved automatically.

Exercise 1: Design a layout for an orders dataset

Goal: Create bucket, zones, dataset path, partitioning, and file name pattern for sales/orders in production. Queries are by order date (daily). Format: Parquet.

  • Environment: prod
  • Zones: raw, clean, curated
  • Partition by: event (order) date, daily

Deliverables:

  • Bucket name
  • Path template for curated zone
  • One example file path for 2026-01-10
Exercise 2: Refactor a messy path

Given the path:

s3://CompanyData/Raw/Orders/2024/1/5/file.csv

Refactor it to be DNS-compliant, zone-aware, hive-partitioned by event date (daily), and better file naming. Assume prod, dataset: orders, domain: sales, format: parquet.

Exercise checklist

  • Bucket/container name is lowercase and DNS-compliant.
  • Zones separated clearly (raw/clean/curated).
  • Hive-style partitions: year=YYYY/month=MM/day=DD.
  • File name includes dataset, event window, and unique suffix.
  • No over-partitioning; daily is used unless hourly is justified.
  • PII or sensitive data isolated if applicable.

Common mistakes and self-check

  • Over-partitioning: hour-level for low-volume data creates many tiny files. Self-check: average files per partition should be at least a few large files, not dozens of tiny ones.
  • Bucket names with uppercase/underscores: violates DNS rules. Self-check: all lowercase, digits, and hyphens only.
  • Mixing event and load time: confuses consumers. Self-check: partitions reflect the time dimension used in queries.
  • Vague file names: e.g., final.csv. Self-check: can you find the source and time window from the name?
  • PII mixed with non-PII: complicates access control. Self-check: sensitive prefixes/buckets are isolated.
  • Lack of versioning for breaking changes: forces all consumers to update at once. Self-check: use version=vN when needed.

Practical projects

  • Redesign a small data lake: pick three datasets (orders, customers, products) and apply the template across raw/clean/curated.
  • Partition tuning: run the same query on daily vs. hourly partitions (on sample data) and compare object counts and query times.
  • Versioned migration: simulate a schema change (v1 to v2) and keep both versions accessible; document a consumer migration plan.

Mini challenge

Design a path template for IoT telemetry where consumers filter mostly by device_id and by day. What is your partitioning strategy? Explain your trade-offs and provide a final path template and example file name.

Learning path

  • Start: Object Storage basics (buckets, permissions).
  • Then: Layout and naming (this lesson).
  • Next: File formats and small-file management.
  • Later: Table formats (Iceberg/Delta/Hudi) and governance catalogs.

Next steps

  • Apply the template to one real dataset this week.
  • Publish a short team standard (one page) and get feedback.
  • Automate checks in ingestion (reject paths that violate the standard).

Quick Test and progress

Take the quick test below to check your understanding. Everyone can take it for free. If you are logged in, your progress is saved automatically.

Practice Exercises

2 exercises to complete

Instructions

Create a production layout for the sales/orders dataset. Queries filter by order date (daily). Use Parquet. Include zones, bucket name, curated path template, and an example file path for 2026-01-10.

  • Environment: prod
  • Zones: raw, clean, curated
  • Partition by: event (order) date, daily
Expected Output
A DNS-compliant bucket, a curated path template with hive-style partitions, and an example Parquet file path for 2026-01-10 with an informative filename.

Object Storage Layout And Naming — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Object Storage Layout And Naming?

AI Assistant

Ask questions about this tool