Why this matters
Batch file ingestion moves files (CSV, JSON, Parquet, etc.) from sources like S3, GCS, Azure Blob, or SFTP into your data lake or warehouse on a schedule. Data Engineers do this to onboard partners, load application exports, backfill history, and power analytics reliably.
- Real tasks: set up daily partner file drops, validate schema, partition data by date, and load into staging/warehouse with deduplication.
- Impact: correctness, recoverability, and predictable costs vs. always-on streaming.
Concept explained simply
Batch file ingestion is a repeatable recipe: discover files, validate and transform them, store them in a structured location, then load into query-ready tables. You run it on a schedule (for example, every day at 01:00).
Mental model
Imagine a conveyor belt where a new sealed box arrives each day. You: (1) check the label (file name and date), (2) open and verify contents (schema and row counts), (3) place items into the right shelf (partitioned storage), and (4) update the catalog (warehouse merge). If a box is damaged, you put it in a quarantine area and continue.
Core patterns and formats
- Sources: S3/GCS/Blob storage, SFTP, shared folders, app exports.
- Formats: CSV (simple, fragile), JSON Lines (semi-structured), Parquet/Avro (columnar/binary, strong types), compressed with gzip or snappy.
- Partitioning: typically by event_date or ingestion_date (YYYY/MM/DD). Keep small files combined; target ~100–512 MB per file for efficient reads.
- Workflow: land (raw) → validate → normalize → load (staging) → merge into warehouse (dim/fact).
- Idempotency: safe to re-run without duplicating rows. Use unique keys + upsert/merge, or load into a temp area then atomically replace.
Worked examples
Example 1 — Daily CSV from object storage to warehouse
- Discovery: list files matching vendorA/sales/2024-05-*.csv.
- Validation: check header columns, count rows, sample for encoding issues.
- Landing: copy to data-lake/raw/vendorA/sales/ingestion_date=2024-05-15/.
- Normalization: convert to Parquet; standardize column names and types.
- Load: stage in warehouse.sales_stg, then MERGE into warehouse.sales.
MERGE INTO warehouse.sales t
USING warehouse.sales_stg s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...)
VALUES (...);
Example 2 — SFTP JSON Lines, zipped
- Pull: download *.jsonl.gz from SFTP incoming/.
- Integrity: verify file size and gzip integrity; if fail → move to quarantine.
- Schema: validate required fields (id, ts, amount).
- Landing: write to raw/partnerB/payments/ingestion_date=YYYY-MM-DD/.
- Transform: decompress → parse JSON Lines → write Parquet.
- Load: stage and upsert using id + ts to resolve late corrections.
Example 3 — Idempotent backfill with watermarks
- Watermark: run for event_date between 2024-01-01 and 2024-01-31.
- File pattern: daily/partitioned input files per event_date.
- Process per day: validate → write to curated/event_date=YYYY-MM-DD/.
- Merge: use business_key (customer_id, event_ts) for idempotency.
- Re-run: same inputs produce same final state (no duplicates).
Implementation steps (template)
- Define contract: file naming, columns, types, delimiter, compression, delivery time.
- Choose partitions: event_date if available; else ingestion_date.
- Plan idempotency: unique key + merge, or replace partitions atomically.
- Add validations: schema, row count thresholds, null checks, domain checks.
- Handle errors: quarantine invalid files; continue for the rest.
- Orchestrate: run daily with retries and alerting.
- Catalog and document: what lands where, retention, and SLA.
Data validation and schema handling
- Header and types: verify exact column names and parse types.
- Row-level checks: primary key not null, amounts non-negative, timestamps parseable.
- Schema drift: on new columns, extend schema compatibly; on missing columns, backfill defaults or block and escalate.
- Bad rows: separate to a dead-letter file with reason; keep ingestion moving.
Quick validator checklist
- Columns match expectation
- Delimiter and quoting consistent
- Encoding UTF-8
- Row count within expected range
- Primary key unique
File partitioning and naming
Use stable, sortable names and directories so listing is predictable and idempotent.
raw/vendor/source=vendorA/dataset=sales/event_date=YYYY-MM-DD/part-0001.parquet
curated/vendorA/sales/event_date=YYYY/MM/DD/part-0001.snappy.parquet
- Include event_date, source, dataset in path.
- Atomicity: write to a temp path, then move/rename into the final partition.
Idempotency and deduplication
- Prefer upsert with a stable business key.
- De-duplicate in staging using window functions (keep latest by updated_at).
- Reruns should not create duplicates; protect with unique constraints when available.
-- Example dedup in staging
CREATE TABLE sales_stg_dedup AS
SELECT * EXCEPT(rn) FROM (
SELECT s.*, ROW_NUMBER() OVER (
PARTITION BY order_id ORDER BY updated_at DESC
) rn
FROM sales_stg s
)
WHERE rn = 1;
Monitoring and retries
- Track counts: files discovered, files ingested, rows loaded, rows rejected.
- Alert on: missing expected file by SLA, schema mismatches, spike in rejects.
- Retry policy: exponential backoff for transient storage/SFTP errors.
Security and compliance basics
- Credentials: store securely; rotate regularly.
- PII: encrypt at rest; mask or tokenize sensitive fields early.
- Access: least privilege to buckets/containers and tables.
Exercises
Do these to lock in the skill. The same tasks are listed below with expected outputs.
Exercise 1 — Design a daily batch file ingestion (planning)
- Choose a dataset (e.g., orders) and define the file contract (columns, types, delimiter, compression, delivery time).
- Propose a directory and naming scheme with partitions.
- Define validations and idempotent load approach (keys + merge).
- Write a short runbook: steps, retries, alerts, and quarantine rules.
- Deliverable: a one-page design with examples of paths and a MERGE skeleton.
Exercise 2 — Validate, stage, and deduplicate CSVs (hands-on)
- Assume two CSV files with overlapping order_id rows. Validate schema and row counts.
- Load into a staging table, then create a deduplicated staging view/table keeping the latest updated_at per order_id.
- Show a final MERGE statement into the target table.
- Deliverable: sample SQL and final row counts before/after dedup.
Checklist before you run in production
- Clear file contract agreed
- Partitioning chosen and documented
- Idempotent load path defined
- Validation and quarantine in place
- Monitoring and alerts configured
Common mistakes (and how to self-check)
- Only filtering duplicates by file name. Self-check: re-run pipeline on the same day; do you get extra rows?
- Huge numbers of tiny files. Self-check: average file size; aim for hundreds of MB, not KB.
- No quarantine. Self-check: where do bad rows/files go?
- Coupling to ingestion order. Self-check: can files arrive late without breaking logic?
- Ignoring time zones. Self-check: are partitions based on UTC consistently?
Practical projects
- Partner dropzone: simulate an SFTP drop of daily transactions; land, validate, convert to Parquet, and upsert into a warehouse table.
- Historical backfill: ingest a month of archived CSVs using a watermark loop with idempotent merges.
- Schema evolution: accept a new optional column mid-week; keep pipeline running and reflect it downstream.
Mini challenge
You receive two files for the same day: sales_2024-05-15_part1.csv and sales_2024-05-15_part2.csv. Part1 included late fixes overlapping with Part2. Describe:
- Partitioning path you will use
- How you will deduplicate with updated_at logic
- How a re-run will avoid new duplicates
Hint
Load both parts into staging, dedup with ROW_NUMBER() over order_id ordered by updated_at, then MERGE into target. Keep event_date partition fixed.
Who this is for
- Junior to mid Data Engineers who need reliable scheduled file loads
- Analytics Engineers owning pipelines for partner data
- Ops-minded engineers improving robustness and cost predictability
Prerequisites
- Comfort with SQL (SELECT, INSERT, MERGE/UPSERT)
- Basic understanding of object storage and folders/paths
- Familiarity with CSV/JSON and compression (gzip)
Learning path
- This lesson: batch file ingestion fundamentals
- Next: schema evolution and data contracts in ingestion
- Then: orchestration, SLAs, and alerting
- Finally: cost optimization and performance tuning
Next steps
- Complete the exercises below and compare with the provided solutions.
- Build one practical project and share your design doc for peer review.
- Take the quick test to confirm readiness.
Quick Test
Take the test below to check your understanding. Available to everyone. If you are logged in, your progress and score will be saved automatically.