How to learn Batch File Ingestion for Data Ingestion in Data Engineer for free

Why this matters

Batch file ingestion moves files (CSV, JSON, Parquet, etc.) from sources like S3, GCS, Azure Blob, or SFTP into your data lake or warehouse on a schedule. Data Engineers do this to onboard partners, load application exports, backfill history, and power analytics reliably.

Real tasks: set up daily partner file drops, validate schema, partition data by date, and load into staging/warehouse with deduplication.
Impact: correctness, recoverability, and predictable costs vs. always-on streaming.

Concept explained simply

Batch file ingestion is a repeatable recipe: discover files, validate and transform them, store them in a structured location, then load into query-ready tables. You run it on a schedule (for example, every day at 01:00).

Mental model

Imagine a conveyor belt where a new sealed box arrives each day. You: (1) check the label (file name and date), (2) open and verify contents (schema and row counts), (3) place items into the right shelf (partitioned storage), and (4) update the catalog (warehouse merge). If a box is damaged, you put it in a quarantine area and continue.

Core patterns and formats

Sources: S3/GCS/Blob storage, SFTP, shared folders, app exports.
Formats: CSV (simple, fragile), JSON Lines (semi-structured), Parquet/Avro (columnar/binary, strong types), compressed with gzip or snappy.
Partitioning: typically by event_date or ingestion_date (YYYY/MM/DD). Keep small files combined; target ~100–512 MB per file for efficient reads.
Workflow: land (raw) → validate → normalize → load (staging) → merge into warehouse (dim/fact).
Idempotency: safe to re-run without duplicating rows. Use unique keys + upsert/merge, or load into a temp area then atomically replace.

Worked examples

Example 1 — Daily CSV from object storage to warehouse

Discovery: list files matching vendorA/sales/2024-05-*.csv.
Validation: check header columns, count rows, sample for encoding issues.
Landing: copy to data-lake/raw/vendorA/sales/ingestion_date=2024-05-15/.
Normalization: convert to Parquet; standardize column names and types.
Load: stage in warehouse.sales_stg, then MERGE into warehouse.sales.

MERGE INTO warehouse.sales t
USING warehouse.sales_stg s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...)
VALUES (...);

Example 2 — SFTP JSON Lines, zipped

Pull: download *.jsonl.gz from SFTP incoming/.
Integrity: verify file size and gzip integrity; if fail → move to quarantine.
Schema: validate required fields (id, ts, amount).
Landing: write to raw/partnerB/payments/ingestion_date=YYYY-MM-DD/.
Transform: decompress → parse JSON Lines → write Parquet.
Load: stage and upsert using id + ts to resolve late corrections.

Example 3 — Idempotent backfill with watermarks

Watermark: run for event_date between 2024-01-01 and 2024-01-31.
File pattern: daily/partitioned input files per event_date.
Process per day: validate → write to curated/event_date=YYYY-MM-DD/.
Merge: use business_key (customer_id, event_ts) for idempotency.
Re-run: same inputs produce same final state (no duplicates).

Implementation steps (template)

Define contract: file naming, columns, types, delimiter, compression, delivery time.
Choose partitions: event_date if available; else ingestion_date.
Plan idempotency: unique key + merge, or replace partitions atomically.
Add validations: schema, row count thresholds, null checks, domain checks.
Handle errors: quarantine invalid files; continue for the rest.
Orchestrate: run daily with retries and alerting.
Catalog and document: what lands where, retention, and SLA.

Data validation and schema handling

Header and types: verify exact column names and parse types.
Row-level checks: primary key not null, amounts non-negative, timestamps parseable.
Schema drift: on new columns, extend schema compatibly; on missing columns, backfill defaults or block and escalate.
Bad rows: separate to a dead-letter file with reason; keep ingestion moving.

Quick validator checklist

Columns match expectation
Delimiter and quoting consistent
Encoding UTF-8
Row count within expected range
Primary key unique

File partitioning and naming

Use stable, sortable names and directories so listing is predictable and idempotent.

raw/vendor/source=vendorA/dataset=sales/event_date=YYYY-MM-DD/part-0001.parquet
curated/vendorA/sales/event_date=YYYY/MM/DD/part-0001.snappy.parquet

Include event_date, source, dataset in path.
Atomicity: write to a temp path, then move/rename into the final partition.

Idempotency and deduplication

Prefer upsert with a stable business key.
De-duplicate in staging using window functions (keep latest by updated_at).
Reruns should not create duplicates; protect with unique constraints when available.

-- Example dedup in staging
CREATE TABLE sales_stg_dedup AS
SELECT * EXCEPT(rn) FROM (
  SELECT s.*, ROW_NUMBER() OVER (
    PARTITION BY order_id ORDER BY updated_at DESC
  ) rn
  FROM sales_stg s
)
WHERE rn = 1;

Monitoring and retries

Track counts: files discovered, files ingested, rows loaded, rows rejected.
Alert on: missing expected file by SLA, schema mismatches, spike in rejects.
Retry policy: exponential backoff for transient storage/SFTP errors.

Security and compliance basics

Credentials: store securely; rotate regularly.
PII: encrypt at rest; mask or tokenize sensitive fields early.
Access: least privilege to buckets/containers and tables.

Exercises

Do these to lock in the skill. The same tasks are listed below with expected outputs.

Exercise 1 — Design a daily batch file ingestion (planning)

Choose a dataset (e.g., orders) and define the file contract (columns, types, delimiter, compression, delivery time).
Propose a directory and naming scheme with partitions.
Define validations and idempotent load approach (keys + merge).
Write a short runbook: steps, retries, alerts, and quarantine rules.

Deliverable: a one-page design with examples of paths and a MERGE skeleton.

Exercise 2 — Validate, stage, and deduplicate CSVs (hands-on)

Assume two CSV files with overlapping order_id rows. Validate schema and row counts.
Load into a staging table, then create a deduplicated staging view/table keeping the latest updated_at per order_id.
Show a final MERGE statement into the target table.

Deliverable: sample SQL and final row counts before/after dedup.

Checklist before you run in production

Clear file contract agreed
Partitioning chosen and documented
Idempotent load path defined
Validation and quarantine in place
Monitoring and alerts configured

Common mistakes (and how to self-check)

Only filtering duplicates by file name. Self-check: re-run pipeline on the same day; do you get extra rows?
Huge numbers of tiny files. Self-check: average file size; aim for hundreds of MB, not KB.
No quarantine. Self-check: where do bad rows/files go?
Coupling to ingestion order. Self-check: can files arrive late without breaking logic?
Ignoring time zones. Self-check: are partitions based on UTC consistently?

Practical projects

Partner dropzone: simulate an SFTP drop of daily transactions; land, validate, convert to Parquet, and upsert into a warehouse table.
Historical backfill: ingest a month of archived CSVs using a watermark loop with idempotent merges.
Schema evolution: accept a new optional column mid-week; keep pipeline running and reflect it downstream.

Mini challenge

You receive two files for the same day: sales_2024-05-15_part1.csv and sales_2024-05-15_part2.csv. Part1 included late fixes overlapping with Part2. Describe:

Partitioning path you will use
How you will deduplicate with updated_at logic
How a re-run will avoid new duplicates

Hint

Load both parts into staging, dedup with ROW_NUMBER() over order_id ordered by updated_at, then MERGE into target. Keep event_date partition fixed.

Who this is for

Junior to mid Data Engineers who need reliable scheduled file loads
Analytics Engineers owning pipelines for partner data
Ops-minded engineers improving robustness and cost predictability

Prerequisites

Comfort with SQL (SELECT, INSERT, MERGE/UPSERT)
Basic understanding of object storage and folders/paths
Familiarity with CSV/JSON and compression (gzip)

Learning path

This lesson: batch file ingestion fundamentals
Next: schema evolution and data contracts in ingestion
Then: orchestration, SLAs, and alerting
Finally: cost optimization and performance tuning

Next steps

Complete the exercises below and compare with the provided solutions.
Build one practical project and share your design doc for peer review.
Take the quick test to confirm readiness.

Quick Test

Take the test below to check your understanding. Available to everyone. If you are logged in, your progress and score will be saved automatically.

Menu

Batch File Ingestion

Table of Contents