What Data Engineers do
Data Engineers design, build, and maintain the pipelines and platforms that move, transform, and serve data for analytics, machine learning, and operational use. Your work enables reliable, timely, and secure data for decision-makers and downstream systems.
- Day-to-day: design schemas, build ingestion jobs, write transformations (ETL/ELT), set up orchestration, monitor data quality, manage storage/compute costs, and document datasets.
- Typical deliverables: batch/stream pipelines, data warehouse models (e.g., star schemas), data quality checks, orchestration DAGs, reproducible infrastructure, and runbooks.
- Collaboration: partner with analysts, scientists, backend engineers, security, and product teams.
See a sample day
- 09:00 – Standup: review overnight jobs and priorities.
- 10:00 – Build: add a new ingestion source and schema evolution handling.
- 12:00 – Code review: validate a teammate’s ELT into the warehouse.
- 14:00 – Reliability: add data quality tests and alerting to a flaky pipeline.
- 16:00 – Docs: update the dataset README and a troubleshooting runbook.
Where you can work
- Industries: tech, finance/fintech, healthcare, retail/e-commerce, gaming, media, manufacturing, logistics, government.
- Teams: centralized data platforms, analytics engineering, ML platform, product engineering, data governance, or embedded in a domain squad.
- Company sizes: startups (build fast, broad scope), scale-ups (platformizing), enterprises (governance, reliability, complex domains).
Hiring expectations by level
- Junior: can write basic ingestion and transformations, follow patterns, ship small features with guidance, add tests, and document steps.
- Mid-level: designs end-to-end pipelines, chooses formats/partitioning, adds observability, handles backfills, and mentors juniors.
- Senior: leads designs across teams, sets standards (quality, cost, security), plans capacity, formalizes SLAs, and drives platform improvements.
Salary ranges (rough)
- Junior: ~$70k–$110k
- Mid-level: ~$100k–$150k
- Senior: ~$140k–$200k+
- Staff/Lead: ~$180k–$250k+
Varies by country/company; treat as rough ranges.
Skill map
These are the core skills you will build as a Data Engineer:
- Data Ingestion: Batch and incremental loading from APIs, files, and databases; schema evolution; file formats.
- ETL/ELT Development: Transformations, modeling, idempotency, backfills, performance tuning.
- Orchestration and Scheduling: DAG design, dependencies, retries, SLAs, observability.
- Data Warehousing: Star/snowflake schemas, partitioning, clustering, ACID tables, cost/perf tradeoffs.
- Streaming Systems Basics: Topics/partitions, consumer groups, windowing, watermarking, delivery semantics.
- Data Quality and Reliability: Tests, expectations, lineage, monitoring, alerting, SLAs/SLOs.
- Security and Governance: Least privilege, PII handling, encryption, audits, cataloging, access policies.
- Infrastructure and DevOps Basics: IaC, containers, CI/CD, environment parity, cost controls.
- Documentation: Dataset READMEs, runbooks, ADRs, onboarding guides.
Practical projects for your portfolio
- 1) Analytics Lakehouse Pipeline: Land raw files to object storage, normalize to columnar format, load curated tables into a warehouse. Outcome: reproducible DAG, partition strategy, and a data dictionary.
- 2) CDC from OLTP to Warehouse: Capture incremental changes from a transactional DB and maintain a Type 2 dimension. Outcome: idempotent upserts, late-arriving data handling, and tests.
- 3) Real-time Metrics Stream: Ingest clickstream to a streaming system and compute 1/5/60-min windows. Outcome: exactly/at-least-once tradeoff explanation and lag dashboards.
- 4) Data Quality Guardrails: Add contracts, expectations, and lineage to an existing pipeline. Outcome: alerts, runbook, and defined SLOs.
- 5) Cost-Aware Warehouse: Optimize a large table using partitioning/clustering and storage tiering. Outcome: before/after query cost and latency comparison.
Learning path
- Foundations first: Learn file formats (CSV, JSON, Parquet), batch vs stream, idempotency.
Mini task
Convert a 1GB CSV to Parquet and measure file size and query time differences.
- Build ingestion: Pull data from one API and one database. Handle pagination, rate limits, and incremental loads.
Mini task
Create a job that retries with exponential backoff and writes audit logs.
- Model and transform: Design a small star schema and implement ELT in a warehouse.
Mini task
Write a dimension with Type 2 history and a fact table joined on surrogate keys.
- Orchestrate and observe: Build a DAG with dependencies, retries, and alerts.
Mini task
Add data quality checks that fail the DAG on critical rule violations.
- Stream and scale: Add a streaming component for real-time metrics and document SLAs/SLOs.
Mini task
Implement a 5-minute tumbling window and handle out-of-order events with watermarks.
Interview preparation checklist
Technical checklist
- Explain ETL vs ELT tradeoffs and when to use each.
- Choose file formats and partitioning for different query patterns.
- Design an idempotent pipeline and backfill plan.
- Describe streaming semantics (at-most/at-least/exactly-once).
- Demonstrate data quality strategy: tests, thresholds, and alerts.
- Outline least-privilege access and PII protection.
- Show CI/CD for pipelines and IaC basics.
Behavioral checklist
- Share a time you fixed a flaky pipeline and prevented recurrence.
- Describe a tradeoff you made (cost vs latency) and why.
- Explain how you collaborated with stakeholders to define SLAs.
- Walk through a post-incident review and the resulting action items.
Common mistakes and how to avoid them
- No idempotency: Re-runs double-count. Fix: use upserts/merge and deterministic keys.
- Overcomplicated DAGs: Hard to maintain. Fix: favor small, composable tasks and clear ownership.
- Ignoring data quality: Silent data drift. Fix: add expectations, thresholds, and lineage.
- Poor cost visibility: Surprise bills. Fix: tagging, budgets, partition pruning, clustering.
- Weak documentation: Tribal knowledge. Fix: README templates, runbooks, auto-generated docs where possible.
- Security as an afterthought: Risk exposure. Fix: least privilege, encryption, masking, reviews.
Next steps
Pick a skill to start in the Skills section below. Build one project, document it well, and iterate with feedback. When ready, try the exam to check your readiness.