What does an ETL Developer do?
An ETL Developer designs, builds, and maintains data pipelines that Extract data from sources, Transform it into usable formats, and Load it into target systems (data warehouses, lakes, or marts). You ensure reliable, timely, and high-quality data for analytics, reporting, and downstream applications.
Typical responsibilities
- Gather data requirements and map source fields to target models
- Design ETL/ELT workflows, data transformations, and error handling
- Implement pipelines using SQL, Python, or ETL tools
- Schedule and orchestrate jobs; monitor runs and fix failures
- Enforce data quality checks and reconcile data counts
- Optimize performance: partitioning, indexing, parallelism
- Document lineage, transformations, and operational runbooks
- Collaborate with analysts, DBAs, and platform engineers
Typical deliverables
- Source-to-Target Mapping (STM) document
- Reusable transformation components/patterns
- Scheduled pipeline DAGs with alerting
- Data quality rules and validation reports
- Operational dashboards and logs
- Runbooks, handover notes, and versioned code
Day-to-day workflow
- Morning: check last night’s runs, review alerts, fix failures
- Midday: refine requirements, update mappings, implement features
- Afternoon: code reviews, performance tuning, add tests and docs
- End of day: deploy updates, backfill if needed, verify metrics
Mini task: Define a simple pipeline
Pick one public dataset in your mind (e.g., sales orders). Write down:
- Source tables and key columns
- Transformations needed (e.g., currency standardization)
- Load destination (e.g., fact_sales, dim_customer)
- Success metric (e.g., row counts match, zero nulls in keys)
Hiring expectations by level
Junior ETL Developer
- Implements defined mappings and simple transformations
- Understands basic SQL, joins, and incremental loads
- Can follow runbooks to triage failures
- Needs support for design and performance tuning
Mid-level ETL Developer
- Owns pipelines end-to-end for a domain
- Designs robust incremental patterns and error handling
- Implements data quality checks and observability
- Optimizes jobs and mentors juniors
Senior ETL Developer
- Architects data flows across teams and environments
- Defines standards, reusable frameworks, and governance
- Leads capacity planning, SLAs, and cost-performance trade-offs
- Partners with stakeholders to translate business needs into data models
Salary expectations
- Junior: ~$60k–$90k
- Mid-level: ~$90k–$130k
- Senior: ~$130k–$180k+
Varies by country/company; treat as rough ranges.
Where you can work
- Industries: finance, e-commerce, healthcare, SaaS, logistics, gaming, media
- Teams: data platform/engineering, analytics engineering, BI, ML data pipelines
- Tech environments: on-prem (RDBMS, ETL suites), cloud (AWS/GCP/Azure, modern orchestration)
Prerequisites
- Comfortable with SQL (joins, window functions, DDL/DML)
- Basic scripting (Python preferred) for utilities and transforms
- Understanding of databases, files, and batch vs. streaming
- Familiarity with Git and basic CI practices
Who this is for
- Detail-oriented builders who enjoy turning messy data into reliable datasets
- People who like repeatable processes, automation, and measurable quality
- Collaborators who can translate business questions into data pipelines
Learning path
- Model the destination first. Define dimensions/facts or target tables before writing code.
Mini task: Sketch a star schema for an orders domain. - Map sources to targets. Create a Source-to-Target Mapping with datatypes, rules, and edge cases.
Mini task: Identify how to deduplicate customers. - Build a minimal pipeline. Implement an incremental load with a watermark and validation checks.
Mini task: Add row-count and null-check validations. - Productionize. Schedule the job, add retries and alerts, and document runbooks.
Mini task: Write a one-page runbook: what to check when a job fails. - Optimize and scale. Partition, compress, and parallelize. Profile the slowest step and fix it.
Mini task: List two partition keys and justify them.
Skill map
- ETL Requirements And Source Analysis: turn business needs and raw sources into a clear mapping
- ETL Design Patterns: reusable solutions for incremental loads, CDC, SCD, and error handling
- Data Transformations: SQL/Python transformations, joins, aggregations, type casting, and validation
- Data Quality And Validation: rules, expectations, reconciliation, and data contracts
- Scheduling And Orchestration: job dependencies, retries, backfills, and monitoring
- Performance And Scalability: partitioning, indexing, parallelism, file formats, and cost control
- ETL Tooling And Implementation: hands-on with an ETL/ELT tool and CI basics
- Documentation And Handover: lineage, runbooks, and stakeholder-friendly notes
Practical projects
1) Sales warehouse mini-pipeline
Build a daily incremental load from orders.csv into fact_orders and dim_customer.
- Outcome: scheduled job, watermark-based loads, row-count checks
- Include: source-to-target mapping, SQL scripts, runbook
2) Customer SCD Type 2
Create a slowly changing dimension tracking changes of customer addresses.
- Outcome: SCD2 table with current_flag, valid_from/to, deduplicated changes
- Include: performance test on 10M rows sample
3) CDC from application DB
Ingest changes from a relational source using a change table or timestamp.
- Outcome: upserts/merges into target with idempotency
- Include: replay test to prove no double-counting
4) Data quality dashboard
Implement rules (null checks, referential integrity, freshness) and report failures.
- Outcome: alerting on broken rules; per-table validation summary
5) Orchestrated DAG with backfill
Design a DAG with dependencies and a safe backfill process for a missed week.
- Outcome: reproducible backfill without duplicating data
- Include: state management and guardrails
Interview preparation checklist
- Explain incremental vs full loads; when to use each
- Walk through SCD Types 1–2; trade-offs and implementation details
- Design a pipeline with idempotent retries and late-arriving data
- Discuss partitioning, clustering, and file formats (Parquet/ORC/CSV)
- Show how you validate data (row counts, uniqueness, constraints)
- Describe monitoring: alerts, logs, metrics, and backfills
- Whiteboard a DAG: task dependencies, schedules, SLAs
- Provide a sample runbook and documentation snippet
Mini mock question
Your daily job failed after loading data for 2026-01-10, then retried and ran again. How do you prevent duplicates? Mention your idempotency pattern and validation checks.
Common mistakes and how to avoid them
- Skipping requirements mapping: always produce an STM before coding
- No idempotency: design loads to be safe on retries
- Weak validation: add row-count, duplicates, and null-key checks
- Hard-coded logic: parameterize dates, schemas, and paths
- No backfill plan: design for safe historical reloads
- Ignoring cost: choose efficient formats, prune data early, and profile queries
- Poor documentation: capture lineage, assumptions, and recovery steps
Quick rescue checklist when a pipeline breaks
- Stop downstream dependents
- Identify the last consistent partition or watermark
- Fix root cause and replay from the checkpoint
- Reconcile counts and critical metrics
- Document the incident and prevention steps
Next steps
Pick a skill to start in the Skills section below, then build one practical project. When ready, take the exam to check your readiness. Progress on the exam is available to everyone; saved progress is available for logged-in users.