How to learn Data Engineering & Platforms for free

What is Data Engineering & Platforms?

Data Engineering & Platforms is about building the reliable pipelines, storage, and processing layers that turn raw data into trustworthy, usable datasets. Data Engineers design models and tables, move and transform data at scale, ensure quality, and make it easy and safe for analysts, scientists, and products to use data.

Typical problems solved

Ingesting data from APIs, databases, files, and streaming sources
Designing data models and warehouses so data is easy to query
Transforming raw data into clean, standardized datasets
Scheduling and orchestrating pipelines that run daily or in real time
Ensuring data quality, lineage, observability, and cost control
Providing secure, governed access to datasets across teams

Who this is for

Builders who enjoy creating systems that run reliably
People who like SQL, scripting, and automating repetitive tasks
Detail-oriented problem-solvers who care about correctness and speed
Those who want impact across many teams by enabling analytics and ML

Prerequisites

Comfort with basic SQL (SELECT, WHERE, GROUP BY, JOIN)
Basic Python or similar scripting language
Familiarity with CSV/JSON and working in a terminal
Optional but helpful: Git basics and any cloud exposure

Learning path (at a glance)

Strengthen SQL and data modeling fundamentals
Automate transformations with Python and a workflow tool
Learn warehouse/lake concepts and partitioning
Add testing, data quality checks, and documentation
Deploy to the cloud or a local equivalent; add monitoring

Mini task: Normalize a messy dataset

Take a CSV with duplicate users and inconsistent timestamps. Write SQL to de-duplicate by a primary key, standardize time zones to UTC, and produce a clean dimension table.

Careers inside this direction

Data Engineer – Builds and maintains data pipelines, storage, and processing systems that make reliable, timely data available for analytics and products. Best for: people who enjoy building systems, are comfortable with SQL/Python, and value reliability.

Where you can work

Industries: tech, fintech, e-commerce, healthcare, gaming, media, logistics, travel, government, energy
Company types: startups, scale-ups, enterprises, consultancies, data platform vendors, nonprofits
Team setups: central data/platform teams, embedded product teams, data platform groups in large orgs

Salary ranges by stage

Junior Data Engineer: ~$60k–95k
Mid-level Data Engineer: ~$95k–140k
Senior/Staff Data Engineer: ~$140k–200k+

Varies by country/company; treat as rough ranges.

Growth map

Junior → Mid: solid SQL, reliable ETL/ELT jobs, basic data modeling, version control, simple tests
Mid → Senior: end-to-end ownership, cost/perf tuning, workflow orchestration, governance, observability
Senior → Staff/Lead: platform design, data contracts, multi-team architecture, SLAs/SLOs, mentoring

Signals you are ready for the next level

You prevent issues with tests and design, not just fix them after alerts
You make trade-offs explicit: cost vs freshness vs complexity
Others rely on your patterns and documentation to move faster

Tools & stack overview

Languages: SQL, Python (sometimes Scala/Java for Spark-heavy stacks)
Storage: PostgreSQL/MySQL, object storage (S3/Blob/GCS), data warehouses (Snowflake, BigQuery, Redshift)
Processing: dbt for transformations, Spark for big data, streaming with Kafka/Kinesis/PubSub
Orchestration: Airflow, Prefect, Dagster
Containers/Infra: Docker, Kubernetes (optional at start)
Quality/Observability: tests in SQL/Python, Great Expectations/Soda-like checks, basic lineage
Collaboration: Git, pull requests, code reviews, documentation

Choosing a beginner-friendly stack

Warehouse: start with a free-tier or local Postgres
Transformations: dbt Core + SQL models
Orchestration: Prefect or Airflow locally
Storage: local files for practice; understand object storage concepts

Beginner roadmap (4–8 weeks)

Week 1: SQL core

Practice SELECT, filtering, aggregation, window functions
Write joins (inner/left) and handle NULLs explicitly
Mini task: build a customer metrics query (orders, revenue, first_seen)

Week 2: Python for data

Read/write CSV/JSON, work with datetime and time zones
Call a REST API and save responses incrementally
Mini task: fetch daily data and append only new records

Week 3: Data modeling

Star schema basics: facts, dimensions, slowly changing dimensions (simple)
Design staging, core, marts layers
Mini task: sketch a simple warehouse diagram and build 2–3 tables

Week 4: Orchestration & ELT

Schedule a daily job that extracts data, loads to warehouse, runs transforms
Add retries, logging, and parameterized runs
Mini task: one-click run that rebuilds a daily sales mart

Week 5: Quality, docs, and monitoring

Add tests (not null, unique, foreign keys) and freshness checks
Document tables and columns with clear owners and purposes
Mini task: add anomaly alert when daily volume drops by 30%+

Week 6: Cloud concepts or local equivalents

Understand object storage, compute, networking basics, and IAM concepts
Practice deploying a simple pipeline to run on a schedule
Mini task: cost-aware design (partitioning, compression, pruning)

Stretch Weeks 7–8: Streaming and optimization

Build a small streaming ingest and transform job
Tune a heavy query with partitions, clustering, and indexes

Common mistakes

Skipping tests and documentation, causing breakages and rework
Over-engineering early; keep designs simple and evolve with needs
Ignoring costs and data freshness trade-offs
Pipelines that only work on your machine; automate and parametrize
Unclear ownership; define data contracts and consumers

Mini project ideas

CSV to warehouse: Load daily CSVs into a staging table, clean them, and publish a mart
API ingestion: Incrementally fetch a public API and build a simple fact and two dimensions
Log parser: Parse app logs into events and sessions, with a daily sessionization job
Quality checks: Add not-null/unique tests and a freshness check with a simple alert
Cost-aware partitioning: Partition a large fact table by date and compare scan sizes

Practical projects

Project 1: End-to-end sales analytics platform

Extract: API + CSV ingest to staging
Transform: build star schema (orders, order_items, customers, products)
Orchestrate: daily schedule with retries and logging
Quality: tests for keys, freshness, and basic anomaly checks
Deliver: a daily revenue dashboard dataset (by day, product, channel)

Project 2: Streaming events to feature store

Ingest clickstream events
Aggregate rolling 7-day counts per user in near-real-time
Publish a features table and document data contracts

Project 3: Data reliability toolkit

Add schema and null checks to 5 critical tables
Set up freshness SLAs and a weekly lineage review
Create runbooks for common incidents and on-call handoff notes

Next steps

Take the quick fit test on this page to gauge your match
Choose the Data Engineer path and commit to the 6-week roadmap
Start a practical project and iterate with tests, docs, and monitoring
When ready, explore the professions section below to focus your journey

Note: Anyone can take the fit test for free; if you log in, your progress will be saved.

Menu

Data Engineering & Platforms

Table of Contents