How to learn Compute Engines Spark Trino Flink Basics for Compute And Storage Foundations in Data Platform Engineer for free

Why this matters

As a Data Platform Engineer, you will pick and run the right compute engine for each job. Day-to-day tasks include:

Transforming large datasets (daily batch jobs) for analytics in an efficient, reliable way.
Serving fast interactive SQL to analysts across multiple data sources.
Building and operating real-time pipelines that process events with low latency.

Spark, Trino, and Flink are three cornerstone engines you will encounter. Knowing where each shines saves cost, improves reliability, and keeps stakeholders happy.

Who this is for

Aspiring and junior data engineers/platform engineers.
Analytics engineers who want to understand compute choices.
Backend engineers moving into data processing.

Prerequisites

Comfortable with SQL basics (SELECT, JOIN, GROUP BY, WHERE).
Familiarity with files in a data lake (e.g., Parquet/CSV) and object storage concepts.
Basic understanding of distributed systems ideas (clusters, nodes, parallelism).

Concept explained simply

Think of your data platform like a kitchen:

Spark: a versatile chef who can cook big meals (batch) and also handle a steady stream of orders (structured streaming). Great at heavy lifting.
Trino: a fast food counter for questions. You ask a SQL question and it quickly fetches ingredients from many pantries (federated sources) to serve an answer.
Flink: a conveyor-belt kitchen built for continuous orders (streaming). It tracks the state of each order carefully and knows when late orders still count.

Mental model: compute + storage separation

All three engines compute over data; they usually don’t store it. They read from storage (data lake, databases, message queues) and write results back. Your job is to pick the right engine for the task, format the data for efficient reads/writes (e.g., Parquet partitions), and tune the work (joins, shuffles) so clusters do less work for the same output.

Architectures at a glance

Spark: driver (plans the job) + executors (do the work). Modes: standalone, YARN, Kubernetes. Batch and structured streaming.
Trino: coordinator (parses/optimizes SQL) + workers (execute tasks). Stateless per query; great for interactive/federated SQL.
Flink: JobManager (plans, checkpoints) + TaskManagers (slots run tasks). Stream-first with exactly-once processing via checkpoints/savepoints.

Fault tolerance basics

Spark batch: recomputation using lineage; shuffle files; retries. Spark streaming: micro-batch/continuous + checkpoints.
Trino: queries are ephemeral; failures typically require re-running the query.
Flink: checkpoints (periodic state snapshots), savepoints (manual snapshots for upgrades), watermarks for event-time control.

When to use which

Use Spark when you need heavy batch transforms, large joins, complex ETL/ELT, or machine learning feature prep.
Use Trino when you need fast, interactive SQL across data lake and external systems (federation) with low operational overhead.
Use Flink when you need low-latency, stateful streaming with event-time semantics (windows, exactly-once sinks).

Storage and formats fundamentals

Prefer columnar formats (Parquet/ORC) for analytics to reduce I/O and enable predicate/column pruning.
Partition data by frequently filtered columns (e.g., date=YYYY-MM-DD). Keep partition sizes healthy (hundreds of MBs per file) to avoid small-file problems.
Enable compression (e.g., Snappy) for Parquet/ORC. It’s CPU-cheap and I/O-friendly.

Performance basics

Pushdown and pruning: let the engine skip irrelevant rows/columns/partitions.
Joins: broadcast small tables to avoid shuffles; otherwise ensure good partitioning keys.
Parallelism: enough tasks to use all cores but not too many to create overhead. Examples: Spark’s spark.sql.shuffle.partitions; Flink’s operator parallelism; Trino worker count and memory per query.
Data locality: co-locate compute near storage or use high-throughput networks.

Quick tuning checklist

Is input Parquet/ORC with needed columns only?
Are filters selective and applied early?
Are small dimension tables broadcasted (Spark/Trino)?
Are stream watermarks set (Flink) for correct event-time behavior?
Are partitions sized to avoid tiny files?

Worked examples

Example 1 — Spark batch transform (daily sales)

// Pseudocode (PySpark-like)
orders = spark.read.parquet("s3://lake/sales/orders/date=2026-01-10/")
items  = spark.read.parquet("s3://lake/sales/items/")

filtered = orders.filter("country = 'US' AND total_amount > 0")
joined = filtered.join(items, "order_id")
result = joined.groupBy("date", "category").agg(sum("line_amount").alias("revenue"))

result.write.mode("overwrite").partitionBy("date").parquet("s3://lake/

Why Spark: heavy join and aggregation on large data; scheduled daily.
Tuning ideas: ensure items is smaller and broadcastable; filter early; write partitioned by date.

Example 2 — Trino interactive SQL (federated lookup)

-- Suppose hive.sales_orders is Parquet in your data lake
-- and mysql.crm_accounts is in an operational DB
SELECT o.order_id, a.account_tier, o.total_amount
FROM hive.sales_orders o
JOIN mysql.crm_accounts a ON o.account_id = a.id
WHERE o.order_date BETWEEN DATE '2026-01-01' AND DATE '2026-01-10'
  AND a.region = 'NA';

Why Trino: quick, ad-hoc question across lake + MySQL without building a pipeline.
Tuning ideas: ensure partition pruning on order_date; consider pushing heavy filters to data sources; set session properties to broadcast small dimension tables if needed.

Example 3 — Flink streaming (30-min event-time window)

// Pseudocode (Flink DataStream API-like)
val events = env
  .fromKafka("pageviews")
  .assignTimestampsAndWatermarks(watermarkStrategy.withBoundedOutOfOrderness(Duration.ofMinutes(5)))

events
  .keyBy(_.userId)
  .window(TumblingEventTimeWindows.of(Time.minutes(30)))
  .reduce( (a,b) => a.merge(b) )
  .addSink(toParquet("s3://lake/stream/pageviews_30m/"))

Why Flink: event-time correctness with late arrivals; continuous low-latency outputs.
Tuning ideas: set appropriate watermark lateness; checkpointing for exactly-once; choose sink with batching to avoid tiny files.

Hands-on exercises

These mirror the exercises below. Try them before opening solutions.

ex1 – Choose the engine
Scenario: Data scientists want a quick analysis joining a 5 TB Parquet fact table in S3 with a small 50k-row customer table in MySQL. They need results in minutes, one-off.
- Pick the engine and justify.
- Outline 3 steps to run it efficiently.
ex2 – Join tuning
Scenario: In Spark, you join a 2 TB fact table with a 200 MB dimension table. The job is slow due to shuffle.
- Pick a join strategy and two config/data changes to speed it up.
- Explain why each change helps.

Exercise checklist

Engine choice aligns with latency and data source needs.
Partition pruning and filter pushdown considered.
Broadcast join used for small tables when appropriate.
Output layout avoids tiny files.

Common mistakes and self-check

Using CSV for analytics: switch to Parquet/ORC to cut I/O dramatically.
Ignoring partitioning: add date or other high-selectivity partition columns.
Forgetting broadcast joins: small tables should be broadcast to avoid shuffles (Spark/Trino).
No watermarks in streams: Flink windows miscount late events; set watermarks/lateness.
Tiny files: combine outputs, use compaction or larger batch sizes.

Self-check prompts

Can you explain why Trino is good for federated ad-hoc queries?
Can you sketch Spark driver/executors and when a shuffle happens?
Can you describe event-time vs processing-time and why watermarks matter in Flink?

Practical projects

Daily batch with Spark: build a job that reads Parquet orders, joins with a small dimension, writes a date-partitioned fact table with metrics.
Interactive SQL with Trino: connect a Hive/Glue catalog and a MySQL catalog; run 3 useful cross-source queries; document runtime and costs.
Streaming with Flink: consume events from a topic, compute 10-minute tumbling windows per key with event-time watermarks, write to Parquet with rolling files.

Mini challenge

You must deliver both a backfill of the last 90 days of metrics and then keep them fresh every 5 minutes. Which engines do you pick and why? Write a short plan that includes data formats, partitioning, and how you’ll validate correctness.

Hint

Consider Spark for backfill and Flink for continuous updates; store results in Parquet partitioned by date/hour. Validate via sampling and aggregate checks.

Learning path and next steps

Master Parquet and partitioning (filters, statistics, compression).
Learn Spark joins and shuffle tuning; try structured streaming basics.
Learn Trino catalogs, connectors, and session properties for joins/memory.
Learn Flink event-time, watermarks, checkpoints, and savepoints.

Next steps

Do the hands-on exercises above and compare your answers with the solutions.
Build one practical project end-to-end this week.
Then take the Quick Test to confirm your understanding.

Quick Test

The test is available to everyone. Only logged-in users will have their progress saved.

Menu

Compute Engines Spark Trino Flink Basics

Table of Contents