Topic Not Found

Why this matters

As a Data Engineer, you ship code to many places: Airflow workers, Spark clusters, containers, and servers. Each place needs the exact same dependencies to run reliably. Good dependency management prevents brittle pipelines, avoids “works on my machine,” and makes rollbacks and audits straightforward.

Keep ETL jobs reproducible across dev, CI, and prod.
Pin and lock versions to prevent surprise breaks from upstream releases.
Deploy containers with deterministic builds for Spark/Batch/Streaming tasks.
Meet security expectations by verifying hashes and using known-good versions.

Concept explained simply

Dependencies are the external pieces your code needs: libraries, system packages, connectors. Managing them means choosing versions, recording them, installing them the same way everywhere, and updating them safely.

Mental model

Recipe: You write a short list of ingredients you want (top-level dependencies).
Lock: A precise shopping list the store can fulfill exactly (resolved, pinned, hashed versions).
Kitchen: An isolated place to cook (virtualenv/conda or a container).
Meal: The built artifact or image that anyone can run, anywhere.

Core practices (what to actually do)

Use isolation: virtual environments (Python) or containers for every project.
Prefer pins and lock files: avoid unbounded ranges like ">=" or "latest" in production builds.
Separate intent vs resolution: keep a short "requirements.in" (or pyproject dependencies) and a locked "requirements.txt" or lockfile.
Reproducible installs: use hashes and deterministic flags.
Handle system deps explicitly: pin OS packages and native libs in Dockerfiles.
Cache wisely: cache wheels and artifacts; avoid caching the internet (keep lock files authoritative).
Upgrade on purpose: schedule dependency reviews; update in small chunks; test and rollback.

Glossary (quick reference)

Top-level dependency: The library you request (e.g., pandas).
Transitive dependency: A dependency of your dependency (e.g., numpy pulled by pandas).
Pinning: Choosing an exact version (e.g., pandas==2.2.2).
Lock file: Machine-generated file that records exact resolved versions and hashes.
Constraints file: A file that restricts versions without adding new packages.
Semantic Versioning (SemVer): MAJOR.MINOR.PATCH. Breaking changes usually in MAJOR.

Worked examples

Example 1 — Python ETL with pip-tools (deterministic installs)

Create an isolated env.

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

Write top-level intent in requirements.in.

# requirements.in
pandas==2.2.*
requests>=2.31,<2.33
pyarrow==14.*

Compile a locked file with hashes.

python -m pip install --upgrade pip pip-tools
pip-compile requirements.in \
  --generate-hashes \
  -o requirements.txt

Install exactly what’s locked.
```
pip-sync requirements.txt
```

Sanity check imports.

python -c "import pandas, requests, pyarrow; print('ok')"

This approach prevents accidental upgrades and ensures every environment matches.

Example 2 — Airflow + provider with a constraints file (no online lookups required)

Create a local constraints file for your chosen Airflow version (example versions only).

# constraints-airflow-2.8.4.txt
apache-airflow==2.8.4
apache-airflow-providers-amazon==8.16.0
click==8.1.7
pendulum==3.0.0
# ... include other resolved pins as needed

In your Dockerfile, install using the constraints file.

# Dockerfile (snippet)
FROM python:3.10-slim
WORKDIR /app
COPY constraints-airflow-2.8.4.txt /app/
RUN python -m pip install --upgrade pip \
    && pip install "apache-airflow==2.8.4" \
           "apache-airflow-providers-amazon==8.16.0" \
           -c constraints-airflow-2.8.4.txt \
    && airflow version

Using a constraints file prevents version drift and keeps provider versions compatible.

Example 3 — Docker with pinned OS packages (Spark job runner)

Create a minimal, pinned base.

FROM python:3.10-slim
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
     openjdk-17-jre-headless=17.* \
     libkrb5-3=1.20.* \
  && rm -rf /var/lib/apt/lists/*
ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

Pin Python deps with a lock file, then install.
Copy your Spark job and run with the exact JVM + Python versions you expect.

Pinned OS packages avoid subtle runtime issues on different hosts.

Example 4 — Scala Spark with shaded fat JAR (avoid transitive clashes)

In build tool (e.g., sbt/Gradle), fix versions for Spark and connectors.
Shade or relocate conflicting transitive deps (e.g., guava) to prevent runtime classpath conflicts.
Publish one fat JAR; the cluster sees one deterministic artifact.

Shading isolates your dependency universe from the cluster’s.

Exercises

Complete the tasks below. You can compare with the solutions in each exercise card. Use the checklist to track progress.

Exercise 1 — Pin and lock Python ETL dependencies with pip-tools

Create a virtual environment.
Create requirements.in listing only top-level packages (pandas, requests, pyarrow).
Use pip-tools to compile requirements.txt with hashes.
Install with pip-sync.
Run a one-line Python import test.

Show solution

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip pip-tools
printf "pandas==2.2.*\nrequests>=2.31,<2.33\npyarrow==14.*\n" > requirements.in
pip-compile requirements.in --generate-hashes -o requirements.txt
pip-sync requirements.txt
python -c "import pandas, requests, pyarrow; print('ok')"

Expected: a requirements.txt with fully pinned versions and sha256 hashes; the import test prints "ok".

[ ] venv created
[ ] requirements.in added
[ ] requirements.txt compiled with hashes
[ ] pip-sync completed
[ ] import test passed

Exercise 2 — Install Airflow + provider using a local constraints file in Docker

Create constraints-airflow-2.8.4.txt with pinned versions (example values are acceptable).
Write a Dockerfile that installs Airflow and a provider using -c constraints file.
Build the image; ensure airflow version prints during build.

Show solution

# constraints-airflow-2.8.4.txt (example)
apache-airflow==2.8.4
apache-airflow-providers-amazon==8.16.0
click==8.1.7
pendulum==3.0.0

# Dockerfile (example)
FROM python:3.10-slim
WORKDIR /app
COPY constraints-airflow-2.8.4.txt /app/
RUN python -m pip install --upgrade pip \
    && pip install "apache-airflow==2.8.4" \
           "apache-airflow-providers-amazon==8.16.0" \
           -c constraints-airflow-2.8.4.txt \
    && airflow version

Expected: docker build succeeds and prints an Airflow version consistent with the constraints file.

[ ] constraints file created
[ ] Dockerfile written
[ ] Image builds successfully

Tip: The specific versions here are examples. Use versions that match your environment policy.

Common mistakes and how to self-check

Using "latest" or wide ranges (e.g., >=) in production. Self-check: Does your lock file pin exact versions? Do images build deterministically?
Mixing global Python with project envs. Self-check: Does your prompt show the project venv? Does which python point inside .venv?
Ignoring transitive changes. Self-check: Regenerate the lock file in CI and diff before merging.
Unpinned OS packages in Docker. Self-check: Are apt packages version-pinned (e.g., 1.20.*)?
Lock files not used in CI/CD. Self-check: CI installs with the lock/constraints, not from the intent file.
No rollback plan. Self-check: Can you checkout prior lock file, rebuild, and redeploy quickly?

Quick self-audit checklist

[ ] Project has an intent file (requirements.in/pyproject) and a lock file.
[ ] CI uses lock/constraints for installs.
[ ] Dockerfiles pin OS packages and clean apt caches.
[ ] Virtualenv or container is mandatory for local dev.
[ ] Regular dependency review cadence exists.

Practical projects

Create a reproducible Airflow DAG environment: constraints file + Docker image + small DAG import test.
Build a Spark runner image: pinned JRE, pinned Python deps, sample PySpark job.
Set up a dependency update workflow: weekly lockfile refresh PR + smoke tests.
Package a shared utility library and serve it from an internal artifact registry; reference it from two services.

Who this is for

Data Engineers building pipelines and platform components.
Analytics Engineers maintaining dbt or Python workflows.
Platform/Infra folks supporting Airflow, Spark, or batch systems.

Prerequisites

Basic command line usage.
Beginner Python or JVM build familiarity.
Docker basics (build, run) helpful but not required.

Learning path

Start: Dependency fundamentals (this lesson).
Next: Build reproducible containers for data jobs.
Then: CI/CD for data pipelines (install from lock, cache artifacts, scan vulnerabilities).
Later: Environment promotion and rollbacks (dev → staging → prod with the same artifact).

Mini challenge

Take any existing pipeline, create a lock/constraints file, rebuild the artifact/image, and run a smoke test. Then bump one non-breaking dependency (e.g., patch/minor), rebuild, and confirm the smoke test still passes. Record the steps you used for rollback.

Next steps

Adopt lock files everywhere code is executed (local, CI, prod).
Introduce a regular update cadence with automated PRs and smoke tests.
Standardize base images with pinned OS and language runtimes.

Quick Test availability

The Quick Test is available to everyone. Only logged-in users get saved progress.

Menu

Dependency Management

Table of Contents