How to learn Packaging And Project Structure for Python in Machine Learning Engineer for free

Why this matters

As a Machine Learning Engineer, your code must be reproducible, testable, and easy for teammates to use. Good packaging and project structure lets you:

Ship reusable training/inference code as an installable package.
Run experiments consistently across machines and CI.
Avoid import path chaos in notebooks and scripts.
Make model training and serving entry points discoverable.

Real tasks you'll face:

Turn a research notebook into a pip-installable module for the team.
Expose a CLI (e.g., train-model) to run experiments with config files.
Separate package code from data, models, and notebooks to keep repos clean.
Version your package and model artifacts reliably.

Concept explained simply

Packaging is how you bundle your Python code so others can install and use it. Project structure is how you arrange files and folders so the project is predictable and maintainable.

Mental model

Your package is the product: everything inside src/your_pkg/ is what users import.
Your repo is the workshop: docs, tests, data samples, notebooks, and scripts live here but are not necessarily part of the installable package.
Entry points are doors: CLIs or functions that teammates can reliably call.

Core components of a solid ML package

Use a src layout

project-root/
  pyproject.toml
  src/
    my_package/
      __init__.py
      dataio.py
      features/
        __init__.py
        build.py
      models/
        __init__.py
        train.py
        predict.py
  tests/
    test_train.py
  notebooks/
    01_exploration.ipynb
  scripts/
    run_training.py
  README.md

Why: prevents accidental imports from the repo root and keeps imports consistent across dev and CI.

Declare metadata and dependencies in pyproject.toml
Define name, version, dependencies, and console scripts. This is the single source of truth for packaging.
Prefer absolute imports
Use from my_package.models import train instead of complex relative imports.
Separate code from data
Never package large datasets or credentials. Keep data in external storage or data/ ignored by packaging.
Provide entry points
Expose training and inference commands via console_scripts so teammates can run them without hunting for files.
Versioning
Use semantic versioning (e.g., 0.1.0). Mirror version in pyproject.toml and src/my_package/__init__.py.
Tests
Place tests in tests/, keep them importable after pip install -e ..

What goes into pyproject.toml?

At minimum:

[build-system] with a backend such as setuptools or hatchling.
[project] with name, version, dependencies, and optional scripts.
[tool.setuptools] (if using setuptools) to configure package discovery from src.

Worked examples

Example 1: Minimal ML package with setuptools

# pyproject.toml
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "churn_model"
version = "0.1.0"
description = "Churn prediction package"
authors = [{name = "Your Team"}]
requires-python = ">=3.9"
dependencies = [
  "pandas>=1.5",
  "scikit-learn>=1.2",
  "numpy>=1.23"
]

[project.scripts]
train-model = "churn_model.models.train:main"
predict = "churn_model.models.predict:main"

[tool.setuptools.packages.find]
where = ["src"]

# src/churn_model/__init__.py
__version__ = "0.1.0"

# src/churn_model/models/train.py
def main():
    print("Training started...")
    # load data, train, save model
    print("Training finished.")

Install locally (editable) and run:

pip install -e .
train-model

Example 2: src layout prevents import confusion

With src/, importing from a notebook uses the installed package, not local files:

# In a notebook after `pip install -e .`
from churn_model.features.build import make_features

This avoids accidental relative imports like from ..features or relying on the notebook's working directory.

Example 3: Keep data out of the package

Do not include large data or models in the package. Instead, reference paths via config:

# src/churn_model/dataio.py
from pathlib import Path

def get_data_path(env_var="DATA_DIR") -> Path:
    import os
    root = os.environ.get(env_var, "data")
    p = Path(root)
    p.mkdir(parents=True, exist_ok=True)
    return p

Package installs instantly; runtime downloads or loads from data/ after installation.

Step-by-step: turn a notebook into a package

Create structure: add src/your_pkg/ with __init__.py, move reusable code into modules.
Add pyproject.toml: define name, version, dependencies, and console_scripts.
Refactor imports: use absolute imports (from your_pkg.models import train).
Create entry points: small main() functions for train/predict.
Install editable: pip install -e ., then import in notebooks.
Add tests: write basic tests in tests/ to verify imports and core functions.

Exercises

Do these in a fresh folder. Mirror of the Exercises panel below. Note: The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Create pyproject.toml
Make a package named churn_model (version 0.1.0) using setuptools. Include dependencies: pandas, scikit-learn, numpy. Add scripts: train-model and predict.
Exercise 2 — Add entry points and fix imports
Create src/churn_model/models/train.py with main() that prints "Training finished." Install in editable mode and run train-model. Use absolute imports if needed.
Exercise 3 — Add version and expose it
Put __version__ = "0.1.0" in src/churn_model/__init__.py. From Python, verify import churn_model; print(churn_model.__version__) prints 0.1.0.

Exercise checklist

pyproject.toml exists with correct build-system and project fields.
Package installs with pip install -e . without errors.
Entry points run: train-model prints expected lines.
Absolute imports succeed from a notebook or REPL.
__version__ is accessible.

Common mistakes and self-check

No src layout: Imports work locally but fail in CI. Fix: move code to src/ and enable package discovery.
Relative import chains: Hard to read and brittle. Fix: use absolute imports from the top-level package.
Data included in package: Bloats install. Fix: load data at runtime from external paths.
Unpinned critical deps: Repro breaks later. Fix: specify minimal versions; pin in lockfiles or CI as needed.
Missing entry points: Teammates guess how to run things. Fix: provide console_scripts with clear names.

Self-check questions:

Can a teammate install your package in a clean environment and run training with a single command?
Do imports work from any working directory?
Is your package free of large data and secrets?

Practical projects

Turn a churn prediction notebook into a package with train-model and predict commands.
Create a feature engineering subpackage (features/) and add tests validating transformations.
Add a simple config system (YAML or JSON) and pass path via CLI flag.

Who this is for

ML/AI practitioners moving from notebooks to production-grade code.
Engineers who collaborate across teams and need reproducible pipelines.

Prerequisites

Comfortable with Python functions, modules, and virtual environments.
Basic terminal usage and pip.
Familiarity with ML libraries such as scikit-learn.

Learning path

Organize code into src/ and refactor to absolute imports.
Create pyproject.toml and install locally.
Add entry points and basic tests.
Refine structure (features, data IO, models) and document usage in README.

Mini challenge

In under 30 minutes, create a package iris_pipeline with:

fit-model CLI that trains a simple classifier on Iris and saves a model file.
predict-iris CLI that loads the model and predicts from a CSV.
Both commands should print concise status messages and exit cleanly.

Next steps

Add tests that run in CI.
Introduce type hints and a linter configuration via pyproject.toml.
Document how to run training and prediction with examples.

Menu

Packaging And Project Structure

Table of Contents

Why this matters

Concept explained simply

Mental model

Core components of a solid ML package

Worked examples

Example 1: Minimal ML package with setuptools

Example 2: src layout prevents import confusion

Example 3: Keep data out of the package

Step-by-step: turn a notebook into a package

Exercises

Exercise checklist

Common mistakes and self-check

Practical projects

Who this is for

Prerequisites

Learning path

Mini challenge

Next steps

Practice Exercises

Create a minimal pyproject.toml for churn_model

Instructions

Expected Output

Add entry points and fix imports

Expose package version

Packaging And Project Structure — Quick Test

Have questions about Packaging And Project Structure?

AI Assistant