Why this matters
As a Machine Learning Engineer, your code must be reproducible, testable, and easy for teammates to use. Good packaging and project structure lets you:
- Ship reusable training/inference code as an installable package.
- Run experiments consistently across machines and CI.
- Avoid import path chaos in notebooks and scripts.
- Make model training and serving entry points discoverable.
Real tasks you'll face:
- Turn a research notebook into a pip-installable module for the team.
- Expose a CLI (e.g.,
train-model) to run experiments with config files. - Separate package code from data, models, and notebooks to keep repos clean.
- Version your package and model artifacts reliably.
Concept explained simply
Packaging is how you bundle your Python code so others can install and use it. Project structure is how you arrange files and folders so the project is predictable and maintainable.
Mental model
- Your package is the product: everything inside
src/your_pkg/is what users import. - Your repo is the workshop: docs, tests, data samples, notebooks, and scripts live here but are not necessarily part of the installable package.
- Entry points are doors: CLIs or functions that teammates can reliably call.
Core components of a solid ML package
- Use a src layout
project-root/ pyproject.toml src/ my_package/ __init__.py dataio.py features/ __init__.py build.py models/ __init__.py train.py predict.py tests/ test_train.py notebooks/ 01_exploration.ipynb scripts/ run_training.py README.mdWhy: prevents accidental imports from the repo root and keeps imports consistent across dev and CI.
- Declare metadata and dependencies in
pyproject.tomlDefine name, version, dependencies, and console scripts. This is the single source of truth for packaging.
- Prefer absolute imports
Use
from my_package.models import traininstead of complex relative imports. - Separate code from data
Never package large datasets or credentials. Keep data in external storage or
data/ignored by packaging. - Provide entry points
Expose training and inference commands via
console_scriptsso teammates can run them without hunting for files. - Versioning
Use semantic versioning (e.g., 0.1.0). Mirror version in
pyproject.tomlandsrc/my_package/__init__.py. - Tests
Place tests in
tests/, keep them importable afterpip install -e ..
What goes into pyproject.toml?
At minimum:
[build-system]with a backend such assetuptoolsorhatchling.[project]with name, version, dependencies, and optional scripts.[tool.setuptools](if using setuptools) to configure package discovery fromsrc.
Worked examples
Example 1: Minimal ML package with setuptools
# pyproject.toml
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "churn_model"
version = "0.1.0"
description = "Churn prediction package"
authors = [{name = "Your Team"}]
requires-python = ">=3.9"
dependencies = [
"pandas>=1.5",
"scikit-learn>=1.2",
"numpy>=1.23"
]
[project.scripts]
train-model = "churn_model.models.train:main"
predict = "churn_model.models.predict:main"
[tool.setuptools.packages.find]
where = ["src"]
# src/churn_model/__init__.py
__version__ = "0.1.0"
# src/churn_model/models/train.py
def main():
print("Training started...")
# load data, train, save model
print("Training finished.")
Install locally (editable) and run:
pip install -e .
train-model
Example 2: src layout prevents import confusion
With src/, importing from a notebook uses the installed package, not local files:
# In a notebook after `pip install -e .`
from churn_model.features.build import make_features
This avoids accidental relative imports like from ..features or relying on the notebook's working directory.
Example 3: Keep data out of the package
Do not include large data or models in the package. Instead, reference paths via config:
# src/churn_model/dataio.py
from pathlib import Path
def get_data_path(env_var="DATA_DIR") -> Path:
import os
root = os.environ.get(env_var, "data")
p = Path(root)
p.mkdir(parents=True, exist_ok=True)
return p
Package installs instantly; runtime downloads or loads from data/ after installation.
Step-by-step: turn a notebook into a package
- Create structure: add
src/your_pkg/with__init__.py, move reusable code into modules. - Add pyproject.toml: define name, version, dependencies, and
console_scripts. - Refactor imports: use absolute imports (
from your_pkg.models import train). - Create entry points: small
main()functions for train/predict. - Install editable:
pip install -e ., then import in notebooks. - Add tests: write basic tests in
tests/to verify imports and core functions.
Exercises
Do these in a fresh folder. Mirror of the Exercises panel below. Note: The quick test is available to everyone; only logged-in users get saved progress.
-
Exercise 1 — Create pyproject.toml
Make a package named
churn_model(version0.1.0) using setuptools. Include dependencies: pandas, scikit-learn, numpy. Add scripts:train-modelandpredict. -
Exercise 2 — Add entry points and fix imports
Create
src/churn_model/models/train.pywithmain()that prints "Training finished." Install in editable mode and runtrain-model. Use absolute imports if needed. -
Exercise 3 — Add version and expose it
Put
__version__ = "0.1.0"insrc/churn_model/__init__.py. From Python, verifyimport churn_model; print(churn_model.__version__)prints0.1.0.
Exercise checklist
- pyproject.toml exists with correct build-system and project fields.
- Package installs with
pip install -e .without errors. - Entry points run:
train-modelprints expected lines. - Absolute imports succeed from a notebook or REPL.
__version__is accessible.
Common mistakes and self-check
- No src layout: Imports work locally but fail in CI. Fix: move code to
src/and enable package discovery. - Relative import chains: Hard to read and brittle. Fix: use absolute imports from the top-level package.
- Data included in package: Bloats install. Fix: load data at runtime from external paths.
- Unpinned critical deps: Repro breaks later. Fix: specify minimal versions; pin in lockfiles or CI as needed.
- Missing entry points: Teammates guess how to run things. Fix: provide
console_scriptswith clear names.
Self-check questions:
- Can a teammate install your package in a clean environment and run training with a single command?
- Do imports work from any working directory?
- Is your package free of large data and secrets?
Practical projects
- Turn a churn prediction notebook into a package with
train-modelandpredictcommands. - Create a feature engineering subpackage (
features/) and add tests validating transformations. - Add a simple config system (YAML or JSON) and pass path via CLI flag.
Who this is for
- ML/AI practitioners moving from notebooks to production-grade code.
- Engineers who collaborate across teams and need reproducible pipelines.
Prerequisites
- Comfortable with Python functions, modules, and virtual environments.
- Basic terminal usage and pip.
- Familiarity with ML libraries such as scikit-learn.
Learning path
- Organize code into
src/and refactor to absolute imports. - Create
pyproject.tomland install locally. - Add entry points and basic tests.
- Refine structure (features, data IO, models) and document usage in README.
Mini challenge
In under 30 minutes, create a package iris_pipeline with:
fit-modelCLI that trains a simple classifier on Iris and saves a model file.predict-irisCLI that loads the model and predicts from a CSV.- Both commands should print concise status messages and exit cleanly.
Next steps
- Add tests that run in CI.
- Introduce type hints and a linter configuration via
pyproject.toml. - Document how to run training and prediction with examples.