How to learn Writing Reusable Analysis Code for Python (pandas and numpy) in Data Scientist for free

Why this matters

As a Data Scientist, you often repeat similar steps: loading, cleaning, enriching, aggregating, and visualizing. Writing these as small, parameterized functions makes your analyses faster to build, easier to debug, and safer to reuse across projects.

Real task: Standardize data cleaning for weekly sales reports without copying code blocks.
Real task: Reuse the same feature engineering steps for both training and inference.
Real task: Share a robust aggregation routine with teammates that works on different datasets.

Concept explained simply

Reusable analysis code is a set of small, single-purpose functions that accept data and parameters, return new data without hidden side effects, and can be composed into pipelines. In pandas/NumPy, this typically means functions that take a DataFrame/ndarray and return a transformed one.

Mental model

Think LEGO bricks: each function is a brick with a clear shape (inputs/outputs). You can snap bricks together (compose) to build larger workflows. If a brick breaks, you replace it without rebuilding the whole set.

Principles to keep in mind

DRY: Don’t Repeat Yourself—extract repeated steps into functions.
Small, focused functions: one verb per function (clean, filter, enrich, aggregate).
Pure-ish functions: avoid mutating inputs; return a new object instead.
Parameterize: no hard-coded column names or thresholds.
Compose: use DataFrame.pipe to chain transformations.
Document: short docstrings with parameters, returns, and examples.
Type hints: help readers and tools catch errors early.
Vectorize: prefer pandas/NumPy operations over Python loops.

Worked examples

Example 1: Refactor repeated cleaning into a reusable function

Goal: parse dates, trim strings, coerce numeric columns, clip negatives, drop duplicates.

Show code

import pandas as pd
from typing import List, Optional

def clean_frame(df: pd.DataFrame, *, date_col: Optional[str] = None, numeric_cols: Optional[List[str]] = None) -> pd.DataFrame:
    """
    Return a cleaned copy of df.
    - Optionally parse date_col.
    - Trim whitespace in string columns.
    - Coerce numeric_cols to numeric and clip at 0.
    - Drop duplicates and reset index.
    """
    numeric_cols = numeric_cols or []
    out = df.copy()

    # Parse date if present
    if date_col and date_col in out.columns:
        out[date_col] = pd.to_datetime(out[date_col], errors="coerce")

    # Trim strings
    for c in out.select_dtypes(include="object").columns:
        out[c] = out[c].str.strip()

    # Coerce and clip numerics
    for c in numeric_cols:
        if c in out.columns:
            out[c] = pd.to_numeric(out[c], errors="coerce").clip(lower=0)

    out = out.drop_duplicates().reset_index(drop=True)
    return out

# Usage
raw = pd.DataFrame({
    "date": ["2025-12-01", "2025-12-01", "2025-12-02"],
    "product": [" A ", " A ", "B"],
    "revenue": ["10", "10", "-3"],
})
clean = clean_frame(raw, date_col="date", numeric_cols=["revenue"]) 
print(clean)

Example 2: Parameterized aggregation with flat columns

Goal: reusable groupby that accepts grouping columns and metric mapping.

Show code

import pandas as pd
from typing import List, Dict, Union, Callable

AggSpec = Dict[str, Union[List[str], Callable]]

def summarise_by(df: pd.DataFrame, *, by: List[str], metrics: AggSpec) -> pd.DataFrame:
    """
    Group by 'by' and apply 'metrics'. Returns a flat-column DataFrame.
    Example metrics: {"revenue": ["sum", "mean"], "qty": "sum"}
    """
    grouped = df.groupby(by, dropna=False).agg(metrics)
    # Flatten MultiIndex columns
    grouped.columns = [
        ("_".join(t) if isinstance(t, tuple) else str(t))
        for t in grouped.columns.to_list()
    ]
    return grouped.reset_index()

# Usage
sales = pd.DataFrame({
    "region": ["E", "E", "W"],
    "product": ["A", "A", "B"],
    "revenue": [10, 15, 7],
    "qty": [1, 2, 1],
})
print(summarise_by(sales, by=["region", "product"], metrics={"revenue": ["sum", "mean"], "qty": "sum"}))

Example 3: Compose a pipeline with DataFrame.pipe

Goal: chain transformations without creating many temp variables.

Show code

import pandas as pd
import numpy as np

from typing import Optional

# Reuse clean_frame from Example 1

def add_margin(df: pd.DataFrame, *, revenue_col: str, cost_col: str, out_col: str = "margin") -> pd.DataFrame:
    out = df.copy()
    out[out_col] = out[revenue_col] - out[cost_col]
    return out

def add_margin_pct(df: pd.DataFrame, *, margin_col: str = "margin", revenue_col: str = "revenue", out_col: str = "margin_pct") -> pd.DataFrame:
    out = df.copy()
    denom = out[revenue_col]
    out[out_col] = np.where(denom.eq(0), np.nan, out[margin_col] / denom)
    return out

def filter_recent(df: pd.DataFrame, *, date_col: str, days: int = 90, now: Optional[str] = None) -> pd.DataFrame:
    now_ts = pd.Timestamp(now) if now else pd.Timestamp.today()
    cutoff = now_ts - pd.Timedelta(days=days)
    mask = pd.to_datetime(df[date_col]) >= cutoff
    return df.loc[mask].copy()

# Example usage
raw = pd.DataFrame({
    "date": ["2025-12-20", "2025-11-01", "2025-12-28"],
    "revenue": [100, 50, 0],
    "cost": [60, 30, 5],
})
result = (
    raw
    .pipe(clean_frame, date_col="date", numeric_cols=["revenue", "cost"])  # from Example 1
    .pipe(add_margin, revenue_col="revenue", cost_col="cost")
    .pipe(add_margin_pct, margin_col="margin", revenue_col="revenue")
    .pipe(filter_recent, date_col="date", days=40, now="2026-01-01")
)
print(result)

Who this is for

Data Scientists and Analysts who repeat similar pandas workflows.
Anyone preparing notebooks for handoff to teammates or production.

Prerequisites

Basic Python syntax, functions, and modules.
Comfort with pandas DataFrame operations and NumPy arrays.

Learning path

Write one small function for a repeated step (e.g., trimming strings).
Parameterize it (columns, thresholds) and add a docstring.
Compose two functions with DataFrame.pipe.
Generalize to a mini-pipeline (clean → enrich → aggregate).
Bundle helpers into a module file for reuse across notebooks.

Exercises

Do these in your local environment. The Quick Test below checks your understanding. Everyone can take the test; only logged-in users get saved progress.

Exercise 1 — Clean and standardize

Write a function load_and_clean that reads CSV text and returns a cleaned DataFrame using the same rules as clean_frame in Example 1.

Data to use

from io import StringIO
import pandas as pd

csv = StringIO("""
date,product,revenue,cost
2025-12-30, A ,100, 60
2025-12-30, A ,100, 60
2025-12-31,B,-5, 3
""")
# Use pd.read_csv(csv)

Parse date column.
Trim spaces in strings.
Coerce revenue and cost to numeric and clip at 0.
Drop duplicates and reset index.

Exercise 2 — Reusable aggregation

Write a function sales_summary(df, by, metrics) that wraps groupby, applies metrics, flattens columns, and returns a DataFrame.

Data to use

import pandas as pd
sales = pd.DataFrame({
  "region": ["N", "N", "S", "S"],
  "product": ["A", "A", "A", "B"],
  "revenue": [10, 20, 7, 5],
  "qty": [1, 2, 1, 1],
})

Call: sales_summary(sales, by=["region", "product"], metrics={"revenue": ["sum", "mean"], "qty": "sum"}).
Ensure column names are flat, like revenue_sum, revenue_mean, qty_sum.

Self-check checklist

Your functions don’t mutate inputs; they return new DataFrames.
No hard-coded column names where a parameter would be better.
Each function has a one-line docstring and type hints.
You can chain them with .pipe in a single expression.

Common mistakes and how to self-check

Hard-coded specifics: Replace literals (e.g., column names) with parameters; set sensible defaults.
In-place mutation: Using inplace=True can hide side effects—prefer copying and returning.
Mixing I/O and transforms: Keep file reading separate from transformations so you can reuse transforms on in-memory data.
Overlong functions: Split into smaller steps; one verb per function is a good rule.
Non-vectorized loops: Replace for loops over rows with pandas/NumPy operations for speed and clarity.
Unclear outputs: Ensure your function returns a predictable schema and document it.

Quick self-audit

Rename function to a verb (e.g., filter_recent, not recent_filter_stuff).
Remove any print/log statements from transformation functions; return data instead.
Add a small example in the docstring showing input → output columns.

Practical projects

Reusable cleaning module: Build clean_frame, standardize_columns, coerce_numeric, and filter_date_range in one Python file. Use across 3 datasets.
Feature engineering toolkit: Implement add_ratio, winsorize, zscore, and bucketize with parameters. Demonstrate on a training and a scoring set.
Aggregation library: Create summarise_by, top_n_by_group, and pivot_compact. Include docstrings and type hints.

Mini challenge

Build a 4-step pipeline: clean → filter_recent → add_margin → summarise_by. Make each function parameterized and composable with .pipe. Write a one-paragraph docstring for each function with a minimal example.

Hints

Start from Example 1 and 3; unify parameter names so composition is straightforward.
Keep transformations pure: return a new DataFrame each step.
Flatten aggregated columns for predictable downstream use.

Next steps

Extract your helpers into a single utils_analysis.py and import it across notebooks.
Add lightweight tests for each function (even 2–3 assertions per function helps).
Measure runtime on a sample dataset and replace any slow row-wise operations with vectorized alternatives.

Menu

Writing Reusable Analysis Code

Table of Contents

Why this matters

Concept explained simply

Mental model

Worked examples

Example 1: Refactor repeated cleaning into a reusable function

Example 2: Parameterized aggregation with flat columns

Example 3: Compose a pipeline with DataFrame.pipe

Who this is for

Prerequisites

Learning path

Exercises

Exercise 1 — Clean and standardize

Exercise 2 — Reusable aggregation

Self-check checklist

Common mistakes and how to self-check

Practical projects

Mini challenge

Next steps

Practice Exercises

Clean and standardize a CSV string

Instructions

Expected Output

Reusable aggregation with flat columns

Writing Reusable Analysis Code — Quick Test

Have questions about Writing Reusable Analysis Code?

AI Assistant