luvv to helpDiscover the Best Free Online Tools
Topic 1 of 10

Writing Reusable Analysis Code

Learn Writing Reusable Analysis Code for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Data Scientist, you often repeat similar steps: loading, cleaning, enriching, aggregating, and visualizing. Writing these as small, parameterized functions makes your analyses faster to build, easier to debug, and safer to reuse across projects.

  • Real task: Standardize data cleaning for weekly sales reports without copying code blocks.
  • Real task: Reuse the same feature engineering steps for both training and inference.
  • Real task: Share a robust aggregation routine with teammates that works on different datasets.

Concept explained simply

Reusable analysis code is a set of small, single-purpose functions that accept data and parameters, return new data without hidden side effects, and can be composed into pipelines. In pandas/NumPy, this typically means functions that take a DataFrame/ndarray and return a transformed one.

Mental model

Think LEGO bricks: each function is a brick with a clear shape (inputs/outputs). You can snap bricks together (compose) to build larger workflows. If a brick breaks, you replace it without rebuilding the whole set.

Principles to keep in mind
  • DRY: Don’t Repeat Yourself—extract repeated steps into functions.
  • Small, focused functions: one verb per function (clean, filter, enrich, aggregate).
  • Pure-ish functions: avoid mutating inputs; return a new object instead.
  • Parameterize: no hard-coded column names or thresholds.
  • Compose: use DataFrame.pipe to chain transformations.
  • Document: short docstrings with parameters, returns, and examples.
  • Type hints: help readers and tools catch errors early.
  • Vectorize: prefer pandas/NumPy operations over Python loops.

Worked examples

Example 1: Refactor repeated cleaning into a reusable function

Goal: parse dates, trim strings, coerce numeric columns, clip negatives, drop duplicates.

Show code
import pandas as pd
from typing import List, Optional

def clean_frame(df: pd.DataFrame, *, date_col: Optional[str] = None, numeric_cols: Optional[List[str]] = None) -> pd.DataFrame:
    """
    Return a cleaned copy of df.
    - Optionally parse date_col.
    - Trim whitespace in string columns.
    - Coerce numeric_cols to numeric and clip at 0.
    - Drop duplicates and reset index.
    """
    numeric_cols = numeric_cols or []
    out = df.copy()

    # Parse date if present
    if date_col and date_col in out.columns:
        out[date_col] = pd.to_datetime(out[date_col], errors="coerce")

    # Trim strings
    for c in out.select_dtypes(include="object").columns:
        out[c] = out[c].str.strip()

    # Coerce and clip numerics
    for c in numeric_cols:
        if c in out.columns:
            out[c] = pd.to_numeric(out[c], errors="coerce").clip(lower=0)

    out = out.drop_duplicates().reset_index(drop=True)
    return out

# Usage
raw = pd.DataFrame({
    "date": ["2025-12-01", "2025-12-01", "2025-12-02"],
    "product": [" A ", " A ", "B"],
    "revenue": ["10", "10", "-3"],
})
clean = clean_frame(raw, date_col="date", numeric_cols=["revenue"]) 
print(clean)

Example 2: Parameterized aggregation with flat columns

Goal: reusable groupby that accepts grouping columns and metric mapping.

Show code
import pandas as pd
from typing import List, Dict, Union, Callable

AggSpec = Dict[str, Union[List[str], Callable]]

def summarise_by(df: pd.DataFrame, *, by: List[str], metrics: AggSpec) -> pd.DataFrame:
    """
    Group by 'by' and apply 'metrics'. Returns a flat-column DataFrame.
    Example metrics: {"revenue": ["sum", "mean"], "qty": "sum"}
    """
    grouped = df.groupby(by, dropna=False).agg(metrics)
    # Flatten MultiIndex columns
    grouped.columns = [
        ("_".join(t) if isinstance(t, tuple) else str(t))
        for t in grouped.columns.to_list()
    ]
    return grouped.reset_index()

# Usage
sales = pd.DataFrame({
    "region": ["E", "E", "W"],
    "product": ["A", "A", "B"],
    "revenue": [10, 15, 7],
    "qty": [1, 2, 1],
})
print(summarise_by(sales, by=["region", "product"], metrics={"revenue": ["sum", "mean"], "qty": "sum"}))

Example 3: Compose a pipeline with DataFrame.pipe

Goal: chain transformations without creating many temp variables.

Show code
import pandas as pd
import numpy as np

from typing import Optional

# Reuse clean_frame from Example 1

def add_margin(df: pd.DataFrame, *, revenue_col: str, cost_col: str, out_col: str = "margin") -> pd.DataFrame:
    out = df.copy()
    out[out_col] = out[revenue_col] - out[cost_col]
    return out

def add_margin_pct(df: pd.DataFrame, *, margin_col: str = "margin", revenue_col: str = "revenue", out_col: str = "margin_pct") -> pd.DataFrame:
    out = df.copy()
    denom = out[revenue_col]
    out[out_col] = np.where(denom.eq(0), np.nan, out[margin_col] / denom)
    return out

def filter_recent(df: pd.DataFrame, *, date_col: str, days: int = 90, now: Optional[str] = None) -> pd.DataFrame:
    now_ts = pd.Timestamp(now) if now else pd.Timestamp.today()
    cutoff = now_ts - pd.Timedelta(days=days)
    mask = pd.to_datetime(df[date_col]) >= cutoff
    return df.loc[mask].copy()

# Example usage
raw = pd.DataFrame({
    "date": ["2025-12-20", "2025-11-01", "2025-12-28"],
    "revenue": [100, 50, 0],
    "cost": [60, 30, 5],
})
result = (
    raw
    .pipe(clean_frame, date_col="date", numeric_cols=["revenue", "cost"])  # from Example 1
    .pipe(add_margin, revenue_col="revenue", cost_col="cost")
    .pipe(add_margin_pct, margin_col="margin", revenue_col="revenue")
    .pipe(filter_recent, date_col="date", days=40, now="2026-01-01")
)
print(result)

Who this is for

  • Data Scientists and Analysts who repeat similar pandas workflows.
  • Anyone preparing notebooks for handoff to teammates or production.

Prerequisites

  • Basic Python syntax, functions, and modules.
  • Comfort with pandas DataFrame operations and NumPy arrays.

Learning path

  1. Write one small function for a repeated step (e.g., trimming strings).
  2. Parameterize it (columns, thresholds) and add a docstring.
  3. Compose two functions with DataFrame.pipe.
  4. Generalize to a mini-pipeline (clean → enrich → aggregate).
  5. Bundle helpers into a module file for reuse across notebooks.

Exercises

Do these in your local environment. The Quick Test below checks your understanding. Everyone can take the test; only logged-in users get saved progress.

Exercise 1 — Clean and standardize

Write a function load_and_clean that reads CSV text and returns a cleaned DataFrame using the same rules as clean_frame in Example 1.

Data to use
from io import StringIO
import pandas as pd

csv = StringIO("""
date,product,revenue,cost
2025-12-30, A ,100, 60
2025-12-30, A ,100, 60
2025-12-31,B,-5, 3
""")
# Use pd.read_csv(csv)
  • Parse date column.
  • Trim spaces in strings.
  • Coerce revenue and cost to numeric and clip at 0.
  • Drop duplicates and reset index.

Exercise 2 — Reusable aggregation

Write a function sales_summary(df, by, metrics) that wraps groupby, applies metrics, flattens columns, and returns a DataFrame.

Data to use
import pandas as pd
sales = pd.DataFrame({
  "region": ["N", "N", "S", "S"],
  "product": ["A", "A", "A", "B"],
  "revenue": [10, 20, 7, 5],
  "qty": [1, 2, 1, 1],
})
  • Call: sales_summary(sales, by=["region", "product"], metrics={"revenue": ["sum", "mean"], "qty": "sum"}).
  • Ensure column names are flat, like revenue_sum, revenue_mean, qty_sum.

Self-check checklist

  • Your functions don’t mutate inputs; they return new DataFrames.
  • No hard-coded column names where a parameter would be better.
  • Each function has a one-line docstring and type hints.
  • You can chain them with .pipe in a single expression.

Common mistakes and how to self-check

  • Hard-coded specifics: Replace literals (e.g., column names) with parameters; set sensible defaults.
  • In-place mutation: Using inplace=True can hide side effects—prefer copying and returning.
  • Mixing I/O and transforms: Keep file reading separate from transformations so you can reuse transforms on in-memory data.
  • Overlong functions: Split into smaller steps; one verb per function is a good rule.
  • Non-vectorized loops: Replace for loops over rows with pandas/NumPy operations for speed and clarity.
  • Unclear outputs: Ensure your function returns a predictable schema and document it.
Quick self-audit
  • Rename function to a verb (e.g., filter_recent, not recent_filter_stuff).
  • Remove any print/log statements from transformation functions; return data instead.
  • Add a small example in the docstring showing input → output columns.

Practical projects

  • Reusable cleaning module: Build clean_frame, standardize_columns, coerce_numeric, and filter_date_range in one Python file. Use across 3 datasets.
  • Feature engineering toolkit: Implement add_ratio, winsorize, zscore, and bucketize with parameters. Demonstrate on a training and a scoring set.
  • Aggregation library: Create summarise_by, top_n_by_group, and pivot_compact. Include docstrings and type hints.

Mini challenge

Build a 4-step pipeline: clean → filter_recent → add_margin → summarise_by. Make each function parameterized and composable with .pipe. Write a one-paragraph docstring for each function with a minimal example.

Hints
  • Start from Example 1 and 3; unify parameter names so composition is straightforward.
  • Keep transformations pure: return a new DataFrame each step.
  • Flatten aggregated columns for predictable downstream use.

Next steps

  • Extract your helpers into a single utils_analysis.py and import it across notebooks.
  • Add lightweight tests for each function (even 2–3 assertions per function helps).
  • Measure runtime on a sample dataset and replace any slow row-wise operations with vectorized alternatives.

Practice Exercises

2 exercises to complete

Instructions

Implement load_and_clean(csv_text: str, date_col: str, numeric_cols: list[str]) -> pd.DataFrame that:

  • Loads CSV text with pd.read_csv(StringIO(csv_text)).
  • Parses date_col to datetime.
  • Trims whitespace for string columns.
  • Coerces numeric_cols to numeric and clips negatives to 0.
  • Drops duplicates and resets index.
Starter data
from io import StringIO
csv_text = """
date,product,revenue,cost
2025-12-30, A ,100, 60
2025-12-30, A ,100, 60
2025-12-31,B,-5, 3
"""
Expected Output
A pandas DataFrame with 2 rows (duplicate removed). 'date' parsed to datetime, 'product' trimmed ('A','B'), 'revenue'=[100,0], 'cost'=[60,3].

Writing Reusable Analysis Code — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Writing Reusable Analysis Code?

AI Assistant

Ask questions about this tool