Why this matters
As a Data Scientist, you often repeat similar steps: loading, cleaning, enriching, aggregating, and visualizing. Writing these as small, parameterized functions makes your analyses faster to build, easier to debug, and safer to reuse across projects.
- Real task: Standardize data cleaning for weekly sales reports without copying code blocks.
- Real task: Reuse the same feature engineering steps for both training and inference.
- Real task: Share a robust aggregation routine with teammates that works on different datasets.
Concept explained simply
Reusable analysis code is a set of small, single-purpose functions that accept data and parameters, return new data without hidden side effects, and can be composed into pipelines. In pandas/NumPy, this typically means functions that take a DataFrame/ndarray and return a transformed one.
Mental model
Think LEGO bricks: each function is a brick with a clear shape (inputs/outputs). You can snap bricks together (compose) to build larger workflows. If a brick breaks, you replace it without rebuilding the whole set.
Principles to keep in mind
- DRY: Don’t Repeat Yourself—extract repeated steps into functions.
- Small, focused functions: one verb per function (clean, filter, enrich, aggregate).
- Pure-ish functions: avoid mutating inputs; return a new object instead.
- Parameterize: no hard-coded column names or thresholds.
- Compose: use DataFrame.pipe to chain transformations.
- Document: short docstrings with parameters, returns, and examples.
- Type hints: help readers and tools catch errors early.
- Vectorize: prefer pandas/NumPy operations over Python loops.
Worked examples
Example 1: Refactor repeated cleaning into a reusable function
Goal: parse dates, trim strings, coerce numeric columns, clip negatives, drop duplicates.
Show code
import pandas as pd
from typing import List, Optional
def clean_frame(df: pd.DataFrame, *, date_col: Optional[str] = None, numeric_cols: Optional[List[str]] = None) -> pd.DataFrame:
"""
Return a cleaned copy of df.
- Optionally parse date_col.
- Trim whitespace in string columns.
- Coerce numeric_cols to numeric and clip at 0.
- Drop duplicates and reset index.
"""
numeric_cols = numeric_cols or []
out = df.copy()
# Parse date if present
if date_col and date_col in out.columns:
out[date_col] = pd.to_datetime(out[date_col], errors="coerce")
# Trim strings
for c in out.select_dtypes(include="object").columns:
out[c] = out[c].str.strip()
# Coerce and clip numerics
for c in numeric_cols:
if c in out.columns:
out[c] = pd.to_numeric(out[c], errors="coerce").clip(lower=0)
out = out.drop_duplicates().reset_index(drop=True)
return out
# Usage
raw = pd.DataFrame({
"date": ["2025-12-01", "2025-12-01", "2025-12-02"],
"product": [" A ", " A ", "B"],
"revenue": ["10", "10", "-3"],
})
clean = clean_frame(raw, date_col="date", numeric_cols=["revenue"])
print(clean)
Example 2: Parameterized aggregation with flat columns
Goal: reusable groupby that accepts grouping columns and metric mapping.
Show code
import pandas as pd
from typing import List, Dict, Union, Callable
AggSpec = Dict[str, Union[List[str], Callable]]
def summarise_by(df: pd.DataFrame, *, by: List[str], metrics: AggSpec) -> pd.DataFrame:
"""
Group by 'by' and apply 'metrics'. Returns a flat-column DataFrame.
Example metrics: {"revenue": ["sum", "mean"], "qty": "sum"}
"""
grouped = df.groupby(by, dropna=False).agg(metrics)
# Flatten MultiIndex columns
grouped.columns = [
("_".join(t) if isinstance(t, tuple) else str(t))
for t in grouped.columns.to_list()
]
return grouped.reset_index()
# Usage
sales = pd.DataFrame({
"region": ["E", "E", "W"],
"product": ["A", "A", "B"],
"revenue": [10, 15, 7],
"qty": [1, 2, 1],
})
print(summarise_by(sales, by=["region", "product"], metrics={"revenue": ["sum", "mean"], "qty": "sum"}))
Example 3: Compose a pipeline with DataFrame.pipe
Goal: chain transformations without creating many temp variables.
Show code
import pandas as pd
import numpy as np
from typing import Optional
# Reuse clean_frame from Example 1
def add_margin(df: pd.DataFrame, *, revenue_col: str, cost_col: str, out_col: str = "margin") -> pd.DataFrame:
out = df.copy()
out[out_col] = out[revenue_col] - out[cost_col]
return out
def add_margin_pct(df: pd.DataFrame, *, margin_col: str = "margin", revenue_col: str = "revenue", out_col: str = "margin_pct") -> pd.DataFrame:
out = df.copy()
denom = out[revenue_col]
out[out_col] = np.where(denom.eq(0), np.nan, out[margin_col] / denom)
return out
def filter_recent(df: pd.DataFrame, *, date_col: str, days: int = 90, now: Optional[str] = None) -> pd.DataFrame:
now_ts = pd.Timestamp(now) if now else pd.Timestamp.today()
cutoff = now_ts - pd.Timedelta(days=days)
mask = pd.to_datetime(df[date_col]) >= cutoff
return df.loc[mask].copy()
# Example usage
raw = pd.DataFrame({
"date": ["2025-12-20", "2025-11-01", "2025-12-28"],
"revenue": [100, 50, 0],
"cost": [60, 30, 5],
})
result = (
raw
.pipe(clean_frame, date_col="date", numeric_cols=["revenue", "cost"]) # from Example 1
.pipe(add_margin, revenue_col="revenue", cost_col="cost")
.pipe(add_margin_pct, margin_col="margin", revenue_col="revenue")
.pipe(filter_recent, date_col="date", days=40, now="2026-01-01")
)
print(result)
Who this is for
- Data Scientists and Analysts who repeat similar pandas workflows.
- Anyone preparing notebooks for handoff to teammates or production.
Prerequisites
- Basic Python syntax, functions, and modules.
- Comfort with pandas DataFrame operations and NumPy arrays.
Learning path
- Write one small function for a repeated step (e.g., trimming strings).
- Parameterize it (columns, thresholds) and add a docstring.
- Compose two functions with DataFrame.pipe.
- Generalize to a mini-pipeline (clean → enrich → aggregate).
- Bundle helpers into a module file for reuse across notebooks.
Exercises
Do these in your local environment. The Quick Test below checks your understanding. Everyone can take the test; only logged-in users get saved progress.
Exercise 1 — Clean and standardize
Write a function load_and_clean that reads CSV text and returns a cleaned DataFrame using the same rules as clean_frame in Example 1.
Data to use
from io import StringIO
import pandas as pd
csv = StringIO("""
date,product,revenue,cost
2025-12-30, A ,100, 60
2025-12-30, A ,100, 60
2025-12-31,B,-5, 3
""")
# Use pd.read_csv(csv)
- Parse date column.
- Trim spaces in strings.
- Coerce revenue and cost to numeric and clip at 0.
- Drop duplicates and reset index.
Exercise 2 — Reusable aggregation
Write a function sales_summary(df, by, metrics) that wraps groupby, applies metrics, flattens columns, and returns a DataFrame.
Data to use
import pandas as pd
sales = pd.DataFrame({
"region": ["N", "N", "S", "S"],
"product": ["A", "A", "A", "B"],
"revenue": [10, 20, 7, 5],
"qty": [1, 2, 1, 1],
})
- Call:
sales_summary(sales, by=["region", "product"], metrics={"revenue": ["sum", "mean"], "qty": "sum"}). - Ensure column names are flat, like
revenue_sum,revenue_mean,qty_sum.
Self-check checklist
- Your functions don’t mutate inputs; they return new DataFrames.
- No hard-coded column names where a parameter would be better.
- Each function has a one-line docstring and type hints.
- You can chain them with
.pipein a single expression.
Common mistakes and how to self-check
- Hard-coded specifics: Replace literals (e.g., column names) with parameters; set sensible defaults.
- In-place mutation: Using
inplace=Truecan hide side effects—prefer copying and returning. - Mixing I/O and transforms: Keep file reading separate from transformations so you can reuse transforms on in-memory data.
- Overlong functions: Split into smaller steps; one verb per function is a good rule.
- Non-vectorized loops: Replace
forloops over rows with pandas/NumPy operations for speed and clarity. - Unclear outputs: Ensure your function returns a predictable schema and document it.
Quick self-audit
- Rename function to a verb (e.g.,
filter_recent, notrecent_filter_stuff). - Remove any print/log statements from transformation functions; return data instead.
- Add a small example in the docstring showing input → output columns.
Practical projects
- Reusable cleaning module: Build
clean_frame,standardize_columns,coerce_numeric, andfilter_date_rangein one Python file. Use across 3 datasets. - Feature engineering toolkit: Implement
add_ratio,winsorize,zscore, andbucketizewith parameters. Demonstrate on a training and a scoring set. - Aggregation library: Create
summarise_by,top_n_by_group, andpivot_compact. Include docstrings and type hints.
Mini challenge
Build a 4-step pipeline: clean → filter_recent → add_margin → summarise_by. Make each function parameterized and composable with .pipe. Write a one-paragraph docstring for each function with a minimal example.
Hints
- Start from Example 1 and 3; unify parameter names so composition is straightforward.
- Keep transformations pure: return a new DataFrame each step.
- Flatten aggregated columns for predictable downstream use.
Next steps
- Extract your helpers into a single
utils_analysis.pyand import it across notebooks. - Add lightweight tests for each function (even 2–3 assertions per function helps).
- Measure runtime on a sample dataset and replace any slow row-wise operations with vectorized alternatives.