Who this is for
Beginner to intermediate data analysts who use pandas and want clean, readable ways to create new columns and apply conditional logic without loops.
Prerequisites
- Basic Python syntax
- pandas DataFrame basics (creating DataFrames, selecting columns)
- Comfort with boolean logic (and: &, or: |, not: ~) and parentheses
Learning path
- Review selecting columns and boolean filtering.
- Learn DataFrame.assign to create multiple columns in a clean pipeline.
- Learn where (Series/DataFrame.where) and numpy.where for conditional values.
- Combine assign + where to build safe, readable transformations.
- Practice on real-like datasets (sales, customers, sensors).
Why this matters
As a Data Analyst, you constantly create derived metrics and flags:
- Sales metrics: revenue, discount, net revenue, margin, profit per order.
- Customer insights: segment users by spend, recency, or behavior.
- Data cleaning: replace out-of-range values with NaN or defaults.
Using assign and where lets you do this clearly, safely, and fast—no slow loops, fewer bugs, and code that teammates can understand.
Concept explained simply
Two building blocks:
- assign: Adds new columns (or overwrites) and returns a new DataFrame. You can chain multiple new columns in one readable step.
- where: Keeps values where a condition is True; replaces values where False (with NaN by default or a provided alternative).
numpy.where is a vectorized if-else: np.where(condition, value_if_true, value_if_false).
Mental model
- assign: Think of a conveyor belt. Each assign step adds a labeled box (column) based on what’s already on the belt. With lambda, you reference the current DataFrame state inside the chain.
- where: Think of a stencil over your data. Where the stencil is cut out (True), original values pass through. Where it’s covered (False), values get replaced.
Syntax snapshots
import pandas as pd
import numpy as np
# assign with existing columns
out = df.assign(new_col = df['a'] * df['b'])
# assign with lambda (recommended for chaining)
out = (
df
.assign(revenue=lambda d: d['qty'] * d['price'])
.assign(high_value=lambda d: np.where(d['revenue'] >= 100, True, False))
)
# Series.where: keep values where condition is True, else replace
safe_ratio = (df['num'] / df['den']).where(df['den'] != 0, 0.0)
# DataFrame.where: apply across many columns (same rule per cell)
clean = df.where(df >= 0) # negative values become NaN
Worked examples
Example 1: Sales revenue, discount, and net
import pandas as pd
import numpy as np
sales = pd.DataFrame({
'item': ['A','B','C','D'],
'qty': [3, 10, 0, 7],
'price': [5.0, 8.0, 12.0, 3.0],
'discount_rate': [0.00, 0.10, 0.00, 0.20]
})
result = (
sales
.assign(gross=lambda d: d['qty'] * d['price'])
.assign(discount_value=lambda d: d['gross'] * d['discount_rate'])
.assign(net=lambda d: d['gross'] - d['discount_value'])
)
print(result)
What you should see
item qty price discount_rate gross discount_value net
0 A 3 5.0 0.00 15.0 0.0 15.0
1 B 10 8.0 0.10 80.0 8.0 72.0
2 C 0 12.0 0.00 0.0 0.0 0.0
3 D 7 3.0 0.20 21.0 4.2 16.8
Example 2: Flag high-value orders with np.where
flagged = (
sales
.assign(gross=lambda d: d['qty'] * d['price'])
.assign(high_value=lambda d: np.where(d['gross'] >= 50, 'High', 'Regular'))
)
print(flagged[['item','gross','high_value']])
Expected output
item gross high_value
0 A 15.0 Regular
1 B 80.0 High
2 C 0.0 Regular
3 D 21.0 Regular
Example 3: Safe margin rate with where
profit_df = pd.DataFrame({
'qty': [3, 10, 0, 7],
'price': [5.0, 8.0, 12.0, 3.0],
'profit': [5.0, 12.0, -2.0, 4.0]
})
with_margin = (
profit_df
.assign(revenue=lambda d: d['qty'] * d['price'])
.assign(margin_rate=lambda d: (d['profit'] / d['revenue']).where(d['revenue'] > 0, 0.0))
)
print(with_margin)
Expected output
qty price profit revenue margin_rate
0 3 5.0 5.0 15.0 0.333333
1 10 8.0 12.0 80.0 0.150000
2 0 12.0 -2.0 0.0 0.000000
3 7 3.0 4.0 21.0 0.190476
Exercises you can try now
These mirror the exercises below so you can check your answers. Everyone can attempt them. Progress is saved only for logged-in learners.
- EX1 — Order totals with conditional discount
Build columns: gross, discount_value (apply coupon_pct if coupon is True, else 0), net, and is_big (net >= 100). - EX2 — Clean sensor temps and bucketize
Replace out-of-range temps with NaN; create temp_bucket: Low (<18), OK (18–25), High (>25). - EX3 — Safe margin
Create profit and margin; margin is profit/sales where sales > 0 else NaN.
Self-check checklist
- Used assign with lambda for readable chaining.
- Used where to avoid division by zero or to set defaults.
- Used parentheses around boolean expressions with & and |.
- Outputs match expected columns and shapes.
Common mistakes and how to self-check
- Forgetting parentheses in boolean logic. Use (cond1) & (cond2) not cond1 & cond2 without brackets.
- Mixing where meaning: remember where keeps values when condition is True; replaces when False.
- Not importing numpy for np.where.
- Division by zero: always guard with where before dividing, or compute then where.
- Chained assignment warning: prefer df = df.assign(...) rather than df['x'] ... inside many steps.
- Type surprises: np.where with mixed types can upcast. Be explicit if you need ints/floats.
Practical projects
- Sales analysis mini-report: compute gross, net, discount flags, high-value segments, and safe margins for a month of orders.
- Customer segmentation: assign spend tiers (e.g., Bronze/Silver/Gold) using np.where or np.select, plus recency flags.
- Data quality pass: use where to null-out invalid ranges (negative ages, impossible readings) and produce a clean dataset.
Next steps
- Combine your new columns with groupby to aggregate by segment or date.
- Use vectorized string/date methods to enrich columns (parsing dates, extracting domains).
- Package transformations in functions and apply with pipe for reusable pipelines.
Mini challenge
Create columns for a small product dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'sku': ['A','B','C','D','E'],
'price': [20, 35, 12, 50, 40],
'cost': [12, 20, 14, 45, 38],
'units': [5, 3, 10, 0, 1]
})
- Build revenue, profit, margin_rate (profit/revenue where revenue>0 else 0).
- Classify health: 'Healthy' if margin_rate >= 0.25 and revenue >= 100, 'Low margin' if margin_rate < 0.10 and revenue > 0, else 'OK'.
Show one possible solution
sol = (
df
.assign(revenue=lambda d: d['price'] * d['units'])
.assign(profit=lambda d: (d['price'] - d['cost']) * d['units'])
.assign(margin_rate=lambda d: (d['profit'] / d['revenue']).where(d['revenue'] > 0, 0.0))
.assign(health=lambda d: np.where((d['margin_rate'] >= 0.25) & (d['revenue'] >= 100),
'Healthy',
np.where((d['margin_rate'] < 0.10) & (d['revenue'] > 0),
'Low margin',
'OK')))
)
print(sol)