Why this matters
Pandas, NumPy, and Matplotlib are the core trio for everyday data analysis. As a Data Analyst, you will regularly:
- Do fast numeric transforms (z-scores, percent changes, thresholds) with NumPy on pandas columns.
- Convert between pandas and NumPy when libraries expect arrays.
- Plot clean visuals quickly using pandas with Matplotlib under the hood, and customize with Matplotlib APIs.
Who this is for
- Beginners who know basic pandas and want to compute faster with NumPy.
- Analysts who can make basic plots but want to customize them cleanly.
- Anyone preparing for analyst interviews involving vectorized operations and plotting.
Prerequisites
- Python basics (variables, functions, importing modules).
- Pandas basics (Series, DataFrame, indexing, selecting columns).
- Very light Matplotlib familiarity (axes, labels) is helpful but not required.
Concept explained simply
Pandas stores tabular data and labels it with an index. NumPy powers fast numeric operations. Matplotlib draws the charts. They interoperate like this:
- Use NumPy functions (like
np.log,np.where,np.mean) directly on pandas Series/DataFrames. Pandas passes the data to NumPy efficiently. - Convert to NumPy when you need raw arrays using
to_numpy(). Convert back to pandas withpd.Series(...)orpd.DataFrame(...)to regain labels. - Use
df.plot(...)for quick charts. For fine control, get anAxesfrom Matplotlib and pass it to pandas:df.plot(ax=ax).
Mental model
- Pandas = labeled containers + convenient API.
- NumPy = speed engine for number crunching.
- Matplotlib = canvas and brushes for drawing.
Move data between them when needed. Keep labels in pandas for alignment and readability; use NumPy for heavy math; draw with Matplotlib using axes.
Quick setup snippet
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Sample DataFrame
df = pd.DataFrame({
'day': pd.date_range('2023-01-01', periods=10, freq='D'),
'sales': [12, 15, 13, 20, 18, 17, 30, 28, 22, 25],
'visits': [120, 130, 125, 150, 160, 158, 200, 195, 180, 190]
}).set_index('day')
print(df.head())
Worked examples
Example 1 — Vectorized transforms with NumPy
Compute z-scores for the sales column and flag unusually high values.
sales = df['sales']
mu = sales.mean() # pandas Series method
sigma = sales.std(ddof=0) # population std for demo
z = (sales - mu) / sigma # vectorized math works directly
# Use NumPy to label outliers (> 1.0 std)
df['is_high'] = np.where(z > 1.0, 1, 0)
print(z.round(2))
print(df['is_high'].value_counts())
Key point: You didn’t have to convert to NumPy. Pandas interoperates with NumPy ufuncs and operators.
Example 2 — Converting to/from NumPy
When another library needs arrays, convert with to_numpy().
X = df[['sales', 'visits']].to_numpy(dtype=float) # shape (n, 2)
print(X.shape, X.dtype)
# Later, convert results back to pandas while keeping index
y = (X[:, 0] / X[:, 1]) # conversion rate as raw ndarray
s_conversion = pd.Series(y, index=df.index, name='conv_rate')
df = pd.concat([df, s_conversion], axis=1)
print(df.head())
Tip: Prefer to_numpy() over .values because it respects dtypes more consistently.
Example 3 — Plotting with pandas + Matplotlib axes
Create a line chart with two y-axes: pandas handles data; Matplotlib handles layout.
fig, ax1 = plt.subplots(figsize=(7, 4))
ax2 = ax1.twinx() # second y-axis
# Plot on specific axes for full control
_df1 = df[['sales']]
_df2 = df[['visits']]
_df1.plot(ax=ax1, color='tab:blue', marker='o', legend=False)
_df2.plot(ax=ax2, color='tab:orange', linestyle='--', legend=False)
ax1.set_title('Sales and Visits over Time')
ax1.set_xlabel('Day')
ax1.set_ylabel('Sales', color='tab:blue')
ax2.set_ylabel('Visits', color='tab:orange')
ax1.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Pattern: fig, ax = plt.subplots() → pass ax to df.plot() → customize using Matplotlib.
Common interop operations
- Elementwise math:
np.log(df['sales']),np.sqrt(df['visits']). - Conditional choice:
np.where(cond, a, b)returns a vector; assign to a new column. - Row-wise functions: use vectorized NumPy patterns first; avoid
apply(..., axis=1)if not necessary. - Aggregation ignoring NaN:
np.nanmean(df['sales'].to_numpy())or pandasdf['sales'].mean(). - Broadcasting: shapes must match. For 1D arrays, use length = number of rows.
Learning path
- Warm-up: Run the setup snippet and print shapes/dtypes.
- Vectorize: Replace any loops with NumPy ufuncs (
np.log,np.where,np.clip). - Convert safely: Practice
to_numpy()and reconstruct labeled Series/DataFrames. - Plot basics: Use
df.plot()then move to Matplotlib axes for fine control. - Polish: Add titles, labels, grids, legends, colors, twin axes.
Common mistakes and self-check
- Using .values instead of to_numpy(): May give unexpected dtypes. Self-check: confirm
arr.dtypeafter conversion. - Forgetting index alignment: Pandas aligns by index; NumPy ignores labels. Self-check: after converting to NumPy, verify shapes and ordering before assigning back.
- Silent NaN propagation:
np.log(0)→-inf, operations with NaN propagate NaN. Self-check: runnp.isfinite()on results before plotting. - Plotting on default axes then customizing another axes: Your settings won’t apply. Self-check: always pass
ax=...intodf.plot()when customizing.
Self-checklist
- I can compute a new column using only NumPy vectorized calls.
- I can convert between pandas and NumPy without losing row order.
- I can pass an existing Matplotlib
Axestodf.plot(). - I can handle NaN/inf before plotting.
Exercises
These mirror the interactive exercise(s) below. Do them in your Python environment.
Exercise 1 — Vectorize, convert, and plot
- Create a DataFrame with two numeric columns and a date index (10–20 rows).
- Compute a z-score on one column using NumPy; create a binary flag with
np.where. - Convert the selected numeric columns to a NumPy array with
to_numpy(dtype=float). - Back in pandas, create a ratio column from the NumPy result; plot a line chart and customize with Matplotlib axes.
Expected: z-scores printed, flag counts, and a customized chart rendered.
Practical projects
- Marketing KPI dashboard (static): Build a DataFrame with daily spend, clicks, conversions. Use NumPy for CTR/CVR, clip outliers, and plot multi-axis trends with Matplotlib.
- Quality control thresholds: Given measurements, compute rolling z-scores with NumPy, mark out-of-control points, and highlight them on a plot.
- Anomaly tags: Use
np.whereandnp.selectto assign labels (normal, warn, alert) based on multiple conditions; visualize counts per day.
Mini challenge
Given columns revenue and visits, create:
rev_per_visit= revenue / visits (handle division by zero vianp.whereornp.dividewithwhere).tag='high','med','low'usingnp.selectbased on quantiles.- A dual-axis plot:
rev_per_visiton left,visitson right, styled distinctly.
Next steps
- Practice replacing loops with NumPy ufuncs and
np.where/np.select. - Adopt a plotting pattern:
fig, ax = plt.subplots()→df.plot(ax=ax)→ customize. - Try a small project above and keep code snippets for reuse.
Progress & test
The quick test below is available to everyone. If you are logged in, your progress will be saved automatically.
Quick Test
When you are ready, take the quick test below to check your understanding.