Why this matters
As a Data Visualization Engineer, your charts are only as good as the data behind them. Real-world data comes messy: wrong types, totals mixed with detail rows, missing values, and wide tables that don’t fit your viz. Wrangling and transformation let you clean, reshape, and compute the exact metrics your visuals rely on (e.g., stacked shares, rolling averages, YoY deltas).
- Prepare “long” (tidy) data for libraries like D3, Plotly, or BI tools.
- Compute chart-ready fields: percentages, bins, ranks, moving averages.
- Join dimensions (e.g., product → category) to enrich context.
Who this is for
Anyone building charts or dashboards in Python or JavaScript who needs consistent, clean, and well-shaped data feeding visual components.
Prerequisites
- Basic Python (lists, dicts) or JS (arrays, objects).
- Comfort with reading arrays/tables and simple arithmetic.
- Optional: pandas basics (Python) or array methods (JS: map/filter/reduce).
Concept explained simply
Data wrangling is turning whatever data you receive into the exact shape your visualization needs. You remove junk, convert types, reshape columns/rows, combine tables, and compute derived values.
Mental model: a tidy data pipeline
- Input: raw data (wide, messy, mixed types)
- Clean: remove totals, fix types, handle missing values
- Reshape: wide ↔ long, pivot, unpivot
- Enrich: joins (lookups), derived columns
- Aggregate: group and summarize
- Output: chart-ready dataset
Tip: Long vs wide
Long (tidy) format is best for most charts: each row is one observation with variables like category, date, and value. Wide tables (months as columns) are great for humans, not for code-driven visuals.
Core operations you will use
- Select/rename columns to a consistent schema
- Filter rows (e.g., remove subtotals, nulls, outliers)
- Type conversion: strings → numbers/dates/booleans
- Reshape: pivot/unpivot (wide ↔ long)
- Group and aggregate: sum, mean, count, rolling windows
- Joins: left, inner to add categories/metadata
- Derived columns: percentages, bins, ranks, deltas
- Sort and limit for top-N charts
Worked examples
Example 1: Clean and unpivot monthly sales
Goal: remove a total row, convert text numbers, and unpivot months so each row is product-month-value.
Raw (wide)
[
{"product":"A","Jan":"10","Feb":"12","Mar":null},
{"product":"B","Jan":"NA","Feb":"7","Mar":"9"},
{"product":"Total","Jan":"10","Feb":"19","Mar":"9"}
]
Python (pandas)
import pandas as pd
data = [
{"product":"A","Jan":"10","Feb":"12","Mar":None},
{"product":"B","Jan":"NA","Feb":"7","Mar":"9"},
{"product":"Total","Jan":"10","Feb":"19","Mar":"9"}
]
df = pd.DataFrame(data)
df = df[df["product"] != "Total"].replace("NA", pd.NA)
for m in ["Jan","Feb","Mar"]:
df[m] = pd.to_numeric(df[m], errors="coerce")
long_df = df.melt(id_vars=["product"], var_name="month", value_name="value").dropna(subset=["value"])
print(long_df.to_dict(orient="records"))
JavaScript (arrays)
const data = [
{product:"A",Jan:"10",Feb:"12",Mar:null},
{product:"B",Jan:"NA",Feb:"7",Mar:"9"},
{product:"Total",Jan:"10",Feb:"19",Mar:"9"}
];
const months = ["Jan","Feb","Mar"];
const cleaned = data.filter(r => r.product !== "Total").map(r => {
const o = { product: r.product };
months.forEach(m => {
const v = r[m];
o[m] = (v === null || v === "NA" || v === "") ? null : Number(v);
});
return o;
});
const long = [];
cleaned.forEach(r => months.forEach(m => {
const v = r[m];
if (v != null && !Number.isNaN(v)) long.push({product:r.product, month:m, value:v});
}));
console.log(long);
Example 2: Join categories and compute stacked shares
Goal: join sales to product categories, then compute per-month category share for a stacked 100% bar chart.
sales_long = [
{"product":"A","month":"Jan","value":10},
{"product":"A","month":"Feb","value":12},
{"product":"B","month":"Feb","value":7},
{"product":"B","month":"Mar","value":9},
{"product":"C","month":"Feb","value":5}
]
products = [
{"product":"A","category":"Gadgets"},
{"product":"B","category":"Gadgets"},
{"product":"C","category":"Accessories"}
]
Python (pandas)
import pandas as pd
sales = [
{"product":"A","month":"Jan","value":10},
{"product":"A","month":"Feb","value":12},
{"product":"B","month":"Feb","value":7},
{"product":"B","month":"Mar","value":9},
{"product":"C","month":"Feb","value":5}
]
products = [
{"product":"A","category":"Gadgets"},
{"product":"B","category":"Gadgets"},
{"product":"C","category":"Accessories"}
]
df = pd.DataFrame(sales).merge(pd.DataFrame(products), on="product", how="left")
cat_month = df.groupby(["month","category"], as_index=False)["value"].sum().rename(columns={"value":"category_total"})
month_total = df.groupby("month", as_index=False)["value"].sum().rename(columns={"value":"month_total"})
res = cat_month.merge(month_total, on="month")
res["share"] = (res["category_total"] / res["month_total"]).round(2)
print(res.sort_values(["month","category"]).to_dict(orient="records"))
JavaScript (arrays)
const sales = [
{product:"A",month:"Jan",value:10},
{product:"A",month:"Feb",value:12},
{product:"B",month:"Feb",value:7},
{product:"B",month:"Mar",value:9},
{product:"C",month:"Feb",value:5}
];
const products = [
{product:"A",category:"Gadgets"},
{product:"B",category:"Gadgets"},
{product:"C",category:"Accessories"}
];
const prodMap = Object.fromEntries(products.map(p => [p.product, p.category]));
const withCat = sales.map(s => ({...s, category: prodMap[s.product] || null}));
const monthTotals = withCat.reduce((acc, r) => {
acc[r.month] = (acc[r.month] || 0) + r.value; return acc;
}, {});
const catMonthTotals = {};
withCat.forEach(r => {
const key = r.month+"|"+r.category;
catMonthTotals[key] = (catMonthTotals[key] || 0) + r.value;
});
const res = Object.entries(catMonthTotals).map(([k, ct]) => {
const [month, category] = k.split("|");
const mt = monthTotals[month];
return {month, category, category_total: ct, month_total: mt, share: Math.round((ct/mt)*100)/100};
});
console.log(res.sort((a,b)=> a.month.localeCompare(b.month) || a.category.localeCompare(b.category)));
Example 3: Prepare rolling average for a smooth line chart
Goal: compute a 3-period rolling mean to reduce noise in a time series.
ts = [
{"date":"2024-01-01","value":10},
{"date":"2024-01-02","value":16},
{"date":"2024-01-03","value":9},
{"date":"2024-01-04","value":13},
{"date":"2024-01-05","value":12}
]
Python (pandas)
import pandas as pd
s = pd.DataFrame(ts)
s["date"] = pd.to_datetime(s["date"])
s = s.sort_values("date")
s["roll3"] = s["value"].rolling(window=3, min_periods=1).mean().round(2)
print(s.to_dict(orient="records"))
JavaScript (arrays)
const s = ts.slice().sort((a,b)=> new Date(a.date)-new Date(b.date));
const out = s.map((row, i, arr) => {
const start = Math.max(0, i-2);
const win = arr.slice(start, i+1);
const avg = Math.round((win.reduce((sum,r)=>sum+r.value,0)/win.length)*100)/100;
return {...row, roll3: avg};
});
console.log(out);
Exercises
Try these in Python or JavaScript. Everyone can take the Quick Test; log in to save your progress.
Exercise 1 — Clean, reshape, and summarize
Input:
[
{"product":"A","Jan":"10","Feb":"12","Mar":null},
{"product":"B","Jan":"NA","Feb":"7","Mar":"9"},
{"product":"Total","Jan":"10","Feb":"19","Mar":"9"}
]
Tasks:
- Remove the “Total” row.
- Convert month columns to numbers; treat "NA"/null as missing.
- Unpivot to long (product, month, value), drop missing.
- Compute total per product and sort descending.
Hints
- Filter rows where product != "Total".
- Pandas: melt; JS: build long array with nested loops.
- Replace "NA" with null/NaN then to_numeric/Number.
Expected output:
[
{"product":"A","total":22},
{"product":"B","total":16}
]
Exercise 2 — Join and compute category share
Inputs:
sales_long = [
{"product":"A","month":"Jan","value":10},
{"product":"A","month":"Feb","value":12},
{"product":"B","month":"Feb","value":7},
{"product":"B","month":"Mar","value":9},
{"product":"C","month":"Feb","value":5}
]
products = [
{"product":"A","category":"Gadgets"},
{"product":"B","category":"Gadgets"},
{"product":"C","category":"Accessories"}
]
Tasks:
- Left join sales to products on product.
- Compute category_total per (month, category).
- Compute month_total per month and category share = category_total / month_total (round to 2 decimals).
Hints
- Compute month totals once; reuse for share.
- Watch division by zero; guard if a month has no rows.
Expected output:
[
{"month":"Jan","category":"Gadgets","category_total":10,"month_total":10,"share":1},
{"month":"Feb","category":"Accessories","category_total":5,"month_total":24,"share":0.21},
{"month":"Feb","category":"Gadgets","category_total":19,"month_total":24,"share":0.79},
{"month":"Mar","category":"Gadgets","category_total":9,"month_total":9,"share":1}
]
Self-check checklist
- No aggregate rows (like “Total”) remain in detail-level data.
- Numeric columns are actual numbers, not strings.
- Long format used where charts expect series/category/value rows.
- Aggregations and shares recompute correctly if inputs change.
Common mistakes and how to self-check
- Mixing totals with detail rows: filter them out before aggregation.
- Forgetting type conversion: strings won’t sum or sort numerically.
- Silent missing values: standardize "NA", "", null to actual missing and decide drop/impute.
- Wrong denominator for percentages: ensure you divide by the correct group total (e.g., per month for stacked bars).
- Incorrect join: use left join when you must keep all fact rows; verify key uniqueness.
Quick self-test
Pick one chart you made recently. Identify its required grain (e.g., month x category). Can you produce exactly one row per grain with the values the chart uses? If not, adjust your pipeline.
Practical projects
- Sales dashboard: ingest CSVs, clean, compute stacked category shares, top-5 products, rolling trend.
- Web analytics: session data → daily active users, bounce rate, 7-day rolling averages.
- Survey results: unpivot multiple-choice columns, compute response distributions and NPS by segment.
Learning path
- Load data safely: validate columns and types.
- Clean: remove totals, standardize missing, convert types.
- Reshape: wide ↔ long as charts require.
- Enrich: joins for categories, geos, segments.
- Aggregate: groupby and window functions for metrics.
- Validate: spot-check sums, counts, and sample rows.
- Export: supply consistent schema to visualization layer.
Next steps
- Finish the exercises above.
- Then take the Quick Test at the end of this page to check understanding.
- Apply these steps to your next visualization—save a reusable transformation script.
Mini challenge
Given daily signups by source in wide format (columns Mon–Sun), remove totals, unpivot to long, join a lookup table mapping sources to channels, and compute channel share per day for a stacked percent bar. Bonus: add a 3-day rolling mean.
Starter data
signups_wide = [
{"source":"Email","Mon":"12","Tue":"10","Wed":"8"},
{"source":"Ads","Mon":"20","Tue":"18","Wed":"22"},
{"source":"Total","Mon":"32","Tue":"28","Wed":"30"}
]
lookup = [
{"source":"Email","channel":"Owned"},
{"source":"Ads","channel":"Paid"}
]