luvv to helpDiscover the Best Free Online Tools
Topic 4 of 8

Data Wrangling And Transformation

Learn Data Wrangling And Transformation for free with explanations, exercises, and a quick test (for Data Visualization Engineer).

Published: December 28, 2025 | Updated: December 28, 2025

Why this matters

As a Data Visualization Engineer, your charts are only as good as the data behind them. Real-world data comes messy: wrong types, totals mixed with detail rows, missing values, and wide tables that don’t fit your viz. Wrangling and transformation let you clean, reshape, and compute the exact metrics your visuals rely on (e.g., stacked shares, rolling averages, YoY deltas).

  • Prepare “long” (tidy) data for libraries like D3, Plotly, or BI tools.
  • Compute chart-ready fields: percentages, bins, ranks, moving averages.
  • Join dimensions (e.g., product → category) to enrich context.

Who this is for

Anyone building charts or dashboards in Python or JavaScript who needs consistent, clean, and well-shaped data feeding visual components.

Prerequisites

  • Basic Python (lists, dicts) or JS (arrays, objects).
  • Comfort with reading arrays/tables and simple arithmetic.
  • Optional: pandas basics (Python) or array methods (JS: map/filter/reduce).

Concept explained simply

Data wrangling is turning whatever data you receive into the exact shape your visualization needs. You remove junk, convert types, reshape columns/rows, combine tables, and compute derived values.

Mental model: a tidy data pipeline

  1. Input: raw data (wide, messy, mixed types)
  2. Clean: remove totals, fix types, handle missing values
  3. Reshape: wide ↔ long, pivot, unpivot
  4. Enrich: joins (lookups), derived columns
  5. Aggregate: group and summarize
  6. Output: chart-ready dataset
Tip: Long vs wide

Long (tidy) format is best for most charts: each row is one observation with variables like category, date, and value. Wide tables (months as columns) are great for humans, not for code-driven visuals.

Core operations you will use

  • Select/rename columns to a consistent schema
  • Filter rows (e.g., remove subtotals, nulls, outliers)
  • Type conversion: strings → numbers/dates/booleans
  • Reshape: pivot/unpivot (wide ↔ long)
  • Group and aggregate: sum, mean, count, rolling windows
  • Joins: left, inner to add categories/metadata
  • Derived columns: percentages, bins, ranks, deltas
  • Sort and limit for top-N charts

Worked examples

Example 1: Clean and unpivot monthly sales

Goal: remove a total row, convert text numbers, and unpivot months so each row is product-month-value.

Raw (wide)
[
  {"product":"A","Jan":"10","Feb":"12","Mar":null},
  {"product":"B","Jan":"NA","Feb":"7","Mar":"9"},
  {"product":"Total","Jan":"10","Feb":"19","Mar":"9"}
]
Python (pandas)
import pandas as pd

data = [
  {"product":"A","Jan":"10","Feb":"12","Mar":None},
  {"product":"B","Jan":"NA","Feb":"7","Mar":"9"},
  {"product":"Total","Jan":"10","Feb":"19","Mar":"9"}
]

df = pd.DataFrame(data)
df = df[df["product"] != "Total"].replace("NA", pd.NA)
for m in ["Jan","Feb","Mar"]:
    df[m] = pd.to_numeric(df[m], errors="coerce")
long_df = df.melt(id_vars=["product"], var_name="month", value_name="value").dropna(subset=["value"])
print(long_df.to_dict(orient="records"))
JavaScript (arrays)
const data = [
  {product:"A",Jan:"10",Feb:"12",Mar:null},
  {product:"B",Jan:"NA",Feb:"7",Mar:"9"},
  {product:"Total",Jan:"10",Feb:"19",Mar:"9"}
];
const months = ["Jan","Feb","Mar"];
const cleaned = data.filter(r => r.product !== "Total").map(r => {
  const o = { product: r.product };
  months.forEach(m => {
    const v = r[m];
    o[m] = (v === null || v === "NA" || v === "") ? null : Number(v);
  });
  return o;
});
const long = [];
cleaned.forEach(r => months.forEach(m => {
  const v = r[m];
  if (v != null && !Number.isNaN(v)) long.push({product:r.product, month:m, value:v});
}));
console.log(long);

Example 2: Join categories and compute stacked shares

Goal: join sales to product categories, then compute per-month category share for a stacked 100% bar chart.

sales_long = [
  {"product":"A","month":"Jan","value":10},
  {"product":"A","month":"Feb","value":12},
  {"product":"B","month":"Feb","value":7},
  {"product":"B","month":"Mar","value":9},
  {"product":"C","month":"Feb","value":5}
]
products = [
  {"product":"A","category":"Gadgets"},
  {"product":"B","category":"Gadgets"},
  {"product":"C","category":"Accessories"}
]
Python (pandas)
import pandas as pd
sales = [
  {"product":"A","month":"Jan","value":10},
  {"product":"A","month":"Feb","value":12},
  {"product":"B","month":"Feb","value":7},
  {"product":"B","month":"Mar","value":9},
  {"product":"C","month":"Feb","value":5}
]
products = [
  {"product":"A","category":"Gadgets"},
  {"product":"B","category":"Gadgets"},
  {"product":"C","category":"Accessories"}
]

df = pd.DataFrame(sales).merge(pd.DataFrame(products), on="product", how="left")
cat_month = df.groupby(["month","category"], as_index=False)["value"].sum().rename(columns={"value":"category_total"})
month_total = df.groupby("month", as_index=False)["value"].sum().rename(columns={"value":"month_total"})
res = cat_month.merge(month_total, on="month")
res["share"] = (res["category_total"] / res["month_total"]).round(2)
print(res.sort_values(["month","category"]).to_dict(orient="records"))
JavaScript (arrays)
const sales = [
  {product:"A",month:"Jan",value:10},
  {product:"A",month:"Feb",value:12},
  {product:"B",month:"Feb",value:7},
  {product:"B",month:"Mar",value:9},
  {product:"C",month:"Feb",value:5}
];
const products = [
  {product:"A",category:"Gadgets"},
  {product:"B",category:"Gadgets"},
  {product:"C",category:"Accessories"}
];
const prodMap = Object.fromEntries(products.map(p => [p.product, p.category]));
const withCat = sales.map(s => ({...s, category: prodMap[s.product] || null}));
const monthTotals = withCat.reduce((acc, r) => {
  acc[r.month] = (acc[r.month] || 0) + r.value; return acc;
}, {});
const catMonthTotals = {};
withCat.forEach(r => {
  const key = r.month+"|"+r.category;
  catMonthTotals[key] = (catMonthTotals[key] || 0) + r.value;
});
const res = Object.entries(catMonthTotals).map(([k, ct]) => {
  const [month, category] = k.split("|");
  const mt = monthTotals[month];
  return {month, category, category_total: ct, month_total: mt, share: Math.round((ct/mt)*100)/100};
});
console.log(res.sort((a,b)=> a.month.localeCompare(b.month) || a.category.localeCompare(b.category)));

Example 3: Prepare rolling average for a smooth line chart

Goal: compute a 3-period rolling mean to reduce noise in a time series.

ts = [
  {"date":"2024-01-01","value":10},
  {"date":"2024-01-02","value":16},
  {"date":"2024-01-03","value":9},
  {"date":"2024-01-04","value":13},
  {"date":"2024-01-05","value":12}
]
Python (pandas)
import pandas as pd

s = pd.DataFrame(ts)
s["date"] = pd.to_datetime(s["date"])
s = s.sort_values("date")
s["roll3"] = s["value"].rolling(window=3, min_periods=1).mean().round(2)
print(s.to_dict(orient="records"))
JavaScript (arrays)
const s = ts.slice().sort((a,b)=> new Date(a.date)-new Date(b.date));
const out = s.map((row, i, arr) => {
  const start = Math.max(0, i-2);
  const win = arr.slice(start, i+1);
  const avg = Math.round((win.reduce((sum,r)=>sum+r.value,0)/win.length)*100)/100;
  return {...row, roll3: avg};
});
console.log(out);

Exercises

Try these in Python or JavaScript. Everyone can take the Quick Test; log in to save your progress.

Exercise 1 — Clean, reshape, and summarize

Input:

[
  {"product":"A","Jan":"10","Feb":"12","Mar":null},
  {"product":"B","Jan":"NA","Feb":"7","Mar":"9"},
  {"product":"Total","Jan":"10","Feb":"19","Mar":"9"}
]

Tasks:

  • Remove the “Total” row.
  • Convert month columns to numbers; treat "NA"/null as missing.
  • Unpivot to long (product, month, value), drop missing.
  • Compute total per product and sort descending.
Hints
  • Filter rows where product != "Total".
  • Pandas: melt; JS: build long array with nested loops.
  • Replace "NA" with null/NaN then to_numeric/Number.

Expected output:

[
  {"product":"A","total":22},
  {"product":"B","total":16}
]

Exercise 2 — Join and compute category share

Inputs:

sales_long = [
  {"product":"A","month":"Jan","value":10},
  {"product":"A","month":"Feb","value":12},
  {"product":"B","month":"Feb","value":7},
  {"product":"B","month":"Mar","value":9},
  {"product":"C","month":"Feb","value":5}
]
products = [
  {"product":"A","category":"Gadgets"},
  {"product":"B","category":"Gadgets"},
  {"product":"C","category":"Accessories"}
]

Tasks:

  • Left join sales to products on product.
  • Compute category_total per (month, category).
  • Compute month_total per month and category share = category_total / month_total (round to 2 decimals).
Hints
  • Compute month totals once; reuse for share.
  • Watch division by zero; guard if a month has no rows.

Expected output:

[
  {"month":"Jan","category":"Gadgets","category_total":10,"month_total":10,"share":1},
  {"month":"Feb","category":"Accessories","category_total":5,"month_total":24,"share":0.21},
  {"month":"Feb","category":"Gadgets","category_total":19,"month_total":24,"share":0.79},
  {"month":"Mar","category":"Gadgets","category_total":9,"month_total":9,"share":1}
]

Self-check checklist

  • No aggregate rows (like “Total”) remain in detail-level data.
  • Numeric columns are actual numbers, not strings.
  • Long format used where charts expect series/category/value rows.
  • Aggregations and shares recompute correctly if inputs change.

Common mistakes and how to self-check

  • Mixing totals with detail rows: filter them out before aggregation.
  • Forgetting type conversion: strings won’t sum or sort numerically.
  • Silent missing values: standardize "NA", "", null to actual missing and decide drop/impute.
  • Wrong denominator for percentages: ensure you divide by the correct group total (e.g., per month for stacked bars).
  • Incorrect join: use left join when you must keep all fact rows; verify key uniqueness.
Quick self-test

Pick one chart you made recently. Identify its required grain (e.g., month x category). Can you produce exactly one row per grain with the values the chart uses? If not, adjust your pipeline.

Practical projects

  • Sales dashboard: ingest CSVs, clean, compute stacked category shares, top-5 products, rolling trend.
  • Web analytics: session data → daily active users, bounce rate, 7-day rolling averages.
  • Survey results: unpivot multiple-choice columns, compute response distributions and NPS by segment.

Learning path

  1. Load data safely: validate columns and types.
  2. Clean: remove totals, standardize missing, convert types.
  3. Reshape: wide ↔ long as charts require.
  4. Enrich: joins for categories, geos, segments.
  5. Aggregate: groupby and window functions for metrics.
  6. Validate: spot-check sums, counts, and sample rows.
  7. Export: supply consistent schema to visualization layer.

Next steps

  • Finish the exercises above.
  • Then take the Quick Test at the end of this page to check understanding.
  • Apply these steps to your next visualization—save a reusable transformation script.

Mini challenge

Given daily signups by source in wide format (columns Mon–Sun), remove totals, unpivot to long, join a lookup table mapping sources to channels, and compute channel share per day for a stacked percent bar. Bonus: add a 3-day rolling mean.

Starter data
signups_wide = [
  {"source":"Email","Mon":"12","Tue":"10","Wed":"8"},
  {"source":"Ads","Mon":"20","Tue":"18","Wed":"22"},
  {"source":"Total","Mon":"32","Tue":"28","Wed":"30"}
]
lookup = [
  {"source":"Email","channel":"Owned"},
  {"source":"Ads","channel":"Paid"}
]

Practice Exercises

2 exercises to complete

Instructions

Remove total rows, convert month columns to numbers, unpivot to long, drop missing, and compute totals per product sorted descending.

Input:

[
  {"product":"A","Jan":"10","Feb":"12","Mar":null},
  {"product":"B","Jan":"NA","Feb":"7","Mar":"9"},
  {"product":"Total","Jan":"10","Feb":"19","Mar":"9"}
]
Expected Output
[{"product":"A","total":22},{"product":"B","total":16}]

Data Wrangling And Transformation — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Data Wrangling And Transformation?

AI Assistant

Ask questions about this tool