Why this matters
As a Data Analyst, you often need to show how two numeric variables move together: ad spend vs conversions, price vs demand, support wait time vs CSAT, sessions vs revenue, or feature usage vs retention. Scatter plots are your go-to to reveal patterns, strength and direction of relationships, outliers, and whether returns are diminishing. Done well, a simple scatter plot can drive decisions like budget allocation, pricing changes, or where to investigate data quality issues.
Who this is for
- Beginner to intermediate analysts who need to visualize relationships between two numeric variables.
- Anyone preparing analyses for stakeholders and dashboards.
- Students practicing data storytelling and model sanity checks.
Prerequisites
- Basic understanding of numeric variables and axes.
- Comfort with a plotting tool (Excel/Google Sheets, Tableau/Power BI, Python matplotlib/Seaborn, or R ggplot2).
- Know what correlation means at a high level (direction and strength).
Concept explained simply
A scatter plot places each observation as a dot by its x-value and y-value. Patterns of dots tell you about the relationship between the variables.
- Direction: positive (up-right), negative (down-right), or none (cloud).
- Shape: linear, curved (e.g., diminishing returns), clustered groups, or fan-shaped spread (heteroscedasticity).
- Strength: tighter band = stronger relationship; quantify with correlation (Pearson for linear, Spearman for monotonic).
- Context: encode a third variable via color/shape; a fourth via size. But keep it readable.
Mental model: dots, patterns, and questions
Think of each dot as a story. Where dots line up, a rule might exist. Where dots stray, an exception or new factor may be at play. Ask:
- Is there a clear trend? If yes, how strong?
- Are there groups that behave differently?
- Any outliers that deserve investigation?
- Does variance change with x (fan shape)?
- Would a curved line fit better than a straight one?
Design and techniques that work
- Overplotting fix: add transparency (20–50% alpha), jitter small integers, or use small markers. For very dense data, try binning (hex/rect) or sample points.
- Trendline: add a linear fit to show direction and strength. If curved pattern, use a polynomial/LOESS trendline for exploration.
- Scale: scatter plots do not need axes starting at zero. Use ranges that include all data and comparisons. Consider log scales for variables spanning orders of magnitude.
- Encodings: use color for categorical groups; shape for colorblind-safe contrast; size only when it adds clear meaning (e.g., revenue). Keep legends concise.
- Labels: label axes with units; add short subtitle to state the question; annotate key outliers or thresholds.
- Ethics: correlation is not causation. Use language like “associated with” unless you have causal evidence.
Worked examples
Example 1 — Marketing spend vs sign-ups
Data: spend (k$) on x, sign-ups on y for 20 campaigns. Pattern: steep increase at low spend, flattening after ~60k. A linear trendline shows positive slope but residuals curve, suggesting diminishing returns. Action: propose a cap per campaign and shift extra budget to underfunded, efficient ranges.
Example 2 — Price vs units sold
Pattern: down-right slope (negative). Two clusters appear (Region A and B). Aggregating both hides a stronger negative trend within each region (Simpson's paradox risk). Action: facet by region or color points; analyze separately; avoid one-size-fits-all pricing.
Example 3 — Wait time vs CSAT
Pattern: gentle negative slope; strong outliers for very low CSAT at moderate wait times. Investigation shows days with an IVR outage. Action: annotate outage dates; present both the general trend and the outlier explanation.
How to build a great scatter plot (step-by-step)
- Question first: what relationship are you testing? Write a one-line subtitle capturing it.
- Select variables: both should be numeric. If one is categorical, consider jittered dot plot or box plot instead.
- Plot points: start with small circles, moderate transparency.
- Add context: encode group by color/shape; avoid using both unless necessary.
- Add trendline: begin with linear; if curvature is visible, try LOESS for exploration.
- Check residuals pattern: look for curves (model mismatch), fanning (heteroscedasticity), or clusters (hidden groups).
- Tune scales: consider log scale for skewed, multiplicative data.
- Annotate: mark key outliers and add a brief takeaway above or below the chart.
Quick checklist before sharing:
- Axes labeled with units, readable ranges.
- Legend is clear and minimal.
- Overplotting handled (transparency/jitter/binning).
- Trendline appropriate and not misleading.
- Outliers investigated or explained.
- Takeaway sentence is honest: “associated with”, not causal claims.
Common mistakes and how to self-check
- Using line charts instead of scatter for unordered pairs. Self-check: Is there a meaningful sequence on x? If not, use scatter.
- Declaring causation from correlation. Self-check: Could a third factor explain both variables?
- Forcing axes to start at zero. Self-check: Does zero add meaning? If not, choose a tight but honest range.
- Ignoring overplotting. Self-check: Zoom in; if points stack, add transparency or jitter.
- Hiding group differences. Self-check: Color/facet by plausible groups (region, device, segment) and compare trends.
- One-size trendline. Self-check: Are residuals curved or fan-shaped? Consider non-linear fit or transform.
Exercises
These mirror the exercises below. Do them in your preferred tool (Sheets/Excel, Python, R, BI tool) and sanity-check with the provided solutions.
Exercise 1 — Interpret a relationship
You receive a scatter plot description: Each point is a weekly campaign. X = marketing spend (k$). Y = sign-ups. The linear trendline slope is positive; correlation r ≈ 0.78. Three points near 55k spend have much lower sign-ups than neighbors.
- Describe the relationship (direction, strength, form).
- Identify likely outliers and give two possible reasons.
- Write a single actionable recommendation.
Hints
- Think diminishing returns and campaign quality.
- Consider tracking issues or external events.
Expected output (what good looks like)
Clear positive relationship, moderately strong; likely diminishing returns; note low-performing ~55k points as outliers; recommend capping spend and investigating those campaigns.
Show solution
Direction positive, strength moderately strong (r ~0.78), likely slight curvature (flattening). Outliers around 55k could be mis-targeted audiences or tracking/landing page issues. Recommendation: cap per-campaign spend near the elbow and shift excess to campaigns in the efficient region; investigate outliers before increasing budgets.
Exercise 2 — Build and tune a scatter plot
Use the small dataset below.
| Spend_k | Signups | Channel |
|---|---|---|
| 10 | 95 | Search |
| 20 | 166 | Social |
| 30 | 210 | |
| 40 | 260 | Search |
| 50 | 295 | Affiliate |
| 55 | 160 | Social |
| 60 | 315 | |
| 70 | 325 | Search |
| 80 | 330 | Affiliate |
- Plot Signups vs Spend_k (x=Spend_k, y=Signups).
- Color points by Channel; add slight transparency.
- Add a linear trendline; report approximate correlation.
- Annotate any outlier; give a 1–2 sentence takeaway.
Hints
- Compute correlation across all points; expect a high positive value.
- The 55k/160 point should stand out.
- Use a short subtitle: “Signups rise with spend, with an outlier at 55k.”
Expected output
A readable scatter plot with color by Channel, transparency ~30–40%. Linear trendline with r ~0.85–0.95 overall. The 55k/160 point annotated as an outlier. Takeaway: strong positive association with possible diminishing returns at high spend; investigate the outlier campaign.
Show solution
- Plot points with x=Spend_k, y=Signups; size small, alpha ~0.3.
- Color by Channel (e.g., Search, Social, Email, Affiliate); keep a clear legend.
- Add linear trendline; correlation is roughly 0.9 (outlier reduces it slightly).
- Annotate (55,160) as underperforming. Takeaway: overall, signups increase with spend, but one campaign underperformed; check creative, audience, or tracking before scaling.
Self-check checklist:
- Axes clearly labeled with units (k$ for spend).
- Overplotting considered (transparency used).
- Trendline present and appropriate.
- Outlier identified and annotated.
- Concise, honest takeaway included.
Mini challenge
Pick any dataset with two numeric variables you use at work or study (e.g., sessions vs revenue). Create two versions:
- Version A: single-color, linear trendline.
- Version B: color by a meaningful segment and use a LOESS curve.
Write one sentence on which version better supports a decision and why.
Practical projects
- Diminishing returns report: Build a scatter plot of spend vs outcome across 3 months. Add a LOESS curve, annotate the elbow point, and recommend a budget cap.
- Segment contrast: Create a faceted scatter plot by region or device. Compare slopes and add a short text panel with per-segment insights.
- Outlier diary: For a product metric pair (usage vs retention), list top 5 outliers and, for each, a likely cause and next step to validate.
Learning path
- Now: master scatter plots (this lesson, exercises, test).
- Next: trendlines and simple regression diagnostics (residual checks).
- Then: distributions (histograms, KDE) to understand variable shapes.
- Later: segmentation and faceting for multi-group comparisons.
- Finally: dashboard integration with consistent styling and annotations.
Next steps
- Turn your best scatter into a reusable template (style, fonts, colors).
- Create a short annotation library (outlier, elbow, cluster callouts).
- Share one plot with a teammate and ask, “What decision would this support?” Iterate.
Quick Test
Take the short test below to check your understanding. Available to everyone; only logged-in users get saved progress.