Why this matters
Correlation exploration helps you quickly discover which variables move together before you build models or make decisions. As a Data Analyst, you will:
- Prioritize features for modeling by spotting strong relationships.
- Diagnose unexpected business trends (e.g., conversion lift when price drops).
- Catch data quality issues (e.g., suspiciously perfect correlations).
- Communicate findings with simple visuals (scatter plots, heatmaps) that stakeholders understand.
Concept explained simply
Correlation measures how two variables move together. Values range from -1 to +1:
- +1: move perfectly together (as one increases, the other always increases linearly).
- 0: no linear relationship.
- -1: move perfectly in opposite directions.
Mental model
Imagine placing points on a flat table (a scatter plot). If the points form a narrow uphill line, correlation is strongly positive. A downhill line is strongly negative. A cloud with no slope means near zero correlation. If the pattern is curved (U-shaped), linear correlation may be near zero even if a relationship exists.
Common correlation types
- Pearson r: linear relationship (sensitive to outliers, continuous variables).
- Spearman rho: monotonic relationship via ranked data (robust to outliers and non-linear monotonic shapes).
- Kendall tau: rank-based, robust; good with many ties and small samples.
- Point-biserial: one binary and one continuous variable.
- Pearson on binaries (phi) or Cramér’s V (categorical) for association, not strictly correlation.
Correlation is not causation. Use domain knowledge, experiments, or causal designs to infer causes.
When to use which correlation
- Use Pearson when variables are continuous, roughly linear, and outliers are minimal.
- Use Spearman when the relationship is monotonic but not linear, or when outliers are present.
- Use Kendall when sample size is small or ranks have many ties.
- Use point-biserial for one binary + one continuous variable.
Quick decision checklist
- Relationship looks straight in scatter plot → Pearson.
- Looks curved but consistently increasing/decreasing → Spearman.
- Many ties/small n → Kendall.
- Binary + continuous → Point-biserial.
Practical workflow
- Define the question: Which variables should move together if our hypothesis is true?
- Prepare data: correct types, handle missing values (drop or impute), filter obvious errors.
- Visualize pairs: scatter plots or rank-scatter (scatter on ranks) to spot nonlinearity, outliers, and clusters.
- Compute correlation(s): start with Pearson; add Spearman/Kendall if needed.
- Validate: check sensitivity by removing extreme outliers or analyzing subgroups.
- Guard against multiple comparisons: if you scan many pairs, treat p-values cautiously; focus on effect sizes and stability.
- Communicate: use a correlation matrix heatmap with concise annotations and caveats.
Notes for time series
- Trend and seasonality can create spurious correlations.
- Consider differencing, detrending, or comparing aligned lags after making series stationary.
Worked examples
Example 1 — Pearson vs Spearman under an outlier
Data (Ad spend → Sales units):
Spend: 100, 120, 130, 140, 160, 170, 180, 190, 400 Sales: 230, 245, 260, 275, 290, 300, 310, 330, 320
- Pearson r ≈ 0.94 (the outlier at Spend=400 bends the line).
- Spearman rho ≈ 0.98 (rank order is still mostly preserved).
Takeaway: Use Spearman if a single extreme point distorts linearity.
Example 2 — Zero Pearson but real relationship
Let x = -4,-3,-2,-1,0,1,2,3,4 and y = x^2. The relationship is U-shaped.
- Pearson r ≈ 0 (no linear slope).
- Spearman rho ≈ 0 (not strictly monotonic across full range).
Takeaway: A non-monotonic curve can hide under zero correlation. Visualize before concluding “no relationship.”
Example 3 — Interpreting a correlation matrix
Variables: price, discount_pct, visits, conversions, email_clicks (8 observations).
price: 10,12,9,8,15,14,11,13 discount_pct: 10,5,20,25,0,5,10,0 visits: 200,180,250,300,150,160,220,170 conversions: 20,18,27,34,13,15,22,16 email_clicks: 40,35,60,70,20,25,45,30
- visits vs conversions: strong + (≈ 0.99).
- discount vs visits: strong + (≈ 0.95).
- price vs conversions: strong − (≈ −0.9).
- price vs discount: strong − (≈ −0.85).
- clicks vs visits: strong + (≈ 0.95).
Takeaway: Use signs and magnitudes to form testable hypotheses (e.g., higher discounts → more visits → more conversions).
Hands-on exercises
These mirror the exercises below and can be done in any tool (spreadsheet, Python, R). Use approximate results if computing by hand.
- Exercise 1: Compute Pearson and Spearman for the Ad spend vs Sales dataset. Then remove the Spend=400 row and recompute. Compare how each metric changes.
- Exercise 2: Build a Pearson correlation matrix for the 5-variable retail dataset. Identify the top 3 strongest pairs by absolute value. Suggest one practical action based on your findings.
Self-check checklist
- I visualized the points before trusting a single metric.
- I tried both Pearson and Spearman when nonlinearity/outliers were suspected.
- I confirmed that units or scaling did not change correlation results.
- I noted that correlation ≠ causation and proposed a way to validate causality (experiment or temporal analysis).
Note: The quick test is available to everyone. If you log in, your progress will be saved.
Common mistakes and how to self-check
- Assuming causation from high correlation. Self-check: Can you name a plausible confounder? If yes, you do not have causation.
- Ignoring outliers. Self-check: Recompute correlation after removing top/bottom 1% or obvious data errors.
- Forgetting nonlinearity. Self-check: Inspect scatter plots and residuals; try Spearman.
- Multiple comparisons fishing. Self-check: Focus on effect sizes, stability across samples, and pre-registered hypotheses when possible.
- Using Pearson with binary variables. Self-check: Switch to point-biserial or appropriate categorical association measures.
- Time series spuriousness. Self-check: Detrend or difference before correlating; check autocorrelation.
Practical projects
- Marketing funnel study: Correlate channel metrics (clicks, visits, cost) with conversions. Produce a heatmap and a one-page insight summary with caveats.
- Price-discount analysis: Correlate price, discount_pct, stock, and sales. Flag relationships that change by product category.
- Support operations: Correlate first-response-time, backlog size, and CSAT. Propose an experiment to validate a suspected causal link.
Who this is for
- Beginner to intermediate Data Analysts needing reliable EDA habits.
- Professionals switching from reporting to analytical decision support.
Prerequisites
- Basic descriptive statistics (mean, median, variance).
- Comfort with spreadsheets or a scripting language (Python/R) for simple calculations.
- Ability to read scatter plots and simple matrices.
Learning path
- Data cleaning and validation.
- Univariate exploration (distributions, outliers).
- Correlation exploration (this lesson).
- Feature screening and simple predictive baselines.
- Hypothesis testing and experiment design.
Next steps
- Replicate these analyses on your real data and document assumptions.
- Try partial correlations to control obvious confounders.
- If working with time series, practice detrending and lag analysis before correlating.
Mini challenge
You find r = 0.62 between discount_pct and conversions. After segmenting by device type, the correlation drops to 0.15 for mobile and 0.65 for desktop. What would you recommend?
Possible approach
Report the segment-specific correlations, prioritize desktop for discount tests, and investigate why mobile response is weak (UX, page speed, or attribution). Consider controlled experiments per segment.