luvv to helpDiscover the Best Free Online Tools
Topic 8 of 8

Monitoring And Profiling

Learn Monitoring And Profiling for free with explanations, exercises, and a quick test (for Data Visualization Engineer).

Published: December 28, 2025 | Updated: December 28, 2025

Why this matters

Monitoring and profiling turn guesswork into data. As a Data Visualization Engineer, you’re responsible for dashboards and charts that load fast, feel smooth, and scale as data grows. Real tasks include:

  • Diagnosing a slow dashboard and proving whether the bottleneck is the query, API, or chart rendering.
  • Keeping interaction latency under 100–200 ms for filters, tooltips, and zoom.
  • Preventing regressions when datasets or users increase.
  • Setting and tracking performance SLOs like time-to-first-chart, p95 query latency, FPS during interactions, and error rates.
Typical performance questions you’ll answer
  • Which chart is slowest and why?
  • Are we compute-bound (CPU), memory-bound, or network-bound?
  • What is the p95 load time for our top dashboard?
  • Did last week’s release increase rows scanned or decrease cache hit rate?

Concept explained simply

Monitoring is continuous measurement of live systems. Profiling is a focused investigation to locate hotspots. Together they form a feedback loop to keep charts fast and reliable.

Key metrics to track:

  • Latency: time-to-first-byte (TTFB), time-to-first-chart (TTFC), total load time, p95 interaction latency.
  • Throughput: requests per second, queries per minute, concurrency.
  • Resource usage: CPU, memory, GPU, rows scanned, bytes transferred.
  • Quality signals: error rates, timeouts, data freshness, cache hit rate.
Mental model: The 5-step loop
  1. Instrument: Add timers, counters, and labels to code and queries.
  2. Collect: Emit metrics and logs (client and server).
  3. Visualize: Build a small performance dashboard.
  4. Alert: Define thresholds and notify on breaches.
  5. Improve: Optimize, then repeat with the same measurements.

What to monitor (by layer)

  • Client (browser):
    • TTFC (initial chart draw), p95 interaction latency, frames per second (FPS) during zoom/brush, memory usage after navigation.
  • APIs: p95 latency, error rate, payload size, decompression time.
  • SQL/warehouse: p95 query time, rows scanned, bytes read, cache hit rate, queue wait time.
  • Infra: CPU, memory, I/O, network bandwidth, autoscaling events.
  • Data quality: freshness, missing values, schema changes.
  • Product signals: most-used dashboards, drop-off after 3 seconds, retries per session.
Good starter thresholds
  • TTFC: < 2.0s p95 on broadband; < 4.0s p95 on low-end devices.
  • Interaction latency: < 100–200 ms p95 for tooltips/filter clicks.
  • API latency: < 500 ms p95; errors < 0.5%.
  • Query rows scanned: baseline per dashboard; alert on +50% week-over-week.

Profiling methods you’ll use

  • Browser DevTools: Performance timeline (CPU flamegraph), Memory heap snapshots, Network waterfall (TTFB, content download).
  • SQL EXPLAIN/EXPLAIN ANALYZE: Identify scans vs. seeks, sort/aggregate costs, missing indexes, partition pruning.
  • Tracing and sampling: Request traces across frontend, API, and warehouse; sample to reduce overhead.
  • Synthetic vs. real-user monitoring (RUM): Synthetic is controlled; RUM shows real devices and networks.
  • A/B or before/after tests: Verify improvements with the same measurement approach.

Worked examples

Example 1 — Slow dashboard on first load
  1. Measure: Network shows API returns in 300 ms, but TTFC is 4.8 s; DevTools reveals heavy layout and paint on one table chart.
  2. Hypothesis: Rendering 20k DOM nodes (table rows) blocks main thread.
  3. Fix: Use pagination or virtualization (render ~50 rows), move heavy formatting to server, and send pre-aggregated data.
  4. Result: TTFC drops to 1.6 s p95; interaction latency from 320 ms to 80 ms.
Example 2 — Janky scatter plot
  1. Measure: 200k points; FPS drops to 12 during pan/zoom; GPU mostly idle; main thread busy.
  2. Fix: Aggregate to hexbin/contours, enable data sampling for dense regions, and switch rendering to canvas or WebGL.
  3. Result: 55–60 FPS; p95 zoom latency < 120 ms.
Example 3 — Memory creep after filter changes
  1. Measure: Heap snapshots show detached DOM nodes accumulating; event listeners not removed on chart rerender.
  2. Fix: Dispose charts on unmount, reuse canvases, and clear timers/listeners.
  3. Result: Memory plateaus under 200 MB after 10 filter cycles; no OOM crashes.

How to instrument quickly

  1. Add timers around fetch and initial render (TTFB, TTFC, total load).
  2. Log payload size and rows count with the same request ID.
  3. Record p50/p95 interaction latency for key actions (filter apply, tooltip open).
  4. Set initial SLOs and a weekly review: if breached for 3 days, create an issue.
Copy-paste metric names you can adopt
  • viz_ttfc_ms, viz_interaction_latency_ms, viz_payload_kb
  • api_p95_ms, api_error_rate
  • sql_rows_scanned, sql_bytes_read, sql_cache_hit_rate

Exercises

These match the graded exercises below. Complete them, then take the quick test. The test is available to everyone; only logged-in users will see saved progress in their account.

Exercise 1 — Profile a heavy SVG chart

Goal: Identify the main-thread bottleneck and propose two fixes.

Starter HTML (save as profile-svg.html and open locally)
<!doctype html>
<html>
<body>
  <h3>Heavy SVG test</h3>
  <button id="draw">Draw 80k circles</button>
  <svg id="s" width="900" height="500" style="border:1px solid #ccc"></svg>
  <script>
    const s = document.getElementById('s');
    document.getElementById('draw').onclick = () => {
      console.time('render');
      for (let i=0;i<80000;i++){
        const c = document.createElementNS('http://www.w3.org/2000/svg','circle');
        c.setAttribute('cx', Math.random()*900);
        c.setAttribute('cy', Math.random()*500);
        c.setAttribute('r', 2);
        c.setAttribute('fill', 'steelblue');
        s.appendChild(c);
      }
      console.timeEnd('render');
    };
  </script>
</body>
</html>

Steps:

  • Open Performance in DevTools, record while pressing Draw.
  • Identify top CPU tasks (layout/paint, scripting, GC).
  • Write two fixes that would reduce time by 80%+.

Exercise 2 — Define SLOs and alerts

Goal: Create a minimal performance contract for a top dashboard.

  • Pick three metrics (e.g., viz_ttfc_ms p95, api_p95_ms, sql_rows_scanned).
  • Set thresholds and an alert rule (breach for N minutes/hours).
  • Describe how you’ll visualize and review them weekly.
Quality checklist to self-evaluate
  • Metrics map to user experience (TTFC, interaction latency).
  • Includes both client and data/query layers.
  • Thresholds are realistic and testable.
  • Includes a plan to visualize and to act on breaches.

Common mistakes and self-check

  • Relying on averages. Use p95/p99 to catch tail latencies.
  • Profiling only in dev. Test on realistic data and devices.
  • No baseline. Always capture pre-change metrics.
  • Optimizing the wrong layer. Confirm where time is spent first.
  • Unbounded metric labels. Avoid high-cardinality tags (e.g., per-user IDs).
  • Ignoring payload size and rows scanned. Data volume often dominates.
  • Skipping memory checks. Watch heap over multiple interactions.
Self-check prompts
  • Can you point to one most-expensive function or query step?
  • Do you know your current p95 TTFC for the top 3 dashboards?
  • What alert would wake you up, and why?

Practical projects

  • Build a performance dashboard showing TTFC p95, API p95, and rows scanned for your top 5 charts. Add a weekly trend.
  • Profile a chart interaction (zoom/brush) and ship one change that reduces p95 latency by 30%+. Document before/after.
  • Set an SLO (e.g., TTFC p95 < 2s) and implement a simple alert on breach for 15 minutes.

Learning path

  1. Start with client metrics (TTFC, interaction p95) using the Performance API and DevTools.
  2. Add API timing and payload size logging with request IDs.
  3. Use EXPLAIN/ANALYZE to understand query plans and rows scanned.
  4. Create a small performance dashboard; set SLOs and alerts.
  5. Automate regression checks in your release process.

Who this is for

Data Visualization Engineers, dashboard developers, analytics engineers, and BI practitioners who own the speed and smoothness of data experiences.

Prerequisites

  • Basic web performance concepts (network waterfall, main thread).
  • Comfort with SQL and reading query plans.
  • Ability to run and measure code locally with DevTools.

Next steps

  • Adopt the 5-step loop: instrument, collect, visualize, alert, improve.
  • Add a performance section to pull requests with before/after metrics.
  • Schedule a 30-minute weekly review of top dashboards’ p95 metrics.

Mini challenge

Pick one high-traffic dashboard. Set TTFC p95 < 2s and interaction p95 < 150 ms goals. Capture today’s baseline, propose two changes, and estimate impact. Re-measure after one change and log the before/after numbers.

Quick Test

Take the quick test below to reinforce key ideas. The test is available to everyone; only logged-in users will see saved progress.

Practice Exercises

2 exercises to complete

Instructions

Save this file as profile-svg.html and open it in your browser. Record a performance profile while pressing Draw. Identify the dominant bottleneck and list two concrete fixes.

<!doctype html>
<html>
<body>
  <h3>Heavy SVG test</h3>
  <button id="draw">Draw 80k circles</button>
  <svg id="s" width="900" height="500" style="border:1px solid #ccc"></svg>
  <script>
    const s = document.getElementById('s');
    document.getElementById('draw').onclick = () => {
      console.time('render');
      for (let i=0;i<80000;i++){
        const c = document.createElementNS('http://www.w3.org/2000/svg','circle');
        c.setAttribute('cx', Math.random()*900);
        c.setAttribute('cy', Math.random()*500);
        c.setAttribute('r', 2);
        c.setAttribute('fill', 'steelblue');
        s.appendChild(c);
      }
      console.timeEnd('render');
    };
  </script>
</body>
</html>
  • Open DevTools Performance, click Record, then press Draw.
  • Stop recording and inspect the flame chart.
  • Answer: what consumes most time, and what two fixes will you ship?
Expected Output
Main thread is dominated by layout/paint and DOM operations from adding tens of thousands of SVG nodes. Two fixes: reduce DOM nodes (aggregation/sampling/virtualization) and switch to canvas/WebGL or server-side aggregation to shrink payload.

Monitoring And Profiling — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Monitoring And Profiling?

AI Assistant

Ask questions about this tool