Who this is for
You build or maintain video/streaming ML systems and want to make models understand motion, order, and timing. Ideal for Computer Vision Engineers moving from per-frame models to sequence-aware systems.
Prerequisites
- Comfortable with 2D CNNs and per-frame inference.
- Basic linear algebra and probability (means, variance, Gaussian).
- Familiar with batching, tensors, and FPS concepts.
Why this matters
Real tasks need time-awareness:
- Action recognition in short clips (e.g., waving vs. standing).
- Tracking objects across frames to keep stable IDs.
- Event detection in streams (fall detection, anomaly spikes).
- Reducing flicker in detections for a smooth UX.
- Meeting latency budgets in live video (edge cameras, sports, robotics).
Concept explained simply
Video is not just many images; it’s a sequence where the present depends on the recent past. Temporal models combine visual evidence across time to make steadier, more accurate predictions.
Mental model
Think of a model as a memory-equipped observer:
- It sees a new frame.
- It updates a small memory of what just happened (motion, state).
- It predicts using both the new frame and that memory.
Three key choices shape your design:
- Causality: use only past (low latency) vs. past+future (higher accuracy, higher latency).
- Receptive field: how many frames influence a prediction.
- State handling: explicit memory (RNN/Kalman) vs. implicit (temporal convs/3D CNN).
Core building blocks
- Frame differencing: subtract consecutive frames to highlight motion; tiny, fast signal.
- Optical flow (concept): estimates pixel motion vectors for richer motion cues.
- Temporal smoothing: EMA or majority vote to reduce prediction flicker.
- Recurrent models (LSTM/GRU): carry state across frames; naturally causal.
- Temporal convolutions (1D over time): fixed window, parallel-friendly; great for short actions.
- 3D CNNs: learn space+time jointly; strong offline accuracy, heavier compute.
- Transformers for video: attention across frames; powerful but can be memory-hungry.
- Tracking filters (e.g., Kalman): predict position/velocity and reconcile with detections.
Causal vs. bidirectional
Causal uses only past frames → lower latency, suitable for streaming. Bidirectional peeks into future frames → higher accuracy, not suitable for strict real-time without buffering delay.
Windows, stride, and latency
Window size = how many frames considered. Stride = how often you run the temporal module. Centered windows introduce lookahead delay. Causal windows do not, but might be less accurate.
Streaming constraints
- Fixed latency budget per frame (e.g., under 33 ms at 30 FPS).
- Bounded memory for buffers and state.
- Graceful degradation if frames drop or arrive late.
Worked examples
Example 1: Stabilizing detections over time (EMA + majority vote)
- Run your per-frame detector. For each class probability p_t, keep an EMA: s_t = alpha * p_t + (1 - alpha) * s_{t-1}, 0 < alpha ≤ 1.
- For discrete labels, keep a short queue (e.g., last 5 labels) and output the majority label.
- Tune alpha and queue length to trade responsiveness vs. stability.
Result: far less flicker with minimal compute.
Example 2: Simple tracking with IoU + Kalman filter
- Associate detections to existing tracks using IoU matching.
- For each track, maintain a Kalman state (x, y, w, h, vx, vy). Predict, then update with matched detection.
- Create new tracks for unmatched detections; delete tracks idle for N frames.
Result: consistent IDs and smoother boxes over time.
Example 3: Action recognition with 2D CNN + temporal conv
- Extract per-frame features using a 2D CNN.
- Stack features across T frames → a tensor of shape [T, F].
- Apply 1D temporal convs over T (causal or centered) → classification head.
Result: compact, parallel-friendly temporal modeling without full 3D compute.
Exercises
Exercise 1 — Upgrade a per-frame detector to a streaming detector
Design a minimal plan to turn a per-frame detector into a streaming event detector using a sliding window and temporal smoothing. Include buffer management, aggregation, and latency notes.
Exercise 2 — Latency and memory budget
Compute the algorithmic delay and total latency for a centered temporal window. Then estimate feature-buffer memory. Show formulas and numeric results.
Checklist
- I can explain causal vs. bidirectional and when to use each.
- I can choose an appropriate window size and stride for a latency budget.
- I know at least two ways to reduce flicker (EMA, majority vote).
- I can describe basic Kalman tracking and its benefit for stability.
- I can compare 2D+temporal-conv, 3D CNN, and RNN/GRU trade-offs.
Common mistakes and self-check
- Mistake: Using centered windows in live streams without acknowledging added latency. Self-check: Did you compute lookahead delay in ms?
- Mistake: Over-smoothing, causing sluggish response. Self-check: Measure time-to-detect for rapid events after tuning.
- Mistake: Ignoring dropped frames. Self-check: Does your buffer logic handle missing timestamps gracefully?
- Mistake: State drift in RNNs. Self-check: Do you reset or re-initialize state on scene cuts?
Practical projects
- Stabilized face detection overlay: apply EMA to reduce jitter and quantify flicker reduction.
- Multiclass action spotter: 2D CNN features + 1D temporal conv with causal windows; measure latency/accuracy trade-offs.
- Lightweight tracker: IoU + Kalman; evaluate ID switches and track fragmentation on short clips.
Mini challenge
Given a 30 FPS camera and a 12-frame causal window, design a pipeline that flags a “fall” within 300 ms from onset. Include smoothing choice, thresholding, and reset rules for state. Keep memory within a small rolling buffer.
Hints
- Latency budget at 30 FPS is ~33 ms/frame.
- Causal windows add no lookahead, only compute time.
- Use two-stage thresholds to reduce false positives (enter/exit hysteresis).
Quick Test
Take the quick test below to check your understanding. Everyone can take it for free; only logged-in users get saved progress.
Learning path
- Start with smoothing and simple differencing to feel temporal stability.
- Add a short causal temporal conv over 2D CNN features.
- Introduce a GRU baseline and compare latency vs. accuracy.
- Experiment with a small 3D CNN on short clips; note compute and memory costs.
- Integrate a Kalman filter for tracking and ID stability.
- Harden streaming: handle drops, re-sync, and state resets on scene cuts.
Next steps
- Pick one project and measure latency, stability (flicker), and accuracy.
- Document hyperparameters (window size, alpha, thresholds) and how they affect performance.
- Prepare a short demo clip showing before/after temporal modeling.