Why this matters
Computer Vision Engineers often must deliver real-time or low-latency inference on NVIDIA GPUs and edge devices. TensorRT compiles and optimizes your trained model for the target GPU so you can:
- Hit real-time FPS for video analytics and robotics.
- Reduce cloud GPU costs by increasing throughput per device.
- Meet tight latency SLAs for interactive applications.
- Deploy efficiently on edge devices (e.g., Jetson) with limited power budgets.
Concept explained simply
TensorRT is an inference optimizer and runtime for NVIDIA GPUs. You give it a trained model (usually via ONNX). TensorRT then:
- Analyzes your model graph.
- Fuses compatible layers to reduce memory movement.
- Chooses fast GPU kernels ("tactics").
- Uses lower precision (FP16 or INT8) when safe, to gain speed.
- Builds a hardware-tuned engine file you can run repeatedly.
Mental model
Imagine packing a suitcase: TensorRT repacks everything to use less space and weight (memory and compute), and labels the bag for a specific trip (your GPU). You still bring the same items (model behavior), but in a more compact, faster-to-access setup.
Workflow: from model to fast engine
- Export your model to ONNX with correct opset and shapes.
- Decide precision: FP32 (baseline), FP16 (often free speedup), or INT8 (fastest if calibrated/quantized).
- Choose static or dynamic shapes. For dynamic shapes, define min/opt/max ranges (optimization profiles).
- Build the TensorRT engine (with trtexec or the API). Tune workspace memory and other flags.
- Serialize the engine (.plan file) and deploy with the TensorRT runtime.
- Validate accuracy and benchmark latency/throughput.
Key terms
- Engine: The compiled, optimized model file run by TensorRT runtime.
- Precision: FP32, FP16, INT8. Lower precision can increase speed and reduce memory.
- Optimization profile: Min/Opt/Max input shapes for dynamic shape models.
- Workspace: Temporary GPU memory TensorRT uses to try faster kernels/tactics.
- Calibration: Process to determine INT8 scaling using representative data if your model is not already quantized (Q/DQ).
Worked examples
Example 1: FP16 engine for image classification
Goal: Convert a ResNet50 ONNX model to a TensorRT FP16 engine and benchmark.
- Ensure the ONNX input name and shape are correct (e.g., input: 1x3x224x224).
- Build and save the engine with trtexec (replace input name if different):
trtexec \
--onnx=resnet50.onnx \
--fp16 \
--shapes=input:1x3x224x224 \
--workspace=4096 \
--saveEngine=resnet50_fp16.plan \
--avgRuns=100 --warmUp=10
What to expect: Lower latency than FP32 and similar accuracy for most models.
Example 2: INT8 engine using an existing calibration cache
Goal: Use INT8 for more speed. Assumes you already created a calibration cache file with representative data.
trtexec \
--onnx=model.onnx \
--int8 --fp16 \
--calib=calibration.cache \
--shapes=input:1x3x640x640 \
--workspace=6144 \
--saveEngine=model_int8.plan \
--avgRuns=200 --warmUp=20
Notes:
- If your ONNX has Q/DQ nodes (quantization-aware), calibration is not required.
- INT8 needs careful validation; expect tiny numeric differences but similar task-level accuracy.
Example 3: Dynamic shapes and profiles for object detection
Goal: Support multiple resolutions and small batches efficiently.
trtexec \
--onnx=detector.onnx \
--fp16 \
--minShapes=input:1x3x320x320 \
--optShapes=input:1x3x640x640 \
--maxShapes=input:4x3x1280x1280 \
--workspace=8192 \
--saveEngine=detector_fp16_dynamic.plan
At runtime, set input shapes within the chosen profile to avoid re-optimization costs. Use batch=1 for latency-critical paths; batch>1 for higher throughput.
Performance levers that matter
- Precision: Start with FP16; use INT8 if accuracy holds after calibration/quantization.
- Workspace size: Larger workspace lets TensorRT pick faster tactics (uses more GPU memory).
- Optimization profiles: Choose realistic min/opt/max shapes; keep ranges tight to get better kernels.
- Layer fusion: Comes for free when possible; reduces memory traffic and kernel launches.
- Streams and batching: More streams/batch for throughput; single-stream, batch=1 for lowest latency.
- Pre/post-processing: Move to GPU if possible to avoid CPU-GPU copies.
- Target hardware: Engines are tuned to GPU architecture; rebuild when changing GPU families.
Validate correctness and performance
- Accuracy parity: Compare outputs between the original framework and TensorRT within a small tolerance (e.g., mean abs diff < 1e-3 for FP16).
- Task metrics: For detection/segmentation, compare mAP/IoU on a validation subset.
- Latency: Measure p50/p95 latency with warm-up runs; include end-to-end time.
- Throughput: Report in FPS or QPS at steady-state.
- Stability: Check memory usage over time and catch any OOMs.
Who this is for
- Computer Vision Engineers deploying models on NVIDIA GPUs or Jetson devices.
- ML Engineers optimizing inference latency and throughput.
Prerequisites
- Basic GPU concepts (CUDA cores, memory limits).
- Ability to export models to ONNX (PyTorch or TensorFlow).
- Comfort with command line and simple Python for I/O.
Learning path
Step 1 — Export and sanity check ONNX
- Export model with correct opset and input names.
- Run a few test inputs through ONNX Runtime to confirm equivalence.
Step 2 — Build FP16 engine
- Use trtexec to produce an FP16 engine.
- Benchmark and note latency/throughput gains.
Step 3 — Add dynamic shapes and profiles
- Define realistic min/opt/max shapes.
- Rebuild and test with varied inputs.
Step 4 — Explore INT8
- Use Q/DQ models or a calibration cache built with representative data.
- Re-check accuracy and only keep INT8 if metrics meet your bar.
Exercises
Everyone can do the exercises and the quick test below. Progress is saved if you are logged in.
- Exercise 1: Build and benchmark a TensorRT FP16 engine
Using an ONNX classification model, build a TensorRT FP16 engine with trtexec, then collect average latency and throughput.
Instructions
- Confirm the ONNX input name and set shape to 1x3x224x224 (or your model's real input).
- Run:
trtexec --onnx=model.onnx --fp16 --shapes=input:1x3x224x224 --workspace=4096 --saveEngine=model_fp16.plan --avgRuns=200 --warmUp=20 - Record the reported latency and throughput.
Expected output: Console report including average latency (ms) and throughput (FPS/QPS).
Hints
- If you see an input mismatch error, verify the exact input tensor name in your ONNX graph.
- Increase
--workspaceif performance is poor and you have spare GPU memory. - Use
--iterationsor--avgRunsand--warmUpfor stable measurements.
Show solution
A typical successful run looks like:
trtexec --onnx=model.onnx --fp16 --shapes=input:1x3x224x224 --workspace=4096 --saveEngine=model_fp16.plan --avgRuns=200 --warmUp=20 [... build log ...] Throughput: 1450 qps Latency: min = 0.55 ms, max = 1.10 ms, mean = 0.69 ms, median = 0.68 ms [... more stats ...]If your numbers differ widely from FP32, confirm that the same preprocessing and input shapes are used.
Exercise checklist
- [ ] Engine file (.plan) created without errors.
- [ ] Mean latency recorded after warm-up.
- [ ] Throughput recorded and compared to baseline.
- [ ] Notes on workspace and precision choices saved.
Common mistakes and self-check
- Wrong input names or shapes: Self-check by printing model inputs from ONNX and matching them in trtexec.
- Overly wide dynamic ranges: Huge min/max shapes can reduce kernel quality. Tighten profiles to realistic values.
- INT8 without proper data: Calibration with non-representative data hurts accuracy. Use a representative sample (coverage of lighting, classes, sizes).
- Ignoring pre/post-processing time: Measure end-to-end, not just model time.
- Comparing different precisions unfairly: Use tolerances when comparing FP32 vs FP16 vs INT8.
Practical projects
- Real-time camera classification: Build FP16 and INT8 engines, measure latency per resolution (224, 320, 480).
- Object detection microservice: Add dynamic shape support and autoswitch between batch=1 (low latency) and batch=4 (throughput).
- Edge deployment trial: On a Jetson device, compare GPU vs DLA (if available) for a quantized model and report power vs FPS.
Mini challenge
Reduce p95 latency by 20% for a detection model at 640x640. You may adjust precision (FP16/INT8), workspace, and optimization profiles. Report trade-offs (accuracy, memory, throughput).
Next steps
- Automate engine builds in CI to produce per-GPU artifacts.
- Add accuracy regression checks comparing framework vs TensorRT outputs.
- Prepare for serving at scale with a production inference server and health checks.