luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Acceleration With TensorRT Basics

Learn Acceleration With TensorRT Basics for free with explanations, exercises, and a quick test (for Computer Vision Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

Computer Vision Engineers often must deliver real-time or low-latency inference on NVIDIA GPUs and edge devices. TensorRT compiles and optimizes your trained model for the target GPU so you can:

  • Hit real-time FPS for video analytics and robotics.
  • Reduce cloud GPU costs by increasing throughput per device.
  • Meet tight latency SLAs for interactive applications.
  • Deploy efficiently on edge devices (e.g., Jetson) with limited power budgets.

Concept explained simply

TensorRT is an inference optimizer and runtime for NVIDIA GPUs. You give it a trained model (usually via ONNX). TensorRT then:

  • Analyzes your model graph.
  • Fuses compatible layers to reduce memory movement.
  • Chooses fast GPU kernels ("tactics").
  • Uses lower precision (FP16 or INT8) when safe, to gain speed.
  • Builds a hardware-tuned engine file you can run repeatedly.

Mental model

Imagine packing a suitcase: TensorRT repacks everything to use less space and weight (memory and compute), and labels the bag for a specific trip (your GPU). You still bring the same items (model behavior), but in a more compact, faster-to-access setup.

Workflow: from model to fast engine

  1. Export your model to ONNX with correct opset and shapes.
  2. Decide precision: FP32 (baseline), FP16 (often free speedup), or INT8 (fastest if calibrated/quantized).
  3. Choose static or dynamic shapes. For dynamic shapes, define min/opt/max ranges (optimization profiles).
  4. Build the TensorRT engine (with trtexec or the API). Tune workspace memory and other flags.
  5. Serialize the engine (.plan file) and deploy with the TensorRT runtime.
  6. Validate accuracy and benchmark latency/throughput.
Key terms
  • Engine: The compiled, optimized model file run by TensorRT runtime.
  • Precision: FP32, FP16, INT8. Lower precision can increase speed and reduce memory.
  • Optimization profile: Min/Opt/Max input shapes for dynamic shape models.
  • Workspace: Temporary GPU memory TensorRT uses to try faster kernels/tactics.
  • Calibration: Process to determine INT8 scaling using representative data if your model is not already quantized (Q/DQ).

Worked examples

Example 1: FP16 engine for image classification

Goal: Convert a ResNet50 ONNX model to a TensorRT FP16 engine and benchmark.

  1. Ensure the ONNX input name and shape are correct (e.g., input: 1x3x224x224).
  2. Build and save the engine with trtexec (replace input name if different):
trtexec \
  --onnx=resnet50.onnx \
  --fp16 \
  --shapes=input:1x3x224x224 \
  --workspace=4096 \
  --saveEngine=resnet50_fp16.plan \
  --avgRuns=100 --warmUp=10

What to expect: Lower latency than FP32 and similar accuracy for most models.

Example 2: INT8 engine using an existing calibration cache

Goal: Use INT8 for more speed. Assumes you already created a calibration cache file with representative data.

trtexec \
  --onnx=model.onnx \
  --int8 --fp16 \
  --calib=calibration.cache \
  --shapes=input:1x3x640x640 \
  --workspace=6144 \
  --saveEngine=model_int8.plan \
  --avgRuns=200 --warmUp=20

Notes:

  • If your ONNX has Q/DQ nodes (quantization-aware), calibration is not required.
  • INT8 needs careful validation; expect tiny numeric differences but similar task-level accuracy.

Example 3: Dynamic shapes and profiles for object detection

Goal: Support multiple resolutions and small batches efficiently.

trtexec \
  --onnx=detector.onnx \
  --fp16 \
  --minShapes=input:1x3x320x320 \
  --optShapes=input:1x3x640x640 \
  --maxShapes=input:4x3x1280x1280 \
  --workspace=8192 \
  --saveEngine=detector_fp16_dynamic.plan

At runtime, set input shapes within the chosen profile to avoid re-optimization costs. Use batch=1 for latency-critical paths; batch>1 for higher throughput.

Performance levers that matter

  • Precision: Start with FP16; use INT8 if accuracy holds after calibration/quantization.
  • Workspace size: Larger workspace lets TensorRT pick faster tactics (uses more GPU memory).
  • Optimization profiles: Choose realistic min/opt/max shapes; keep ranges tight to get better kernels.
  • Layer fusion: Comes for free when possible; reduces memory traffic and kernel launches.
  • Streams and batching: More streams/batch for throughput; single-stream, batch=1 for lowest latency.
  • Pre/post-processing: Move to GPU if possible to avoid CPU-GPU copies.
  • Target hardware: Engines are tuned to GPU architecture; rebuild when changing GPU families.

Validate correctness and performance

  • Accuracy parity: Compare outputs between the original framework and TensorRT within a small tolerance (e.g., mean abs diff < 1e-3 for FP16).
  • Task metrics: For detection/segmentation, compare mAP/IoU on a validation subset.
  • Latency: Measure p50/p95 latency with warm-up runs; include end-to-end time.
  • Throughput: Report in FPS or QPS at steady-state.
  • Stability: Check memory usage over time and catch any OOMs.

Who this is for

  • Computer Vision Engineers deploying models on NVIDIA GPUs or Jetson devices.
  • ML Engineers optimizing inference latency and throughput.

Prerequisites

  • Basic GPU concepts (CUDA cores, memory limits).
  • Ability to export models to ONNX (PyTorch or TensorFlow).
  • Comfort with command line and simple Python for I/O.

Learning path

Step 1 — Export and sanity check ONNX
  • Export model with correct opset and input names.
  • Run a few test inputs through ONNX Runtime to confirm equivalence.
Step 2 — Build FP16 engine
  • Use trtexec to produce an FP16 engine.
  • Benchmark and note latency/throughput gains.
Step 3 — Add dynamic shapes and profiles
  • Define realistic min/opt/max shapes.
  • Rebuild and test with varied inputs.
Step 4 — Explore INT8
  • Use Q/DQ models or a calibration cache built with representative data.
  • Re-check accuracy and only keep INT8 if metrics meet your bar.

Exercises

Everyone can do the exercises and the quick test below. Progress is saved if you are logged in.

  1. Exercise 1: Build and benchmark a TensorRT FP16 engine

    Using an ONNX classification model, build a TensorRT FP16 engine with trtexec, then collect average latency and throughput.

    Instructions
    1. Confirm the ONNX input name and set shape to 1x3x224x224 (or your model's real input).
    2. Run:
      trtexec --onnx=model.onnx --fp16 --shapes=input:1x3x224x224 --workspace=4096 --saveEngine=model_fp16.plan --avgRuns=200 --warmUp=20
    3. Record the reported latency and throughput.

    Expected output: Console report including average latency (ms) and throughput (FPS/QPS).

    Hints
    • If you see an input mismatch error, verify the exact input tensor name in your ONNX graph.
    • Increase --workspace if performance is poor and you have spare GPU memory.
    • Use --iterations or --avgRuns and --warmUp for stable measurements.
    Show solution

    A typical successful run looks like:

    trtexec --onnx=model.onnx --fp16 --shapes=input:1x3x224x224 --workspace=4096 --saveEngine=model_fp16.plan --avgRuns=200 --warmUp=20
    [... build log ...]
    Throughput: 1450 qps
    Latency: min = 0.55 ms, max = 1.10 ms, mean = 0.69 ms, median = 0.68 ms
    [... more stats ...]

    If your numbers differ widely from FP32, confirm that the same preprocessing and input shapes are used.

Exercise checklist

  • [ ] Engine file (.plan) created without errors.
  • [ ] Mean latency recorded after warm-up.
  • [ ] Throughput recorded and compared to baseline.
  • [ ] Notes on workspace and precision choices saved.

Common mistakes and self-check

  • Wrong input names or shapes: Self-check by printing model inputs from ONNX and matching them in trtexec.
  • Overly wide dynamic ranges: Huge min/max shapes can reduce kernel quality. Tighten profiles to realistic values.
  • INT8 without proper data: Calibration with non-representative data hurts accuracy. Use a representative sample (coverage of lighting, classes, sizes).
  • Ignoring pre/post-processing time: Measure end-to-end, not just model time.
  • Comparing different precisions unfairly: Use tolerances when comparing FP32 vs FP16 vs INT8.

Practical projects

  • Real-time camera classification: Build FP16 and INT8 engines, measure latency per resolution (224, 320, 480).
  • Object detection microservice: Add dynamic shape support and autoswitch between batch=1 (low latency) and batch=4 (throughput).
  • Edge deployment trial: On a Jetson device, compare GPU vs DLA (if available) for a quantized model and report power vs FPS.

Mini challenge

Reduce p95 latency by 20% for a detection model at 640x640. You may adjust precision (FP16/INT8), workspace, and optimization profiles. Report trade-offs (accuracy, memory, throughput).

Next steps

  • Automate engine builds in CI to produce per-GPU artifacts.
  • Add accuracy regression checks comparing framework vs TensorRT outputs.
  • Prepare for serving at scale with a production inference server and health checks.

Practice Exercises

1 exercises to complete

Instructions

  1. Confirm the ONNX input name and set shape to 1x3x224x224 (or your model's actual input).
  2. Run:
    trtexec --onnx=model.onnx --fp16 --shapes=input:1x3x224x224 --workspace=4096 --saveEngine=model_fp16.plan --avgRuns=200 --warmUp=20
  3. Record average latency and throughput from the console output.
Expected Output
Console output reporting mean latency (ms) and throughput (FPS/QPS).

Acceleration With TensorRT Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Acceleration With TensorRT Basics?

AI Assistant

Ask questions about this tool