Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit quantization is slower than implicit quantization and produces invalid results #4366

Open
itmo153277 opened this issue Feb 24, 2025 · 0 comments

Comments

@itmo153277
Copy link

Description

Since implicit quantization is deprecated, I started migrating my model pipeline to explicit quantization.
However, I encountered some issues:

  1. Different behaviour with concat:

With implicit quantization the graph looks like this:

A(fp16:linear) -> Concat
B(fp16:linear) -> Concat
C(fp16:linear) -> Concat -> Quantize+Reformat -> Conv

Basically, concat is replaced with a basic copy, since all inputs are aligned.

However, when I use explicit quantization the graph becomes like this:

A(fp16:linear) -> Quantize --> Concat
B(fp16:linear) -> Quantize --> Concat
C(fp16:linear) -> Quantize --> Concat --> Reformat -> Conv

TRT switched up Quantize and Concat, but this resulted in a suboptimal graph which is ~30% slower. No matter what I tried, I was not able to reproduce the plan from the implicit quantization using the explicitly quantized model.

  1. Q/DQ placement with ConvTranspose.

With implicit quantization TRT is able to fuse ConvTranspose and activation, and according to all recommendations, Q/DQ nodes should be placed like this:

input -> Q -> DQ -> ConvTranspose -> Activation -> Q -> DQ -> output

However, when I try this method, TRT fails to merge ConvTranspose and activation and this results in an invalid output. I am forced to do it like this:

input -> Q -> DQ -> ConvTranspose -> Q -> DQ -> Activation -> Q -> DQ -> output
  1. Explicitly quantized convolutions are slower than implicitly quantized ones

I get consistently slower profiling results with explicitly quantized model (~5%), and it seems like it mostly comes down to tactic selection. Algorithm selectors are deprecated and I cannot understand how to use editable timing cache for CaskConvolution nodes because there are absolutely no cache keys in verbose logs.

Additional issue: since my network uses FP16 inputs I expect TRT to be able to use it directly without any reformats. However, without DIRECT_IO flag TRT always first converts FP16 to FP32 and then back to FP16. DIRECT_IO is deprecated, what should I use as an alternative?

Environment

TensorRT Version: 10.8.0.43

NVIDIA GPU: RTX 3060 LT

NVIDIA Driver Version: 572.47

CUDA Version: 12.8.0

CUDNN Version: 9.7.1.26

Operating System: Windows 11

Relevant Files

Data

Scripts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant