You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TRT switched up Quantize and Concat, but this resulted in a suboptimal graph which is ~30% slower. No matter what I tried, I was not able to reproduce the plan from the implicit quantization using the explicitly quantized model.
Q/DQ placement with ConvTranspose.
With implicit quantization TRT is able to fuse ConvTranspose and activation, and according to all recommendations, Q/DQ nodes should be placed like this:
However, when I try this method, TRT fails to merge ConvTranspose and activation and this results in an invalid output. I am forced to do it like this:
Explicitly quantized convolutions are slower than implicitly quantized ones
I get consistently slower profiling results with explicitly quantized model (~5%), and it seems like it mostly comes down to tactic selection. Algorithm selectors are deprecated and I cannot understand how to use editable timing cache for CaskConvolution nodes because there are absolutely no cache keys in verbose logs.
Additional issue: since my network uses FP16 inputs I expect TRT to be able to use it directly without any reformats. However, without DIRECT_IO flag TRT always first converts FP16 to FP32 and then back to FP16. DIRECT_IO is deprecated, what should I use as an alternative?
Description
Since implicit quantization is deprecated, I started migrating my model pipeline to explicit quantization.
However, I encountered some issues:
With implicit quantization the graph looks like this:
Basically, concat is replaced with a basic copy, since all inputs are aligned.
However, when I use explicit quantization the graph becomes like this:
TRT switched up Quantize and Concat, but this resulted in a suboptimal graph which is ~30% slower. No matter what I tried, I was not able to reproduce the plan from the implicit quantization using the explicitly quantized model.
With implicit quantization TRT is able to fuse ConvTranspose and activation, and according to all recommendations, Q/DQ nodes should be placed like this:
However, when I try this method, TRT fails to merge ConvTranspose and activation and this results in an invalid output. I am forced to do it like this:
I get consistently slower profiling results with explicitly quantized model (~5%), and it seems like it mostly comes down to tactic selection. Algorithm selectors are deprecated and I cannot understand how to use editable timing cache for CaskConvolution nodes because there are absolutely no cache keys in verbose logs.
Additional issue: since my network uses FP16 inputs I expect TRT to be able to use it directly without any reformats. However, without DIRECT_IO flag TRT always first converts FP16 to FP32 and then back to FP16. DIRECT_IO is deprecated, what should I use as an alternative?
Environment
TensorRT Version: 10.8.0.43
NVIDIA GPU: RTX 3060 LT
NVIDIA Driver Version: 572.47
CUDA Version: 12.8.0
CUDNN Version: 9.7.1.26
Operating System: Windows 11
Relevant Files
Data
Scripts
The text was updated successfully, but these errors were encountered: