Unet results wrong of TensorRT 10.x when running on GPU L40s #4351

Fans0014 · 2025-02-08T09:50:19Z

Description

Im trying to convert a unet(model size 1.9GB/fp32) with opset=17 to tensorrt.
trt8.6 version results was correct, but got 15% performance down.
trt10.0.0/10.5/10.8 results were nan.
Is there some high level optimization(eg op fusion) was introduced to trt10.x which may cause the nan results?
How can I debug the inference procedure to fix my pytorch model, and can run it correctlly on trt10.x.

Environment

TensorRT Version:8.6, 10.0, 10.5, 10.8

NVIDIA GPU:L40s

NVIDIA Driver Version:535.161.08

CUDA Version:12.2

CUDNN Version:8.4

Operating System:Ubuntu 20.04

Python Version (if applicable):3.10

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

The text was updated successfully, but these errors were encountered:

Fans0014 · 2025-02-19T11:43:31Z

I use polygraphy to debug the tensorrt model layer precision. I got another problem.
1, With command
polygraphy run unet_fp32_op17_v5.onnx --load-inputs ./data/rel_input.json --trt --onnxrt --tactic-sources --rtol 1e-02 --atol 1e-02 > log_trt_out.txt
The result shows below
trt-runner Stats: mean=-0.14245, std-dev=0.44876, var=0.20138, median=-0.20319, min=-1.5187 at (0, 0, 127, 16), max=1.2616 at (0, 1, 104, 0)
onnxrt-runner Stats: mean=-0.14654, std-dev=0.45621, var=0.20813, median=-0.20612, min=-1.5602 at (0, 0, 127, 16), max=1.1212 at (0, 1, 104, 0)

2, With command
polygraphy run unet_fp32_op17_v5.onnx --load-inputs ./data/rel_input.json --trt --onnxrt --tactic-sources --rtol 1e-02 --atol 1e-02 --trt-outputs mark all --onnx-outputs mark all > log_trt_all.txt
The result shows below
trt-runner Stats: mean=-0.14654, std-dev=0.45621, var=0.20813, median=-0.20612, min=-1.5602 at (0, 0, 127, 16), max=1.1212 at (0, 1, 104, 0)
onnxrt-runne Stats: mean=-0.14654, std-dev=0.45621, var=0.20813, median=-0.20612, min=-1.5602 at (0, 0, 127, 16), max=1.1212 at (0, 1, 104, 0)

Why the result difference under the same tools and model?
Onnx: 1.17
Onnx runtime: 1.20
trt: 10.7
polygraphy: 0.49.14

Fans0014 · 2025-02-24T02:25:31Z

def attention(self, q, k, v, mask=None):
        bs, width, length = q.shape
        ch = width // self.num_heads
        scale = 1 / torch.sqrt(torch.sqrt(torch.ones(1)[0].to(device=q.device)*ch)) # ch = 64/96
        weight = torch.einsum(
            "bct,bcs->bts",
            (q * scale).reshape(bs * self.num_heads, ch, length),
            (k * scale).reshape(bs * self.num_heads, ch, -1),
        )
        if mask is not None:
            mask = (
                mask.view(mask.size(0), 1, 1, mask.size(1))
                .repeat(1, self.num_heads, 1, 1)
                .flatten(0, 1)
            )
            weight = weight.masked_fill(mask == 0, float("-inf"))
        weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype)
        a = torch.einsum("bts,bcs->bct", weight, v.reshape(bs * self.num_heads, ch, -1))
        return a.reshape(bs, -1, length)    

def forward(self, x, cond=None, cond_mask=None):
        b, c, *spatial = x.shape
        qkv = self.qkv(self.norm(x))
        q, k, v = qkv.reshape(b, 3 * c, -1).chunk(3, dim=1)
        # h = self.attention(q, k, v) # original code
        h = self.attention(v, v, v) # debug code

By debugging the model line by line, I found that with the original code h = self.attention(q, k, v), the results from TensorRT (TRT) are inconsistent with those from PyTorch/ONNX.
But when I change it to h = self.attention(v, v, v) or h = q+k+v , the model result was aligned.

Fans0014 · 2025-02-25T03:20:05Z

I finally found that all the wired results were come from the below code.

        q, k, v = qkv.reshape(b, 3 * c, -1).chunk(3, dim=1)
        h = self.attention(q, k, v) # original code

I guess that q/k tensor will be fused in the tensorrt model, and the fused node caused problem(guess some bugs were there).

So I return the q/k to the model output by do some overhead operation(caculate mean value and return here), so that the q/k tensor won't be fused by tensorrt model. And then model outputs are inconsistent with onnx/torch with 3% perf down.

Now I am looking for other ways to mark the q/k tensor so that it won't be fused in the tensorrt model and won't introduce additional operations.

LeoZDong assigned nvpohanh Feb 10, 2025

LeoZDong added Module:Performance General performance issues triaged Issue has been triaged by maintainers Investigating Issue needs further investigation labels Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unet results wrong of TensorRT 10.x when running on GPU L40s #4351

Unet results wrong of TensorRT 10.x when running on GPU L40s #4351

Fans0014 commented Feb 8, 2025 •

edited

Loading

Fans0014 commented Feb 19, 2025 •

edited

Loading

Fans0014 commented Feb 24, 2025 •

edited

Loading

Fans0014 commented Feb 25, 2025 •

edited

Loading

Unet results wrong of TensorRT 10.x when running on GPU L40s #4351

Unet results wrong of TensorRT 10.x when running on GPU L40s #4351

Comments

Fans0014 commented Feb 8, 2025 • edited Loading

Description

Environment

Relevant Files

Steps To Reproduce

Fans0014 commented Feb 19, 2025 • edited Loading

Fans0014 commented Feb 24, 2025 • edited Loading

Fans0014 commented Feb 25, 2025 • edited Loading

Fans0014 commented Feb 8, 2025 •

edited

Loading

Fans0014 commented Feb 19, 2025 •

edited

Loading

Fans0014 commented Feb 24, 2025 •

edited

Loading

Fans0014 commented Feb 25, 2025 •

edited

Loading