Nccl ops correction changes #3387

apbose · 2025-02-10T19:16:54Z

No description provided.

peri044 · 2025-02-11T19:32:03Z

py/torch_tensorrt/dynamo/backend/backends.py

+    # transpose key deleted since not desirable to lower it to permute
+    to_delete = {
+        key
+        for key in settings_aot_autograd["decompositions"]
+        if "detach" in key._name or "transpose" in key._name
+    }
+
+    for key in to_delete:
+        del settings_aot_autograd["decompositions"][key]


we have a remove_detach lowering pass. Can that help here ?

That is not helping here, because the graph explicitly does not have detach ops to remove the nodes. Instead it encounters this in https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2153. This might be due to the %hook_result_3 = call_function[target=torch._dynamo.variables.tensor.prim_to_local](args = (%outputs_3,), kwargs = {}) where it gets the DTensor to local tensor and might need to detach the distributed operation.

narendasan · 2025-02-18T23:38:24Z

py/torch_tensorrt/dynamo/backend/backends.py

-    if aten.detach in settings_aot_autograd["decompositions"]:
-        del settings_aot_autograd["decompositions"][aten.detach]
+    # transpose key deleted since not desirable to lower it to permute
+    to_delete = {


Will this apply to all cases not just NCCL?

You mean in the non distributed example? I am not sure about that answer, I added this for the llama3 example since I was issues in the model lowering and it was generating graph breaks at the wrong part, leading to complex input error. It can be added to all cases in case if we want to not lower transpose to permute.

Regarding the discussion

detach: remove_detach does not help since the graph explicitly does not have detach ops to remove the nodes. Instead it encounters this in https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2153. This might be due to the %hook_result_3 = call_function[target=torch._dynamo.variables.tensor.prim_to_local](args = (%outputs_3,), kwargs = {}) where it moves the DTensor to local tensor and needs to detach the distributed operation. This is in tensor_parallel_simple_example.py

transpose: transpose is more for tackling the tensor_parallel_llama3.py. The broad modification I did to handle the complex nos, are:
a. Modifying the placeholder node shape and type
b. Modifying the inputs to the reshape and slice ops with complex inputs
c. Replace the complex tensorrt mul
I see that if I decompose transpose to permute, the graph in gpu_0, has output of complex tensor mul as complex64 or complex 128 which goes as input to acc_* graph causing complex input error. Transpose being in the graph helps in it handling the complex input in gpu_0 graph partition only.

Regarding the discussion, would removal of transpose from decomposition affect the result- I would think no, since this is not removal of op like detach, but instead it is just that we do not lower it to permute. But you could provide me more insights if not

Also if we would want it to be model specific and not apply to all models, I think it can be the next step to include it either in the UI or something like we do in the non distributed models, including in the torch_disabled_decomposition dictionary which applies to all model. Specifying UI, since we are talking about model specific disabled decomposition. As of now since this part of code applies to only distributed model, it should be good to go.

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 04:08:09.204507+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 04:08:34.494240+00:00
@@ -23,11 +23,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -48,11 +48,11 @@
            inputs,
            use_dynamo_tracer=True,
            fuse_distributed_ops=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 19:22:33.272875+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 19:23:00.029705+00:00
@@ -15,11 +15,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -40,11 +40,11 @@
            inputs,
            use_dynamo_tracer=True,
            fuse_distributed_ops=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 19:24:59.719676+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 19:25:27.751500+00:00
@@ -15,11 +15,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -40,11 +40,11 @@
            inputs,
            use_dynamo_tracer=True,
            fuse_distributed_ops=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

narendasan · 2025-02-28T23:13:54Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

@@ -364,6 +364,15 @@ def forward(self, *inputs: torch.Tensor) -> torch.Tensor | Tuple[torch.Tensor, .
            (i.contiguous() if isinstance(i, torch.Tensor) else torch.tensor(i).cuda())
            for i in inputs
        ]
+
+        for i, contiguous_input in enumerate(contiguous_inputs):


Does the C++ API not need these changes?

I am not clear on this aspect. Could you please let me know what would be required as part of this. I am running distributed python example and did not encounter the requirement for this.

narendasan · 2025-02-28T23:16:21Z

tests/py/dynamo/distributed/distributed_utils.py

+    os.environ["MASTER_ADDR"] = "127.0.0.1"
+    os.environ["MASTER_PORT"] = str(port)
+    os.environ["TRTLLM_PLUGINS_PATH"] = (
+        find_repo_root() + "/lib/libnvinfer_plugin_tensorrt_llm.so"


Assume a tmp path instead of some arbitrary location in the source tree

narendasan · 2025-02-28T23:18:10Z

tests/py/dynamo/conversion/harness.py

@@ -351,6 +351,7 @@ def generate_graph(
        enable_passes: bool,
        propagate_shapes: bool = False,
        settings: CompilationSettings = CompilationSettings(),
+        fuse_distributed_ops: bool = False,


Why do we need this setting in the harness? Since it only applies to distributed, you should 1. have a test for the pass specifically, 2. have a test for converting the custom op without a pass and 3. have a test that applies the pass then converts the custom op

I had the same doubt actually.

I can look for the test for pass

Without a pass, it is not possible to encounter the custom op.

gathered_tensor = torch.ops._c10d_functional.all_gather_into_tensor(x, world_size, group_name) gathered_tensor = torch.ops._c10d_functional.wait_tensor(gathered_tensor)

together form the nccl_ops code, and for that the pass is required

That is what is done right now. Maybe applying the pass in harness is not the right place. Where would you suggest it to be? Since we want to test the converters only, can't harness have a specific option saying distributed=True?

The mpirun -n 1 --allow-run-as-root python test_distributed_simple_example.py ideally covers all of it

Actually just putting enable_passes=True, works in the converter test. Earlier the constant folding was giving an error, but that is resolved.

narendasan · 2025-02-28T23:21:30Z

tests/py/dynamo/distributed/test_nccl_ops.sh

+
+OS="$(uname -s)"
+ARCH="$(uname -m)"
+PYTHON_VERSION="$(python3 -c 'import sys; print(f"cp{sys.version_info.major}{sys.version_info.minor}")')"


Why does it matter what python version the system is using? Can we just use the same .so for all python versions?

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 16:25:48.020598+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 16:26:14.671617+00:00
@@ -15,11 +15,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -40,11 +40,11 @@
            inputs,
            use_dynamo_tracer=True,
            fuse_distributed_ops=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

apbose · 2025-03-03T17:02:15Z

I am not sure why lint is showing this

Skipping .ipynb files as Jupyter dependencies are not installed.
You can fix this by running ``pip install "black[jupyter]"``
would reformat /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py

Oh no! 💥 💔 💥
1 file would be reformatted, 595 files would be left unchanged.
Skipping .ipynb files as Jupyter dependencies are not installed.
You can fix this by running ``pip install "black[jupyter]"``
would reformat /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py

All done! ✨ 🍰 ✨
1 file would be reformatted, 595 files would be left unchanged.

I see no error in local pre-commit

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 17:06:44.889577+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 17:07:10.979519+00:00
@@ -15,11 +15,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -40,11 +40,11 @@
            inputs,
            use_dynamo_tracer=True,
            fuse_distributed_ops=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 18:15:05.376989+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 18:15:31.109367+00:00
@@ -15,11 +15,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -40,11 +40,11 @@
            inputs,
            use_dynamo_tracer=True,
            enable_passes=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

apbose added 2 commits February 10, 2025 10:59

backend changes to decomposition for transpose

321f6de

adding back detach op in ops to be deleted in decomposition

f79560b

apbose requested a review from narendasan February 10, 2025 19:16

facebook-github-bot added the cla signed label Feb 10, 2025

peri044 reviewed Feb 11, 2025

View reviewed changes

narendasan reviewed Feb 18, 2025

View reviewed changes

github-actions bot added component: tests Issues re: Tests component: converters Issues re: Specific op converters labels Feb 28, 2025

github-actions bot requested changes Feb 28, 2025

View reviewed changes

apbose force-pushed the nccl_ops_additional branch from 091c83f to 67b970e Compare February 28, 2025 19:24

github-actions bot requested changes Feb 28, 2025

View reviewed changes

narendasan reviewed Feb 28, 2025

View reviewed changes

github-actions bot requested changes Mar 3, 2025

View reviewed changes

apbose added 3 commits March 3, 2025 09:06

nccl ops correction

82298c5

using detach lowering pass

66861f4

adding test cases

d2aaa44

apbose force-pushed the nccl_ops_additional branch from 0f6966f to ca8478a Compare March 3, 2025 17:06

github-actions bot requested changes Mar 3, 2025

View reviewed changes

adding the test script and correction to the backend

9cf0e21

nccl ops address review

4b79bfb

apbose force-pushed the nccl_ops_additional branch from ca8478a to 4b79bfb Compare March 3, 2025 18:14

github-actions bot requested changes Mar 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nccl ops correction changes #3387

Nccl ops correction changes #3387

apbose commented Feb 10, 2025

peri044 Feb 11, 2025

apbose Mar 3, 2025

narendasan Feb 18, 2025

apbose Feb 21, 2025

apbose Mar 3, 2025 •

edited

Loading

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

narendasan Feb 28, 2025

apbose Mar 3, 2025

narendasan Feb 28, 2025

narendasan Feb 28, 2025

apbose Mar 3, 2025 •

edited

Loading

apbose Mar 3, 2025

apbose Mar 3, 2025

narendasan Feb 28, 2025

github-actions bot left a comment

apbose commented Mar 3, 2025

github-actions bot left a comment

github-actions bot left a comment

Nccl ops correction changes #3387

Are you sure you want to change the base?

Nccl ops correction changes #3387

Conversation

apbose commented Feb 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apbose Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apbose Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

apbose commented Mar 3, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

apbose Mar 3, 2025 •

edited

Loading

apbose Mar 3, 2025 •

edited

Loading