Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nccl ops correction changes #3387

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Nccl ops correction changes #3387

wants to merge 7 commits into from

Conversation

apbose
Copy link
Collaborator

@apbose apbose commented Feb 10, 2025

No description provided.

@apbose apbose requested a review from narendasan February 10, 2025 19:16
@github-actions github-actions bot added component: lowering Issues re: The lowering / preprocessing passes component: conversion Issues re: Conversion stage component: api [Python] Issues re: Python API component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: torch_compile labels Feb 10, 2025
Comment on lines 68 to 76
# transpose key deleted since not desirable to lower it to permute
to_delete = {
key
for key in settings_aot_autograd["decompositions"]
if "detach" in key._name or "transpose" in key._name
}

for key in to_delete:
del settings_aot_autograd["decompositions"][key]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a remove_detach lowering pass. Can that help here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not helping here, because the graph explicitly does not have detach ops to remove the nodes. Instead it encounters this in https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2153. This might be due to the %hook_result_3 = call_function[target=torch._dynamo.variables.tensor.prim_to_local](args = (%outputs_3,), kwargs = {}) where it gets the DTensor to local tensor and might need to detach the distributed operation.

if aten.detach in settings_aot_autograd["decompositions"]:
del settings_aot_autograd["decompositions"][aten.detach]
# transpose key deleted since not desirable to lower it to permute
to_delete = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this apply to all cases not just NCCL?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in the non distributed example? I am not sure about that answer, I added this for the llama3 example since I was issues in the model lowering and it was generating graph breaks at the wrong part, leading to complex input error. It can be added to all cases in case if we want to not lower transpose to permute.

Copy link
Collaborator Author

@apbose apbose Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the discussion

  1. detach: remove_detach does not help since the graph explicitly does not have detach ops to remove the nodes. Instead it encounters this in https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2153. This might be due to the %hook_result_3 = call_function[target=torch._dynamo.variables.tensor.prim_to_local](args = (%outputs_3,), kwargs = {}) where it moves the DTensor to local tensor and needs to detach the distributed operation. This is in tensor_parallel_simple_example.py

  2. transpose: transpose is more for tackling the tensor_parallel_llama3.py. The broad modification I did to handle the complex nos, are:
    a. Modifying the placeholder node shape and type
    b. Modifying the inputs to the reshape and slice ops with complex inputs
    c. Replace the complex tensorrt mul
    I see that if I decompose transpose to permute, the graph in gpu_0, has output of complex tensor mul as complex64 or complex 128 which goes as input to acc_* graph causing complex input error. Transpose being in the graph helps in it handling the complex input in gpu_0 graph partition only.

Regarding the discussion, would removal of transpose from decomposition affect the result- I would think no, since this is not removal of op like detach, but instead it is just that we do not lower it to permute. But you could provide me more insights if not

  1. Also if we would want it to be model specific and not apply to all models, I think it can be the next step to include it either in the UI or something like we do in the non distributed models, including in the torch_disabled_decomposition dictionary which applies to all model. Specifying UI, since we are talking about model specific disabled decomposition. As of now since this part of code applies to only distributed model, it should be good to go.

@github-actions github-actions bot added component: tests Issues re: Tests component: converters Issues re: Specific op converters labels Feb 28, 2025
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 04:08:09.204507+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 04:08:34.494240+00:00
@@ -23,11 +23,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -48,11 +48,11 @@
            inputs,
            use_dynamo_tracer=True,
            fuse_distributed_ops=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 19:22:33.272875+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 19:23:00.029705+00:00
@@ -15,11 +15,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -40,11 +40,11 @@
            inputs,
            use_dynamo_tracer=True,
            fuse_distributed_ops=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

@apbose apbose force-pushed the nccl_ops_additional branch from 091c83f to 67b970e Compare February 28, 2025 19:24
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 19:24:59.719676+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-02-28 19:25:27.751500+00:00
@@ -15,11 +15,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -40,11 +40,11 @@
            inputs,
            use_dynamo_tracer=True,
            fuse_distributed_ops=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

@@ -364,6 +364,15 @@ def forward(self, *inputs: torch.Tensor) -> torch.Tensor | Tuple[torch.Tensor, .
(i.contiguous() if isinstance(i, torch.Tensor) else torch.tensor(i).cuda())
for i in inputs
]

for i, contiguous_input in enumerate(contiguous_inputs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the C++ API not need these changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear on this aspect. Could you please let me know what would be required as part of this. I am running distributed python example and did not encounter the requirement for this.

os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = str(port)
os.environ["TRTLLM_PLUGINS_PATH"] = (
find_repo_root() + "/lib/libnvinfer_plugin_tensorrt_llm.so"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume a tmp path instead of some arbitrary location in the source tree

@@ -351,6 +351,7 @@ def generate_graph(
enable_passes: bool,
propagate_shapes: bool = False,
settings: CompilationSettings = CompilationSettings(),
fuse_distributed_ops: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this setting in the harness? Since it only applies to distributed, you should 1. have a test for the pass specifically, 2. have a test for converting the custom op without a pass and 3. have a test that applies the pass then converts the custom op

Copy link
Collaborator Author

@apbose apbose Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same doubt actually.

  1. I can look for the test for pass
  2. Without a pass, it is not possible to encounter the custom op.
gathered_tensor = torch.ops._c10d_functional.all_gather_into_tensor(x, world_size, group_name)
gathered_tensor = torch.ops._c10d_functional.wait_tensor(gathered_tensor)

together form the nccl_ops code, and for that the pass is required

  1. That is what is done right now. Maybe applying the pass in harness is not the right place. Where would you suggest it to be? Since we want to test the converters only, can't harness have a specific option saying distributed=True?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mpirun -n 1 --allow-run-as-root python test_distributed_simple_example.py ideally covers all of it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually just putting enable_passes=True, works in the converter test. Earlier the constant folding was giving an error, but that is resolved.


OS="$(uname -s)"
ARCH="$(uname -m)"
PYTHON_VERSION="$(python3 -c 'import sys; print(f"cp{sys.version_info.major}{sys.version_info.minor}")')"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it matter what python version the system is using? Can we just use the same .so for all python versions?

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 16:25:48.020598+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 16:26:14.671617+00:00
@@ -15,11 +15,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -40,11 +40,11 @@
            inputs,
            use_dynamo_tracer=True,
            fuse_distributed_ops=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

@apbose
Copy link
Collaborator Author

apbose commented Mar 3, 2025

I am not sure why lint is showing this

Skipping .ipynb files as Jupyter dependencies are not installed.
You can fix this by running ``pip install "black[jupyter]"``
would reformat /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py

Oh no! 💥 💔 💥
1 file would be reformatted, 595 files would be left unchanged.
Skipping .ipynb files as Jupyter dependencies are not installed.
You can fix this by running ``pip install "black[jupyter]"``
would reformat /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py

All done! ✨ 🍰 ✨
1 file would be reformatted, 595 files would be left unchanged.

I see no error in local pre-commit

@apbose apbose force-pushed the nccl_ops_additional branch from 0f6966f to ca8478a Compare March 3, 2025 17:06
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 17:06:44.889577+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 17:07:10.979519+00:00
@@ -15,11 +15,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -40,11 +40,11 @@
            inputs,
            use_dynamo_tracer=True,
            fuse_distributed_ops=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

@apbose apbose force-pushed the nccl_ops_additional branch from ca8478a to 4b79bfb Compare March 3, 2025 18:14
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 18:15:05.376989+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-03-03 18:15:31.109367+00:00
@@ -15,11 +15,11 @@

from conversion.harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)
@@ -40,11 +40,11 @@
            inputs,
            use_dynamo_tracer=True,
            enable_passes=True,
        )

-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops_scatter(self, linear_layer_dim):

        class DistributedReduceScatterModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: api [Python] Issues re: Python API component: conversion Issues re: Conversion stage component: converters Issues re: Specific op converters component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: lowering Issues re: The lowering / preprocessing passes component: runtime component: tests Issues re: Tests component: torch_compile
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants