add cutlass support for blackwell fp8 gemm #13798

kushanam · 2025-02-25T02:27:07Z

This PR adds support for cutlass blackwell gemm for fp8
A couple of notes:
1- The sm100_fp8_config_default is the only supported config for now with static tile and cluster shapes. Consequent PRs will support optimized configs for other shapes. Also 2xsm gemm will be added.
2- added default constructors for c2x and c3x kernels, some build environments treat the missed constructor warning as error

github-actions · 2025-02-25T02:27:19Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

tlrmchlsmth

Thanks for the contribution @kushanam! Looks good overall, left a few comments/questions

tlrmchlsmth · 2025-02-25T02:29:26Z

csrc/quantization/cutlass_w8a8/c3x/cutlass_gemm_caller.cuh

-  using StrideC = typename Gemm::StrideC;
+  using StrideC = typename Gemm::GemmKernel::StrideC;

  StrideA a_stride{lda, cute::Int<1>{}, 0};
  StrideB b_stride{ldb, cute::Int<1>{}, 0};
-  StrideC c_stride{ldc, cute::Int<1>{}, cute::Int<0>{}};
+  // StrideC c_stride{ldc, cute::Int<1>{}, cute::Int<0>{}};
+  StrideC c_stride =
+      cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(ldc, 1, 0));


Could you explain this change?

tlrmchlsmth · 2025-02-25T02:31:45Z

csrc/quantization/cutlass_w8a8/c3x/scaled_mm.cuh

+  using ElementD = ElementD_;
+  using LayoutD = cutlass::layout::ColumnMajor;
+  static constexpr int AlignmentD =
+      128 / cutlass::sizeof_bits<ElementD_>::value;


We've been setting AlignmentD to 4 to reduce the alignment requirement of these kernels. Can this be 4 instead of 8? Also, do you know what the performance considerations to this are?

Generally 128-bit alignment (i.e. 8 for 16-bit data types) is required for best TMA perf, 4 might work but perf will suffer and It's the same between Hopper and Blackwell

tlrmchlsmth · 2025-02-25T02:34:50Z

csrc/cutlass_extensions/epilogue/scaled_mm_epilogues_c3x_blackwell.hpp

At first glance, this looks very similar to csrc/cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp. Why can't it be the same code?

BW doesn't like EpilogueDescriptor. On the other hand the only use for EpilogueDescriptor in scaled_mm_epilogues_c3x seems to be tile shapes, so alternatively we could get rid of it all together and keep all under the same file

Yep that's the only use.

we could get rid of it all together and keep all under the same file

I think that's the right move. Could you make that change in this PR?

tlrmchlsmth · 2025-02-25T02:38:21Z

CMakeLists.txt

+  cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0a;10.0a;10.1a;12.0a" "${CUDA_ARCHS}")
  if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
    set(SRCS 
       "csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu"
       "csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_fp8.cu"
       "csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_int8.cu"
       "csrc/quantization/cutlass_w8a8/c3x/scaled_mm_azp_sm90_int8.cu"
-       "csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8.cu")
+       "csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8.cu"
+       "csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm100_fp8.cu")


Here we'll have to guard against compilation of scaled_mm_sm100_fp8.cu when CUDA < 12.8

mergify · 2025-02-27T07:21:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kushanam.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth

Thanks, looks great to me now! Could you merge in the changes from latest main?

Signed-off-by: Tyler Michael Smith <[email protected]>

tlrmchlsmth · 2025-03-03T18:49:31Z

csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu

+void cutlass_scaled_mm_sm100(torch::Tensor& c, torch::Tensor const& a,
+                             torch::Tensor const& b,
+                             torch::Tensor const& a_scales,
+                             torch::Tensor const& b_scales,
+                             std::optional<torch::Tensor> const& bias) {
+  TORCH_CHECK(a_scales.dtype() == torch::kFloat32);
+  TORCH_CHECK(b_scales.dtype() == torch::kFloat32);
+
+  int M = a.size(0), N = b.size(1), K = a.size(1);
+  TORCH_CHECK(
+      (a_scales.numel() == 1 || a_scales.numel() == a.size(0)) &&
+          (b_scales.numel() == 1 || b_scales.numel() == b.size(1)),
+      "Currently, block scaled fp8 gemm is not implemented for Blackwell");
+
+  // Standard per-tensor/per-token/per-channel scaling
+  TORCH_CHECK(a_scales.is_contiguous() && b_scales.is_contiguous());
+  TORCH_CHECK(a.dtype() == torch::kFloat8_e4m3fn,
+              "Currently, only fp8 gemm is implemented for Blackwell");
+  vllm::cutlass_scaled_mm_sm100_fp8(c, a, b, a_scales, b_scales, bias);
+}


We might need to ifdef this out when CUDA < 12.8

tlrmchlsmth · 2025-03-03T18:50:51Z

csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu

 #if defined ENABLE_SCALED_MM_C3X && ENABLE_SCALED_MM_C3X
-  if (version_num >= 90) {
+  if (version_num >= 90 && version_num < 100) {
    cutlass_scaled_mm_sm90(c, a, b, a_scales, b_scales, bias);
    return;
+  } else if (version_num >= 100) {
+    cutlass_scaled_mm_sm100(c, a, b, a_scales, b_scales, bias);
+    return;
  }


This is causing linker errors in the CI. Need to guard against calling cutlass_scaled_mm_sm100 when CUDA < 12.8

Something like this:

#if defined ENABLE_SCALED_MM_C3X && ENABLE_SCALED_MM_C3X #if defined CUDA_VERSION && CUDA_VERSION >= 12800 if (version_num >= 90) { cutlass_scaled_mm_sm90(c, a, b, a_scales, b_scales, bias); return; } #else if (version_num >= 90 && version_num < 100) { cutlass_scaled_mm_sm90(c, a, b, a_scales, b_scales, bias); return; } else if (version_num >= 100) { cutlass_scaled_mm_sm100(c, a, b, a_scales, b_scales, bias); return; } #endif

tlrmchlsmth

There are some linker errors when CUDA < 12.8 that need to be addressed. (I left some inline comments)

This made me realize we need to add guards/fallbacks/warnings when running on a Blackwell GPU but using a kernel compiled with CUDA < 12.8, as currently we will try to run the non-forward-compatible kernels for sm90a. That doesn't need to happen in this PR but @kushanam do you have any thoughts there?

Signed-off-by: Tyler Michael Smith <[email protected]>

csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu

csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu

kushanam requested a review from tlrmchlsmth as a code owner February 25, 2025 02:27

mergify bot added the ci/build label Feb 25, 2025

tlrmchlsmth reviewed Feb 25, 2025

View reviewed changes

tlrmchlsmth mentioned this pull request Feb 25, 2025

[Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues #13797

Merged

mergify bot added the needs-rebase label Feb 27, 2025

tlrmchlsmth approved these changes Feb 27, 2025

View reviewed changes

kushanam added 4 commits March 2, 2025 11:35

add cutlass support for blackwell fp8 gemm

980b5e6

remove epilogue descriptor and fix layout issue

8317644

guard scaled_mm_blockwise_sm90_fp8.cu against CUDA<12.8

bbd291e

udpate CMakeLists for seperate Blackwell fp8 entry

662c27e

kushanam force-pushed the blackwell_fp8 branch from 9d8b81f to 662c27e Compare March 2, 2025 19:44

mergify bot removed the needs-rebase label Mar 2, 2025

fix pre commit

d534fa1

Signed-off-by: Tyler Michael Smith <[email protected]>

tlrmchlsmth enabled auto-merge (squash) March 2, 2025 20:12

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 2, 2025

tlrmchlsmth reviewed Mar 3, 2025

View reviewed changes

tlrmchlsmth added 4 commits March 3, 2025 19:00

linker errors

ab0e559

Signed-off-by: Tyler Michael Smith <[email protected]>

fixup

d77639a

Signed-off-by: Tyler Michael Smith <[email protected]>

Merge branch 'main' into blackwell_fp8

a390350

precommit

4456983

Signed-off-by: Tyler Michael Smith <[email protected]>

kushanam commented Mar 4, 2025

View reviewed changes

csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu Outdated Show resolved Hide resolved

csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu Outdated Show resolved Hide resolved

modify sm90/100 version and cuda checks

2b0da32

auto-merge was automatically disabled March 4, 2025 02:51
Head branch was pushed to by a user without write access

pathorn pushed a commit to deepinfra/vllm that referenced this pull request Mar 4, 2025

Apply vllm-project#13798 add cutlass support for blackwell fp8 gemm

a0d508b

tlrmchlsmth enabled auto-merge (squash) March 4, 2025 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cutlass support for blackwell fp8 gemm #13798

add cutlass support for blackwell fp8 gemm #13798

kushanam commented Feb 25, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 25, 2025

tlrmchlsmth left a comment

tlrmchlsmth Feb 25, 2025

tlrmchlsmth Feb 25, 2025

kushanam Feb 25, 2025 •

edited

Loading

tlrmchlsmth Feb 25, 2025

kushanam Feb 25, 2025

tlrmchlsmth Feb 25, 2025

kushanam Feb 25, 2025

kushanam Feb 27, 2025

tlrmchlsmth Feb 25, 2025

mergify bot commented Feb 27, 2025

tlrmchlsmth left a comment

tlrmchlsmth Mar 3, 2025

tlrmchlsmth Mar 3, 2025

tlrmchlsmth left a comment

add cutlass support for blackwell fp8 gemm #13798

Are you sure you want to change the base?

add cutlass support for blackwell fp8 gemm #13798

Conversation

kushanam commented Feb 25, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 25, 2025

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kushanam Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Feb 27, 2025

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrmchlsmth left a comment

Choose a reason for hiding this comment

kushanam commented Feb 25, 2025 •

edited by github-actions bot

Loading

kushanam Feb 25, 2025 •

edited

Loading