Coord refactor #186

t4c1 · 2025-01-17T09:29:31Z

Refactor coordinates for PVC copies to be consistent with how copies for all CUDA GPUs are called.

joeatodd

Nice work @t4c1 - a few small things I spotted.

joeatodd · 2025-01-29T11:08:40Z

include/cute/atom/copy_traits_xe.hpp

+  auto
+  get_pvc_tensor(GShape const& g_shape) const {
+    static_assert(rank(GShape{}) == 3, "mismatch rank");
+    return make_counting_tensor(make_layout(g_shape, make_stride(E<0>(), E<1>(), E<2>())));


get_tma_tensor uses g_stride_ for the 2nd arg to make_layout here. Is there any loss of generality with this simpler approach?

check if this stride works correctly for column major

I am not sure, but I think that is used to encode col/row-major information. For PVC, that is instead encoded in the copy atom itself.

joeatodd · 2025-01-29T11:14:34Z

include/cute/atom/copy_traits_xe.hpp

+    constexpr int dtype_size = sizeof(dtype);
+    constexpr int bits_in_byte = 8;


Cutlass provides cutlass::sizeof_bits<dtype> for this

joeatodd · 2025-01-29T11:15:33Z

include/cute/atom/copy_traits_xe.hpp

+    static_assert(is_rmem<TS>::value);
+    static_assert(size(SLayout{}) * dtype_size * bits_in_byte == size<1>(typename Traits_ST_t::SrcLayout{}),
+                  "Src tensor size does not match copy atom size");
+    static_assert(size(DLayout{}) * dtype_size * bits_in_byte == size<1>(typename Traits_ST_t::DstLayout{}),


As above, use cutlass::sizeof_bits<dtype> I think.

joeatodd · 2025-01-29T11:26:50Z

include/cutlass/gemm/collective/xe_mma.hpp

@@ -137,12 +137,31 @@ struct CollectiveMma<
  using traits_load_B = Copy_Traits<GmemTiledCopyB, StrideB>;
  using atom_load_B = Copy_Atom<traits_load_B, ElementB>;


I think the changes from this file need to be copied over to xe_mma_mixed_input.hpp. I am getting local failure of ninja test_unit_gemm_device

joeatodd · 2025-01-29T11:28:39Z

include/cutlass/gemm/collective/xe_mma.hpp

  using  TensorMKL = decltype(make_tensor(make_gmem_ptr(static_cast<ElementA const*>(nullptr)), make_shape(0,0,0), StrideA{}));   //(m, k)
  using  TensorNKL = decltype(make_tensor(make_gmem_ptr(static_cast<ElementB const*>(nullptr)), make_shape(0,0,0), StrideB{}));   //(n, k)


used in universal gemm

joeatodd · 2025-01-29T11:32:29Z

include/cutlass/gemm/collective/xe_mma.hpp


    // Instantiate the MMA object and get thread slice
    TiledMma tiled_mma;
-    auto thr_mma = tiled_mma.get_slice(thread_idx);
+    // To make all threads in a warp have the same global tensors pass in the index of thread 0 in each warp


Can we have a TODO(Codeplay): here to fix this later?

joeatodd · 2025-01-29T11:39:17Z

include/cutlass/gemm/collective/xe_mma.hpp

+    Tensor tArA = thr_copy_A2.retile_D(tCrA);
+    Tensor tBrB = thr_copy_B2.retile_D(tCrB);
+
+    // Retile global tile for copies
+    Tensor tAgA = thr_copy_A2.retile_S(tCgA);
+    Tensor tBgB = thr_copy_B2.retile_S(tCgB);


retile_D and retile_S do the same thing by the way. Not sure if that affects what's going on here - but I don't think I've seen both used anywhere before.

I am not sure if they always do the same thing. Anyway one is intended for Source and one for Destination and that is how I use them here.

mehdi-goli · 2025-01-29T12:31:18Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

+    Tensor g_cta_D_mnl = local_tile(mD_mnl, CtaTileMNK{}, make_coord(_,_,_), Step<_1,_1, X>{});             // (BLK_M,BLK_N,m,n,l)
+
+    // Slice to get the tile this CTA is responsible for                                                 // (BLK_M,BLK_N)
+    Tensor g_cta_D = g_cta_D_mnl(_,_,m_coord,n_coord,l_coord);                                                   // (BLK_M,BLK_N)


I am wondering here, if it should be possible to avoid this and have something like

Tensor g_cta_D_mnl = local_tile(mD_mnl, CtaTileMNK{}, make_coord(m_coord,n_coord,l_coord), Step<_1,_1, X>{});

It is possible if I manually construct the tile for MN dimesnions. I can only use the last argument(projection) if the previous two have the same number of modees.

mehdi-goli · 2025-01-29T12:32:00Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

+    Tensor gD_mnl = local_tile(g_cta_D, SubgroupTileShape{}, make_coord(_,_,_), Step<_1,_1, X>{});             // (BLK_M,BLK_N,m,n,l)
+
+    // Slice to get the tile this warp is responsible for
+    Tensor gD = gD_mnl(_,_,m_sg,n_sg);                                                   // (BLK_M,BLK_N)


mehdi-goli · 2025-01-29T12:33:51Z

include/cutlass/gemm/collective/xe_mma.hpp


    // Instantiate the MMA object and get thread slice
    TiledMma tiled_mma;
-    auto thr_mma = tiled_mma.get_slice(thread_idx);
+    // To make all threads in a warp have the same global tensors pass in the index of thread 0 in each warp


Suggested change

// To make all threads in a warp have the same global tensors pass in the index of thread 0 in each warp

// To make all work items in a subgroup have the same global tensors pass in the index of work item 0 in each subgroup

include/cute/atom/copy_traits_xe.hpp

include/cutlass/epilogue/collective/xe_epilogue.hpp

include/cute/atom/copy_traits_xe.hpp

include/cutlass/epilogue/collective/xe_epilogue.hpp

include/cutlass/gemm/kernel/xe_gemm.hpp

mehdi-goli · 2025-01-29T12:40:09Z

include/cutlass/gemm/collective/xe_mma.hpp

+    Tensor tCrA = make_tensor<ElementA>(tCgA(_,_,_,0).shape());
+    Tensor tCrB = make_tensor<ElementB>(tCgB(_,_,_,0).shape(), make_stride(_1{}, shape<0>(tCgB) * shape<2>(tCgB), shape<0>(tCgB)));


This too line does not seems to match what you are aiming to do

what do you mean with that?

mehdi-goli · 2025-01-29T12:44:58Z

include/cutlass/gemm/kernel/xe_gemm.hpp

+    Tensor mA_mk = mA_mkl(_,_,l_coord);                                          // (m,k)
+    Tensor mB_nk = mB_nkl(_,_,l_coord);                                          // (n,k)
+
+    auto gA_mk = local_tile(mA_mk, blk_shape, make_coord(_,_,_), Step<_1,  X, _1>{});
+    auto gB_nk = local_tile(mB_nk, blk_shape, make_coord(_,_,_), Step< X, _1, _1>{});
+
+    // Slice with m_coord and n_coord
+    Tensor gA = gA_mk(_,_,m_coord,_);                                                       // (BLK_M,BLK_K,k)
+    Tensor gB = gB_nk(_,_,n_coord,_);                                                       // (BLK_N,BLK_K,k)


Same here I think it should be possible to say:

Tensor gA = local_tile(mA_mkl, blk_shape, make_coord(m_coord,_,l_coord), Step<_1, X, _1>{}); Tensor gB = local_tile(mB_nkl, blk_shape, make_coord(n_coord,_,l_coord), Step< X, _1, _1>{});

done something similar

aacostadiaz · 2025-02-07T14:10:48Z

include/cutlass/gemm/kernel/xe_gemm.hpp

@@ -243,22 +235,19 @@ class GemmUniversal<
    // Get the appropriate blocks for this sub_group -- potential for sub_group locality
    int thread_idx = int(ThreadIdxX());
    auto blk_shape = TileShape{};
-    #ifdef CUTLASS_SYCL_SWITCH_WG


If we are not using CUTLASS_SYCL_SWITCH_WG anymore, could you remove the definition in the CMakeLists?

FMarno

LGTM

rolandschulz · 2025-02-11T06:58:07Z

examples/sycl/pvc/flash_attention_v2/pvc_flash_attn_epilogue.hpp

-  using XE_Copy_O = decltype(make_xe_2d_copy(Copy_Atom<Copy_Traits<CopyOpO, StrideO>, ElementO>{}.with(
-            make_tensor(make_gmem_ptr(static_cast<ElementO const*>(nullptr)), make_layout(make_shape(0, 0, 0), StrideO{}))),
-                                Layout<Shape<_1, Int<SubgroupSize>>>{}));
+  using XE_Copy_O = decltype(make_tiled_copy(Copy_Atom<Trait_O, ElementO>{}


Please don't do reformatting and code changes in the same PR. Makes review unnecessary hard.

This is not reformatting, it is a functional change make_xe_2d_copy -> make_tiled_copy.

rolandschulz · 2025-02-11T09:37:37Z

benchmarks/pvc/benchmarks.hpp

-        TiledMMA<MMAAtom, Layout<Shape<_8,_4,_1>>>,
+        TiledMMA<MMAAtom, 
+                 Layout<Shape<_8,_4,_1>, Stride<_4,_1,_0>>, 
+                 Tile<Layout<Shape<_8, _8, _4>, Stride<_1, _32, _8>>,


might be easier to read with make_ordered_layout

With make_ordered_layout we would also need decltype. I am a bit on the edge but I think I prefer it to be explicit. Maybe we can make a helper in the future that will not need decltype. But not in this PR.

…into coord_refactor

t4c1 added 10 commits January 17, 2025 09:23

PoC

4eb88ab

Merge remote-tracking branch 'origin/sycl-develop' into coord_refactor

5595428

WIP coord tensor

4118c9f

Merge remote-tracking branch 'origin/sycl-develop' into coord_refactor

6516e21

working new xe_mma

2c5226f

cleanup

bbd5a58

check Copy and MMA size

f7dfbc0

more cleanup

73ec96a

add asserts to copy_unpack

fa1bc03

Merge remote-tracking branch 'origin/sycl-develop' into coord_refactor

684804e

t4c1 marked this pull request as ready for review January 29, 2025 09:38

bugfix rmem check

4149f00

joeatodd reviewed Jan 29, 2025

View reviewed changes

more cleanup and fixes

22fca46

mehdi-goli reviewed Jan 29, 2025

View reviewed changes

FMarno reviewed Jan 29, 2025

View reviewed changes

mehdi-goli reviewed Jan 29, 2025

View reviewed changes

fix example lincombdeeltact

8851409

aacostadiaz mentioned this pull request Jan 29, 2025

Adding bandwidth measurement for flash attention #201

Merged

t4c1 added 8 commits January 29, 2025 16:02

fix streamk and mixed input examples

12cd6a4

WIP

3dd3ab5

WIP on flash attention

969a921

fixed flash attention

0df85d6

cleanup

310dc51

more copy traits fixed

30f765a

fixed pvc_flash_attn_k_transpose

9ba0d5f

fixed benchmarks

1dda86a

t4c1 added 7 commits February 6, 2025 11:10

Merge remote-tracking branch 'origin/sycl-develop' into coord_refactor

9bfce13

Merge remote-tracking branch 'origin/sycl-develop' into coord_refactor

0f43c8d

cleanup

3b649bd

simplify xe_visitor

e1fbc9e

fix flash attention again

89ebfcd

simplify copy atom construction

93754e9

fix sizeof_bits

ed9eb06

aacostadiaz reviewed Feb 7, 2025

View reviewed changes

FMarno approved these changes Feb 10, 2025

View reviewed changes

rolandschulz reviewed Feb 11, 2025

View reviewed changes

rolandschulz mentioned this pull request Feb 11, 2025

ContigBlockMMAHelper #212

Open

t4c1 added 3 commits February 17, 2025 12:02

fix copy_block

76d2eb9

fixed remaining tests and disabled ones that are being reworked

b613933

remove CUTLASS_SYCL_SWITCH_WG from remaining places

1aa1745

aacostadiaz mentioned this pull request Feb 18, 2025

Xe persistent tile scheduler #222

Open

t4c1 and others added 11 commits February 19, 2025 11:02

WIP fix for iterating tiledcopy

748616d

Minor fixes

8251cb5

Merge branch 'coord_refactor' of https://github.com/t4c1/cutlass-fork …

37da130

…into coord_refactor

fixed fragment strides

83b183a

Merge branch 'sycl-develop' into coord_refactor

cf8a2dd

fix broken benchmark

49f08f6

revert redundant changes to benchmarks

2aac427

fix causal flash attention

54d5ea9

Merge branch 'sycl-develop' into coord_refactor

7278644

fix more cute tests

cdc1625

fixed xe_visitor_softmax

e2eee55

t4c1 mentioned this pull request Feb 27, 2025

core part of coord refactor #232

Open

t4c1 and others added 3 commits February 27, 2025 11:31

fix pvc_gemm_with_epilogue_softmax

2824b7b

make calculations of indices nicer

8fd3f97

Merge branch 'sycl-develop' into coord_refactor

40fbff7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coord refactor #186

Coord refactor #186

t4c1 commented Jan 17, 2025

joeatodd left a comment

joeatodd Jan 29, 2025

aacostadiaz Feb 4, 2025

t4c1 Feb 6, 2025

joeatodd Jan 29, 2025

t4c1 Feb 6, 2025

joeatodd Jan 29, 2025

t4c1 Feb 6, 2025

joeatodd Jan 29, 2025

t4c1 Feb 6, 2025

joeatodd Jan 29, 2025

t4c1 Feb 6, 2025

joeatodd Jan 29, 2025

t4c1 Feb 6, 2025

joeatodd Jan 29, 2025

t4c1 Feb 6, 2025

mehdi-goli Jan 29, 2025

t4c1 Feb 6, 2025

mehdi-goli Jan 29, 2025

t4c1 Feb 6, 2025

mehdi-goli Jan 29, 2025 •

edited

Loading

t4c1 Feb 6, 2025

mehdi-goli Jan 29, 2025

t4c1 Feb 6, 2025

mehdi-goli Jan 29, 2025 •

edited

Loading

t4c1 Feb 6, 2025

aacostadiaz Feb 7, 2025

t4c1 Feb 17, 2025

FMarno left a comment

rolandschulz Feb 11, 2025

t4c1 Feb 17, 2025

rolandschulz Feb 11, 2025

t4c1 Feb 17, 2025

		constexpr int dtype_size = sizeof(dtype);
		constexpr int bits_in_byte = 8;

		@@ -137,12 +137,31 @@ struct CollectiveMma<
		using traits_load_B = Copy_Traits<GmemTiledCopyB, StrideB>;
		using atom_load_B = Copy_Atom<traits_load_B, ElementB>;

		using TensorMKL = decltype(make_tensor(make_gmem_ptr(static_cast<ElementA const*>(nullptr)), make_shape(0,0,0), StrideA{})); //(m, k)
		using TensorNKL = decltype(make_tensor(make_gmem_ptr(static_cast<ElementB const*>(nullptr)), make_shape(0,0,0), StrideB{})); //(n, k)

	// To make all threads in a warp have the same global tensors pass in the index of thread 0 in each warp
	// To make all work items in a subgroup have the same global tensors pass in the index of work item 0 in each subgroup

		Tensor tCrA = make_tensor<ElementA>(tCgA(_,_,_,0).shape());
		Tensor tCrB = make_tensor<ElementB>(tCgB(_,_,_,0).shape(), make_stride(_1{}, shape<0>(tCgB) * shape<2>(tCgB), shape<0>(tCgB)));

Coord refactor #186

Are you sure you want to change the base?

Coord refactor #186

Conversation

t4c1 commented Jan 17, 2025

joeatodd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehdi-goli Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehdi-goli Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FMarno left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehdi-goli Jan 29, 2025 •

edited

Loading

mehdi-goli Jan 29, 2025 •

edited

Loading