You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.
For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.
Support for new CuTe building blocks specifically for Blackwell SM100 architecture:
CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
Support for mixed input GEMM kernels on Hopper in the profiler.
New CUTLASS profiler flag use-cuda-graphs to reduce overheads when benchmarking launch-bound kernels.
A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler (./cutlass_profiler --operation=GroupedGemm --help for details).
Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
A new functionality documentation specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
Updates to compatibility section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and Target Architecture.
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.
This discussion was created from the release CUTLASS 3.8.0.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.
For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.
tmem
across CuTe as a first class data locale.tmem->rmem
,rmem->tmem
andsmem->tmem data movement instructions
as copy atoms in CuTe.make_tmem_copy()
utility method to ease creation of tiled copies for tmem copy atoms.tmem
and full set of EVT fusions.use-cuda-graphs
to reduce overheads when benchmarking launch-bound kernels../cutlass_profiler --operation=GroupedGemm --help
for details).Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.
This discussion was created from the release CUTLASS 3.8.0.
Beta Was this translation helpful? Give feedback.
All reactions