Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow BLAS muladd when α,β are Int #253

Open
sztal opened this issue Feb 7, 2025 · 5 comments
Open

Extremely slow BLAS muladd when α,β are Int #253

sztal opened this issue Feb 7, 2025 · 5 comments

Comments

@sztal
Copy link

sztal commented Feb 7, 2025

Hi, perhaps I do not understand something about how ArrayLayouts are supposed to work and be used, but it seems that BLAS matrix multiplication (both default_blasmul! and tiled_blasmul!) is very slow in comparison to 5-argument LinearAlgebra.mul!.

Below is a reproducible example.

using LinearAlgebra
import ArrayLayouts: default_blasmul!, tiled_blasmul!, tile_size

A, B, C = (rand(1000, 1000) for _ in 1:3)

# mul! is fast
mul!(copy(C), A, B, 1, 1)  A*B + C  # true
@btime mul!(copy(C), A, B, 1, 1);
# 6.448 ms (4 allocations: 7.63 MiB)

# default_blasmul! is super slow!!
default_blasmul!(1, A, B, 1, copy(C))  A*B + C  # true
@btime default_blasmul!(1, A, B, 1, copy(C));
# 852.139 ms (3 allocations: 7.63 MiB)

# tiled_blasmul! is also very slow!
ts = tile_size(eltype(A), eltype(B), eltype(C))
tiled_blasmul!(ts, 1, A, B, 1, copy(C))  A*B + C  # true
@btime tiled_blasmul!(ts, 1, A, B, 1, copy(C));
# 402.555 ms (12 allocations: 7.66 MiB)

And this of course affects operations using MulAdd wrapper. What am I doing wrong? Or is it really a bug?

Version info

Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 9 5900HX with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
Environment:
  JULIA_PROJECT = @.
  JULIA_PKG_PRESERVE_TIERED_INSTALLED = true
  JULIA_REVISE = manual

PKG status

  [dce04be8] ArgCheck v2.4.0
  [7d9fca2a] Arpack v0.5.4
  [4c555306] ArrayLayouts v1.11.0
  [86223c79] Graphs v1.12.0
⌃ [0b1a1467] KrylovKit v0.9.3
⌃ [5078a376] LazyArrays v2.4.0
  [2ab3a3ac] LogExpFunctions v0.3.29
  [5a0628fe] Supposition v0.3.5
  [37e2e46d] LinearAlgebra v1.11.0
  [2f01184e] SparseArrays v1.11.0
  [f489334b] StyledStrings v1.11.0
  [8dfed614] Test v1.11.0
@dlfivefifty
Copy link
Member

Note that mul! uses native BLAS, where default_blasmul! and tiled_blasmul! are Julia code, that were copied from LinearAlgebra.

Here gives an indication of what's going on:

julia> @time mul!(C, A, B, 1, 1); # calls native BLAS
  0.021381 seconds (1 allocation: 32 bytes)

julia> @time muladd!(1.0, A, B, 1.0, C); # also calls native BLAS
  0.021018 seconds

julia> @time ArrayLayouts.tiled_blasmul!(ts, 1, A, B, 1, C); # slower Julia code

  0.307682 seconds (9 allocations: 30.609 KiB)

julia> @time muladd!(1, A, B, 1, C); # should call native BLAS but calls tiled_blasmul! cause of types 🤦‍♂️
  0.315244 seconds (9 allocations: 30.609 KiB)

@dlfivefifty dlfivefifty changed the title Extremely slow BLAS muladd Extremely slow BLAS muladd when α,β are Int Feb 8, 2025
@sztal
Copy link
Author

sztal commented Feb 8, 2025

I see, thanks for the explanation! So - let me check whether I understand - this happens because I used α and β other than the eltype of the matrices or other than Float32 or Float64 which are what BLAS supports?

Could you specify what are the conditions for muladd!, Mul and MulAdd to dispatch correctly to BLAS?

@jishnub
Copy link
Member

jishnub commented Feb 8, 2025

I think we should promote the scalars to float in such cases. This is what LinearAlgebra does, and this will ensure that we hit the BLAS paths more frequently.

@sztal
Copy link
Author

sztal commented Feb 8, 2025

I think this is a good idea - the current behavior is very unexpected and confusing, especially as it is not consistent with what LinearAlgebra does.

@dlfivefifty
Copy link
Member

It should be an easy fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants