Improve memory pressure detection #524

Jromano1997 · 2025-01-23T13:38:06Z

Hello,

I'm fairly new to Metal and GPU coding in general, so I hope this issue isn't only due to my lack of knowledge.

On my MacBook Air M1:
Metal.versioninfo()

macOS 15.2.0, Darwin 24.2.0

Toolchain:

Julia: 1.11.2

LLVM: 16.0.6

Julia packages:

Metal.jl: 1.5.1

GPUArrays: 11.2.1

GPUCompiler: 1.1.0

KernelAbstractions: 0.9.31

ObjectiveC: 3.2.0

LLVM: 9.1.3

LLVMDowngrader_jll: 0.6.0+0

1 device:

Apple M1 (1.501 GiB allocated)

If I run:

using Metal
N=2^13
X=rand(Float32,N,N)
mtl_X=MtlArray(X);

u=sum(exp.(mtl_X));

I see an increase in the used memory from the activity monitor,
memory which is not freed until I close Julia.

(Image: Activity monitor resulting from running the line `u=sum(exp.(mtl_X));` multiple times)

If then I run:

n=30
for _ in 1:n Metal.@sync u=sum(exp.(mtl_X)) end

the program fills the RAM and my laptop freezes.

Note that if I run instead:

mtl_A=similar(mtl_X)

n=300
for _ in 1:n Metal.@sync begin
      mtl_A.=exp.(mtl_X)
      u=sum(mtl_A) 
      end
end

the code runs just fine without any freezing. I thus suspect that u=sum(exp.(mtl_X)) is allocating an MtlMatrix for exp.(mtl_X), which the garbage collection is unable to free. Is this standard behaviour?
Shouldn't the garbage collector be able to free the memory that he allocates?

The text was updated successfully, but these errors were encountered:

maleadt · 2025-02-20T09:50:14Z

(Image: Activity monitor resulting from running the line u=sum(exp.(mtl_X)); multiple times)

Running GC.gc(true) a single time after that frees up all that memory. Garbage collection being delayed, that is kind-of expected behavior.

If then I run:
n=30
for _ in 1:n Metal.@sync u=sum(exp.(mtl_X)) end
the program fills the RAM and my laptop freezes.

I can confirm this makes the host device less responsive, but AFAICT the Julia GC still works properly. Calling GC.gc(true) afterwards, or simply interrupting the loop, makes memory usage drop down here. So if anything, this looks like macOS really doesn't like to be running close against the memory limit. Maybe it never returns an OOM error code (which we rely on to forcibly call the GC when running out of GPU memory), instead opting to page out memory which is really slow.

There's several possible solutions here. We could port CUDA.jl's early GC invocation heuristics based on GPU memory usage, JuliaGPU/CUDA.jl#2304. Or we could try and switch entirely to Julia's memory allocator such that memory pressure from MtlArray is sensed by the GC causing it to run earlier (assuming Julia itself does a better job here).

maleadt changed the title ~~Garbage collection doesn't trigger on MtlArrays~~ Improve memory pressure detection Feb 20, 2025

maleadt added arrays Things about the array abstraction. help wanted Extra attention is needed labels Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory pressure detection #524

Improve memory pressure detection #524

Jromano1997 commented Jan 23, 2025 •

edited

Loading

maleadt commented Feb 20, 2025

Improve memory pressure detection #524

Improve memory pressure detection #524

Comments

Jromano1997 commented Jan 23, 2025 • edited Loading

maleadt commented Feb 20, 2025

Jromano1997 commented Jan 23, 2025 •

edited

Loading