metal : GPU "idle-throttling" analysis #10119

ggerganov · 2024-11-01T08:49:41Z

Update: the issue is caused by the GPU wired memory collector: #10119 (comment)
Looking for solutions that do not involve kernel re-configuration: #11427

The Apple Silicon GPUs seem to have some sort of (power-saving?) mechanism which affects the performance in a significant way. Here are some analysis and demonstration. I'm hoping that someone in the community knows a way to workaround this.

The llama-idle tool in this PR performs the following computation:

sleep(t)
decode(1 token)
sleep(t)
decode(1 token)
... repeat and measure the average decode time as a function of t ...

The sleep emulates an idle time where the GPU would do nothing in an application after which it wakes up to compute some tokens. In this case, for simplicity - a single token. In the ideal case, without any throttling mechanisms, the decode time should be constant and not change whatever the value of t is.

Here are the results on M2 Ultra from this test for 3 different models, increasing the time t from 0 up to 2200 ms:

./llama-idle -m models/llama-3.1-8b-instruct/ggml-model-f16.gguf

iters:   10, pause:     0 ms, avg decode time:    26.34 +/- 0.19 ms
iters:   10, pause:   200 ms, avg decode time:    29.22 +/- 0.60 ms
iters:   10, pause:   400 ms, avg decode time:    29.40 +/- 0.63 ms
iters:   10, pause:   600 ms, avg decode time:    29.36 +/- 0.73 ms
iters:   10, pause:   800 ms, avg decode time:    29.21 +/- 0.70 ms
iters:   10, pause:  1000 ms, avg decode time:    29.53 +/- 0.52 ms
iters:   10, pause:  1200 ms, avg decode time:   226.81 +/- 223.52 ms
iters:   10, pause:  1400 ms, avg decode time:   429.08 +/- 35.20 ms
iters:   10, pause:  1600 ms, avg decode time:   413.67 +/- 54.53 ms
iters:   10, pause:  1800 ms, avg decode time:   463.81 +/- 50.51 ms
iters:   10, pause:  2000 ms, avg decode time:   419.40 +/- 38.01 ms
iters:   10, pause:  2200 ms, avg decode time:   472.38 +/- 105.87 ms

./llama-idle -m models/llama-3.1-8b-instruct/ggml-model-q8_0.gguf

iters:   10, pause:     0 ms, avg decode time:    16.13 +/- 0.19 ms
iters:   10, pause:   200 ms, avg decode time:    19.02 +/- 0.66 ms
iters:   10, pause:   400 ms, avg decode time:    18.92 +/- 0.48 ms
iters:   10, pause:   600 ms, avg decode time:    19.13 +/- 0.87 ms
iters:   10, pause:   800 ms, avg decode time:    19.27 +/- 0.66 ms
iters:   10, pause:  1000 ms, avg decode time:    19.19 +/- 0.52 ms
iters:   10, pause:  1200 ms, avg decode time:   106.04 +/- 92.54 ms
iters:   10, pause:  1400 ms, avg decode time:   225.60 +/- 29.71 ms
iters:   10, pause:  1600 ms, avg decode time:   299.07 +/- 73.57 ms
iters:   10, pause:  1800 ms, avg decode time:   270.54 +/- 23.73 ms
iters:   10, pause:  2000 ms, avg decode time:   278.60 +/- 27.24 ms
iters:   10, pause:  2200 ms, avg decode time:   268.24 +/- 65.90 ms

./llama-idle -m models/llama-3.1-8b-instruct/ggml-model-q4_0.gguf

iters:   10, pause:     0 ms, avg decode time:    11.04 +/- 0.19 ms
iters:   10, pause:   200 ms, avg decode time:    14.45 +/- 0.58 ms
iters:   10, pause:   400 ms, avg decode time:    14.20 +/- 0.65 ms
iters:   10, pause:   600 ms, avg decode time:    14.30 +/- 0.70 ms
iters:   10, pause:   800 ms, avg decode time:    14.49 +/- 0.57 ms
iters:   10, pause:  1000 ms, avg decode time:    23.82 +/- 29.35 ms
iters:   10, pause:  1200 ms, avg decode time:    86.02 +/- 69.61 ms
iters:   10, pause:  1400 ms, avg decode time:   171.72 +/- 32.36 ms
iters:   10, pause:  1600 ms, avg decode time:   170.60 +/- 36.20 ms
iters:   10, pause:  1800 ms, avg decode time:   157.94 +/- 12.47 ms
iters:   10, pause:  2000 ms, avg decode time:   205.16 +/- 55.31 ms
iters:   10, pause:  2200 ms, avg decode time:   215.77 +/- 53.40 ms

It can be seen that after ~1s of being idle, the next llama_decode would take extra time to compute. Some additional observations:

Same behavior occurs on M1 Pro and M2 Ultra
Occurs only with GPU
Does not occur with CUDA
The extra time is proportional to the model size
This overhead is not from any recent changes, as I remember observing it a year ago, but now took the time to look into it in more details

My best guess is that this is something done by the OS in order to reduce the power consumption. But the problem is that it is quite aggressive and it ruins the performance for local applications that run inference every few seconds.

Hopefully there is a way to disable this somehow - any insights would be highly appreciated. For now, the only solution I can come up with is to introduce an optional "heartbeat" callback that would call a dummy Metal kernel every 0.9s, just to keep the GPU under slight pressure. This is obviously not a great solution, but it would be much better than the current situation.

ngxson · 2024-11-01T10:13:57Z

Found this threads from pytorch, maybe related to this: pytorch/pytorch#124056

CyborgArmy83 · 2024-11-03T12:37:06Z

Not a lot of people are sifting through these requests. Maybe consider adding it to the discussions or even somewhere in the Readme? Sounds like a (potentially) pretty significant thing if we can figure this out.

AndreasKunar · 2024-12-08T12:42:09Z

Just to confirm, that it also seems still to be present on the M4.

Machine: MacBook Pro, M4 Pro, 20 GPU, 48GB RAM
macOS Sequoia 15.1.1
llama.cpp version: 4288 (43ed389)
built with Apple clang version 16.0.0 (clang-1600.0.26.4) for arm64-apple-darwin24.1.0

./build/bin/llama-idle -m ../models.llama.cpp/Llama/llama-2-7b-chat.F16.gguf 
...
iters:   10, pause:     0 ms, avg decode time:    56.37 +/- 0.43 ms
iters:   10, pause:   200 ms, avg decode time:    62.96 +/- 1.26 ms
iters:   10, pause:   400 ms, avg decode time:    63.28 +/- 1.40 ms
iters:   10, pause:   600 ms, avg decode time:    63.37 +/- 1.30 ms
iters:   10, pause:   800 ms, avg decode time:    63.27 +/- 1.58 ms
iters:   10, pause:  1000 ms, avg decode time:    63.74 +/- 1.59 ms
iters:   10, pause:  1200 ms, avg decode time:   317.82 +/- 104.86 ms
iters:   10, pause:  1400 ms, avg decode time:   337.73 +/- 48.03 ms
iters:   10, pause:  1600 ms, avg decode time:   320.89 +/- 20.60 ms
iters:   10, pause:  1800 ms, avg decode time:   326.02 +/- 14.65 ms
iters:   10, pause:  2000 ms, avg decode time:   315.00 +/- 17.74 ms

japindersi · 2024-12-17T14:23:17Z

There is an option in MacOS settings on Mac Mini, under 'Energy', to change to 'High Power' mode. Does it help in any way?

ggerganov · 2024-12-17T14:32:23Z

This option is not present on the Mac Studio:

jilljenn · 2024-12-25T17:36:40Z

Have you considered reporting to http://feedbackassistant.apple.com?

ggerganov · 2024-12-30T12:40:46Z

Have you considered reporting to http://feedbackassistant.apple.com?

I just submitted a ticket. Thanks.

ggerganov · 2025-01-24T16:37:48Z

It turns out that this "throttling" is caused by memory getting unwired after certain amount of time as explained by @awni here. Disabling the wired memory collector with the following command fixes the issue:

sudo sysctl iogpu.disable_wired_collector=1

This unwiring might be possible to be prevented programmatically using Metal's residency sets:

https://developer.apple.com/documentation/metal/mtlresidencyset

~~Edit: doing some tests with residency sets in #11427 , but so far no luck~~

Edit: it works now

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Nov 1, 2024

github-actions bot added the examples label Nov 1, 2024

ggerganov force-pushed the gg/idle branch from a207386 to 059fcd3 Compare January 26, 2025 10:29

examples : add idle tool for investigating GPU idle overhead

6da9021

ggerganov force-pushed the gg/idle branch from 059fcd3 to 6da9021 Compare January 26, 2025 10:31

ggerganov mentioned this pull request Jan 26, 2025

metal : use residency sets #11427

Merged

ggerganov closed this in #11427 Jan 26, 2025

ggerganov deleted the gg/idle branch January 26, 2025 18:06

ggerganov mentioned this pull request Jan 31, 2025

Feature Request: MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 Inference on consumer GPU #11532

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : GPU "idle-throttling" analysis #10119

metal : GPU "idle-throttling" analysis #10119

ggerganov commented Nov 1, 2024 •

edited

Loading

ngxson commented Nov 1, 2024

CyborgArmy83 commented Nov 3, 2024

AndreasKunar commented Dec 8, 2024

japindersi commented Dec 17, 2024

ggerganov commented Dec 17, 2024

jilljenn commented Dec 25, 2024

ggerganov commented Dec 30, 2024

ggerganov commented Jan 24, 2025 •

edited

Loading

metal : GPU "idle-throttling" analysis #10119

metal : GPU "idle-throttling" analysis #10119

Conversation

ggerganov commented Nov 1, 2024 • edited Loading

ngxson commented Nov 1, 2024

CyborgArmy83 commented Nov 3, 2024

AndreasKunar commented Dec 8, 2024

japindersi commented Dec 17, 2024

ggerganov commented Dec 17, 2024

jilljenn commented Dec 25, 2024

ggerganov commented Dec 30, 2024

ggerganov commented Jan 24, 2025 • edited Loading

ggerganov commented Nov 1, 2024 •

edited

Loading

ggerganov commented Jan 24, 2025 •

edited

Loading