-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metal : GPU "idle-throttling" analysis #10119
Conversation
Found this threads from pytorch, maybe related to this: pytorch/pytorch#124056 |
Not a lot of people are sifting through these requests. Maybe consider adding it to the discussions or even somewhere in the Readme? Sounds like a (potentially) pretty significant thing if we can figure this out. |
Just to confirm, that it also seems still to be present on the M4. Machine: MacBook Pro, M4 Pro, 20 GPU, 48GB RAM
|
There is an option in MacOS settings on Mac Mini, under 'Energy', to change to 'High Power' mode. Does it help in any way? |
Have you considered reporting to http://feedbackassistant.apple.com? |
I just submitted a ticket. Thanks. |
It turns out that this "throttling" is caused by memory getting unwired after certain amount of time as explained by @awni here. Disabling the wired memory collector with the following command fixes the issue: sudo sysctl iogpu.disable_wired_collector=1 This unwiring might be possible to be prevented programmatically using Metal's residency sets: https://developer.apple.com/documentation/metal/mtlresidencyset
Edit: it works now |
Update: the issue is caused by the GPU wired memory collector: #10119 (comment)
Looking for solutions that do not involve kernel re-configuration: #11427
The Apple Silicon GPUs seem to have some sort of (power-saving?) mechanism which affects the performance in a significant way. Here are some analysis and demonstration. I'm hoping that someone in the community knows a way to workaround this.
The
llama-idle
tool in this PR performs the following computation:The sleep emulates an idle time where the GPU would do nothing in an application after which it wakes up to compute some tokens. In this case, for simplicity - a single token. In the ideal case, without any throttling mechanisms, the decode time should be constant and not change whatever the value of
t
is.Here are the results on M2 Ultra from this test for 3 different models, increasing the time
t
from0
up to2200 ms
:It can be seen that after
~1s
of being idle, the nextllama_decode
would take extra time to compute. Some additional observations:My best guess is that this is something done by the OS in order to reduce the power consumption. But the problem is that it is quite aggressive and it ruins the performance for local applications that run inference every few seconds.
Hopefully there is a way to disable this somehow - any insights would be highly appreciated. For now, the only solution I can come up with is to introduce an optional "heartbeat" callback that would call a dummy Metal kernel every
0.9s
, just to keep the GPU under slight pressure. This is obviously not a great solution, but it would be much better than the current situation.