Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: use 1 thread if model is fully offloaded #2915

Merged
merged 1 commit into from
Sep 21, 2023

Conversation

JohannesGaessler
Copy link
Collaborator

Currently when using the maximum possible number of GPU layers with CUDA there is no benefit from > 1 thread. In fact, using more than 1 thread is detrimental due to increased overhead. This PR changes the logic for the default number of threads in such a way that (unless the user manually overrides it) only a single thread is used if all layers are offloaded.

I also changed the logic for llama-bench to be the same as main: -1 is interpreted as the number of logical cores.

@KerfuffleV2
Copy link
Collaborator

Maybe update the help for main also?

-t N, --threads N     number of threads to use during computation (default: 8)

It might be good to mention the ability to set it to -1, -2 (in both places) and what that does.

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 1, 2023

Using 1 vs 15 for me does make it .2-.5 tokens faster.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to implement this entirely in llama_eval_internal, similar to what has been done for BLAS:

https://github.com/ggerganov/llama.cpp/blob/8b56b4f2c396eae1f4417e5a859557fed989e0ee/llama.cpp#L2899-L2904

Comment on lines 731 to 735
#ifdef GGML_USE_CUBLAS
if (params.n_gpu_layers >= llama_model_n_layer(model) + 3) {
params.n_threads = 1;
}
#endif // GGML_USE_CUBLAS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an implementation detail that the user should not need to know and in the future we will fix this anyway

@JohannesGaessler
Copy link
Collaborator Author

I haven't forgotten about this, I've only prioritized other things because I think that this is not that high priority.

@JohannesGaessler
Copy link
Collaborator Author

@ggerganov thank you for the hint with llama_eval_internal, the implementation was way simpler this way.

@ggerganov ggerganov merged commit 8185710 into ggml-org:master Sep 21, 2023
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants