CUDA: use 1 thread if model is fully offloaded #2915

JohannesGaessler · 2023-08-30T20:05:39Z

Currently when using the maximum possible number of GPU layers with CUDA there is no benefit from > 1 thread. In fact, using more than 1 thread is detrimental due to increased overhead. This PR changes the logic for the default number of threads in such a way that (unless the user manually overrides it) only a single thread is used if all layers are offloaded.

I also changed the logic for llama-bench to be the same as main: -1 is interpreted as the number of logical cores.

KerfuffleV2 · 2023-09-01T01:35:04Z

Maybe update the help for main also?

-t N, --threads N     number of threads to use during computation (default: 8)

It might be good to mention the ability to set it to -1, -2 (in both places) and what that does.

Ph0rk0z · 2023-09-01T11:02:32Z

Using 1 vs 15 for me does make it .2-.5 tokens faster.

ggerganov

Better to implement this entirely in llama_eval_internal, similar to what has been done for BLAS:

https://github.com/ggerganov/llama.cpp/blob/8b56b4f2c396eae1f4417e5a859557fed989e0ee/llama.cpp#L2899-L2904

ggerganov · 2023-09-02T12:41:38Z

common/common.cpp

+#ifdef GGML_USE_CUBLAS
+        if (params.n_gpu_layers >= llama_model_n_layer(model) + 3) {
+            params.n_threads = 1;
+        }
+#endif // GGML_USE_CUBLAS


This is an implementation detail that the user should not need to know and in the future we will fix this anyway

JohannesGaessler · 2023-09-16T14:02:22Z

I haven't forgotten about this, I've only prioritized other things because I think that this is not that high priority.

JohannesGaessler · 2023-09-18T16:16:48Z

@ggerganov thank you for the hint with llama_eval_internal, the implementation was way simpler this way.

ggerganov requested changes Sep 2, 2023

View reviewed changes

Green-Sky assigned JohannesGaessler Sep 16, 2023

CUDA: use only 1 thread if fully offloaded

de8035a

JohannesGaessler force-pushed the cuda-n-threads-fix branch from ff65f9a to de8035a Compare September 18, 2023 16:15

slaren approved these changes Sep 18, 2023

View reviewed changes

ggerganov approved these changes Sep 21, 2023

View reviewed changes

ggerganov merged commit 8185710 into ggml-org:master Sep 21, 2023

JohannesGaessler mentioned this pull request Sep 21, 2023

ggml: create thread pool lazily #2674

Closed

pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023

CUDA: use only 1 thread if fully offloaded (ggml-org#2915)

e0ab20a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: use 1 thread if model is fully offloaded #2915

CUDA: use 1 thread if model is fully offloaded #2915

JohannesGaessler commented Aug 30, 2023

KerfuffleV2 commented Sep 1, 2023

Ph0rk0z commented Sep 1, 2023

ggerganov left a comment

ggerganov Sep 2, 2023

JohannesGaessler commented Sep 16, 2023

JohannesGaessler commented Sep 18, 2023

CUDA: use 1 thread if model is fully offloaded #2915

CUDA: use 1 thread if model is fully offloaded #2915

Conversation

JohannesGaessler commented Aug 30, 2023

KerfuffleV2 commented Sep 1, 2023

Ph0rk0z commented Sep 1, 2023

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Sep 2, 2023

Choose a reason for hiding this comment

JohannesGaessler commented Sep 16, 2023

JohannesGaessler commented Sep 18, 2023