Skip to content

Commit

Permalink
CUDA: use only 1 thread if fully offloaded
Browse files Browse the repository at this point in the history
  • Loading branch information
JohannesGaessler committed Sep 18, 2023
1 parent 8781013 commit de8035a
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3777,6 +3777,15 @@ static bool llama_eval_internal(
n_threads = std::min(4, n_threads);
}

// If all tensors can be run on the GPU then using more than 1 thread is detrimental.
const bool full_offload_supported = model.arch == LLM_ARCH_LLAMA ||
model.arch == LLM_ARCH_BAICHUAN ||
model.arch == LLM_ARCH_FALCON;
const bool fully_offloaded = model.n_gpu_layers >= (int) hparams.n_layer + 3;
if (ggml_cpu_has_cublas() && full_offload_supported && fully_offloaded) {
n_threads = 1;
}

struct ggml_tensor * res = gf->nodes[gf->n_nodes - 1];
struct ggml_tensor * embeddings = gf->nodes[gf->n_nodes - 2];

Expand Down

0 comments on commit de8035a

Please sign in to comment.