-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : improve API to allow allocating compute graphs on the heap #299
Comments
I am currently working on memory improvements for training and testing with bigger models than before. To solve this I allocate heap memory for the graphs by using the data buffer of a new tensor and then use only build_expand functions instead of the regular build functions. GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep); Example for allocating new graph: struct ggml_tensor * gfbuf = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, sizeof(struct ggml_cgraph) / ggml_type_size(GGML_TYPE_I32) + (sizeof(struct ggml_cgraph) % ggml_type_size(GGML_TYPE_I32) ? 1 : 0));
memset(gfbuf->data, 0, ggml_nbytes(gfbuf));
struct ggml_cgraph * gf = (struct ggml_cgraph *) gfbuf->data; This seems to be enough for solving the stackoverflows in training. In llama.cpp inference code there are only a few locations where a cgraph is allocated on stack. A function to directly allocate an ggml object with enough bytes from the context would be nice! Maybe something like this: GGML_API void * ggml_alloc(struct ggml_context * ctx, size_t nbytes);
...
struct ggml_cgraph * gf = ggml_alloc(ctx0, sizeof(struct ggml_cgraph)); I prefer allocating the memory from the context over just 'malloc', so that the machinery for freeing the context and all its related memory can be reused. |
I think it is a good idea allocating it from the context memory, but it should be automatic. This could be done by passing a context to
To make this work, the graph must only be built in one step, but I think that |
Don't think there is such case. I agree we should aim to eliminate
Yes, agreed. The only problem might be that after ggml-org/llama.cpp#1999 there will be no longer an "eval" |
I don't think that's necessary. We can pass the same ggml_tensor * output = ggml_mul(ctx, ...);
ggml_cgraph gf = ggml_build_forward(ctx, output); // gf will be allocated in ctx
// execute the graph on the CPU
ggml_cgraph_context plan = ggml_graph_compute_plan(&gf);
plan.work_data = malloc(plan.work_size);
ggml_graph_compute(plan);
// execute the same graph with CUDA
ggml_cuda_graph_context cuda_plan = ggml_cuda_compute_plan(&gf);
ggml_cuda_graph_compute(cuda_plan); |
Ah correct - I was a bit confused. Ignore my last comment |
@ggerganov should we remove
I guess that it was used for debugging early on, but currently it seems to be unnecessary. If we want to keep |
|
I think it is only useful if we want to be able to enumerate all the objects allocated in a |
I recently needed the |
Co-authored-by: Johnman <> Co-authored-by: Johnman <tjohnman@github>
Currently,
ggml
forces the user to allocate the compute graphs on the stack. Theggml
API should be extended to support using heap allocated graphs.The text was updated successfully, but these errors were encountered: