-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add embedding mode with arg flag. Currently working #282
Conversation
It's true that I am, but we never reach that line. I forgot to uncomment it
for the PR, but before that, the program fails on the `ggml_graph_compute
(ctx0, &gf);` line.
We never reach that other part.
…On Sun, Mar 19, 2023 at 8:09 PM taher ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In main.cpp
<#282 (comment)>:
> @@ -936,12 +943,27 @@ int main(int argc, char ** argv) {
printf(ANSI_COLOR_YELLOW);
}
+ if (params.embedding){
+ printf("got right before second call.\n");
+ const int64_t t_start_us = ggml_time_us(); //HERE
+ if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) {
+ fprintf(stderr, "Failed to predict\n");
+ return 1;
+ }
+ //ggml_free(model.ctx);
+
+ if (params.use_color) {
+ printf(ANSI_COLOR_RESET);
+ }
+ return 0;
Looks like you're returning without freeing.
ggml_free(model.ctx);
—
Reply to this email directly, view it on GitHub
<#282 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD6JCX7LPLXDZHBKKHKKCHTW47C5XANCNFSM6AAAAAAV75U6WA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
main.cpp
Outdated
if (params.embedding){ | ||
printf("got right before second call.\n"); | ||
const int64_t t_start_us = ggml_time_us(); //HERE | ||
if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not calling llama_eval
for remaining tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sentence embedding depends on the input, we shouldn't have to generate extra tokens later. Though I may be wrong. But changing that wouldn't solve the failed execution error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to capture input text's embeddings or predicted text embeddings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the PR it is doing the input text's embeddings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vector of tokens that you are sending for evaluation is empty. Please ensure that it is set correctly right before evaluating it, and also add a check to verify that it is not empty.
embd = embd_inp;
if (embd.size() > 0) {
if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) {
fprintf(stderr, "Failed to predict\n");
return 1;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if the above fixes the runtime crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I had seen the email and saw the message but not the snippet. I will try this soon. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That solved it! My version is working but I see merge conflicts, I am testing the merge now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge successful. Now we have embeddings! This should be ready to merge into main.
main.cpp
Outdated
while (remaining_tokens > 0) { | ||
// predict | ||
if (embd.size() > 0) { | ||
const int64_t t_start_us = ggml_time_us(); | ||
|
||
if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token)) { | ||
if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass params.embedding
directly and remove the whole if block in line 946
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we want the embedding, we only want to print once, and not generate the next token
input text
…On Tue, Mar 21, 2023 at 9:30 AM taher ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In main.cpp
<#282 (comment)>:
> @@ -936,12 +943,27 @@ int main(int argc, char ** argv) {
printf(ANSI_COLOR_YELLOW);
}
+ if (params.embedding){
+ printf("got right before second call.\n");
+ const int64_t t_start_us = ggml_time_us(); //HERE
+ if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) {
Are you trying to capture input text's embeddings or predicted text
embeddings?
—
Reply to this email directly, view it on GitHub
<#282 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD6JCX7NKIIHNJ6VEJC7KSDW5HJSDANCNFSM6AAAAAAV75U6WA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Do the input tokens need to go through many layers to obtain their embeddings? My guess is that the input embeddings should be obtainable earlier, so that it is not necessary to loop through multiple layers and obtain the logits also. Perhaps we should return as soon as the input tokens are projected into the embedding space? as this step would give us the vector representation we need. |
The embeddings should be the output of the last attention layer, corresponding to the last token in the input. |
The input embeddings obtained at the starting layer are static and not contextual? In order for the embeddings to include context for each input token in the sequence, is it necessary for the input to go through all the layers? |
Sorry, I don't get if those are assertions or questions. I understand we want for the whole input to go through all the attention layers, then we take the representation for the last token. That is what this PR is supposed to be doing. Right now I think it is working, though I would say we can test it more before merging (right now I see it doesn't break, and it gives an output of the correct shape, but I would test more) |
Okay, I think this is working well. I tested it with multiple inputs and the embeddings make a lot of sense.
Do let me know if you think other changes are needed before merging, and thank you for your help in getting this to work! |
main.cpp
Outdated
embd = embd_inp; | ||
if (embd.size() > 0) { | ||
const int64_t t_start_us = ggml_time_us(); | ||
if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass in the params.embedding
instead, which presumably contains the same information about whether or not to use the embeddings as input
main.cpp
Outdated
if (params.embedding){ | ||
embd = embd_inp; | ||
if (embd.size() > 0) { | ||
const int64_t t_start_us = ggml_time_us(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we are storing the start timestamp here, but is it being used to measure anything in embedding mode? If not, let's remove it. Alternatively, do we want to measure the embedding compute time?
Okay, merged with master again, moved everything to llama.cpp. I addressed both changes: now everywhere that makes sense, the boolean matches the console arguments (the test in the beginning for memory size needs to be hardcoded to false) and I removed the spurious time check. |
How are you testing the correctness of the embeddings? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this will be a great addition!
Let's change slightly the interface like this:
- Add parameter to
llama_context_params
calledembeddings
which isfalse
by default. It will be used to specify if we want thellama_eval()
call to keep the computed embeddings or not - Add an embedding buffer (
std::vector<float>
) to thellama_context
similar to thelogits
buffer and resize it upon model load if theembeddings
parameter istrue
. If it isfalse
, then leave it empty - Keep
llama_eval_internal()
as original and only change to copy the embeddings into the new context buffer if it is not empty (i.e.embeddings
parameter was set totrue
during init). You will need to store theggml
tensor at the end in a separate variable so you can access it after theggml_graph_compute()
call, similar to thelogits
. No need for an alternativeggml_graph_compute()
call as you have proposed - Add C API call to get the new embeddings buffer - same as
llama_get_logits()
- You can now print the embeddings in user code, instead of in the
llama
library
I tried multiple inputs and judged from the cosine similarities between semantically similar vs not-similar sentences. I am more than open to other tests if anyone can think of them. Example pair-wise correlations
This also made me think, it may be desirable to normalize the embeddings (to norm 1, not the normalization layer itself). I was doing it in 'post-processing'. Do you think it would be good to add this? Not sure how to do it in ggml, but we would just need to divide by the embeddings norm. |
I'm curious if normalizing reduces the dimensionality of the embedding space? If the goal of the consumer is to only compute cosine similarity from the embeddings, then I think it makes sense to normalize. However, I suggest leaving the embeddings as raw vectors and letting the consumer decide whether or not to normalize them. It's possible that others may have a different opinion on this. |
I agree, normalizing would be lossy. I would assume e.g. GPT-3 API gives you normalized embeddings, but it's easy enough to normalize them on consumer end so I wouldn't sweat it. I will look into the changes ggerganov suggested + there is a new merge conflict to solve. |
It would be helpful to add an example of generating input embeddings, normalizing them, and computing cosine similarity in the example folder in this PR. |
Hi ggerganov, "Keep llama_eval_internal() as original and only change to copy the embeddings into the new context buffer if it is not empty (i.e. embeddings parameter was set to true during init). You will need to store the ggml tensor at the end in a separate variable so you can access it after the ggml_graph_compute() call, similar to the logits. No need for an alternative ggml_graph_compute() call as you have proposed" Do I still keep the behavior where we check and, if embeddings==true, we only run until we have the embeddings then return? If I do I can't leave llama_eval_internal identical. Plus there is that test run we do at the beginning (Something about memory cost) that requires me to send a false for the embeddings, even if we will keep them later. The model load / config is already as specified. |
llama.cpp
Outdated
ggml_build_forward_expand(&gf, inpL); | ||
ggml_graph_compute (ctx0, &gf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to update and compute the graph in embedding mode?
@StrikingLoo Just pushed a change. Give it a try and let me know if it works |
This looks perfect to me. Both generation and embeddings work without errors. Just to make sure: would we be merging like this, or should I add the display_embeddings code under "// TODO: print / use the embeddings"? Edit: |
This will become a separate example program called |
@ggerganov is this the correct command
It seems it's not using the prompt in
|
Hi everyone,
I took a stab at adding embedding mode, where we print the sentence embedding for the input instead of generating more tokens.
If I only add the compute and print in llama_eval itself, that works. But for some reason after adding the boolean flag (even if I create a second function that has the same preamble but in the end only prints the embeddings) it stops working.
Could anyone take a look and tell me where I failed or what I am missing, so that we can add this capability?
Thank you!