Add embedding mode with arg flag. Currently working #282

StrikingLoo · 2023-03-19T06:39:36Z

Hi everyone,
I took a stab at adding embedding mode, where we print the sentence embedding for the input instead of generating more tokens.
If I only add the compute and print in llama_eval itself, that works. But for some reason after adding the boolean flag (even if I create a second function that has the same preamble but in the end only prints the embeddings) it stops working.
Could anyone take a look and tell me where I failed or what I am missing, so that we can add this capability?

Thank you!

StrikingLoo · 2023-03-20T05:19:42Z

It's true that I am, but we never reach that line. I forgot to uncomment it for the PR, but before that, the program fails on the `ggml_graph_compute (ctx0, &gf);` line. We never reach that other part.

…

On Sun, Mar 19, 2023 at 8:09 PM taher ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In main.cpp <#282 (comment)>: > @@ -936,12 +943,27 @@ int main(int argc, char ** argv) { printf(ANSI_COLOR_YELLOW); } + if (params.embedding){ + printf("got right before second call.\n"); + const int64_t t_start_us = ggml_time_us(); //HERE + if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) { + fprintf(stderr, "Failed to predict\n"); + return 1; + } + //ggml_free(model.ctx); + + if (params.use_color) { + printf(ANSI_COLOR_RESET); + } + return 0; Looks like you're returning without freeing. ggml_free(model.ctx); — Reply to this email directly, view it on GitHub <#282 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6JCX7LPLXDZHBKKHKKCHTW47C5XANCNFSM6AAAAAAV75U6WA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

nullhook · 2023-03-20T17:46:12Z

main.cpp

+    if (params.embedding){
+        printf("got right before second call.\n");
+        const int64_t t_start_us = ggml_time_us(); //HERE
+        if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) {


You're not calling llama_eval for remaining tokens

the sentence embedding depends on the input, we shouldn't have to generate extra tokens later. Though I may be wrong. But changing that wouldn't solve the failed execution error.

Are you trying to capture input text's embeddings or predicted text embeddings?

Given the PR it is doing the input text's embeddings

The vector of tokens that you are sending for evaluation is empty. Please ensure that it is set correctly right before evaluating it, and also add a check to verify that it is not empty.

embd = embd_inp; if (embd.size() > 0) { if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) { fprintf(stderr, "Failed to predict\n"); return 1; } }

Let me know if the above fixes the runtime crash

I'm sorry, I had seen the email and saw the message but not the snippet. I will try this soon. Thank you!

That solved it! My version is working but I see merge conflicts, I am testing the merge now

Merge successful. Now we have embeddings! This should be ready to merge into main.

nullhook · 2023-03-20T17:47:07Z

main.cpp

    while (remaining_tokens > 0) {
        // predict
        if (embd.size() > 0) {
            const int64_t t_start_us = ggml_time_us();

-            if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token)) {
+            if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, false)) {


Pass params.embedding directly and remove the whole if block in line 946

But if we want the embedding, we only want to print once, and not generate the next token

StrikingLoo · 2023-03-21T16:36:46Z

input text

…

On Tue, Mar 21, 2023 at 9:30 AM taher ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In main.cpp <#282 (comment)>: > @@ -936,12 +943,27 @@ int main(int argc, char ** argv) { printf(ANSI_COLOR_YELLOW); } + if (params.embedding){ + printf("got right before second call.\n"); + const int64_t t_start_us = ggml_time_us(); //HERE + if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) { Are you trying to capture input text's embeddings or predicted text embeddings? — Reply to this email directly, view it on GitHub <#282 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6JCX7NKIIHNJ6VEJC7KSDW5HJSDANCNFSM6AAAAAAV75U6WA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

nullhook · 2023-03-22T00:34:17Z

Do the input tokens need to go through many layers to obtain their embeddings? My guess is that the input embeddings should be obtainable earlier, so that it is not necessary to loop through multiple layers and obtain the logits also.

Perhaps we should return as soon as the input tokens are projected into the embedding space? as this step would give us the vector representation we need.

StrikingLoo · 2023-03-22T01:06:24Z

The embeddings should be the output of the last attention layer, corresponding to the last token in the input.
Say I input "I am a dog", the transformer should be mapping each token to an embedding in the end (attending each to each). I want the embedding that the last attention layer assigns to the last token. Not the logits themselves, but certainly not just the word embeddings from the beginning -before any feedforward through the attention layers-.

nullhook · 2023-03-22T02:07:39Z

The input embeddings obtained at the starting layer are static and not contextual? In order for the embeddings to include context for each input token in the sequence, is it necessary for the input to go through all the layers?

StrikingLoo · 2023-03-22T02:11:45Z

Sorry, I don't get if those are assertions or questions. I understand we want for the whole input to go through all the attention layers, then we take the representation for the last token. That is what this PR is supposed to be doing. Right now I think it is working, though I would say we can test it more before merging (right now I see it doesn't break, and it gives an output of the correct shape, but I would test more)

StrikingLoo · 2023-03-22T02:27:27Z

Okay, I think this is working well. I tested it with multiple inputs and the embeddings make a lot of sense.
Just to clarify, what this is doing is:

Take the input (a sentence, etc.)
Feed it through all the attention layers. In the end we have N embeddings of size n_embd.
Print the embedding corresponding to the last token in the input.
Stop execution
Right now it works without errors both in embedding mode and not embedding mode. The embeddings look coherent. More exhaustive testing may be performed with someone with access to better compute, but I tried 8 inputs, some semantically related, some not, and the cosine similarities between embeddings made sense. I made some sentences have the exact same last word so that, if we were merely outputting word embeddings with no feedforwarding, similarity should be maximum, which it was not, so this looks good.

Do let me know if you think other changes are needed before merging, and thank you for your help in getting this to work!

nullhook · 2023-03-22T03:46:08Z

main.cpp

+        embd = embd_inp;
+        if (embd.size() > 0) {
+            const int64_t t_start_us = ggml_time_us(); 
+            if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) {


pass in the params.embedding instead, which presumably contains the same information about whether or not to use the embeddings as input

nullhook · 2023-03-22T03:48:54Z

main.cpp

+    if (params.embedding){
+        embd = embd_inp;
+        if (embd.size() > 0) {
+            const int64_t t_start_us = ggml_time_us(); 


It seems that we are storing the start timestamp here, but is it being used to measure anything in embedding mode? If not, let's remove it. Alternatively, do we want to measure the embedding compute time?

…e check

StrikingLoo · 2023-03-22T06:45:05Z

Okay, merged with master again, moved everything to llama.cpp. I addressed both changes: now everywhere that makes sense, the boolean matches the console arguments (the test in the beginning for memory size needs to be hardcoded to false) and I removed the spurious time check.

nullhook · 2023-03-22T13:50:12Z

How are you testing the correctness of the embeddings?

ggerganov

Hi, this will be a great addition!

Let's change slightly the interface like this:

Add parameter to llama_context_params called embeddings which is false by default. It will be used to specify if we want the llama_eval() call to keep the computed embeddings or not
Add an embedding buffer (std::vector<float>) to the llama_context similar to the logits buffer and resize it upon model load if the embeddings parameter is true. If it is false, then leave it empty
Keep llama_eval_internal() as original and only change to copy the embeddings into the new context buffer if it is not empty (i.e. embeddings parameter was set to true during init). You will need to store the ggml tensor at the end in a separate variable so you can access it after the ggml_graph_compute() call, similar to the logits. No need for an alternative ggml_graph_compute() call as you have proposed
Add C API call to get the new embeddings buffer - same as llama_get_logits()
You can now print the embeddings in user code, instead of in the llama library

StrikingLoo · 2023-03-22T16:30:53Z

I tried multiple inputs and judged from the cosine similarities between semantically similar vs not-similar sentences. I am more than open to other tests if anyone can think of them.

Example pair-wise correlations

Napoleonic_France -- cats_are_cute
-0.027480708515286424
Napoleonic_France -- I_love_dogs
-0.31697662922849257
Napoleonic_France -- I_love_cats
-0.301660192174504
Napoleonic_France -- Victorian_England
0.8798725429541867
cats_are_cute -- Napoleonic_France
-0.027480708515286424
cats_are_cute -- I_love_dogs
0.576915180966497
cats_are_cute -- I_love_cats
0.6115158942552829
cats_are_cute -- Victorian_England
-0.01785591030506376
I_love_dogs -- Napoleonic_France
-0.31697662922849257
I_love_dogs -- cats_are_cute
0.576915180966497
I_love_dogs -- I_love_cats
0.9332809579375094
I_love_dogs -- Victorian_England
-0.3006025355059021
I_love_cats -- Napoleonic_France
-0.301660192174504
I_love_cats -- cats_are_cute
0.6115158942552829
I_love_cats -- I_love_dogs
0.9332809579375094
I_love_cats -- Victorian_England
-0.2838594074158372
Victorian_England -- Napoleonic_France
0.8798725429541867
Victorian_England -- cats_are_cute
-0.01785591030506376
Victorian_England -- I_love_dogs
-0.3006025355059021
Victorian_England -- I_love_cats
-0.2838594074158372

This also made me think, it may be desirable to normalize the embeddings (to norm 1, not the normalization layer itself). I was doing it in 'post-processing'. Do you think it would be good to add this? Not sure how to do it in ggml, but we would just need to divide by the embeddings norm.

nullhook · 2023-03-22T18:10:50Z

I'm curious if normalizing reduces the dimensionality of the embedding space? If the goal of the consumer is to only compute cosine similarity from the embeddings, then I think it makes sense to normalize. However, I suggest leaving the embeddings as raw vectors and letting the consumer decide whether or not to normalize them. It's possible that others may have a different opinion on this.

StrikingLoo · 2023-03-22T18:45:21Z

I agree, normalizing would be lossy. I would assume e.g. GPT-3 API gives you normalized embeddings, but it's easy enough to normalize them on consumer end so I wouldn't sweat it.

I will look into the changes ggerganov suggested + there is a new merge conflict to solve.

nullhook · 2023-03-22T21:24:43Z

It would be helpful to add an example of generating input embeddings, normalizing them, and computing cosine similarity in the example folder in this PR.

StrikingLoo · 2023-03-23T00:52:19Z

Hi ggerganov,
Thank you for the instructions. I did steps 1, 2 and 4, but I am not sure what we want the final program to look like in step 3.

"Keep llama_eval_internal() as original and only change to copy the embeddings into the new context buffer if it is not empty (i.e. embeddings parameter was set to true during init). You will need to store the ggml tensor at the end in a separate variable so you can access it after the ggml_graph_compute() call, similar to the logits. No need for an alternative ggml_graph_compute() call as you have proposed"

Do I still keep the behavior where we check and, if embeddings==true, we only run until we have the embeddings then return? If I do I can't leave llama_eval_internal identical. Plus there is that test run we do at the beginning (Something about memory cost) that requires me to send a false for the embeddings, even if we will keep them later.

The model load / config is already as specified.

…ompile

nullhook · 2023-03-23T17:36:42Z

llama.cpp

+            ggml_build_forward_expand(&gf, inpL);
+            ggml_graph_compute       (ctx0, &gf);


Do we really need to update and compute the graph in embedding mode?

ggerganov · 2023-03-23T20:03:07Z

@StrikingLoo Just pushed a change. Give it a try and let me know if it works

StrikingLoo · 2023-03-24T00:24:19Z

This looks perfect to me. Both generation and embeddings work without errors. Just to make sure: would we be merging like this, or should I add the display_embeddings code under "// TODO: print / use the embeddings"?

Edit:
Adding this here instead of posting multiple comments. Eventually I would like a version where either if we send the embedding argument, nothing else is printed to stdout except the embeddings (so that the command can be easily piped) or, alternatively, if we send --embedding --output-path 'path.emb' or some such the embeddings are stored in a file. But I'm okay with the program not having that by default and the users adding this. I just think many people will prefer to use the program as a blackbox without checking the code. Maybe the API mode you mentioned before is already planning to add something like this.

ggerganov · 2023-03-24T15:05:02Z

This will become a separate example program called embedding. For now will merge like this because I want to start refactoring and later we will move into a separate program

loretoparisi · 2023-03-26T17:59:29Z

@ggerganov is this the correct command

./embedding -m models/7B/ggml-model-q4_0.bin -p "ciao" -n 512

It seems it's not using the prompt in p. Infact I do not see in the logs the log I see in embedding.cpp that should print it:

main: seed = 1679853514
llama_model_load: loading model from 'models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required  = 6065.34 MB (+ 1026.00 MB per state)
llama_model_load: loading model part 1/1 from 'models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
2.177604 -1.095253
...
04 0.534440 0.732717 0.781988 -1.836264 -0.860989 -0.564879 0.084990 0.838598 1.210304 -0.441369 -1.963783 2.096257 

llama_print_timings:        load time =  1124.05 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =   464.57 ms /     3 tokens (  154.86 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  2215.86 ms

StrikingLoo added 4 commits March 18, 2023 14:10

working but ugly

b97df76

add arg flag, not working on embedding mode

801071e

Merge branch 'master' of https://github.com/StrikingLoo/llama.cpp

f22ae57

typo

d2b1d3a

anzz1 marked this pull request as draft March 19, 2023 07:00

StrikingLoo marked this pull request as ready for review March 19, 2023 23:51

StrikingLoo mentioned this pull request Mar 19, 2023

How do I get input embeddings? #224

Closed

gjmulder added the enhancement New feature or request label Mar 20, 2023

nullhook reviewed Mar 20, 2023

View reviewed changes

Working! Thanks to @nullhook

76dde26

merge + working embeddings

e4524da

nullhook reviewed Mar 22, 2023

View reviewed changes

StrikingLoo added 2 commits March 21, 2023 23:21

make params argument instead of hardcoded boolean. remove useless tim…

78cff58

…e check

merge with main, move logic for embeddings into llama.cpp

84130ca

StrikingLoo changed the title ~~[ Do not merge ] Add embedding mode with arg flag. Currently not working~~ Add embedding mode with arg flag. Currently working Mar 22, 2023

ggerganov requested changes Mar 22, 2023

View reviewed changes

ggerganov mentioned this pull request Mar 22, 2023

Embeddings models? ggml-org/ggml#43

Closed

merge

84ab887

start doing the instructions but not finished. This probably doesnt c…

859e708

…ompile

nullhook reviewed Mar 23, 2023

View reviewed changes

Embeddings extraction support

8a3c34b

ggerganov merged commit 8d4a855 into ggml-org:master Mar 24, 2023

pachacamac mentioned this pull request Mar 26, 2023

Support for Rotary Embeddings for Llama langchain-ai/langchain#1885

Closed

naktinis mentioned this pull request Apr 20, 2023

[User] Embedding doesn't seem to work? #899

Closed

AAbushady pushed a commit to AAbushady/llama.cpp that referenced this pull request Jan 27, 2024

To fix build problem on Apple Metal LLAMA_METAL=1 (ggml-org#282)

1129d66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add embedding mode with arg flag. Currently working #282

Add embedding mode with arg flag. Currently working #282

StrikingLoo commented Mar 19, 2023

StrikingLoo commented Mar 20, 2023 via email

nullhook Mar 20, 2023 •

edited

Loading

StrikingLoo Mar 21, 2023

nullhook Mar 21, 2023

adriacabeza Mar 21, 2023

nullhook Mar 21, 2023

nullhook Mar 22, 2023

StrikingLoo Mar 22, 2023

StrikingLoo Mar 22, 2023

StrikingLoo Mar 22, 2023

nullhook Mar 20, 2023

StrikingLoo Mar 21, 2023

StrikingLoo commented Mar 21, 2023 via email

nullhook commented Mar 22, 2023 •

edited

Loading

StrikingLoo commented Mar 22, 2023

nullhook commented Mar 22, 2023 •

edited

Loading

StrikingLoo commented Mar 22, 2023

StrikingLoo commented Mar 22, 2023

nullhook Mar 22, 2023

nullhook Mar 22, 2023

StrikingLoo commented Mar 22, 2023

nullhook commented Mar 22, 2023

ggerganov left a comment •

edited

Loading

StrikingLoo commented Mar 22, 2023

nullhook commented Mar 22, 2023

StrikingLoo commented Mar 22, 2023

nullhook commented Mar 22, 2023 •

edited

Loading

StrikingLoo commented Mar 23, 2023

nullhook Mar 23, 2023

ggerganov commented Mar 23, 2023

StrikingLoo commented Mar 24, 2023 •

edited

Loading

ggerganov commented Mar 24, 2023

loretoparisi commented Mar 26, 2023

		ggml_build_forward_expand(&gf, inpL);
		ggml_graph_compute (ctx0, &gf);

Add embedding mode with arg flag. Currently working #282

Add embedding mode with arg flag. Currently working #282

Conversation

StrikingLoo commented Mar 19, 2023

StrikingLoo commented Mar 20, 2023 via email

nullhook Mar 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikingLoo commented Mar 21, 2023 via email

nullhook commented Mar 22, 2023 • edited Loading

StrikingLoo commented Mar 22, 2023

nullhook commented Mar 22, 2023 • edited Loading

StrikingLoo commented Mar 22, 2023

StrikingLoo commented Mar 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikingLoo commented Mar 22, 2023

nullhook commented Mar 22, 2023

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

StrikingLoo commented Mar 22, 2023

nullhook commented Mar 22, 2023

StrikingLoo commented Mar 22, 2023

nullhook commented Mar 22, 2023 • edited Loading

StrikingLoo commented Mar 23, 2023

Choose a reason for hiding this comment

ggerganov commented Mar 23, 2023

StrikingLoo commented Mar 24, 2023 • edited Loading

ggerganov commented Mar 24, 2023

loretoparisi commented Mar 26, 2023

nullhook Mar 20, 2023 •

edited

Loading

nullhook commented Mar 22, 2023 •

edited

Loading

nullhook commented Mar 22, 2023 •

edited

Loading

ggerganov left a comment •

edited

Loading

nullhook commented Mar 22, 2023 •

edited

Loading

StrikingLoo commented Mar 24, 2023 •

edited

Loading