Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add embedding mode with arg flag. Currently working #282

Merged
merged 11 commits into from
Mar 24, 2023

Conversation

StrikingLoo
Copy link
Contributor

Hi everyone,
I took a stab at adding embedding mode, where we print the sentence embedding for the input instead of generating more tokens.
If I only add the compute and print in llama_eval itself, that works. But for some reason after adding the boolean flag (even if I create a second function that has the same preamble but in the end only prints the embeddings) it stops working.
Could anyone take a look and tell me where I failed or what I am missing, so that we can add this capability?

Thank you!

@anzz1 anzz1 marked this pull request as draft March 19, 2023 07:00
@StrikingLoo StrikingLoo marked this pull request as ready for review March 19, 2023 23:51
@StrikingLoo
Copy link
Contributor Author

StrikingLoo commented Mar 20, 2023 via email

@gjmulder gjmulder added the enhancement New feature or request label Mar 20, 2023
main.cpp Outdated
if (params.embedding){
printf("got right before second call.\n");
const int64_t t_start_us = ggml_time_us(); //HERE
if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) {
Copy link

@nullhook nullhook Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not calling llama_eval for remaining tokens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sentence embedding depends on the input, we shouldn't have to generate extra tokens later. Though I may be wrong. But changing that wouldn't solve the failed execution error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to capture input text's embeddings or predicted text embeddings?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the PR it is doing the input text's embeddings

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vector of tokens that you are sending for evaluation is empty. Please ensure that it is set correctly right before evaluating it, and also add a check to verify that it is not empty.

embd = embd_inp;

if (embd.size() > 0) {
   if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) {
      fprintf(stderr, "Failed to predict\n");
      return 1;
   }
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the above fixes the runtime crash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I had seen the email and saw the message but not the snippet. I will try this soon. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That solved it! My version is working but I see merge conflicts, I am testing the merge now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge successful. Now we have embeddings! This should be ready to merge into main.

main.cpp Outdated
while (remaining_tokens > 0) {
// predict
if (embd.size() > 0) {
const int64_t t_start_us = ggml_time_us();

if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token)) {
if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, false)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass params.embedding directly and remove the whole if block in line 946

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we want the embedding, we only want to print once, and not generate the next token

@StrikingLoo
Copy link
Contributor Author

StrikingLoo commented Mar 21, 2023 via email

@nullhook
Copy link

nullhook commented Mar 22, 2023

Do the input tokens need to go through many layers to obtain their embeddings? My guess is that the input embeddings should be obtainable earlier, so that it is not necessary to loop through multiple layers and obtain the logits also.

Perhaps we should return as soon as the input tokens are projected into the embedding space? as this step would give us the vector representation we need.

@StrikingLoo
Copy link
Contributor Author

The embeddings should be the output of the last attention layer, corresponding to the last token in the input.
Say I input "I am a dog", the transformer should be mapping each token to an embedding in the end (attending each to each). I want the embedding that the last attention layer assigns to the last token. Not the logits themselves, but certainly not just the word embeddings from the beginning -before any feedforward through the attention layers-.

@nullhook
Copy link

nullhook commented Mar 22, 2023

The input embeddings obtained at the starting layer are static and not contextual? In order for the embeddings to include context for each input token in the sequence, is it necessary for the input to go through all the layers?

@StrikingLoo
Copy link
Contributor Author

Sorry, I don't get if those are assertions or questions. I understand we want for the whole input to go through all the attention layers, then we take the representation for the last token. That is what this PR is supposed to be doing. Right now I think it is working, though I would say we can test it more before merging (right now I see it doesn't break, and it gives an output of the correct shape, but I would test more)

@StrikingLoo
Copy link
Contributor Author

Okay, I think this is working well. I tested it with multiple inputs and the embeddings make a lot of sense.
Just to clarify, what this is doing is:

  • Take the input (a sentence, etc.)
  • Feed it through all the attention layers. In the end we have N embeddings of size n_embd.
  • Print the embedding corresponding to the last token in the input.
  • Stop execution
    Right now it works without errors both in embedding mode and not embedding mode. The embeddings look coherent. More exhaustive testing may be performed with someone with access to better compute, but I tried 8 inputs, some semantically related, some not, and the cosine similarities between embeddings made sense. I made some sentences have the exact same last word so that, if we were merely outputting word embeddings with no feedforwarding, similarity should be maximum, which it was not, so this looks good.

Do let me know if you think other changes are needed before merging, and thank you for your help in getting this to work!

main.cpp Outdated
embd = embd_inp;
if (embd.size() > 0) {
const int64_t t_start_us = ggml_time_us();
if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass in the params.embedding instead, which presumably contains the same information about whether or not to use the embeddings as input

main.cpp Outdated
if (params.embedding){
embd = embd_inp;
if (embd.size() > 0) {
const int64_t t_start_us = ggml_time_us();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we are storing the start timestamp here, but is it being used to measure anything in embedding mode? If not, let's remove it. Alternatively, do we want to measure the embedding compute time?

@StrikingLoo
Copy link
Contributor Author

Okay, merged with master again, moved everything to llama.cpp. I addressed both changes: now everywhere that makes sense, the boolean matches the console arguments (the test in the beginning for memory size needs to be hardcoded to false) and I removed the spurious time check.

@StrikingLoo StrikingLoo changed the title [ Do not merge ] Add embedding mode with arg flag. Currently not working Add embedding mode with arg flag. Currently working Mar 22, 2023
@nullhook
Copy link

How are you testing the correctness of the embeddings?

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this will be a great addition!

Let's change slightly the interface like this:

  • Add parameter to llama_context_params called embeddings which is false by default. It will be used to specify if we want the llama_eval() call to keep the computed embeddings or not
  • Add an embedding buffer (std::vector<float>) to the llama_context similar to the logits buffer and resize it upon model load if the embeddings parameter is true. If it is false, then leave it empty
  • Keep llama_eval_internal() as original and only change to copy the embeddings into the new context buffer if it is not empty (i.e. embeddings parameter was set to true during init). You will need to store the ggml tensor at the end in a separate variable so you can access it after the ggml_graph_compute() call, similar to the logits. No need for an alternative ggml_graph_compute() call as you have proposed
  • Add C API call to get the new embeddings buffer - same as llama_get_logits()
  • You can now print the embeddings in user code, instead of in the llama library

@StrikingLoo
Copy link
Contributor Author

I tried multiple inputs and judged from the cosine similarities between semantically similar vs not-similar sentences. I am more than open to other tests if anyone can think of them.

Example pair-wise correlations

Napoleonic_France -- cats_are_cute
-0.027480708515286424
Napoleonic_France -- I_love_dogs
-0.31697662922849257
Napoleonic_France -- I_love_cats
-0.301660192174504
Napoleonic_France -- Victorian_England
0.8798725429541867
cats_are_cute -- Napoleonic_France
-0.027480708515286424
cats_are_cute -- I_love_dogs
0.576915180966497
cats_are_cute -- I_love_cats
0.6115158942552829
cats_are_cute -- Victorian_England
-0.01785591030506376
I_love_dogs -- Napoleonic_France
-0.31697662922849257
I_love_dogs -- cats_are_cute
0.576915180966497
I_love_dogs -- I_love_cats
0.9332809579375094
I_love_dogs -- Victorian_England
-0.3006025355059021
I_love_cats -- Napoleonic_France
-0.301660192174504
I_love_cats -- cats_are_cute
0.6115158942552829
I_love_cats -- I_love_dogs
0.9332809579375094
I_love_cats -- Victorian_England
-0.2838594074158372
Victorian_England -- Napoleonic_France
0.8798725429541867
Victorian_England -- cats_are_cute
-0.01785591030506376
Victorian_England -- I_love_dogs
-0.3006025355059021
Victorian_England -- I_love_cats
-0.2838594074158372

This also made me think, it may be desirable to normalize the embeddings (to norm 1, not the normalization layer itself). I was doing it in 'post-processing'. Do you think it would be good to add this? Not sure how to do it in ggml, but we would just need to divide by the embeddings norm.

@nullhook
Copy link

I'm curious if normalizing reduces the dimensionality of the embedding space? If the goal of the consumer is to only compute cosine similarity from the embeddings, then I think it makes sense to normalize. However, I suggest leaving the embeddings as raw vectors and letting the consumer decide whether or not to normalize them. It's possible that others may have a different opinion on this.

@StrikingLoo
Copy link
Contributor Author

I agree, normalizing would be lossy. I would assume e.g. GPT-3 API gives you normalized embeddings, but it's easy enough to normalize them on consumer end so I wouldn't sweat it.

I will look into the changes ggerganov suggested + there is a new merge conflict to solve.

@nullhook
Copy link

nullhook commented Mar 22, 2023

It would be helpful to add an example of generating input embeddings, normalizing them, and computing cosine similarity in the example folder in this PR.

@StrikingLoo
Copy link
Contributor Author

Hi ggerganov,
Thank you for the instructions. I did steps 1, 2 and 4, but I am not sure what we want the final program to look like in step 3.

"Keep llama_eval_internal() as original and only change to copy the embeddings into the new context buffer if it is not empty (i.e. embeddings parameter was set to true during init). You will need to store the ggml tensor at the end in a separate variable so you can access it after the ggml_graph_compute() call, similar to the logits. No need for an alternative ggml_graph_compute() call as you have proposed"

Do I still keep the behavior where we check and, if embeddings==true, we only run until we have the embeddings then return? If I do I can't leave llama_eval_internal identical. Plus there is that test run we do at the beginning (Something about memory cost) that requires me to send a false for the embeddings, even if we will keep them later.

The model load / config is already as specified.

llama.cpp Outdated
Comment on lines 825 to 826
ggml_build_forward_expand(&gf, inpL);
ggml_graph_compute (ctx0, &gf);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to update and compute the graph in embedding mode?

@ggerganov
Copy link
Member

@StrikingLoo Just pushed a change. Give it a try and let me know if it works

@StrikingLoo
Copy link
Contributor Author

StrikingLoo commented Mar 24, 2023

This looks perfect to me. Both generation and embeddings work without errors. Just to make sure: would we be merging like this, or should I add the display_embeddings code under "// TODO: print / use the embeddings"?

Edit:
Adding this here instead of posting multiple comments. Eventually I would like a version where either if we send the embedding argument, nothing else is printed to stdout except the embeddings (so that the command can be easily piped) or, alternatively, if we send --embedding --output-path 'path.emb' or some such the embeddings are stored in a file. But I'm okay with the program not having that by default and the users adding this. I just think many people will prefer to use the program as a blackbox without checking the code. Maybe the API mode you mentioned before is already planning to add something like this.

@ggerganov
Copy link
Member

This will become a separate example program called embedding. For now will merge like this because I want to start refactoring and later we will move into a separate program

@loretoparisi
Copy link

@ggerganov is this the correct command

./embedding -m models/7B/ggml-model-q4_0.bin -p "ciao" -n 512 

It seems it's not using the prompt in p. Infact I do not see in the logs the log I see in embedding.cpp that should print it:

main: seed = 1679853514
llama_model_load: loading model from 'models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required  = 6065.34 MB (+ 1026.00 MB per state)
llama_model_load: loading model part 1/1 from 'models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
2.177604 -1.095253
...
04 0.534440 0.732717 0.781988 -1.836264 -0.860989 -0.564879 0.084990 0.838598 1.210304 -0.441369 -1.963783 2.096257 

llama_print_timings:        load time =  1124.05 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =   464.57 ms /     3 tokens (  154.86 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  2215.86 ms

AAbushady pushed a commit to AAbushady/llama.cpp that referenced this pull request Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants