Chat mode inference stops mid-sentence #460

Belluxx · 2023-03-24T11:43:10Z

Belluxx
Mar 24, 2023

I noticed that in the chat mode the inference often stops mid-sentence and requires the user to press enter to continue.

However this introduces a new line at the end of the context that makes LLaMA terminate what it was saying and give back the control to the user with the reverse prompt User:

Does someone know why this happens and if it would be beneficial to make the Enter keypress just continue inference instead of also adding a \n when both -i and --reverse-prompt are passed? Obviously the \n append is skipped only if the termination was not due to a reverse prompt match.

Note: this happens also after very few tokens (5-10) sometimes.

x02Sylvie · 2023-03-24T14:19:29Z

x02Sylvie
Mar 24, 2023

I noticed it happening myself

it kinda occurs like this
Miko: The game developement is [interferene stops, i need to press enter]
hobby for some of people.

0 replies

vicfic18 · 2023-03-25T12:03:05Z

vicfic18
Mar 25, 2023

I noticed this too, but for me after pressing enter it prints out the introduction of some random book.

1 reply

Belluxx Mar 25, 2023
Author

Yes, same for me, sometimes it gives User: and others it exits the chat completely by starting to write a book

UltraPhil · 2023-04-11T04:46:33Z

UltraPhil
Apr 11, 2023

+1 here. Sometimes I wait it out and it restarts, but if I see the CPU fall to zero, then it won't buldge until I press enter.

0 replies

keldenl · 2023-04-11T10:43:03Z

keldenl
Apr 11, 2023

same here

0 replies

x4080 · 2023-05-09T21:22:30Z

x4080
May 9, 2023

is there solution for this ? I tried to use --mlock and -c 2048, but still stopped

0 replies

DannyDaemonic · 2023-05-10T01:13:14Z

DannyDaemonic
May 10, 2023

There are two possible fixes: -n -1 and -n -1 --ignore-eos.

Some models will use the eos to return control to you when they are done answering. So depending which model you're using, one might work better than the other. If in doubt, try -n -1 first.

0 replies

x4080 · 2023-05-10T02:18:18Z

x4080
May 10, 2023

@DannyDaemonic , I just tried that adding -c 2048 with vicuna 13b havent gotten stopped yet, I'll try your suggestion if I found another stop. Thank you very much

0 replies

DannyDaemonic · 2023-05-10T02:25:30Z

DannyDaemonic
May 10, 2023

-c 2048 increases the context, which is just another way of saying how far back the model remembers.

When the context fills up, it has to cut your old history in half and reprocess it all (to make room to remember new stuff). This can also be a slow process, so it's possible the pause you were seeing was simply the system "thinking" and it wasn't actually waiting for you to hit enter.

Or it could be a combination.

-n tells it how many tokens to generate before handing control back to you. So -n 5 would make it stop every 3 or 4 words. The default this number is now -1, so it probably isn't the cause unless you're manually setting -n (or --n_predict) to something else.

0 replies

x4080 · 2023-05-10T02:28:57Z

x4080
May 10, 2023

@DannyDaemonic maybe I was wrong, I meant -n 2048 ? Not sure sorry, I was only this morning playing with this 😄
so probably I should set -n -1 instead of 2048 ?

0 replies

x4080 · 2023-05-10T03:31:29Z

x4080
May 10, 2023

Ok, thanks all for your help, the key is like @DannyDaemonic said : -n -1
And I found that it needs huge memory, in my m2 it shows swap 307MB (bad thing) - when using 13B 4bit quantized vicuna I think after long session of chat - the stop like is because it doing the swaping I guess (using with --mlock), I will try without --mlock if it still doing swaping

0 replies

DannyDaemonic · 2023-05-10T06:04:24Z

DannyDaemonic
May 10, 2023

The long pause you get after a long conversation is usually due to the context memory being trimmed and re-evaluated so it has space to continue the conversation. --mlock won't change how often this happens, but it could make such re-evaluations faster.

Context length isn't usually limited by your computer's RAM, but rather by how much context the model has been trained to handle.

0 replies

x4080 · 2023-05-10T07:11:21Z

x4080
May 10, 2023

@DannyDaemonic is it possible I experienced memory leak then? It happened when lot of back and forth chatting going on, then the memory swapping happening

0 replies

DannyDaemonic · 2023-05-10T07:13:28Z

DannyDaemonic
May 10, 2023

I'm guessing the "swapping" you are experiencing is the context swapping. It can be a very slow process. It happens once the context fills up, so this will happen after a long conversation.

0 replies

x4080 · 2023-05-10T21:33:24Z

x4080
May 10, 2023

@DannyDaemonic, I see. Today I try to replicate again, without using --mlock, and it wont swap, and the speed different is almost un-noticeable, I think for now I better not use --mlock and with -n -1
M2 pro 16gb

0 replies

staviq · 2023-07-27T10:33:55Z

staviq
Jul 27, 2023

This is still happening. I'm using the server, and in the browser console it tells you the reason for stopping, and when I'm getting incomplete sentences or incomplete code I asked for, the debug console says the reason for stopping was eos.

When this happens, almost certainly the next couple of sentences get progressively more meaningless, including repeating my own questions, attempts at talking to itself, or endless stream of a single character, usually \n or #, and if I continue, it starts to produce more and more garbage, completely ignoring my input, and replying to absolutely random pieces of previous conversation.

It's almost as if when the context gets full and it does it's context "compression" thing, it butchers the question/reply scheme and the model sees messed up things in the context.

0 replies

pugzly · 2023-07-27T12:31:28Z

pugzly
Jul 27, 2023

Maybe some breaking changes might have been introduced recently?

I clearly remember about a month or two ago I was able to have long conversations with large WizardLM models (in interactive/chat mode), but this morning, after long break, I downloaded and compiled latest llama.cpp, and re-quantized my model, and I can only get 1-2 responses from it before it freeze up and then it would start generating random gibberish or talking to itself, after I hit enter.

I tried prompting with and without "-n" parameter, tried different models, but I'm yet to find combination which would work.

0 replies

staviq · 2023-07-27T15:08:59Z

staviq
Jul 27, 2023

I just ran couple of tests on a "single run" mode from the command line, giving it a prompt of a conversation to continue, with -n -1 and the ban eos flag (can't remember it exactly, I'm on my phone right now), just to see what happens.

It continued to generate for a fairly long time, way past the context limit, but at some point it REALLY tried to stop, by saying things like "the end", and generating some more and then saying "for real the end", and then doing some more to the point where it just started to sound like an a SMS "ok, bye I'm going to sklep now, BYE, BYYYYEEEE" and after that gave me a copious amount of emoji followed by continuous stream of non printable characters, aka it got completely off the rails.

(But it was genuinely hilarious, like it wanted to let me know it had absolutely enough, like in the "blink twice if you need help" kind of way)

Any prompt, and several different model, all get to the same point eventually, during repeated tests. Sometimes sooner, sometimes it got quite far.

Edit: Wait, now that I think about it, did an AI just imply that I'm a creep ? Did I just get friendzined by na AI ?

0 replies

pugzly · 2023-07-27T15:39:22Z

pugzly
Jul 27, 2023

With a bit more testing I think I found one of the recent commit introduced some changes, which are likely the cause of my issues #2304
Copy of llama.cpp prior to that commit works as it used to, but the latest one either is not working with older models or I need to prompt it differently? But don't see anything mentioned in the readme, so no idea yet.

0 replies

mattiashallberg · 2023-08-14T07:26:08Z

mattiashallberg
Aug 14, 2023

Using the following parameters with build 916a9ac: -c 512 -b 1024 -n 1024 --keep 1024 --repeat_penalty 1.1 --color -i -p "my question..."
I do not get the issue with any annoying stop, but after a few questions and answers, completely random output appears.
Wondering if it has anything to do with the "--keep" parameter.

https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md

0 replies

hiqsociety · 2023-09-07T10:40:25Z

hiqsociety
Sep 7, 2023

@mattiashallberg having same issues. how to make it "continue" response without stopping?

1 reply

mattiashallberg Sep 7, 2023

I used version master-fc8ef54, not sure if that was the whole reason it works for me though...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chat mode inference stops mid-sentence #460

{{title}}

Replies: 20 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Chat mode inference stops mid-sentence #460

Replies: 20 comments · 2 replies

Belluxx Mar 25, 2023 Author

Replies: 20 comments 2 replies

Belluxx Mar 25, 2023
Author