-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different tokenization leads to BLAS reprocessing #1368
Comments
With DeepSeek R1, this happens quite often, actually. |
Unfortunately this is not something that can be easily fixed, because the model was trained on a specific tokenizer, which must obey its own merging rules. So if A more extreme example. The word "hello" is a single token. And the model also has single tokens for "h" "e" "l" and "o". However, if you shove the 5 tokens |
This is not what I am asking here. The solution is not "make the model print uniformly", and not "fix tokenizer", but prevent koboldcpp from re-tokenizing internal context cache on subsequent requests in this session! |
Yes, what I am trying to say is that doing that will lead to a significant degradation of the output quality. If you just want something to experiment with for your own frontend, I can add an API where you can submit the exact array of token IDs you wish to use within the context. That approach will bypass the tokenizer entirely, and will give you full control of when reprocessing happens. You can use the tokenizer API separately to tokenize individual sub-chunks based on your own text splitting logic, and then feed the IDs for generation. Would that be useful for you? |
Why do you think the output will be worse if that was THE MODEL ITSELF who chosen the different tokens in its previous answer!? I would rather believe that "fixing" the tokenization by re-tokenizing "correctly" (as koboldcpp does now) has more chance to degrade quality. Again, imagine that I ask a question and then let R1 think for 4k tokens. Whatever I get would be tokenized "as the model wants it to be" in its answer. And now, you are claiming that not only the output of 4096 vs. 8x512 will/should/might be different, but that 8x512 should have the better quality than one run for 4096 ! What I am saying is that:
Reloading koboldcpp or flushing the context cache (by submitting an empty story, for example) will convert "output vocabulary" to "input vocabulary", but this won't affect neither p.1 (you'll do BLAS anyway), nor p.2 (it converts the statement to "later continuation of the previous story might have a different outcome than generating it in one session", but not a dependence on amount to generate value – which will hold true only until you'll implement a mechanism for saving context cache between runs). Also, I believe the tokenizer is not that broken: even if it has subtle variations on multi-character (not part of words but symbols) sequences, this should not generally affect the quality at all (imagine your example with "\n\n" vs. "\n"+"\n" – this is still a line feed, not something different). The only model I know was "bad at spacing" is Command R+: it often produces double and triple spaces, indents lists differently, uses inconsistent dashes and quotes, and so on. If you won't edit it in its previous output, this will only get worse; but even with broken lists and extra spaces – the quality of the text wasn't noticeably dropping. (Though it is not necessary due to tokenization in the Command R+ case…) For DeepSeek, I did not notice any structure breakage yet. |
Since different token sequences may form the same printed text, there is a chance that whatever the model outputs would not tokenize back to the same exact tokens when used as an input in the following turn.
The drawback is that some part of yellow text might be reprocessed again the next time, as if the user made an edition somewhere, because context cache would miss due to token id discrepancy.
Here is an example with quantized DeepSeek R1 (I could not see it on distilled model, probably because of a different tokenizer vocabulary):
<|User|>Repeat the string: "**Duration:** Around ~2-4 days."<|end▁of▁sentence|><|Assistant|><think></think>**Duration
The model (at zero temp) writes:
:** Around ~2-4 days.<|end▁of▁sentence|>
Thing is, its
":**"
is one token (weird but okay), then" ~"
(space+tilde), then"2"
and so on.But! This string in input is processed as
":**"
, then" "
(space), then"~"
(just tilde), and"2"
. One token more.So, if you add nothing and just hit "Generate more" – you will see 7 tokens as a new prompt (or whatever the length of the rest of the generated text), instead of just 1 token to process.
This works only the first time, because reprocessing "fixes" the cache, making the model believe it wrote
" "
+"~"
rather than" ~"
.I think it is possible to fix this by pretending that user's input was tokenized just as current context as long as their text representation is the same. Here is my general idea in pseudocode (in javascript; I know you have C++ there, it is just easier for me to show JS):
Basically, when you see different tokens, try to resolve: render those tokes to strings; their length may be different, so we compare only the common prefix. If that wrong – then we cannot resolve: prompt is not equal to context; otherwise, we have a change that the next token would "fill the gap". Append the next token as text to whichever string was shorter and compare again. We'll either end up with equal length and content (meaning the discrepancy is resolved), or hit the different text / end of either array. The algorithm eats even different tokens with the same exact text (it would enter the inner loop, just compare and break back).
Then, you would have "how many tokens of context are valid no matter what were the input tokens" and "how many of input tokens to strip to get the actual continuation of the user prompt with respect to the existing context".
I believe that preserving exact output tokens in the context is important, because a different tokenization may in theory affect logprobs later, but the model at zero temperature should continue printing whatever it was trying to say no matter if the previous generation aborted or not.
(Fairly enough, you can say "but if you paste the same history to a new story it won't anymore, because the tokenization already rendered differently everywhere", but I still think that a history restart is not as frequent as mere Abort that should not change the meaning of existing text!)
And of course, unnecessary reprocessing is bad by itself.
Full logs of the run, there you can see how the model chooses different tokens than got fed to it later:
If this very case is caused by some kind of misbehavior of the DeepSeek tokenizer – does not mean it is no worth to fix the context sewer function to prevent similar reprocessing that are inherently possible.
The text was updated successfully, but these errors were encountered: