Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different tokenization leads to BLAS reprocessing #1368

Open
aleksusklim opened this issue Feb 12, 2025 · 5 comments
Open

Different tokenization leads to BLAS reprocessing #1368

aleksusklim opened this issue Feb 12, 2025 · 5 comments

Comments

@aleksusklim
Copy link

Since different token sequences may form the same printed text, there is a chance that whatever the model outputs would not tokenize back to the same exact tokens when used as an input in the following turn.

The drawback is that some part of yellow text might be reprocessed again the next time, as if the user made an edition somewhere, because context cache would miss due to token id discrepancy.

Here is an example with quantized DeepSeek R1 (I could not see it on distilled model, probably because of a different tokenizer vocabulary):

<|User|>Repeat the string: "**Duration:** Around ~2-4 days."<|end▁of▁sentence|><|Assistant|><think></think>**Duration

The model (at zero temp) writes:
:** Around ~2-4 days.<|end▁of▁sentence|>

Thing is, its ":**" is one token (weird but okay), then " ~" (space+tilde), then "2" and so on.
But! This string in input is processed as ":**", then " " (space), then "~" (just tilde), and "2". One token more.

So, if you add nothing and just hit "Generate more" – you will see 7 tokens as a new prompt (or whatever the length of the rest of the generated text), instead of just 1 token to process.
This works only the first time, because reprocessing "fixes" the cache, making the model believe it wrote " "+"~" rather than " ~".

I think it is possible to fix this by pretending that user's input was tokenized just as current context as long as their text representation is the same. Here is my general idea in pseudocode (in javascript; I know you have C++ there, it is just easier for me to show JS):

function find_matching_prefix_length (context_tokens, input_tokens) {
  i = 0; // valid matching index in context_tokens
  j = 0; // valid matching index in input_tokens
  while (i<context_tokens.length && j<input_tokens.length) { // ensure bound
    if (context_tokens[i]==input_tokens[j]) { // token is the same
      i++;
      j++;
      continue; // increment both and continue
    }
    // tokens don't match but may refer to the same text
    a = ""; // part of next context
    b = ""; // part of next input
    x = i; // copy of indices, because we need to preserve valid values
    y = j;
    while (true) { // trying to fit tokens together
      if (x==context_tokens.length || y==input_tokens.length) { // hit the end of either array
        return [i,j]; // return last known valid indices
      }
      a = a + token_to_string(context_tokens[x]); // concatenate next converted text to accumulate prefix
      b = b + token_to_string(input_tokens[y]);
      s = min(a.length, b.length); // common length to compare
      if (a.substring(0,s)!=b.substring(0,s)) { // if texts are different, we stop
        return [i,j]; // returning last valid
      }
      if (a.length==b.length) { // discrepancy resolved, continue scanning forward
        i = x+1; // set new valid indices for the outer loop
        j = y+1;
        break;
      }
      if (a.length<b.length) { // add one token of context since it is shorter
        x++;
      } else { // add one token of input when it was shorter
        y++;
      }
    }
  }
  return [i, j]; // matched till the end
}

Basically, when you see different tokens, try to resolve: render those tokes to strings; their length may be different, so we compare only the common prefix. If that wrong – then we cannot resolve: prompt is not equal to context; otherwise, we have a change that the next token would "fill the gap". Append the next token as text to whichever string was shorter and compare again. We'll either end up with equal length and content (meaning the discrepancy is resolved), or hit the different text / end of either array. The algorithm eats even different tokens with the same exact text (it would enter the inner loop, just compare and break back).

Then, you would have "how many tokens of context are valid no matter what were the input tokens" and "how many of input tokens to strip to get the actual continuation of the user prompt with respect to the existing context".

I believe that preserving exact output tokens in the context is important, because a different tokenization may in theory affect logprobs later, but the model at zero temperature should continue printing whatever it was trying to say no matter if the previous generation aborted or not.

(Fairly enough, you can say "but if you paste the same history to a new story it won't anymore, because the tokenization already rendered differently everywhere", but I still think that a history restart is not as frequent as mere Abort that should not change the meaning of existing text!)

And of course, unnecessary reprocessing is bad by itself.

Full logs of the run, there you can see how the model chooses different tokens than got fed to it later:

***
Welcome to KoboldCpp - Version 1.83
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend...

Initializing dynamic library: koboldcpp_default.dll
==========
Namespace(admin=True, admindir='C:\\NN\\GPT', adminpassword='', analyze='', benchmark=None, blasbatchsize=512, blasthreads=16, chatcompletionsadapter=None, config=None, contextsize=2048, debugmode=1, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, failsafe=False, flashattention=False, forceversion=0, foreground=False, gpulayers=0, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=True, lora=None, mmproj=None, model='', model_param='D:/GGUF/DeepSeek/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=True, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdquant=False, sdt5xxl='', sdthreads=7, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=8, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=True, usecublas=None, usemlock=False, usemmap=True, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: D:\GGUF\DeepSeek\DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf

The reported GGUF Arch is: deepseek2
Arch Category: 0

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from D:\GGUF\DeepSeek\DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file type   = all F32
print_info: file size   = 211.03 GiB (2.70 BPW)
init_tokenizer: initializing tokenizer for type 2
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 819
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 128
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 192
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 24576
print_info: n_embd_v_gqa     = 16384
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = DeepSeek R1 BF16
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<|begin▁of▁sentence|>'
print_info: EOS token        = 1 '<|end▁of▁sentence|>'
print_info: EOT token        = 1 '<|end▁of▁sentence|>'
print_info: PAD token        = 128815 '<|PAD▁TOKEN|>'
print_info: LF token         = 201 '─К'
print_info: FIM PRE token    = 128801 '<|fim▁begin|>'
print_info: FIM SUF token    = 128800 '<|fim▁hole|>'
print_info: FIM MID token    = 128802 '<|fim▁end|>'
print_info: EOG token        = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU
load_tensors: layer   1 assigned to device CPU
load_tensors: layer   2 assigned to device CPU
load_tensors: layer   3 assigned to device CPU
load_tensors: layer   4 assigned to device CPU
load_tensors: layer   5 assigned to device CPU
load_tensors: layer   6 assigned to device CPU
load_tensors: layer   7 assigned to device CPU
load_tensors: layer   8 assigned to device CPU
load_tensors: layer   9 assigned to device CPU
load_tensors: layer  10 assigned to device CPU
load_tensors: layer  11 assigned to device CPU
load_tensors: layer  12 assigned to device CPU
load_tensors: layer  13 assigned to device CPU
load_tensors: layer  14 assigned to device CPU
load_tensors: layer  15 assigned to device CPU
load_tensors: layer  16 assigned to device CPU
load_tensors: layer  17 assigned to device CPU
load_tensors: layer  18 assigned to device CPU
load_tensors: layer  19 assigned to device CPU
load_tensors: layer  20 assigned to device CPU
load_tensors: layer  21 assigned to device CPU
load_tensors: layer  22 assigned to device CPU
load_tensors: layer  23 assigned to device CPU
load_tensors: layer  24 assigned to device CPU
load_tensors: layer  25 assigned to device CPU
load_tensors: layer  26 assigned to device CPU
load_tensors: layer  27 assigned to device CPU
load_tensors: layer  28 assigned to device CPU
load_tensors: layer  29 assigned to device CPU
load_tensors: layer  30 assigned to device CPU
load_tensors: layer  31 assigned to device CPU
load_tensors: layer  32 assigned to device CPU
load_tensors: layer  33 assigned to device CPU
load_tensors: layer  34 assigned to device CPU
load_tensors: layer  35 assigned to device CPU
load_tensors: layer  36 assigned to device CPU
load_tensors: layer  37 assigned to device CPU
load_tensors: layer  38 assigned to device CPU
load_tensors: layer  39 assigned to device CPU
load_tensors: layer  40 assigned to device CPU
load_tensors: layer  41 assigned to device CPU
load_tensors: layer  42 assigned to device CPU
load_tensors: layer  43 assigned to device CPU
load_tensors: layer  44 assigned to device CPU
load_tensors: layer  45 assigned to device CPU
load_tensors: layer  46 assigned to device CPU
load_tensors: layer  47 assigned to device CPU
load_tensors: layer  48 assigned to device CPU
load_tensors: layer  49 assigned to device CPU
load_tensors: layer  50 assigned to device CPU
load_tensors: layer  51 assigned to device CPU
load_tensors: layer  52 assigned to device CPU
load_tensors: layer  53 assigned to device CPU
load_tensors: layer  54 assigned to device CPU
load_tensors: layer  55 assigned to device CPU
load_tensors: layer  56 assigned to device CPU
load_tensors: layer  57 assigned to device CPU
load_tensors: layer  58 assigned to device CPU
load_tensors: layer  59 assigned to device CPU
load_tensors: layer  60 assigned to device CPU
load_tensors: layer  61 assigned to device CPU
load_tensors: relocated tensors: 1025 of 1025
load_tensors:   CPU_Mapped model buffer size = 47485.39 MiB
load_tensors:   CPU_Mapped model buffer size = 47681.52 MiB
load_tensors:   CPU_Mapped model buffer size = 47681.52 MiB
load_tensors:   CPU_Mapped model buffer size = 47681.52 MiB
load_tensors:   CPU_Mapped model buffer size = 25569.12 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init: layer 0: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 1: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 2: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 3: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 4: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 5: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 6: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 7: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 8: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 9: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 10: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 11: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 12: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 13: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 14: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 15: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 16: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 17: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 18: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 19: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 20: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 21: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 22: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 23: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 24: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 25: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 26: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 27: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 28: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 29: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 30: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 31: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 32: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 33: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 34: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 35: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 36: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 37: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 38: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 39: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 40: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 41: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 42: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 43: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 44: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 45: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 46: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 47: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 48: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 49: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 50: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 51: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 52: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 53: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 54: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 55: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 56: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 57: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 58: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 59: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 60: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init:        CPU KV buffer size =  9760.00 MiB
llama_init_from_model: KV self size  = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model:        CPU compute buffer size =   670.01 MiB
llama_init_from_model: graph nodes  = 5025
llama_init_from_model: graph splits = 1
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
======
Active Modules: TextGeneration AdminControl
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001
::1 - - [12/Feb/2025 05:01:30] "GET / HTTP/1.1" 200 -
::1 - - [12/Feb/2025 05:01:30] "GET /manifest.json HTTP/1.1" 200 -
127.0.0.1 - - [12/Feb/2025 05:01:31] "GET /api/v1/model HTTP/1.1" 200 -
::1 - - [12/Feb/2025 05:01:31] "GET /api/v1/config/max_context_length HTTP/1.1" 200 -
::1 - - [12/Feb/2025 05:01:31] "GET /api/v1/info/version HTTP/1.1" 200 -
127.0.0.1 - - [12/Feb/2025 05:01:31] "GET /api/extra/version HTTP/1.1" 200 -
::1 - - [12/Feb/2025 05:01:31] "GET /api/extra/true_max_context_length HTTP/1.1" 200 -
::1 - - [12/Feb/2025 05:01:31] "GET /sdapi/v1/sd-models HTTP/1.1" 200 -
::1 - - [12/Feb/2025 05:01:31] "GET /api/extra/preloadstory HTTP/1.1" 200 -

Input: {"n": 1, "max_context_length": 4096, "max_length": 256, "rep_pen": 1.03, "temperature": 1, "top_p": 0.9, "top_k": 1, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 360, "rep_pen_slope": 0.7, "sampler_order": [6, 5, 0, 1, 3, 4, 2], "memory": "", "trim_stop": true, "genkey": "KCPP9185", "min_p": 0.1, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "banned_tokens": [], "render_special": false, "logprobs": false, "dry_multiplier": 1.03, "dry_base": 1.75, "dry_allowed_length": 2, "dry_penalty_last_n": 360, "dry_sequence_breakers": ["\n", ":", "\"", "*"], "presence_penalty": 0, "logit_bias": {}, "prompt": "<\uff5cUser\uff5c>Repeat the string: \"**Duration:** Around ~2-4 days.\"<\uff5cend\u2581of\u2581sentence\uff5c><\uff5cAssistant\uff5c><think></think>**Duration", "quiet": true, "stop_sequence": ["<\uff5cend\u2581of\u2581sentence\uff5c><\uff5cUser\uff5c>", "<\uff5cend\u2581of\u2581sentence\uff5c><\uff5cAssistant\uff5c>"], "use_default_badwordsids": false, "bypass_eos": false}::1 - - [12/Feb/2025 05:01:33] "GET /api/v1/model HTTP/1.1" 200 -


(Warning! Request max_context_length=4096 exceeds allocated context size of 2048. It will be reduced to fit. Consider launching with increased --contextsize to avoid errors. This message will only show once per session.)::1 - - [12/Feb/2025 05:01:33] "POST /api/extra/generate/stream HTTP/1.1" 200 -


(Note: Non-default sampler_order detected. Recommended sampler values are [6,0,1,3,4,2,5]. This message will only show once per session.)

Processing 4 dry break strings...
Found a total of 1357 restart heads, 1357 trivial, 0 non-trivial.

Using Seed: 318493

[Debug: Dump Raw Input Tokens, format: 6]
'<|begin▁of▁sentence|> (0)', '<|User|> (128803)', 'Repeat (94973)', ' the (270)', ' string (3418)', ': (28)', ' " (582)', '** (666)', 'Duration (46252)', ':** (11490)', ' Around (34659)', '  (223)', '~ (96)', '2 (20)', '- (15)', '4 (22)', ' days (3137)', '." (2148)', '<|end▁of▁sentence|> (1)', '<|Assistant|> (128804)', '<think> (128798)', '</think> (128799)', '** (666)', 'Duration (46252)',


[Debug: Dump Forwarded Input Tokens, format: 6]
'<|begin▁of▁sentence|> (0)', '<|User|> (128803)', 'Repeat (94973)', ' the (270)', ' string (3418)', ': (28)', ' " (582)', '** (666)', 'Duration (46252)', ':** (11490)', ' Around (34659)', '  (223)', '~ (96)', '2 (20)', '- (15)', '4 (22)', ' days (3137)', '." (2148)', '<|end▁of▁sentence|> (1)', '<|Assistant|> (128804)', '<think> (128798)', '</think> (128799)', '** (666)', 'Duration (46252)',

[Debug: n_past=0 Context Size = 0]


Processing Prompt (24 / 24 tokens)
Generating (1 / 256 tokens) [(:** 100.00%)]
Generating (2 / 256 tokens) [( Around 100.00%)]
Generating (3 / 256 tokens) [( ~ 100.00%)]
Generating (4 / 256 tokens) [(2 100.00%)]
Generating (5 / 256 tokens) [(- 100.00%)]
DRY penalties [(4 1.03)]
Generating (6 / 256 tokens) [(4 100.00%)]
DRY penalties [( days 1.80)]
Generating (7 / 256 tokens) [( days 100.00%)]
DRY penalties [(." 3.15)]
Generating (8 / 256 tokens) [(. 100.00%)]
Generating (9 / 256 tokens) [(<|end▁of▁sentence|> 100.00%)]

(EOS token triggered! ID:1)
llama_perf_context_print:        load time =   28159.49 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    28 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     8 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   37143.88 ms /    36 tokens

[05:02:11] CtxLimit:33/2048, Amt:9/256, Init:0.03s, Process:28.13s (1172.0ms/T = 0.85T/s), Generate:8.98s (998.2ms/T = 1.00T/s), Total:37.11s (0.24T/s)
Output: :** Around ~2-4 days.

Input: {"n": 1, "max_context_length": 4096, "max_length": 256, "rep_pen": 1.03, "temperature": 1, "top_p": 0.9, "top_k": 1, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 360, "rep_pen_slope": 0.7, "sampler_order": [6, 5, 0, 1, 3, 4, 2], "memory": "", "trim_stop": true, "genkey": "KCPP9464", "min_p": 0.1, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "banned_tokens": [], "render_special": false, "logprobs": false, "dry_multiplier": 1.03, "dry_base": 1.75, "dry_allowed_length": 2, "dry_penalty_last_n": 360, "dry_sequence_breakers": ["\n", ":", "\"", "*"], "presence_penalty": 0, "logit_bias": {}, "prompt": "<\uff5cUser\uff5c>Repeat the string: \"**Duration:** Around ~2-4 days.\"<\uff5cend\u2581of\u2581sentence\uff5c><\uff5cAssistant\uff5c><think></think>**Duration:** Around ~2-4 days.", "quiet": true, "stop_sequence": ["<\uff5cend\u2581of\u2581sentence\uff5c><\uff5cUser\uff5c>", "<\uff5cend\u2581of\u2581sentence\uff5c><\uff5cAssistant\uff5c>"], "use_default_badwordsids": false, "bypass_eos": false}
::1 - - [12/Feb/2025 05:02:18] "POST /api/extra/generate/stream HTTP/1.1" 200 -

Processing 4 dry break strings...
Found a total of 1357 restart heads, 1357 trivial, 0 non-trivial.

Using Seed: 318538

[Debug: Dump Raw Input Tokens, format: 6]
'<|begin▁of▁sentence|> (0)', '<|User|> (128803)', 'Repeat (94973)', ' the (270)', ' string (3418)', ': (28)', ' " (582)', '** (666)', 'Duration (46252)', ':** (11490)', ' Around (34659)', '  (223)', '~ (96)', '2 (20)', '- (15)', '4 (22)', ' days (3137)', '." (2148)', '<|end▁of▁sentence|> (1)', '<|Assistant|> (128804)', '<think> (128798)', '</think> (128799)', '** (666)', 'Duration (46252)', ':** (11490)', ' Around (34659)', '  (223)', '~ (96)', '2 (20)', '- (15)', '4 (22)', ' days (3137)', '. (16)',


[Debug: Dump Forwarded Input Tokens, format: 6]
'  (223)', '~ (96)', '2 (20)', '- (15)', '4 (22)', ' days (3137)', '. (16)',

[Debug: n_past=26 Context Size = 26]
'<|begin▁of▁sentence|> (0)', '<|User|> (128803)', 'Repeat (94973)', ' the (270)', ' string (3418)', ': (28)', ' " (582)', '** (666)', 'Duration (46252)', ':** (11490)', ' Around (34659)', '  (223)', '~ (96)', '2 (20)', '- (15)', '4 (22)', ' days (3137)', '." (2148)', '<|end▁of▁sentence|> (1)', '<|Assistant|> (128804)', '<think> (128798)', '</think> (128799)', '** (666)', 'Duration (46252)', ':** (11490)', ' Around (34659)',

Processing Prompt (7 / 7 tokens)
Generating (1 / 256 tokens) [(<|end▁of▁sentence|> 100.00%)]

(EOS token triggered! ID:1)
llama_perf_context_print:        load time =   28159.49 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     7 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    3016.06 ms /     8 tokens

[05:02:21] CtxLimit:34/2048, Amt:1/256, Init:0.03s, Process:2.98s (425.9ms/T = 2.35T/s), Generate:0.00s (3.0ms/T = 333.33T/s), Total:2.98s (0.34T/s)
Output:

If this very case is caused by some kind of misbehavior of the DeepSeek tokenizer – does not mean it is no worth to fix the context sewer function to prevent similar reprocessing that are inherently possible.

@aleksusklim
Copy link
Author

With DeepSeek R1, this happens quite often, actually.

@LostRuins
Copy link
Owner

Unfortunately this is not something that can be easily fixed, because the model was trained on a specific tokenizer, which must obey its own merging rules.

So if \n and \n\n are unique tokens, yes the model will be able to output them individually. But when large chunks of newlines are presented in the training data, it would expect them to be tokenized in the merged form, and you will see incoherence and degradation if not.

A more extreme example. The word "hello" is a single token. And the model also has single tokens for "h" "e" "l" and "o". However, if you shove the 5 tokens [h,e,l,l,o] in, you'd get worse results than simply using the correct single token tokenization for it.

@aleksusklim
Copy link
Author

This is not what I am asking here. The solution is not "make the model print uniformly", and not "fix tokenizer", but prevent koboldcpp from re-tokenizing internal context cache on subsequent requests in this session!

@LostRuins
Copy link
Owner

Yes, what I am trying to say is that doing that will lead to a significant degradation of the output quality.

If you just want something to experiment with for your own frontend, I can add an API where you can submit the exact array of token IDs you wish to use within the context. That approach will bypass the tokenizer entirely, and will give you full control of when reprocessing happens. You can use the tokenizer API separately to tokenize individual sub-chunks based on your own text splitting logic, and then feed the IDs for generation. Would that be useful for you?

@aleksusklim
Copy link
Author

aleksusklim commented Feb 15, 2025

Why do you think the output will be worse if that was THE MODEL ITSELF who chosen the different tokens in its previous answer!?

I would rather believe that "fixing" the tokenization by re-tokenizing "correctly" (as koboldcpp does now) has more chance to degrade quality.

Again, imagine that I ask a question and then let R1 think for 4k tokens. Whatever I get would be tokenized "as the model wants it to be" in its answer.
Now, instead of letting it run for 4k, I set amount to generate to just 512 tokens. There would be 8 stops of the generation, and I would have to run again to continue each time.
Assume, once in a 256 tokens the model outputs something that should be tokenized differently. This would mean for each of 7 next generations of 512 more tokens, I would have to wait for seven BLAS preprocesses of around 256 tokens!

And now, you are claiming that not only the output of 4096 vs. 8x512 will/should/might be different, but that 8x512 should have the better quality than one run for 4096 !
At that rate, llama.cpp might have to consider re-tokenizing thinks on the fly, re-running the generated tokens right as they come out.

What I am saying is that:

  1. BLAS takes time, and should not interrupt a mere continuation of the generation.
  2. There should be no fundamental difference in output depending on "amount to generate" (unless we happen to spit just in-between token sequence that could be tokenized differently, which is unlikely during normal iterations).

Reloading koboldcpp or flushing the context cache (by submitting an empty story, for example) will convert "output vocabulary" to "input vocabulary", but this won't affect neither p.1 (you'll do BLAS anyway), nor p.2 (it converts the statement to "later continuation of the previous story might have a different outcome than generating it in one session", but not a dependence on amount to generate value – which will hold true only until you'll implement a mechanism for saving context cache between runs).

Also, I believe the tokenizer is not that broken: even if it has subtle variations on multi-character (not part of words but symbols) sequences, this should not generally affect the quality at all (imagine your example with "\n\n" vs. "\n"+"\n" – this is still a line feed, not something different).

The only model I know was "bad at spacing" is Command R+: it often produces double and triple spaces, indents lists differently, uses inconsistent dashes and quotes, and so on. If you won't edit it in its previous output, this will only get worse; but even with broken lists and extra spaces – the quality of the text wasn't noticeably dropping. (Though it is not necessary due to tokenization in the Command R+ case…)

For DeepSeek, I did not notice any structure breakage yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants