Llama 3 8B instruct with fixed BPE tokenizer uploaded

Posted by noneabove1182@reddit | LocalLLaMA | View on Reddit | 39 comments

https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF I know it was just a week ago when I posted claiming "full support for Llama 3 in GGUF", but as I'm sure you all know there was a BPE tokenizer bug This is with the fix now, and running it with the latest llama.cpp ./main, we can see that even the Q2_K model gets the simple addition correct: <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|> What is 7777 + 3333?<|eot_id|><|start_header_id|>assistant<|end_header_id|> The answer is: 11110<|eot_id|> [end of text] These models will also work if you haven't updated to latest llama.cpp, but will still have the old broken tokenizer until you get your tool updated. So feel free to download now in anticipation for support! I hear LM Studio should be updated by tomorrow

Reply to Post

39 Comments

[-]

_Zibri_@reddit

can you tell me what did you do to fix it? I have the original model and even with the patch still outputs end of text :(

[-]

noneabove1182@reddit (OP)

Can you be more specific? When you say original model, are you referring to the safetensors? And which patch?

[-]

SomeOddCodeGuy@reddit

Awesome! Thanks for making these. I can't wait until this fix is merged into KoboldCpp.

[-]

Deathcrow@reddit

> I can't wait until this fix is merged into KoboldCpp. I'm clueless, but since this affects tokenization in gguf generation, does anything need to be merged into kobldcpp at all? Shouldn't it just work when loading a correctly tokenized gguf?

[-]

mikael110@reddit

The issue was technically not in the tokenizer itself, but in the pre-tokenizer, which is a pre-processing step that is a part of the inference portion of llama.cpp. The change in the conversion process is just to mark what pre-tokenizer should be used for the model, since llama.cpp now supports multiple different pre-tokenizers. So you need both a model that has been marked correctly, and a version of llama.cpp that has had the pre-tokenizer fix applied. Having just one or the other won't actually fix anything.

[-]

Calcidiol@reddit

So if you have an old-marked model file and a newly built lllama.cpp you're saying it'll still not fix anything. But why would the newly fixed code not be made to correctly process the 'old' model files if literally doing ANYTHING but applying the 'new' code logic wouldn't be correct then I'm not sure what the point of having old-GGUFs "stay wrong" is unless there for some other model is some use case where they actually DO work right with the 'old marking'. Anyway if it is JUST a marking in the metadata that's different between the 'old' and 'new' GGUF wouldn't it be better than downloading 8GB or 70GBy again to just change 1-byte of metadata flag to just announce how to easily re-flag the previous GGUF models for those that have them?

[-]

mikael110@reddit

Yes, old model files will stay broken, to quote Georgi Gerganov himself: >Old GGUF models using BPE tokenizers, generated before this change, will fallback to the "default" pre-tokenization, which in almost all cases is wrong As to why, that is pretty simple, there are multiple different pre-tokenizers and which one to choose cannot be determined just by looking at the model architecture. So there isn't "A" new way to handle things, there are multiple new ways to handle things. And there is no way for llama.cpp to look at an existing model and know which one to choose. That is why a new field is required. >Anyway if it is JUST a marking in the metadata that's different between the 'old' and 'new' GGUF wouldn't it be better than downloading 8GB or 70GBy again to just change 1-byte of metadata flag to just announce how to easily re-flag the previous GGUF models for those that have them? That is indeed an option, the metadata in question is `tokenizer.ggml.pre` and setting it to `llama3` will fix the issue. You can override this during the model load by using the argument `--override-kv tokenizer.ggml.pre=str:llama3`. It is likely possible to set it permanently using the `gguf-new-metadata.py` script but I have never actually tried to add new metadata to a gguf so I'm not sure about the exact syntax.

[-]

0x9e3779b1@reddit

>And there is no way for llama.cpp to look at an existing model and know which one to choose. That is why a new field is required. Not really. It seems trivial to implement more accurate **model aware** fallback, other than some 'default' For the explanation, below I'm refering to `llama.cpp` revision `952d03dbead16e4dbdd1d3458486340673cc2465`, pinned by `ollama v0.1.33`: ```sh $ pwd /Users/ic/dev/ollama_upstream/llm/llama.cpp $ git rev-parse HEAD 952d03dbead16e4dbdd1d3458486340673cc2465 $ awk '(NR>=4341 && NR<=4382 ){print NR " " $0}' llama.cpp 4341 // for now, only BPE models have pre-tokenizers 4342 if (vocab.type == LLAMA_VOCAB_TYPE_BPE) { 4343 if (tokenizer_pre.empty()) { 4344 LLAMA_LOG_WARN("%s: missing pre-tokenizer type, using: 'default'\n", __func__); 4345 LLAMA_LOG_WARN("%s: \n", __func__); 4346 LLAMA_LOG_WARN("%s: ************************************ \n", __func__); 4347 LLAMA_LOG_WARN("%s: GENERATION QUALITY WILL BE DEGRADED! \n", __func__); 4348 LLAMA_LOG_WARN("%s: CONSIDER REGENERATING THE MODEL \n", __func__); 4349 LLAMA_LOG_WARN("%s: ************************************ \n", __func__); 4350 LLAMA_LOG_WARN("%s: \n", __func__); 4351 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; 4352 } else if ( 4353 tokenizer_pre == "default") { 4354 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; 4355 } else if ( 4356 tokenizer_pre == "llama3" || 4357 tokenizer_pre == "llama-v3" || 4358 tokenizer_pre == "llama-bpe") { 4359 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_LLAMA3; 4360 } else if ( 4361 tokenizer_pre == "deepseek-llm") { 4362 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM; 4363 } else if ( 4364 tokenizer_pre == "deepseek-coder") { 4365 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER; 4366 } else if ( 4367 tokenizer_pre == "falcon") { 4368 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_FALCON; 4369 } else if ( 4370 tokenizer_pre == "mpt") { 4371 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_MPT; 4372 } else if ( 4373 tokenizer_pre == "starcoder") { 4374 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_STARCODER; 4375 } else if ( 4376 tokenizer_pre == "gpt-2") { 4377 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_GPT2; 4378 } else { 4379 throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str())); 4380 } 4381 } else { 4382 vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; ``` as you can see, pre-tokenizers are laregly _model-specific_. That is, the most prominent model names are already hardcoded in this logic, indirectly. So, we could amend it to take into account our actual model name: ```cpp if (vocab.type == LLAMA_VOCAB_TYPE_BPE) { if (tokenizer_pre.empty()) { tokenizer_pre = <our_model_name_from_metadata>; } if ( tokenizer_pre == "llama3" || tokenizer_pre == "llama-v3" || tokenizer_pre == "llama-bpe") { ... } else { throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str())); } if (tokenizer_pre.empty()) { LLAMA_LOG_WARN("%s: missing pre-tokenizer type, using: 'default'\n", __func__); ... } ... } ```

[-]

mikael110@reddit

The problem is that GGUF's don't actually contain the model name. They contain the model architecture, which yes would be enough to distinguish some of those models, but for others like Llama-3 and Deepseek it is impossible to distinguish them since they both use the same architecture. And that's coming from [Georgi Gerganov](https://github.com/ggerganov/llama.cpp/pull/6920#discussion_r1580932467) himself. That is the discussion I was paraphrasing in my comment. I kept a close eye on that PR as it developed so I'm well aware of all the code that went into it.

[-]

0x9e3779b1@reddit

Ok, if model name is not to be relied upon, at all, than it's clear. Thank you for the explanation.

[-]

Acceptable_Total_937@reddit

I downloaded the new gguf Q6_K and using it with langchain+llama.cpp. it was working fine when I tested using a simple prompt. When my prompt got longer (still very reasonable size), it started only responding with 'assistant' or random response like "in real time". Anyone else getting this?

[-]

LocoLanguageModel@reddit

Thanks for uploading the pre-token fixed 70b models!

[-]

noneabove1182@reddit (OP)

the post-token fix models are up as well here: https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF

[-]

Calcidiol@reddit

Thanks for the information and making the updates! One thing I'm confused about, though, is how this all works. I would have guessed tokenization relates to presenting the input to the model and taking the output of the model and translating that to text. But I thought those processes were done programmatically by the combination of the UI / inference API / inference engine in some mix. So why is it new GGUF quantized converted models are actually needed? I didn't think the GGUF conversion process could change anything about the model's intrinsic vocabulary / token coding dictionary / token I/O interfaces. So at most I guess maybe there could be metadata in the GGUF that somehow is derivative of the metadata files concerning the origin model maybe what is in the JSON or similar files about the model architecture / vocabulary / etc. So by posting / announcing entirely new quantized models is it indicative that anyone who used previously converted GGUFs will not be able to achieve a correct inference even if they update their llama.cpp related engine code to the current release if they are using the older GGUFs? Or is it simply some metadata that is wrong in the GGUFs which could in theory be theoretically edited / adjusted inside the GGUF by rewriting small parts leaving the actual model content alone? Pulling new 8B but especially 70B models may not be entirely trivial in time or resources if it isn't functionally necessary if there's a simpler solution in code / metadata. e.g. https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF-old vs https://huggingface.co/bartowski/Meta-Llama-3-70B-Instruct-GGUF or this 8b conversion, won't it be really almost 99.9999% identical?

[-]

noneabove1182@reddit (OP)

There is a way to use the old GGUF files with the new tokenizer fix by passing --override-kv tokenizer.ggml.pre=str:llama3 at generation time I haven't gone through the technical details enough to give a confident answer, but my guess would be something about metadata or the way that the conversion encodes the tokenizer itself Reason for announcing it brand new is that you may be able to use the old with a workaround, but better to use new and fixed

[-]

mikael110@reddit

One thing that has become somewhat lost in this discussion (for understandable reasons) is that the issue isn't actually in the tokenizer itself, but in the pre-tokenizer. Most models don't pass text directly to the tokenizer, they instead pre-process the text in some way and pass the pre-processes text to the tokenizer. And it is that process that was essentially broken in old llama.cpp builds. Because it used a hard coded pre-processing step which was generally close to what most models did but not exactly right. And the problem became quite noticeable for llama 3 because it actually uses a rather complex pre-processing step. The new PR adds support for a number of different pre-tokenizers. Since you cannot determine the correct pre-tokenizer just by looking at the model architecture or the tokenizer. A new field had to be introduced to tell llama.cpp what pre-tokenization to perform. That is why a changes were made to the conversion script. The conversion script now figures out which pre-processor is correct and then marks the file during the conversion. This is why you need both a new file and an updated version of llama.cpp.

[-]

noneabove1182@reddit (OP)

Thank you so much for this write up, this explains a lot and why the re-conversion was necessary! Will point future questions here because this is the most succinct write up i've seen on the subject, thanks again :D

[-]

Tall-Entrepreneur686@reddit

This new Llama 2 model is much slower using grammar than llama 2. If I used grammar with llama 2 then it would barely change the t/s. Now adding grammar slows down t/s by 5 to 10 times. EG "temperature": 0, "top\_p": 0.9, "max\_length": 100, "grammar":" root ::= fullanswer \\n fullanswer ::= \\"Herika: \\" answer \\nanswer ::= sentence | \\"<|im\_end|>\\" | sentence \\"\\\\n\\"\\nsentence ::= \[a-zA-Z0-9.,?!' \]\*\\n"

[-]

noneabove1182@reddit (OP)

I wonder if that's expected because of the token pre-processor.. would be unfortunate :S

[-]

mikael110@reddit

No problem, I've seen a lot of confusion around it, so I just wanted to clarify it a bit. And thank you for the work you do requanting the model. You're the only person I've seen so far that has actually bothered keeping up with all of the changes.

[-]

aseichter2007@reddit

I've already been impressed, support dropped, lets see how she goes! >|||! Richard "*Ricky*" LaFleur and Drunk James "Jim" Lahey , rps| Ricky wants Lahey to get more kitties. They have a physical altercation. Lahey uses a cat as a weapon. (The scene takes place in the dimly lit, smoky bar. Ricky and Lahey are standing face-to-face, their faces inches apart. The air is thick with tension.) Ricky: (slurring his words) You're a monster, Lahey. You hate kitties. Lahey: (sneering) Shut up, Ricky. I don't hate kitties. I just don't care about 'em. Ricky: (outraged) You don't care?! You're supposed to be a good guy, Jim! Good guys like kitties! Lahey: ( rolling his eyes) Oh, spare me the drama, Ricky. You're just drunk and looking for trouble. Ricky: (enraged) THAT'S IT, LAHEY! (he swings at Lahey, but misses) Lahey: (laughing) Oh, Ricky's got a swing! (he pulls out a cat from behind the bar and holds it up like a weapon) You want to play rough? Let's play rough! (Ricky's eyes widen in horror as Lahey brandishes the cat) Ricky: (backing away) NOOO! DON'T HURT THE KITTY! Lahey: (menacingly) Oh, this little guy's not going to hurt anyone... unless you try anything funny. (he takes a swipe at Ricky with the cat) (Ricky tries to dodge, but the cat claws his arm. Ricky yelps in pain and rage) Ricky: (furious) YOU MONSTER! (he lunges at Lahey, determined to take down the cat-wielding Lahey)

[-]

nsfw_throwitaway69@reddit

Thanks for work on the quants! Any plans to re-quant the 70b as well?

[-]

noneabove1182@reddit (OP)

Yup :) Will just take a bit longer to make, but should be up tomorrow or so

[-]

DNskfKrH8Ekl@reddit

Super keen to see how this improves crewai local performance. There is still no valid 70B GGUF on Hugging Face, and official does not pass the test What is 3333 + 777? What is 3333 + 777?

[-]

Healthy-Nebula-3603@reddit

where is new version for 70b?

[-]

noneabove1182@reddit (OP)

Slowly getting built :) up to IQ3_XXS

[-]

jayFurious@reddit

godspeed

[-]

noneabove1182@reddit (OP)

Had problems again so rented a runpod instance, hoping i'll be able to upload within an hour :) it'll be on a new repo

[-]

Healthy-Nebula-3603@reddit

thanks I need Q4K\_m :D

[-]

Some_Endian_FP17@reddit

You are the new TheBloke. A total legend, thank you for the GGUFs. Now for a noob who hasn't tried imatrix quants, what would be the equivalent of a Q4KM or Q5KM for CPU inference?

[-]

noneabove1182@reddit (OP)

<3 You can actually just use Q4_K_M or Q5_K_M, all the quants on my page use imatrix Don't use an i-quant (which is unrelated to imatrix) if you use CPU, it's supported but slow you can check here for info about support and notable slowness: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix