Mistral Large 2407 Speculative Decoding issues on llama.cpp

Posted by Judtoff@reddit | LocalLLaMA | View on Reddit | 5 comments

Has anyone been able to get Mistral Large 2407 Speculative Decoding working on llama.cpp server? I'm using Mistral-7B-Instruct-v0.3-Q6\_K.gguf as the Draft Model. It looks like token 10 in the draft model is different than in Mistral Large. I tried naively editing the gguf by replaceing \[control\_8\] with \[IMG\], but this did not work. I'm not sure how else I can force the token in the draft model to match the target model. Here is the command I ran, ./llama.cpp/build/bin/llama-server -m \~/llama.cpp/models/Mistral-Large-Instruct-2407.Q3\_K\_M.gguf-00001-of-00007.gguf -ngl 89 --split-mode row --flash-attn -c 1024 --port 8080 --host [192.168.50.126](http://192.168.50.126) \-md \~/llama.cpp/models/Mistral-7B-Instruct-v0.3-Q6\_K.gguf -ngld 99 --draft-max 16 --draft-min 1 --draft-p-min 0.9 --temp 0.0 and the error: common\_speculative\_are\_compatible: draft model vocab must match target model to use speculation but token 10 content differs - target '\[IMG\]', draft '\[control\_8\]' srv load\_model: the draft model '/home/jud/llama.cpp/models/Mistral-7B-Instruct-v0.3-Q6\_K.gguf' is not compatible with the target model '/home/jud/llama.cpp/models/Mistral-Large-Instruct-2407.Q3\_K\_M.gguf-00001-of-00007.gguf' main: exiting due to model loading error double free or corruption (!prev) Aborted (core dumped) For reference this is on a 3x P40 setup, I am not running out of VRAM (yet).

Reply to Post

5 Comments

[-]

tengo_harambe@reddit

Heads up in case you haven't figured this out yet koboldcpp supports speculative decoding and allows you to bypass the vocab mismatch error in debug mode

[-]

Judtoff@reddit (OP)

In fact that is what drove me from llama.cpp to koboldcpp. In benchmarks I see an incredible speedup, but in practice with long context the speedup more or less vanishes. It's probably great for coding, but for role play around 20000 context it doesn't seem to provide a benefit (or possibly I'm doing something wrong)

[-]

abc-nix@reddit

It is possible, but you need to hack the llama.cpp code before building so that the server will ignore the vocab mismatch error/warning. I have done it with Mistral Large 2411, but for my use (non-English), there is barely any speedup, and sometimes it is a bit slower. First, comment out the line that `return 1;` in [this line](https://github.com/ggerganov/llama.cpp/blob/a4dd490069a66ae56b42127048f06757fc4de4f7/examples/speculative/speculative.cpp#L137) for llama.cpp/examples/speculative/speculative.cpp so that it doesn't exit when it detects a mismatched vocab. Second, comment out the line that `return false;` in [this line](https://github.com/ggerganov/llama.cpp/blob/a4dd490069a66ae56b42127048f06757fc4de4f7/examples/server/server.cpp#L1721) for llama.cpp/examples/server/server.cpp so that the server doesn't exit when it detects a "non-compatible draft model". Third, build normally with cmake and test that llama-server now works with the draft model (it will still spit out the warning messages, but it should ignore it and work "properly"). This is a hack, so I would only use this build when I know the draft model almost matches with the main model. I know that Mistral 7B v0.3 Instruct matches enough to work, but Mistral-Small doesn't (I suspect Mistral Nemo doesn't either, but I haven't tested this). As I mentioned before, I don't have enough VRAM, so Mistral Large with FA is slow even with the draft model. I hope that with your amount of VRAM you can get better speeds than me, and can share them here or in another post once you also have it working. It would be interesting to know how well it works for you with llama.cpp.

[-]

Judtoff@reddit (OP)

That's really helpful thank you. When I tried with Qwen and llama3 in the past I didn't see any speedup, so I had abandoned speculative decoding. I was thinking I would give it one last shot with Mistral large, since on some other backends other users saw speedups. I'm a bit apprehensive about rebuilding llama.cpp. I was really hoping there was a flag that could be set to ignore the vocabulary mismatch or a tool that could edit the GGUF. Thanks for the detailed answer though, I really appreciate it.

[-]

TheTerrasque@reddit

It only really works well when the output is fairly deterministic. Usually by setting temp to 0. Works well with coding, where you don't want creativity.