Mistral Large 2407 Speculative Decoding issues on llama.cpp
Posted by Judtoff@reddit | LocalLLaMA | View on Reddit | 5 comments
Has anyone been able to get Mistral Large 2407 Speculative Decoding working on llama.cpp server? I'm using Mistral-7B-Instruct-v0.3-Q6\_K.gguf as the Draft Model. It looks like token 10 in the draft model is different than in Mistral Large. I tried naively editing the gguf by replaceing \[control\_8\] with \[IMG\], but this did not work. I'm not sure how else I can force the token in the draft model to match the target model.
Here is the command I ran,
./llama.cpp/build/bin/llama-server -m \~/llama.cpp/models/Mistral-Large-Instruct-2407.Q3\_K\_M.gguf-00001-of-00007.gguf -ngl 89 --split-mode row --flash-attn -c 1024 --port 8080 --host [192.168.50.126](http://192.168.50.126) \-md \~/llama.cpp/models/Mistral-7B-Instruct-v0.3-Q6\_K.gguf -ngld 99 --draft-max 16 --draft-min 1 --draft-p-min 0.9 --temp 0.0
and the error:
common\_speculative\_are\_compatible: draft model vocab must match target model to use speculation but token 10 content differs - target '\[IMG\]', draft '\[control\_8\]'
srv load\_model: the draft model '/home/jud/llama.cpp/models/Mistral-7B-Instruct-v0.3-Q6\_K.gguf' is not compatible with the target model '/home/jud/llama.cpp/models/Mistral-Large-Instruct-2407.Q3\_K\_M.gguf-00001-of-00007.gguf'
main: exiting due to model loading error
double free or corruption (!prev)
Aborted (core dumped)
For reference this is on a 3x P40 setup, I am not running out of VRAM (yet).
5 Comments
tengo_harambe@reddit
Judtoff@reddit (OP)
abc-nix@reddit
Judtoff@reddit (OP)
TheTerrasque@reddit