Mistral Large 2407 Speculative Decoding issues on llama.cpp

Posted by Judtoff@reddit | LocalLLaMA | View on Reddit | 5 comments

Has anyone been able to get Mistral Large 2407 Speculative Decoding working on llama.cpp server? I'm using Mistral-7B-Instruct-v0.3-Q6\_K.gguf as the Draft Model. It looks like token 10 in the draft model is different than in Mistral Large. I tried naively editing the gguf by replaceing \[control\_8\] with \[IMG\], but this did not work. I'm not sure how else I can force the token in the draft model to match the target model. Here is the command I ran, ./llama.cpp/build/bin/llama-server -m \~/llama.cpp/models/Mistral-Large-Instruct-2407.Q3\_K\_M.gguf-00001-of-00007.gguf -ngl 89 --split-mode row --flash-attn -c 1024 --port 8080 --host [192.168.50.126](http://192.168.50.126) \-md \~/llama.cpp/models/Mistral-7B-Instruct-v0.3-Q6\_K.gguf -ngld 99 --draft-max 16 --draft-min 1 --draft-p-min 0.9 --temp 0.0 and the error: common\_speculative\_are\_compatible: draft model vocab must match target model to use speculation but token 10 content differs - target '\[IMG\]', draft '\[control\_8\]' srv load\_model: the draft model '/home/jud/llama.cpp/models/Mistral-7B-Instruct-v0.3-Q6\_K.gguf' is not compatible with the target model '/home/jud/llama.cpp/models/Mistral-Large-Instruct-2407.Q3\_K\_M.gguf-00001-of-00007.gguf' main: exiting due to model loading error double free or corruption (!prev) Aborted (core dumped) For reference this is on a 3x P40 setup, I am not running out of VRAM (yet).