Gemmini 4 31b draft model benchmarks
Posted by tecneeq@reddit | LocalLLaMA | View on Reddit | 12 comments
https://docs.google.com/spreadsheets/d/1NzZC4JShGluwH2fdjlMbZ2ke99AcTctUnM7rG12_cYE/edit?usp=sharing
The benchmarks have been run in a LXC-Container on Proxmox on a Bosgame M5 Strix Halo 128GB board. Software was llama.cpp on ROCm 7.2.
Best compromise between speed and precision, i think, is unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL with unsloth/gemma-4-E2B-it-GGUF:UD-Q3_K_XL as the drafting model.
Rattling33@reddit
Thanks for sharing as another m5 owner.
q-admin007@reddit
We are dozens. DOZENS!
andrewlewin@reddit
Other M5 users!! I must be M% owner number 25 or 37!!
klotar99@reddit
If you have the vram, 26B A4B is a better spec drafter for me since the active params are similar. (UD Q2 gives 80-95% acceptance)
DeepOrangeSky@reddit
I forgot to even consider using the MoE as the draft model, but good call. Which exact quants were you using (both for the 31b and the 26b)? I think I will give it a try on my mac. Did you experiment with lots of quant combos, or just one so far?
q-admin007@reddit
I added coresponding benchmarks.
klotar99@reddit
I don’t remember how many quants I tried, but at least 4 for several models. My best outcome for accuracy and speed on single prompts was 31B (UD) Q4 (Q3 and below I start getting wrong answers)-> 26 B (UD) Q2 (don’t need high accuracy since it’s so close to the same model, just speed) I did a similar test with the GPT OSS models and the MoE drafter actually improved the answer for the main model (not entirely sure how).
DeepOrangeSky@reddit
Wow, that's actually pretty weird about the OSS thing, lol. Everyone always says that type of stuff isn't supposed to be able to happen (albeit normally they talk about it in the sense of quality not being able to drop at all, rather than in the sense of quality increasing, but still, def pretty interesting if it did that). I wonder if the target model was somehow malfunctioning with like flipped bits in the main weights (which can go unnoticed since flipped bits don't necessarily overtly destroy a model unless they happen in certain portions of the model. In main weights the model can still seem to work more or less like normal, potentially) and it somehow "repaired" it a little or something weird like that, or if it was an actual genuine improvement without any malfunction-correction or anything.
q-admin007@reddit
I have added 26B-A4B benchmarks as the draft model. Indeed, highest acceptance and highest speedup, for the cost of higher VRAM usage.
PrzemChuck@reddit
Does the temperature affect acceptance? Or were all tests run on greedy decoding
q-admin007@reddit
These settings are an artefact of a copy-and-paste operation, doesn't make a difference is what i say, as long as someone does a benchmark that tells otherwise.
djl610@reddit
S