Gemmini 4 31b draft model benchmarks

[-]

Rattling33@reddit

Thanks for sharing as another m5 owner.

[-]

andrewlewin@reddit

Other M5 users!! I must be M% owner number 25 or 37!!

[-]

klotar99@reddit

If you have the vram, 26B A4B is a better spec drafter for me since the active params are similar. (UD Q2 gives 80-95% acceptance)

[-]

I forgot to even consider using the MoE as the draft model, but good call. Which exact quants were you using (both for the 31b and the 26b)? I think I will give it a try on my mac. Did you experiment with lots of quant combos, or just one so far?

[-]

q-admin007@reddit

I added coresponding benchmarks.

[-]

klotar99@reddit

I don’t remember how many quants I tried, but at least 4 for several models. My best outcome for accuracy and speed on single prompts was 31B (UD) Q4 (Q3 and below I start getting wrong answers)-> 26 B (UD) Q2 (don’t need high accuracy since it’s so close to the same model, just speed) I did a similar test with the GPT OSS models and the MoE drafter actually improved the answer for the main model (not entirely sure how).

[-]

DeepOrangeSky@reddit

Wow, that's actually pretty weird about the OSS thing, lol. Everyone always says that type of stuff isn't supposed to be able to happen (albeit normally they talk about it in the sense of quality not being able to drop at all, rather than in the sense of quality increasing, but still, def pretty interesting if it did that). I wonder if the target model was somehow malfunctioning with like flipped bits in the main weights (which can go unnoticed since flipped bits don't necessarily overtly destroy a model unless they happen in certain portions of the model. In main weights the model can still seem to work more or less like normal, potentially) and it somehow "repaired" it a little or something weird like that, or if it was an actual genuine improvement without any malfunction-correction or anything.

[-]