Gave Maverick another shot (much better!)

Posted by Conscious_Cut_6144@reddit | LocalLLaMA | View on Reddit | 56 comments

For some reason Maverick was hit particularly hard on my multiple choice cyber security benchmark by the llama.cpp inference bug. Went from one of the worst models to one of the best. 1st - GPT-4.5 - 95.01% - $3.87 **2nd - Llama-4-Maverick-UD-Q4-GGUF-latest-Llama.cpp 94.06%** 3rd - Claude-3.7 - 92.87% - $0.30 3rd - Claude-3.5-October - 92.87% **5th - Meta-Llama3.1-405b-FP8 - 92.64%** 6th - GPT-4o - 92.40% 6th - Mistral-Large-123b-2411-FP16 92.40% 8th - Deepseek-v3-api - 91.92% - $0.03 9th - GPT-4o-mini - 91.75% 10th - DeepSeek-v2.5-1210-BF16 - 90.50% 11th - Meta-LLama3.3-70b-FP8 - 90.26% 12th - Qwen-2.5-72b-FP8 - 90.09% 13th - Meta-Llama3.1-70b-FP8 - 89.15% 14th - Llama-4-scout-Lambda-Last-Week - 88.6% 14th - Phi-4-GGUF-Fixed-Q4 - 88.6% 16th - Hunyuan-Large-389b-FP8 - 88.60% 17th - Qwen-2.5-14b-awq - 85.75% 18th - Qwen2.5-7B-FP16 - 83.73% 19th - IBM-Granite-3.1-8b-FP16 - 82.19% 20th - Meta-Llama3.1-8b-FP16 - 81.37% **\*\*\* - Llama-4-Maverick-UD-Q4-GGUF-Old-Llama.cpp 77.44%** **\*\*\* - Llama-4-Maverick-FP8-Lambda-Last-Week- 77.2%** 21st - IBM-Granite-3.0-8b-FP16 - 73.82% Not sure how much faith I put in the bouncing balls test, but it does still struggle with that one. So guessing this is still not going to be a go-to for coding. Still this at least gives me a lot more hope for the L4 reasoner.

Reply to Post

56 Comments

[-]

danielhanchen@reddit

Oh hi! Oh yes I found a few bugs for Llama 4 (QK Norm eps was wrong for Maverick & Scout, helped communicate config.json issues for RoPE to the llama.cpp team etc) There are also other random issues in vLLM, tokenizer changes etc. I remade all quants for Scout to https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF Maverick should be fine (as evidenced by your benchmarks), so I won't be re-making them (unless demand is enough!) The only change for Maverick was the QK Norm eps (was 1e-6 should be 1e-5)

[-]

az226@reddit

Beast!

[-]

Admirable-Star7088@reddit

Thank you! Are your quants using imatrix? If not, is there a reason for this? As far as I know, imatrix improves the quality of quants with no drawbacks. Or am I wrong?

[-]

No_Afternoon_4260@reddit

Imatrix uses a dataset while making the quant. It allows to make quants that should be better than regular quant (of the same size) if tested on the used dataset (or similar topics). Hope it is clear and make you understand why some use them why some don't.

[-]

yoracale@reddit

That's correct! All future GGUFs including the llama 4 and Gemma 3 will be imatrix. We use our own calibration dataset

[-]

No_Afternoon_4260@reddit

Cool and who are you?

[-]

yoracale@reddit

Hey man not sure why you're getting down votes (since not many people know) ahaha but I'm Mike, Daniels brother so I'm part of Unsloth. ☺️🙏

[-]

No_Afternoon_4260@reddit

Hey idk np 🤷. Thanks nice meeting you 🙏

[-]

dampflokfreund@reddit

Any reason why you switched to your own dataset? did you make comparisons to barts?

[-]

yoracale@reddit

All GGUFs we upload from now on will be imatrix

[-]

Admirable-Star7088@reddit

Thanks!

[-]

Devonance@reddit

So this was an issue in llama.cpp, do you know if this is auto fixed in ollama (since it runs llama.cpp as I understand), or do we have to wait for an update from them?

[-]

segmond@reddit

There is demand, count me in.

[-]

UltrMgns@reddit

Thank you! I'm unable to pull it with ollama though (after updating it too), any recommendations on an easy way to deploy the Q4\_K\_XL's? <3

[-]

FullstackSensei@reddit

Legend!!!

[-]

brahh85@reddit

Since you are using llamacpp and you have your own secret benchmark, can you try maverick raising the numbers of experts? for example to 3 `--override-kv llama.expert_used_count=int:3` or more. Maybe it can beat gpt4.5. Even if it doesnt improve, it will show if adding more active agents produces any return.

[-]

Conscious_Cut_6144@reddit (OP)

Maybe I'm thinking about this wrong, Shouldn't changing that significantly change the inference speed? Like set to 1, 3, 10 and default I'm getting 43T/s on all of them.

[-]

brahh85@reddit

i checked the model [card ](https://ollama.com/aravhawk/llama4:maverick-q4_K_M/blobs/b0fe8943b3e4) try this instead `--override-kv llama4.expert_used_count=int:3` also, another thing that i realized looking at an output of [maverick](https://github.com/ggml-org/llama.cpp/issues/12878) llama_model_loader: - kv 22: llama4.expert_count u32 = 16 llama_model_loader: - kv 23: llama4.expert_used_count u32 = 1 Could be that llamacpp is loading just one expert by default? it is also the default value on the config.json of [unsloth](https://huggingface.co/unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF/blob/main/config.json) about result, i was expecting [something like this ](https://www.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/) from that thread is also possible get the idea of using the model with 1 agent, as draft model , and then running the model with more agents. For a similar speed, it could be possible get better result this way than always activating 2 agents per token

[-]

Conscious_Cut_6144@reddit (OP)

That’s cool, didn’t realize it was a thing. Ya I’ll give it a shot when I get home from work.

[-]

Admirable-Star7088@reddit

I think Llama 4 Scout (Q4\_K\_M) is pretty good. With this fix, it will hopefully go from good to awesome. Something has definitively been off. Scout sometimes performs much better than even 70b models, and other times really bad. Also, Scout sometimes use the opposite words, for example it used "good" instead of "bad" in a sentence, which made no sense in the context. Will play around with these fixes as fast my GUI updates (LM Studio and/or Koboldcpp).

[-]

yoracale@reddit

Did you try the full fp8 model and see if it still happens? Also is this using our GGUFs?

[-]

Admirable-Star7088@reddit

With "just" 80GB total RAM (RAM + VRAM) I can sadly not fit and run fp8 to compare. I've been using Bartowski's quants.

[-]

yoracale@reddit

Feel free to use our quants if you want as we updated them with our fixes + other fixes and improved our calibration dataset!

[-]

Admirable-Star7088@reddit

Koboldcpp got updated with latest llama.cpp shortly after my post here, so I tried your quant with it. Now I haven't had time to test it very much, but my first impressions felt much better now. Also, the model's use of wrong/opposite word seems to be fixed now when I compared with the old quants. Another note, silly me discovered that I can actually run higher quants beyond Q4 if I enable `mmap`, so I tried your Q5\_K\_M quant as well, and it felt quite different/better than Q4\_K\_M. Hard to say for sure so far how much is random noise so far, but in my experience with Mixtral 8x7b back when it was released, it was *very* sensitive to quantitation. Since Llama 4 Scout is also a MoE, I imagine the same phenomenon applies here.

[-]

yoracale@reddit

Thanks for trying and oh interesting we've heard a lot about kobold.cpp will try it out!

[-]

Conscious_Cut_6144@reddit (OP)

The framework is just copied from mmlu or something like that, but no can’t share the actual questions.

[-]

wehtammai@reddit

Are you able to share these benchmark frameworks for cyber?

[-]

ezjakes@reddit

Going to need that benchmark a bit harder

[-]

dampflokfreund@reddit

Every time. Every damn time. People, wait atleast a week before you judge a model with a new architecture. Lots of fixed get implemented.

[-]

emprahsFury@reddit

yeah that's cool. It would be cooler if Meta would commit to vllm and llama.cpp on the day before they drop weights that they expect to run their models. They're leaning pretty hard on not-Meta employees to make Meta successful.

[-]

Conscious_Cut_6144@reddit (OP)

They did do it for vllm, but not awq or gptq quant libraries, so you need like 1TB of vram to run maverick still.

[-]

davewolfs@reddit

It is still terrible at coding.

[-]

Hoodfu@reddit

I know it's what's available to run here, but these various tests that are using a version of the model that has 75% of it chopped off (q4) isn't indicative of anything other than that specific version but certainly not of the model in general. I have the ability to run the q4 of deepseek v3 q4 now at 400 gigs and it's pretty good, but it was rather noticeably behind an almost 10x smaller coding model.

[-]

Conscious_Cut_6144@reddit (OP)

You would be surprised, I typically see less than a 1% difference in score going from BF16 to Q4-k-m. And on this test I can’t measure a difference from unsloths UD-Q4 to the full model.

[-]

Hoodfu@reddit

I've seen this kind of response a lot. In every model I've used, there's been a blatant and obvious difference in the quality of responses going from fp16 to q8. Vision models that give full concept recognition at fp16 that don't get any of it at q8 and just give broad details. In flux, there's the T5 encoder and the flux transformer itself. If you render images with the T5 fp32 gguf, there's an obvious difference going to the generally used fp16. Even more of an in your face difference if you use the fp8 of the t5. Hands are now messed up, facial details are just wrong. The 4 bit quants of those are just glaring at that point. In every use case I've ever had, there's a massive difference chopping off the vast majority of the model. If you're only seeing a 1% difference between bf16 and q4, then your test isn't a good test.

[-]

Conscious_Cut_6144@reddit (OP)

The reason you see the response a lot is because it's true. That being said vision is way different. Have a look at unsloth dynamic bits, he keeps the vision at 16bit but llm part is dropped down to 4 bit: [https://unsloth.ai/blog/dynamic-4bit](https://unsloth.ai/blog/dynamic-4bit)

[-]

JockY@reddit

For the open weights models, we’re your tests conducted with base or instruction tuned variants?

[-]

Conscious_Cut_6144@reddit (OP)

Ya they were all instruct.

[-]

JockY@reddit

Nice, thanks. You cybersecurity questions: are they deeply technical in nature, like “identify the UAF bug in this x86_64 disassembly” or more high level, like CISSP stuff?

[-]

Conscious_Cut_6144@reddit (OP)

They are they type of questions you would see on a CISSP

[-]

segmond@reddit

BTW, your post is a bit confusing. On one hand it makes it sound like a "llama.cpp inference bug" which means folks should pull the latest llama.cpp and rebuild. On the other hand, the way you label the rankings make it sound like it's the gguf file that has issues. As Unsloth Daniel mentioned, seems it's the same UD quant for Maverick from 5 days ago that's still on there. So I suppose you just rebuilt llama.cpp. Please confirm.

[-]

Conscious_Cut_6144@reddit (OP)

I tried to show it was the same gguf getting both good and bad results. But ya need to pull llama.cpp and rebuild.

[-]

celsowm@reddit

https://preview.redd.it/a656x9msvmue1.jpeg?width=1600&format=pjpg&auto=webp&s=3737cf22dc4b4e01e3facc3953901ec6ff7aa541 I am gonna try again on openrouter than, lets see

[-]

Expensive-Apricot-25@reddit

WOW, it is quite far ahead of claude, and in my experience, claude is currently the best (i dont have access to gpt4.5)

[-]

Distinct-Target7503@reddit

do you accept model requests? I would like to see how minimax score on that benchmark

[-]

Conscious_Cut_6144@reddit (OP)

I did briefly look into minimax, but doesn’t seem like anyone merged support for it. I see an open issue for vllm and an abandoned issue for llama.cpp.

[-]

yoracale@reddit

Hi minimax was requested previously. We will likely to a new one when it gets released

[-]

AuthorCritical2895@reddit

Just a question : which of these can you run on Macbook M2 Max with 96 GB memory.

[-]

Conscious_Cut_6144@reddit (OP)

Llama 4 scout ud-q4 would run on your machine and be fastest option, but its coding abilities may or may not cut it. Qwen2.5 coder and qwq are the go-to’s

[-]