Gave Maverick another shot (much better!)

Posted by Conscious_Cut_6144@reddit | LocalLLaMA | View on Reddit | 56 comments

For some reason Maverick was hit particularly hard on my multiple choice cyber security benchmark by the llama.cpp inference bug. Went from one of the worst models to one of the best. 1st - GPT-4.5 - 95.01% - $3.87 **2nd - Llama-4-Maverick-UD-Q4-GGUF-latest-Llama.cpp 94.06%** 3rd - Claude-3.7 - 92.87% - $0.30 3rd - Claude-3.5-October - 92.87% **5th - Meta-Llama3.1-405b-FP8 - 92.64%** 6th - GPT-4o - 92.40% 6th - Mistral-Large-123b-2411-FP16 92.40% 8th - Deepseek-v3-api - 91.92% - $0.03 9th - GPT-4o-mini - 91.75% 10th - DeepSeek-v2.5-1210-BF16 - 90.50% 11th - Meta-LLama3.3-70b-FP8 - 90.26% 12th - Qwen-2.5-72b-FP8 - 90.09% 13th - Meta-Llama3.1-70b-FP8 - 89.15% 14th - Llama-4-scout-Lambda-Last-Week - 88.6% 14th - Phi-4-GGUF-Fixed-Q4 - 88.6% 16th - Hunyuan-Large-389b-FP8 - 88.60% 17th - Qwen-2.5-14b-awq - 85.75% 18th - Qwen2.5-7B-FP16 - 83.73% 19th - IBM-Granite-3.1-8b-FP16 - 82.19% 20th - Meta-Llama3.1-8b-FP16 - 81.37% **\*\*\* - Llama-4-Maverick-UD-Q4-GGUF-Old-Llama.cpp 77.44%** **\*\*\* - Llama-4-Maverick-FP8-Lambda-Last-Week- 77.2%** 21st - IBM-Granite-3.0-8b-FP16 - 73.82% Not sure how much faith I put in the bouncing balls test, but it does still struggle with that one. So guessing this is still not going to be a go-to for coding. Still this at least gives me a lot more hope for the L4 reasoner.