Llama 4 - Scout: best quantization resource and comparison to Llama 3.3

Posted by silenceimpaired@reddit | LocalLLaMA | View on Reddit | 14 comments

The two primary resources I’ve seen to get for Scout (GGUF for us GPU poor), seems to be Unsloth and Bartowski… both of which seems to do something non-traditional compared to density models like Llama 70b 3.3. So which one is the best or am I missing one? At first blush Bartowski seems to perform better but then again my first attempt with Unsloth was a smaller quant… so I’m curious what others think. Then for llama 3.3 vs scout it seems comparable with maybe llama 3.3 having better performance and scout definitely far faster at the same performance.

Reply to Post

14 Comments

[-]

silenceimpaired@reddit (OP)

The more I use it the more frustrated I am. It’s better than Llama 3.3 in some areas… but way worse in others.

[-]

crantob@reddit

This may be due to dogma inputs. Dogma that counters reality and is input as 'valid' makes them go insane.

[-]

deathcom65@reddit

How r u guys running experts on GPU and non experts on cpu, like how do u divide it, or is it automatic?

[-]

x0wl@reddit

Experts on CPU, everything else on GPU

[-]

silenceimpaired@reddit (OP)

X0wl commented in the thread below: llama-server -ngl 999 -ot \d+.ffn_.*_exps.=CPU --flash-attn -ctk q8_0 -ctv q8_0 --ctx-size 49152 -t 24 -m ./GGUF/meta-llama_Llama-4-Scout-17B-16E-Instruct-IQ3_XXS.gguf The -ot with the regex does the pinning (you may need to experiment with regex escapes though lol)

[-]

frivolousfidget@reddit

Does iq1_m even work? Would love to see a comparison of benchmarks of sizes like iq1_m vs a qwen and gemma of similar size. Same for UD-Q2_K_XL (unsloth). I imagine results wont be good compared to gemma 27b on similar GB sizes but will be faster…

[-]

silenceimpaired@reddit (OP)

A comparison link was provided below. I’ll add it to the post.

[-]

frivolousfidget@reddit

Yeah ppl and stuff I am talking 24gb gemma 3 vs 24gb scout And 42 gb gemma3 vs 42 gb scout. On mmlu , and others benchmarks

[-]

x0wl@reddit

I feel like a large, sparse model will survive quantization better than a 27B overtrained dense

[-]

x0wl@reddit

Bartowski vs Unsloth small quant comparison: [https://huggingface.co/blog/bartowski/llama4-scout-off](https://huggingface.co/blog/bartowski/llama4-scout-off) On my machine (96GB RAM + 16GB VRAM) I use the Bartowski IQ3\_XXS, I get \~8-10T/s if I pin experts to CPU.

[-]

silenceimpaired@reddit (OP)

How go you pin experts? What are you running? Llama.cpp?

[-]

x0wl@reddit

`llama-server -ngl 999 -ot \d+.ffn_.*_exps.=CPU --flash-attn -ctk q8_0 -ctv q8_0 --ctx-size 49152 --port 12688 -t 24 -m ./GGUF/meta-llama_Llama-4-Scout-17B-16E-Instruct-IQ3_XXS.gguf` The `-ot` does the pinning (you may need to experiment with regex escapes though lol)

[-]

silenceimpaired@reddit (OP)

Oh that’s awesome thanks for sharing.

[-]

Bobcotelli@reddit

when quant for machine with 64gb ram amd ryzen 9 5900 and gpu 7900 xtx? thanks