Benchmarked 18 models that I can run on my RTX 5080 16GB using Nick Lothian's SQL benchmark
Posted by grumd@reddit | LocalLLaMA | View on Reddit | 82 comments
2 days ago there was a very cool post by u/nickl:
https://reddit.com/r/LocalLLaMA/comments/1s7r9wu/comment/odc9xj8/
Highly recommend checking it out!
I've run this benchmark on a bunch of local models that can fit into my RTX 5080, some of them partially offloaded to RAM (I have 96GB, but most will fit if you have 64).
Results:
24: unsloth/Qwen3.5-122B-A10B-GGUF:UD-Q4_K_XL
π©π©π©π©π© π©π©π©π©π© π©π©π©π©π© π©π©π©π₯π© π©π©π©π©π©
23: bartowski/Qwen_Qwen3.5-27B-GGUF:IQ4_XS
π©π©π©π©π© π©π©π©π©π© π©π©π©π©π© π©π©π©π₯π© π₯π©π©π©π©
23: unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS
π©π©π©π©π© π©π©π©π©π© π©π©π©π©π© π©π©π©π₯π© π₯π©π©π©π©
22: unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q6_K_XL
π©π©π©π©π© π©π©π©π©π© π©π©π₯π©π© π©π©π©π₯π© π₯π©π©π©π©
22: mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF:Q3_K_M
π©π©π©π©π© π©π©π©π©π© π©π©π©π©π© π©π₯π©π₯π© π₯π©π©π©π©
21: unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_S
π©π©π©π©π© π©π©π©π©π© π©π©π©π©π© π©π©π©π¨π₯ π₯π¨π©π©π©
20: unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL
π©π©π©π©π¨ π©π©π©π©π© π©π©π¨π©π© π©π©π©π₯π¨ π₯π©π©π©π©
20: mradermacher/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF:Q6_K
π©π©π©π©π© π©π©π©π©π© π©π©π₯π©π© π₯π©π©π₯π© π₯π₯π©π©π©
19: unsloth/GLM-4.7-Flash-GGUF:UD-Q6_K_XL
π©π©π©π©π© π©π©π©π©π© π©π©π₯π©π© π©π©π©π₯π¨ π₯π¨π©π₯π©
18: unsloth/GLM-4.5-Air-GGUF:Q5_K_M
π©π©π©π©π© π©π©π©π©π© π©π©π₯π©π© π₯π©π©π₯π© π¨π¨π₯π©π¨
18: bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF:Q6_K_L
π©π©π©π©π© π©π©π©π©π© π©π©π¨π©π© π©π©π©π₯π© π¨π¨π₯π¨π¨
16: unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
π©π©π©π©π¨ π©π©π©π©π© π©π©π¨π©π© π₯π¨π©π₯π¨ π₯π¨π©π¨π©
16: byteshape/Devstral-Small-2-24B-Instruct-2512-GGUF:IQ3_S
π©π©π©π©π© π©π©π©π©π© π₯π©π¨π©π© π©π©π¨π₯π¨ π¨π¨π₯π¨π©
16: mradermacher/Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-i1-GGUF:Q6_K
π©π©π©π©π© π©π©π©π©π© π©π©π¨π₯π© π₯π©π₯π₯π¨ π₯π©π₯π©π¨
14: mradermacher/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT-i1-GGUF:Q6_K
π©π©π©π©π© π©π©π©π©π© π₯π©π₯π©π© π©π¨π₯π₯π¨ π¨π¨π₯π¨π¨
14: unsloth/GLM-4.6V-GGUF:Q3_K_S
π©π©π©π©π© π©π©π©π©π© π₯π©π¨π¨π© π₯π©π©π¨π¨ π¨π¨π¨π¨π¨
5: bartowski/Tesslate_OmniCoder-9B-GGUF:Q6_K_L
π¨π¨π¨π¨π¨ π¨π¨π¨π©π© π©π¨π¨π©π¨ π¨π¨π©π¨π¨ π¨π¨π¨π¨π¨
5: unsloth/Qwen3.5-9B-GGUF:UD-Q6_K_XL
π¨π¨π¨π¨π¨ π¨π¨π¨π©π© π¨π©π¨π¨π© π¨π©π¨π¨π¨ π¨π¨π¨π¨π¨
The biggest surprise is Qwen3.5-9B-Claude-4.6-HighIQ-THINKING to be honest, going from 5 green tests with Qwen3.5-9B to 16 green tests. Most errors of Qwen3.5-9B boiled down to being unable to call the tools with correct formatting. For how small it is it's a very reliable finetune.
Qwen3.5-122B-A10B is still king with 16GB GPUs because I can offload experts to RAM. Speed isn't perfect but the quality is great and I can fit a sizable context into VRAM. Q4_K_XL uses around 68GB RAM, IQ3_XXS around 33GB RAM, so the smaller quant can be used with 64GB system RAM.
Note though - these benchmarks mostly test a pretty isolated SQL call. It's a nice quick benchmark to compare two models, even with tool calling, but it's not representative of a larger codebase context understanding where larger models will pull ahead.
Ranmark@reddit
new qwen3.6. dont look at the time. i have 10 year old hardware. had to increase timeout, of course
grumd@reddit (OP)
I did the same exact quant and got 3 errors.
Sometimes you just get lucky. Even though this test uses temperature 0.1 there's still randomness to it. For example, 122B Q4_K_XL only has 1 error consistently, but even small changes to my setup will break it and get 2-3 errors. Using a BETTER quant can also give me more errors. I tried 122B Q6 (offloaded to nvme ssd) and got 2 or 3 errors (don't remember).
I think getting one error is just luck
Ranmark@reddit
Hey, you should try this one: https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-IQ4_XS-GGUF
It's so good, im getting better results then with new dense 3.6 model. And it's more stable then any other distill / non-distill. Idk what is this black magic.
Ranmark@reddit
Thanks, I'm feeling better now :) I now rerunning 27b (iq3_xs) and it feels MUCH more consistent (23/25 all the time). Looks like it's still a way to go for me (122b is just too much for my hardware). Hope that Alibaba releases 3.6 27b soon.
Ranmark@reddit
it's strange but i couldnt repeat the same result. regularly failing some queries like q2, q10, q21
Ranmark@reddit
iq4_nl
idumlupinar@reddit
What would you recommend for the following setup?
128 GB DDR4 RAM
GTX 1050TI GPU
AMD 5800x3d cpu
EVGA 850 G2 PSU
Shall I pull the trigger and go for RTX 3090 used as it is the only available top tier gpu upgrade for my psu I believe?
Or can I still do something with my current config?
I am on Windows and I already tried Qwen3-Coder-Next. Seems to be working but very slow (expected)
grumd@reddit (OP)
1050ti is a dead end unfortunately. 4GB VRAM at ~100 GB/s is worse than a CPU with DDR5. A used 3090 will be suuper good, or if 7900XTX is less expensive then it's also a good option
Greant_82@reddit
Could you try Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1?
Ranmark@reddit
Thanks for the results. What do you think about apex models? Is it worth using?
grumd@reddit (OP)
Apex? What's that?
Ranmark@reddit
If i got this right it's a MoE quantization method (Adaptive Precision for EXpert Models) where different parts getting different compression based on their importance. The goal is to get the most quality/weight.
It's a models from mudler like mudler/Qwen3.5-122B-A10B-APEX-GGUF.
Atretador@reddit
Qwen3.5-14B-A3B-Claude-Opus-Reasoning-Distilled-4.6-MXFP4_MOE
grumd@reddit (OP)
So a bit worse than Qwopus3.5-9B. Thanks for testing
Atretador@reddit
I was hoping for something better, qwen 35 jumps from 9B to 27B - something in between could be nice
grumd@reddit (OP)
35B-A3B is the one in-between (closer to 27B though)
Atretador@reddit
Qwen3.5-14B-A3B-Claude-Opus-Reasoning-Distilled-4.6-MXFP4_MOE.gguf
Big_Trip6677@reddit
What about samuelcardillo/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
https://huggingface.co/samuelcardillo/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
Shot-Buffalo-2603@reddit
How does Qwen3.5-122B-A10B fit on a 16GB GPU? 122B Q4 should be like a ~60GB model? Could you elaborate on this? Iβm new and been trying to understand what models I can run effectively.
jopereira@reddit
You should really read the OP, everything is explained
Shot-Buffalo-2603@reddit
I did read it and I understand the basis of MOE but donβt understand why βoffloading the experts to ramβ would be a good thing. I would think best case scenario would be the model sitting in system RAM and the experts would be loaded onto the GPU VRAM 10B params at a time when being executed so it would fit and be fast. This doesnβt seem to he what is described?
I donβt have enough experience to know if Iβm completely misunderstanding or if thatβs whatβs actually being described.
I also donβt understand if that type of swapping would happen automatically or if itβs something you have to manually configure to work that way.
grumd@reddit (OP)
You just have a misunderstanding of how MOE works, read up on it. It has 10B active parameters, some shared experts that are used for every token, and those things need to be on VRAM to be fast. But most of the 122B parameters are "experts" and only a few of them are activated for each token. So the experts can be in RAM and just a few of them will be fetched from RAM at a time, which is fine even with RAM speeds.
Blaze6181@reddit
Do you get decent decode speeds? I'm looking at like 30-35 tok/s but it could just be my ddr4 ram.
grumd@reddit (OP)
1000-1500t/s for prefill, 15-20t/s for generation
Blaze6181@reddit
Sorry I hate to bother you again, but do you mind posting your llama.cpp args? I want to see what I'm doing so wrong.
grumd@reddit (OP)
Could be your DRR4 yeah, but try to increase -ub a lot. When I use default -ub 512, my pp speed is under 300, but with -ub 2048-4096 it goes up to 1500 t/s
Also --no-mmap gives a solid improvement for pp
I'll edit this comment with my full command in a minute
the__storm@reddit
By "offloading the experts to ram" they mean what you described - grabbing weights from system memory as needed for each token. (It's not the whole 10B; a relatively small fraction of the model (attention, router, sometimes shared experts) will always be activated for each token and is kept in VRAM.)
Shot-Buffalo-2603@reddit
Thanks, i will have to give this a try
Nick-QuickStock@reddit
Why is no one talking about Google Turbo Quant??
grumd@reddit (OP)
There's nothing to talk about
Nick-QuickStock@reddit
Oh? It doesnt allow you to use larger llms than you could before? Yeah, nothing to talk about. π€
grumd@reddit (OP)
Turboquant-like vector rotation for kv cache was already merged into llama.cpp and you can use q8_0 and q4_0 kv cache quants with better precision than before. It's still not lossless of course, but I've started using q8_0 by default. It won't give you access to larger models but you can squeeze more context now.
I just don't understand the purpose of your comment, "why is noone talking". First of all, a ton of people are talking too much about it. Second, if you want discussions about it, start actual discussions with substance
Direct_Technician812@reddit
please add https://huggingface.co/unsloth/gemma-4-31B-it-GGUF
grumd@reddit (OP)
Nah it's too big for my 16gb gpu. Maybe I'll try later with some layer offloading
Big_Trip6677@reddit
What about Jackrong/Qwopus3.5-27B-v3
https://huggingface.co/Jackrong/Qwopus3.5-27B-v3
grumd@reddit (OP)
It's already in the post, 23/25
noctrex@reddit
Time to update it already, gemma 4 dropped. :)
moahmo88@reddit
Could you test "HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive",please? Whatβs the best model for an Nvidia 5070 Ti 16GB VRAM and 32GB RAM?
grumd@reddit (OP)
This is just an uncensored model, it shouldn't have much different results from the original 35B-A3B. Not gonna download it, you can easily run the benchmark yourself if you have it.
For 16+32 I'd say either use Q6_K_XL of 35B-A3B or one of the Q3 quants of 27B depending on how much context you need
moahmo88@reddit
Thank you for your kind reply!
mapsbymax@reddit
The MoE shift for 16GB cards is honestly wild. A year ago on 16GB you were stuck with 7-13B dense models and that was it. Now 122B params topping the chart on the same hardware, just because the active parameter count stays small.
The bottleneck basically moved from VRAM to system RAM bandwidth. DDR5 makes this viable β on DDR4 I'd expect those offloaded expert speeds to be rough.
Also really interesting that the distillation only clearly helps at 9B. The Claude Opus distill of 27B scoring the same as vanilla 27B suggests the base model already knows the patterns β distillation mostly helps when the model is small enough that it genuinely lacks the capability. That's a useful heuristic for picking finetunes: the smaller the base, the more a good distill can move the needle.
twack3r@reddit
Why? Why run a social media account to have Claude write your comments?
grumd@reddit (OP)
Important to mention - the 27B distill is "Claude-4.6-Opus-Reasoning-Distilled" - and the same name distill exists for 9B - and for 9B it also doesn't improve anything and still fails all tests
9B distills called "HighIQ" and "Qwopus" are the only ones that really improved the tool calling of this model
I might have to go look for 27B HighIQ model, but a 27B Qwopus v3 doesn't exist yet
AvidCyclist250@reddit
How exactly is this done? Trying to do this in LM Studio but I can't find a way to force this. Just layers GPU, and layers CPU. And Experts to load (8). 64 GB RAM and 16 VRAM
grumd@reddit (OP)
I'm using llama.cpp, compiling it myself too. There's a ton of flexible options.
In LM Studio there's a checkbox "Force Model Expert Weights onto CPU" or smth similar. Keep all layers on the GPU and only move the experts onto CPU.
AvidCyclist250@reddit
I'm on Linux and I spent about 8 hours and well full llama.cpp and Chat Box. Cuda, etc. Managed to set up everything there that I had with LM Studio. Your launch option idea was too great to pass up. Couldn't emulate it precisely in LM Studio so I had to do this. Thanks for sharing your idea.
grumd@reddit (OP)
mmap enabled keeps experts on the disk and doesn't load them into RAM for me - terrible inference speed
you can go without
fit- either using --n-cpu-moe to decide how many layers of experts to offload to CPU, or even using--override-tensorwith a regexp to fully control the offloading, for example offloading all experts can be done like this-ot ".*_exps\.weight=CPU". but you'd have to figure out the precise context length you want to use, takes timedo not use q4 for kv cache, the quality of responses will be VERY low. either use default ctv/ctk, or use q8_0 and make sure llama.cpp is newer than 0.9.11 or b8626 - they added kv cache quant vector rotation, improving quality - q8_0 is almost lossless to f16 now.
qwen 3.5 is quite sensitive to kv cache quantization, you'll destroy quality and the RAM saving is not as big
Tormeister@reddit
Interesting how you got Qwen3.5 27B IQ4_XS to 23/25, I will try it later.
I tried 27B Q5_K_S = 20/25, and also 27B Q6_K = 22/25
grumd@reddit (OP)
Yes F16 KV most likely and that particular Bartowski quant is pretty good. I think I also used --seed 3407. Also, when you get 22/25 for example, are any of the tests timing out at 120s? Due to how slow the model is at my machine (e.g. 10-15 t/s) I increased the timeout to let it finish generating and actually see if it gets the results correctly. Without increasing the timeout I'd probably fail more tests
moahmo88@reddit
Amazing work!Thanks!
rm-rf-rm@reddit
great work, this is single pass, double pass, best of 3?
grumd@reddit (OP)
Single pass for all of them, too long to do multiple π₯²
PracticlySpeaking@reddit
This also clearly illustrates the 'meh' difference btw Qwen3.5-35b and 122b
sine120@reddit
If you have a lot of VRAM and not a lot of RAM, 27B is awesome. If you have a lot of RAM and not a lot of VRAM, 122B is awesome. If you have a "meh" amount of either, 35B is awesome.
coyo-teh@reddit
If you have 128 GB of RAM and 96 of VRAM what would you recommend
sine120@reddit
I'd just run the 122B completely in VRAM, or if you don't care about speed, minimax.
grumd@reddit (OP)
If you have such expensive hardware, I'd say you should do your own research and benchmarks, but maybe try a low quant of Qwen3.5-397B-A17B
grumd@reddit (OP)
35B is Q6 while 122B is Q3-4 so that's not entirely fair to say
PracticlySpeaking@reddit
Flip that around... For 'easy' coding, there is only a small difference.
grumd@reddit (OP)
You've edited your comments tho, your first comment said "35B vs 122B" at first.
What I'm saying - 27B and 122B are very close, but 35B is noticably worse than both of them, I wouldn't say it's a "meh" difference. This benchmark is too easy to show the actual difference between them.
Norwood_Reaper_@reddit
Did you by chance test the jackrong 27b qwopus?
grumd@reddit (OP)
It's top 5 in my post
Norwood_Reaper_@reddit
Sorry I missed it! Yeah I'm on v2 now, I really like it a lot. I've got it hooked into Claude code from the leak yesterday and it is fairly capable if you prompt it well.
Yassfive1@reddit
In practice what would it allow you to locally vs using cloud? Sorry to be pessimistic but when you compare the local models perf vs the fraction lf the cost of gemini. Im like whats the point ?
grumd@reddit (OP)
With local models you can choose what you use, you can use abliterated uncensored models if you want, you can tune their parameters depending on what you need. They are also completely private and Google doesn't read your messages (can be important for security if you're using it for important work).
tmvr@reddit
Tried it as well yesterday with a bunch of models and had the same issue with Qwen3.5 9B at Q8 - a bunch of errors due to tool calling issues.
grumd@reddit (OP)
Try the Qwopus model! It's a finetuned Qwen3.5-9B but it had no issues with tools, it's the best 9B model in this benchmark that I tried
Big_Mix_4044@reddit
Can confirm, same two fails for the qwen3.5-27b q4_k_m by bart. BUT, with q8 kv and new quant rotation in llama.cpp. So it worth something at least. It's interesting to bench models for something aside from PPL and KLD.
grumd@reddit (OP)
Yep and I love this test because it's both quick, relatively real-world and uses tools
hyrulia@reddit
Can you test Qwopus please?
grumd@reddit (OP)
Pretty impressive, 17/25, better than Qwen3.5-9B-Claude-4.6-HighIQ-THINKING
hyrulia@reddit
Thank you!
grumd@reddit (OP)
Yep and thank you for the pointer, I'll keep this Qwopus 9B in my cache just in case I wanna play with it later with real tasks :)
sine120@reddit
Any chance you recorded tg and pp speeds while going through these? Curious how they perform for how long the tests took.
grumd@reddit (OP)
Yeah I should have recorded the time it took for each, but it heavily depends on how much context you need. These tests don't need more than like 15k of context, so I could allocate that and offload fewer layers to CPU, drastically improving speed. But that doesn't mean I will actually daily drive the models like that, I need 80k-120k minimum for real tasks, preferably 150k. So real speeds will be different. So I only used these tests as a quick quality comparison.
For pp/tg:
122B Q4 in real usage is like 1500/15-19.
122B Q3 can do 1500-2000/25-30.
27B Q4 has to be partially offloaded and will do around 1000/15 depending on how many layers you offload.
27B Q3 I think can be fully on GPU and is around 30-40 tg maybe (I don't usually use it so I don't remember).
35B is very fast at 2000+/70.
9B is around 100 tg.
I don't use 27B Q3 because quality-wise it's worse than 35B Q6, and the latter is 60+tg anyway. That I confirmed by running a full suite of Aider benchmarks, not just this SQL quick test.
sine120@reddit
I've only got 64GB of RAM, 16GB of VRAM. As much as I want to run the 27B and 122B models, I can't fit enough context to get the 27B to be more useful beyond chatting. 122B works in IQ3, so I might have to give that more attention, but I've been hoping the 35B with CPU offloading would be better than Qwen3-Coder-Next. What was your command for the Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS run?
grumd@reddit (OP)
My day to day IQ3_XXS command:
no-mmap is important to actually force the model into RAM
no-mmproj removes vision capabilities but saves VRAM
ub 2048 gives really good prefill speed - lower ub means lower prefill
131k context seems enough for most tasks for me
cache-ram 0 is important because extra 8GB of RAM just for cache isn't fun when you barely have any free RAM
fitt is at 256 because I'm using my integrated GPU for desktop rendering - my monitor is connected to the motherboard - funnily enough it doesn't affect my performance in games at all
You can also try Q3_K_S for 122B - not as much free RAM but can also be doable. It has much better quant quality compared to IQ3_XXS for many tensors - you can check the details in huggingface when you click on a quant and scroll down to the weight quant information
the__storm@reddit
Is this on DDR5?
grumd@reddit (OP)
Yeah
GroundbreakingMall54@reddit
the fact that 122B MoE fits on 16GB and still tops the chart is insane. i remember when running anything over 13B felt like a flex. also interesting that the claude opus distill of qwen 27B scores the same as vanilla - would have expected the reasoning distillation to help more with SQL logic
grumd@reddit (OP)
Tbh both 27B and 35B distills are not better (or even worse) than vanilla. The specific 9B distills I have there are the only distills where you can see a clear improvement
AdamDhahabi@reddit
Great work!