16 GB VRAM users, what model do we like best now?
Posted by lemon07r@reddit | LocalLLaMA | View on Reddit | 98 comments
I'm finding Qwen 3.5 27b at IQ3 quants to be quite nice, I can usually fit around 32k (this is usually enough context for me since I dont use my local models for anything like coding) without issues and get around 40+ t/s on my RTX 4080 using ik_llama.cpp compiled for CUDA. I'm wondering if we could maybe get away with iq4 quants for the gemma 26b moe using turboquant for kv cache..
Being on 16gb kind of feels like edging, cause the quality drop off between iq4 and q4 feel pretty noticable to me.. but you also give-up a ton of speed as soon as you need to start offloading layers.
Equivalent_Bit_461@reddit
Is it even worth using models this low in quant? I just took the moe pill and run everything quant 6, most important bits in gram, rest on ram, also thanks to turbo quant easily can stay over 100k context. Sure quant 6 might not be lighting fast but at least is not severely reduced
Top-Rub-4670@reddit
In my tests IQ3/Q3 has been fine for both Qwen 3.5 27b and Gemma 4 31b. Asking specific questions about some deep knowledge is definitely worse than Q4+, but the reasoning seems to mostly be there? At least it hasn't failed any of my go-to test tasks.
I found that Q3 was "fine" for role playing in Gemma 4 26b, but it doesn't follow directions as well as Q4+ and it tends to get confused in long contexts. It also frequently forget its personality and starts talking neutral. As for Q2 it's the same, but worse, plus it starts making lots of typos. I haven't noticed any significant difference between Q4/Q5/Q6/Q8. So there seems to be a threshold at Q4 for 26b, and possibly for other similarly sized MoE models?
But Q3 for Qwen 3.5 9b and Gemma 4 E4B is like a lobotomy, they fail all the "complex" tasks I've tried.
Note that I have tried all the small quants out there for the models I've talked about. The static ones, the imatrix ones, the unsloth ones. It doesn't make any real difference, the Q3/Q4 cliff is real!
InternationalNebula7@reddit
What speeds with what config and hardware?
ea_man@reddit
If you don't waste VRAM you should be able to fit Qwen_Qwen3.5-27B-IQ4_XS.gguf 15.2 GB with some 80k spare context at Q_4.
- https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGU
Either use integrated graphics for DE or kill X11, otherwise if you tune it properly you should be able to run LXqt with some 40k context.
BTW: Qwen_Qwen3.5-27B-IQ3_XXS.gguf 11.3 GB runs the same way on a 12GB GPU.
Top-Rub-4670@reddit
That's insane, how do you fit 15.2GB in 16GB VRAM? Where is the KV cache going? The context? Hell, your OS' desktop renderer?
lly0571@reddit
Gemma4-26B-A4B-IQ4_XS for speed and Qwen3.5-27B-Q3_K_XL for quality. Both of them can handle ~32k context with 16GB.
Top-Rub-4670@reddit
I haven't noticed any difference between Q3_K_XL and Q3_K_M and the benchmarks seem to agree. Has your experience been different?
Morphon@reddit
Qwen 3.5 over here!
35B-A3B at Q6K and 128k context (expert weights pushed to CPU). 35t/s. Very usable speeds, low precision loss because of the big quant.
122B-A110B at IQ3_S and 128k context (again, expert weights on CPU). 15t/s. Still usable speeds, but not as "Just ask the AI and get an answer right away" level of speed. Less precision, but MUCH better domain knowledge.
These two have replaced almost everything else I've used.
toalv@reddit
How do you push expert weights to CPU? I'm using Ollama, does it do this automatically or do I need to use llama.cpp or similar?
mlhher@reddit
You should stop using Ollama and use llama.cpp. You do not need any config for llama.cpp just use "-fit on". You can use it for all models, llama.cpp just smartly fits whatever gives the best speed.
Ollama should be avoided for many many reasons.
Jayfree138@reddit
Does llama.cpp swap models in and out of VRAM as needed or do you have to do it manually? With an Ollama backend if i call a different model than the one that is loaded and i dont have enough VRAM to fit both it'll drop the previous model out of VRAM automatically to make space for the one i'm using rather than overflow to system RAM.
This enables me to string together multiple models in a sequence with minimal VRAM usage, which is critical on a consumer GPU with limited memory.
If llama.cpp can do that with minimal setup i'll seriously think about switching.
fligglymcgee@reddit
Yes, llama.cpp now has a release called llama-server that handles this pretty well. Llama-swap is a bit more flexible, but either are good choices and both will hot swap models on demand for you.
lolwutdo@reddit
I thought -fit was on by default?
toalv@reddit
Could you expand on those reasons?
Wild_Requirement8902@reddit
try out lmstudio nice ui and if you have several computer running it you can link them together to be able to load and unload models from any of your pc in any pc connected through their link feature.
ThankGodImBipolar@reddit
LMStudio is an ollama frontend...
4xi0m4@reddit
The main issues are: custom model format that makes quantization harder, closed-source and slower to adopt new llama.cpp features (like the new MoE CPU offload), and limited flexibility for fine-tuning. Also Ollama quantization tooling lags behind what llama.cpp can do directly. For 16GB VRAM users squeezing the most out, direct llama.cpp with correct quantization flags usually wins.
lemon07r@reddit (OP)
Slower speed, usually more issues, more complicated/less simple to use (ironically).
DragonfruitIll660@reddit
I think Ollama can use n-cpu-moe to offload experts to regular ram. If I remember right there is a slider for it (I haven't really used Ollama, generally just use llama.cpp but I remember hearing about it)
Morphon@reddit
I'm using LMStudio. Not sure what the actual flags would be if running this on the cmdline.
This allows me to use my 64GB of system RAM to circumvent the speed tax on these bigger models. KV Cache and some layers sit on the GPU. Inference experts sit in RAM and are partially run on the CPU.
It's been a huge game changer for me.
toalv@reddit
I'm on 64GB as well, appreciate the tips.
n8mo@reddit
Agreed.
35B-A3B is by farrrr my favourite model atm. Fast enough on my 5070ti and smart enough for most things I use it for.
OneStoneTwoMangoes@reddit
What quant of Qwen 35 runs well on 5070Ti Laptop?
LoSboccacc@reddit
What the prompt processing speed of that?
Monad_Maya@reddit
Why IQ3_S on 122B, system specs?
Morphon@reddit
RTX 5070 (12gb VRAM), AMD 5900XT (64GB DDR4)
Monad_Maya@reddit
64GB, got it.
I was trying to run gpt-oss 120B last year and found the memory capacity insufficient. Had to move to 128GB to get breathing room.
It's a worthwhile upgrade (prices not withstanding).
I'm on 7900XT 20GB + 5900X + 128GB DDR4.
I'd suggest a quant of Qwen 3.5 27B, it's slightly better than Qwen 35B although way slower too. Your experience might vary from mine.
Di_Vante@reddit
How did you get 122b properly configured? Did you set like specific params, or are using stock? I'm only getting trash from it :(
AvidCyclist250@reddit
I use this, with models downloaded via lm studio.
cd llama.cpp ./build/bin/llama-server \ -m "/home/-----YOURNAME----/.lmstudio/models/unsloth/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-IQ3_XXS.gguf" \ --jinja \ -b 2048 -ub 2048 \ --temp 0.6 \ -ctk q8_0 -ctv q8_0 \ -fitc 68304 --fit on -fitt 256 \ --cache-ram 0 --parallel 1 \ -t 6 --reasoning-budget 1024 \
Di_Vante@reddit
Awesome, ill try this out. Tyvm!
AvidCyclist250@reddit
Can lower q8 to to q4 for more context. Might want to test your use cases before doing that. Haven't noticed any big drawbacks, but others have said it made a difference for them
Popular_Tomorrow_204@reddit
Im a complete beginner, so i might not understand correctly.
Are you using it for coding or other Tasks? If yes would you recommend it for coding?
Morphon@reddit
I don't use it for "vibe coding". But I will ask it questions about syntax and standard library functions, and occasionally for some code review tasks (how can I make this function more efficient/idiomatic, etc...). If it has good training data for the language (like JavaScript, Python, etc...), it does quite well for these tasks. Rarer languages (Smalltalk) - not so great. It will hallucinate methods like you wouldn't believe! :-)
iamapizza@reddit
Could you share your llama server arguments, might help for comparison.
Morphon@reddit
Just using the standard LMStudio defaults. It's an Unsloth Dynamic.
Slide up the context to 128k (or 64k on my machine with the 5070).
GPU Offload to maximum. Unified KV Cache, Offload KV Cache to GPU memory. Number of Layer to force experts to CPU - Maxmimum.
12GB RTX 5070 with an AMD 5900XT (only DDR4) with 64gb cranks out over 20t/s.
Ell2509@reddit
Just FYI there is no edge. Everyone wants the next size up.
embeeweezer@reddit
I'm in the qwen3.5 35b MoE ballpark as well. Would like to get the 27B up to speed though. Anyone got a Speculative Decoding config running?
PhlarnogularMaqulezi@reddit
Speed and context length wise, Qwen3.5-9b Q8 has been outstanding for its size.
bnolsen@reddit
I've been using q6 on my 3060 12gb. Its,more of a general household llm. I also have a 128gb strix halo.
-Ellary-@reddit
5060ti 16gb / 32gb ram.
gemma-4-26B-A4B-it-IQ4_XS - 90k of context (Q8) - all layers - 90tps.
gemma-4-31B-it-IQ4_XS - 16k of context (Q8) - 52 layers - 10tps.
gemma-4-31B-it-IQ3_XXS - 45k of context (Q8) - all layers - 25tps.
Qwen3.5-27B-IQ4_XS - 20k of context - all layers - 25tps.
Qwen3.5-27B-heretic-v3.i1-IQ3_XXS - 77k of context - all layers - 25tps.
Skyfall-31B-v4.2-IQ3_XXS - 32k of context (Q8) - all layers - 25tps.
IQ3_XXS is surprisingly good, It is around Q2K size but performance is really better.
I'd say there is just no point of running 9b model at Q8, just run IQ3_XXS 27b, size is the same.
InternationalNebula7@reddit
This is a very helpful post.
5080 16gb; no vision
gemma-4-31B-it-Q3_K_S - 18994 context (Q8) - all layers - tg 45tps, pp 1577tps
gemma-4-31B-it-Q3_K_M.gguf - 18k context (Q8) - 55 layers - tg 17.5 tps, pp 1100tps
gemma-4-26B-A4B-it-UD-IQ4_NL.gguf - 18k context (Q8) - all layers - tg 136, pp 5567tps
Dabalam@reddit
I understand people getting 60 t/s won't be fretting about their speed, but people using Q3 dense models at 20 t/s could be getting 2 to 3 times the speed with similar quant MoE or the same speed at Q4. I'm surprised the speed difference isn't so important to most.
grumd@reddit
qwen 3.5 122b if you have enough RAM (64+)
sine120@reddit
I imagine at 64GB RAM you're probably looking at less TG than the 27B, with about the same quant and context size, no?
grumd@reddit
27B is a worse option at 16GB VRAM, but a better option at 24GB VRAM and higher. Nobody's gonna see my reply now that it's downvoted to the bottom of the post, but that's true and there's a few reasons for that.
You just can't run 27B on 16GB with enough context (100k+) while keeping a good quant (better than IQ3_XS). Once you start offloading layers to CPU to get more context at better quants, tg drops to 10 or lower. Dense models also suffer from quantization more than large MoE models - because with MoE models you can quantize experts more aggressively but then keep higher precision for more important layers. With dense models you don't have a lot of leeway.
I've ran 27B at Q3_K_S and IQ4_XS (the latter you have to offload some layers to CPU) at the Aider benchmark and Q3 scored ~50%, Q4 scored ~60%. 35B-A3B:Q6_K_XL scored ~55%.
122B at IQ3_XXS scored 67% while being faster than IQ4_XS.
After a lot of testing, I've ended up daily driving 122B Q4_K_XL (96GB RAM here) with 160k context. It's way better than any quant I could run with 27B, and it does real world coding tasks spectacularly.
With 122B my speed is around 15-20 tg, 1200-1500 pp depending on context depth.
sine120@reddit
I'll have to play with the 122B a little more. At 64GB I'm in Q3 territory again, probably the IQ3_XXS, but I always kind of assumed 120B+ models wouldn't be worth it. I'll have to try it. For me the PP speeds is what kills me as I was using opencode, but if I switch to Pi I can probably get better mileage.
No idea why you're downvoted, us lower VRAM folks should definitely be considering MoE's.
grumd@reddit
In my inbox there was a very rude comment saying "reading comprehension duuuh OP said 16GB VRAM", maybe some people don't know about CPU offloading and just downvote?
Anyway.
I'd recommend IQ3_S as your best option at 64GB. It's higher quality than IQ3_XXS but the size is almost the same. The next noticable step up is IQ4_XS but it's hard to fit with 64GB RAM. Nothing between IQ3_S and IQ4_XS is worth it. Another option is this dude https://huggingface.co/Goldkoron/Qwen3.5-122B-A10B He's made quants with better KLD than unsloth's quants at the same size. You can try K_G_3.50 to see if it fits. It's supposed to be higher quality than UD-Q3_K_XL.
I'm benchmarking his 5.00 quant against similarly sized UD-Q4_K_XL from unsloth, but it will take around a week of benchmarks to get the results.
sine120@reddit
Let me know how it benches, I'd be curious. Optimizing everything tickles my brain in just the right way.
Witty_Mycologist_995@reddit
Gemma 26b all the way
random_boy8654@reddit
Gemma 26b vs qwen 3.5 35B ?
Witty_Mycologist_995@reddit
Gemma.
throwaway957263@reddit
What quant did you use? I tried https://ollama.com/VladimirGav/gemma4-26b-16GB-VRAM
But it leaves you with 1GB VRAM for KV cache, leaving you with 8192 context
Witty_Mycologist_995@reddit
I just ran the unsloth version. On llama cpp
sine120@reddit
Like you said, 27B at IQ3_XXS does well. I have 64GB of system RAM, so I tend to run MoE's in harnesses with a small amount of system prompt if possible. Qwen3-Coder is good, 3.5-35B-A3B is good, and Gemma4-26B is good. If I don't need as much intelligence/ coding ability, 3.5-9B is also pretty good, and I want to play with Qwopus to see how it handles.
I wish there were something up-to-date in the 12-20B range, as that would probably give 16GB folks enough context to be more useful and use higher quants.
grumd@reddit
You should try 122B at IQ3_XXS, at a low quant it outperforms 27B. 27B gets ahead of 122B at higher quants
xeeff@reddit
please let me know how Qwopus (9B/35B A3B/27B) works out for you, and what your use cases are. i'll be waiting :)
sine120@reddit
Is there a 35B Qwopus? I only see 4/9/27B.
xeeff@reddit
you're right, MoE isn't here yet.
ansibleloop@reddit
My issue is I want the entire model in my GPU for speed, but with my monitors I only have like 15GB of RAM with 12GBish for the model and 3GB for context
I need to offload some of that and try Gemma 4
sine120@reddit
For low context/ quick chats, you can fit pretty good intelligence in 16GB, but for longer context work you'll pretty much need to give up on that and accept it's going to be a background task.
Spicy_mch4ggis@reddit
Qwopus is pretty decent, I quite like it
MerePotato@reddit
Unsloths Q6_K quant of Gemma 4 26BA4B with MoE offloading (--n-cpu-MoE) is your best bet imo, just make sure you're on the latest build of llama.cpp.
AlterTableUsernames@reddit
Greath answers here, anyone a recommendation for 8GB VRAM + 32GB DDR4?
AlwaysLateToThaParty@reddit
Qwen3.5 9B Heretic Q6_K or Q8_0, depending how much else i have in VRAM. My work computer is locked down. Can't even plug a phone into it to charge it. But at least it has an RTX 5000 in it. So that's what I use if I need to use inference at work. Not as good as my home system, but it works a treat.
Danmoreng@reddit
Why ik_llama over llama.cpp?
lemon07r@reddit (OP)
Supposedly has optimizations that make it faster, which I think upstream ends up getting some of too, but a lot more slowly
Danmoreng@reddit
Well, I tested that a few months ago and found no performance benefits, that’s why I stick to llama.cpp. The only benefit apparently might be different quants (the IQ ones) which llama.cpp won‘t get because of personal differences: https://github.com/ggml-org/llama.cpp/pull/19726#issuecomment-3946355613
If you want to try out llama.cpp, I got some scripts to build from source and settings I found optimal for the Qwen 3.5 family here: https://github.com/Danmoreng/local-qwen3-coder-env
The 27B model in Q4 is too large for 16GB though. I prefer MoE variants, since they have decent performance if split between GPU and CPU. For example I get around ~70 t/s with the 35B model on my RTX 5080 mobile.
lemon07r@reddit (OP)
Yeah there seems to be sort of an ebb and flow of llama.cpp catching up, ik having stuff added, etc. I think the gap has gotten pretty small now, but since ik works too I havent had a reason not to use it. It does compile a little slower though
Uriziel01@reddit
Qwen 3.5 Coder Next, hands down. I've beed testing Gemma 4 for the last couple days so maybe I'll switch for general assistent but for coding still Qwen is the best for my 16GB VRAM use cases.
tuliosarmento@reddit
Is there a 3.5 coder next?
TastyStatistician@reddit
Gemma 4 26b is currently the best for 16gb VRAM.
Qwen 3.5 is also great but thinks way too much. 4b or 9b with thinking off are great if you need large context room.
RandomTrollface@reddit
Gemma 4 31b and qwen 3.5 27b both iq3_xxs. They seem smarter to me than the smaller models at higher quant.
Fyksss@reddit
gemma 4 31B Q3_K_S and IQ3_XXS
Long_comment_san@reddit
Damn can't wait to get a reasonably priced GPU with 32 gb VRAM. R9700 is quite close as is B70, but nah, I do play games as well. No idea why AMD doesn't just click it and push something in the 800$ with 24gb with slower VRAM.
ea_man@reddit
Aye, I want a 9070xt Super with double VRAM.
Maleficent_Celery_55@reddit
i wish amd did something like 7900xtx this generation. i hope they do it next time.
ea_man@reddit
Let's hope that prices keep going down.
Personally I don't want the power of an *090xtx , I would be happy if the *070 was 16-24GB and the *070XT was 24-32BG because this gen to pay 650 for 9070xt 16GB ain't sweet, 600 for 9070 wasn't sound.
ThankGodImBipolar@reddit
No AMD representation here yet; I've been able to run Gemma 4 27B Q8 at 15-20 tok/s on my 7800XT. I've also tried a Q4_K_M quant (Heretic, if that makes any difference), and that runs at ≈25tok/s. I haven't rebuilt llama.cpp since Gemma 4 came out, so it's possible it may run faster on the current branch. I'm planning on doing some more messing around tonight and may update if I can find some improvements.
In addition to that, I've also been using Qwen 3.5 Coder Next (64GB of system RAM) at IQ4_XS, and that runs at ≈28tok/s. Not sure whether this or Gemma 4 27B is better for coding; will have to experiment some more.
I'd appreciate if anyone has any insight into whether these speeds seem appropriate for my hardware, if I'm using stupid quants, etc.. I'm going to keep following along with this thread.
Correct-Boss-9206@reddit
I have been running Gemma4 Q4_K_M and it runs pretty fast for my use case. 28 tok/s on my 5070ti quality feels solid at that quant.
LostDrengr@reddit
Currently using Gemma4-26B-IQ3 plenty of room for context and its hitting 124t/s on 5080.
WhataburgerFreak@reddit
I’m with you as well at q3_m_k with that same card at 135k context at q8 on cache
lolwutdo@reddit
Qwen 3.5 397b
kiwibonga@reddit
I upgraded to 32GB from 16GB because it wasn't comfortable enough with Devstral Small 2 24B, similar constraints to Qwen 27B which I use now.
With turboquant though we will be able to have full quality and full context in 32 GB which is really cool.
Not really answering your question, but highly recommend going 32GB. A 5060Ti 16GB is only $500-700.
H3g3m0n@reddit
The IQ3_XXS of Gemma-31b should allow for around 60k context. Someone posted benchmarks of twitter that it's almost as good as the Q4. Could even get more context with something like turboquant/rotorquant if your willing to figure out which random fork is decent.
Unfurtunatly as of now CUDA 13.2 has a bug that causes it to output gibbirish in llama.cpp I tried downgrading to 13.1 which solved the gibberish issue but ran into another bug that caused it to crash if loading the vision mmproj. Might try 13.0 or a 12.x and see if they solve both bugs.
Currnelty I'm just sticking with the MOE of Qwen which gets the full context and decent speeds with n-cpu-moe offloading. It seems better than the Gemma4 MOE.
lacerating_aura@reddit
Qwen3.5 122B IQ4XS with maxed out, almost, beyond 200k anyway, bf16 context. Its a dedicated machine for hosting the model and some very light services. 16gb vram, 64gb ram. Or Q8KXL of Gemma 26B, dense models aren't just worth at 3070 class gpu.
Erwylh_@reddit
Gemma4-E4B-f16 with long ctx. But I needed vision and audio processing capabilities as well, so it was a suprise that I got the perfect model for my usecase.
Herocem@reddit
Gemma 4 26B-A4B for me at Q4, 128k. I get 60/ts when context is empty and goes all the way down to 40 t/s when it gets full. I am running it on 5070 ti, 32 gb ddr4 3600, Ryzen 7 5800X3D.
I use it for my personal assistant project on n8n.
lemon07r@reddit (OP)
Hmm I want to try this, but at the same time that only marginally faster than dense 27b at iq3.. and I get the feeling a dense 27b model would still be smarter and more capable.
the__storm@reddit
Was it maybe offloading to system memory? It's a lot faster if you can squeeze it into VRAM, which is just barely possible with the 26B at Q4 (IQ4, if you want any space for context).
But yeah the 27B dense is going to be significantly smarter.
popcornkiller1088@reddit
gemma-4-26B-A4B-it-UD-IQ3_S.gguf is awesome for 16gb Vram ( RTX 4080 Super)
While I can hit 90 t/s at 32k context on the main card, bridging the second PC let me bump the context up to 130k. Speed dropped to 20 t/s, but having that massive window is a total game changer.
Experimenting with llama.cpp RPC servers to bypass VRAM limits. Using an RTX 4080 Super + an RTX 3060 Ti (8GB) via Ethernet.
send-moobs-pls@reddit
I'm in the 8GB poor house but I just can't not find anything that compares to qwen 3.5 models right now. I'll say maybe their weakness is like creativity or role play or something because the qwen vibe is pretty "codex" feeling, Gemma might be better if you specifically want creativity or personality like that. But for general thinking, tasks, tools etc I'm basically still in shock at how the qwen 9B makes everything else I can run look like a joke
lemon07r@reddit (OP)
I really need to look into the Gemma models, but I'm not entirely convinced they will be better than the qwen 3.5 models. EQ bench actually shows qwen 3.5 27b model to be the better writer than any of the gemma models.
Shamp0oo@reddit
You can run the IQ4_XS quant of Qwen 3.5 27B with 16 GB of VRAM and up to 40k context (q8). See my comment and follow up comment for instructions. I recently switched to the unsloth IQ4_XS quant which is slightly bigger and therefore only allows for around 32k context but it felt more robust with tool calls in Open WebUI to me.
Guilty_Rooster_6708@reddit
Gemma 26B and Qwen3.5 35B. MoE all the way
the__storm@reddit
I've been using Gemma 4 26B at IQ4_XS; gets about 65K context at fp16. I agree that the IQ4 is more compressed than I'd like, but I find that Gemma is still quite good at non-coding tasks.
I have 64GB system memory but it's dual channel DDR4 so I'm loathe to offload anything with lots of active parameters to it. If there was an updated Coder-Next (80B-A3B) that would be a nice option.
DragonfruitIll660@reddit
Using UD Q3 of Gemma 4 31B today, likely going to be my new main model (it feels like a higher weight model all of a sudden). Otherwise I generally use GLM 4.5 Air Q4KM with n-cpu-moe maxxed out and you still get 8-12 TPS based on context.
anzzax@reddit
Try this one https://huggingface.co/Intel/Qwen3.5-35B-A3B-gguf-q2ks-mixed-AutoRound, I run it with '--n-cpu-moe 8'. It's very fast with still acceptable quality, but if you want the smartest option - find quant of qwen3.5-27b that you can fit into 16GB
Sadman782@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1scw979/gemma_4_for_16_gb_vram/
You can use Gemma 4 26B MoE IQ4_XS