16 GB VRAM users, what model do we like best now?

[-]

moflinCASIO@reddit

I actually just spent the last few hours testing this exact problem on a much weaker setup than a 4080, and honestly I came away way more impressed with 16GB VRAM than I expected.

My setup:

RTX 4060 Ti 16GB
i5-11400F (6C/12T)
32GB DDR4-3200
llama.cpp CUDA build
Flash Attention enabled

I used to just run default Ollama setups without really understanding quantization differences, but after compiling llama.cpp properly and testing IQ quants directly, the performance difference was honestly massive.

What surprised me most:
IQ quants absolutely dominated K-quants on my setup.

I tested:

Gemma4 E4B IQ4_XS
Gemma4 26B A4B UD-IQ4_XS
Qwen3.6 35B A3B UD-IQ2_M
Qwen3.6 35B A3B UD-IQ3_XXS
Qwen3.6 27B Q3_K_M

Results were kinda shocking to me:

Qwen3.6 35B A3B UD-IQ2_M -> \~81 tok/s -> \~13.1GB VRAM
Qwen3.6 35B A3B UD-IQ3_XXS -> \~74 tok/s -> \~14.6GB VRAM
Gemma4 26B A4B UD-IQ4_XS -> \~61 tok/s -> \~15.7GB VRAM

But then Qwen3.6 27B Q3_K_M only got me \~18 tok/s despite the GPU sitting at 99% utilization the whole time and pulling \~160W.

That was the moment I realized:
K-quants are probably just too compute-heavy for this class of GPU.

So at least on a 4060 Ti, IQ quants felt WAY better than I expected.

And honestly, I kinda agree with your point about 16GB “feeling like edging” lol. The difference between “fully GPU resident” and “slightly overflowing VRAM” is absolutely brutal.

But I also came away thinking:
16GB is actually still a really good place to be if you optimize carefully and stay realistic about quant choice.

Before this I honestly thought “maybe 24GB is basically mandatory now,” but after testing llama.cpp CUDA + IQ quants properly, I’m way less convinced.

[-]

grumd@reddit

qwen 3.5 122b if you have enough RAM (64+)

[-]

sine120@reddit

I imagine at 64GB RAM you're probably looking at less TG than the 27B, with about the same quant and context size, no?

[-]

grumd@reddit

27B is a worse option at 16GB VRAM, but a better option at 24GB VRAM and higher. Nobody's gonna see my reply now that it's downvoted to the bottom of the post, but that's true and there's a few reasons for that.

You just can't run 27B on 16GB with enough context (100k+) while keeping a good quant (better than IQ3_XS). Once you start offloading layers to CPU to get more context at better quants, tg drops to 10 or lower. Dense models also suffer from quantization more than large MoE models - because with MoE models you can quantize experts more aggressively but then keep higher precision for more important layers. With dense models you don't have a lot of leeway.

I've ran 27B at Q3_K_S and IQ4_XS (the latter you have to offload some layers to CPU) at the Aider benchmark and Q3 scored ~50%, Q4 scored ~60%. 35B-A3B:Q6_K_XL scored ~55%.

122B at IQ3_XXS scored 67% while being faster than IQ4_XS.

After a lot of testing, I've ended up daily driving 122B Q4_K_XL (96GB RAM here) with 160k context. It's way better than any quant I could run with 27B, and it does real world coding tasks spectacularly.

With 122B my speed is around 15-20 tg, 1200-1500 pp depending on context depth.

[-]

crantob@reddit

Thank you for your report.

[-]

VoidAlchemy@reddit

I appreciate your nuanced comments.

Have you tried ik_llama.cpp given sounds like you're doing hybrid CPU+GPU inference? Also if you've not checked out AesSedai's mainline llama.cpp quants on huggingface give them a look for MoE optimized recipes similar to what I make for ik_llama.cpp (i'm ubergarm on hf).

Cheers!

[-]

grumd@reddit

Oh shoot it's ubergarm! Yep I've used your quants with ik_llama! Thanks for your work 🙂

[-]

VoidAlchemy@reddit

haha yes its me! thanks for being out in in this wild reddit world trying to share good information! haha cheers!

[-]

sine120@reddit

I'll have to play with the 122B a little more. At 64GB I'm in Q3 territory again, probably the IQ3_XXS, but I always kind of assumed 120B+ models wouldn't be worth it. I'll have to try it. For me the PP speeds is what kills me as I was using opencode, but if I switch to Pi I can probably get better mileage.

No idea why you're downvoted, us lower VRAM folks should definitely be considering MoE's.

[-]

grumd@reddit

In my inbox there was a very rude comment saying "reading comprehension duuuh OP said 16GB VRAM", maybe some people don't know about CPU offloading and just downvote?

Anyway.

I'd recommend IQ3_S as your best option at 64GB. It's higher quality than IQ3_XXS but the size is almost the same. The next noticable step up is IQ4_XS but it's hard to fit with 64GB RAM. Nothing between IQ3_S and IQ4_XS is worth it. Another option is this dude https://huggingface.co/Goldkoron/Qwen3.5-122B-A10B He's made quants with better KLD than unsloth's quants at the same size. You can try K_G_3.50 to see if it fits. It's supposed to be higher quality than UD-Q3_K_XL.

I'm benchmarking his 5.00 quant against similarly sized UD-Q4_K_XL from unsloth, but it will take around a week of benchmarks to get the results.

[-]

Monkey_1505@reddit

Might be worth considering a reap, as mild 20-25% type reaps probably give less loss than more aggressive quants? This way you could use a less quantized file.

That's the logic in my mind anyway, I could be wrong.

[-]

grumd@reddit

In my experience any type of reap just destroys the model's performance tbh. Much worse benchmarks than a quant of similar size of the original model

[-]

Monkey_1505@reddit

I would have thought heavy quantization, like 3xxs was worse than a 20% reap which seem to only show mild changes in benchmarks. But I can't say I've really compared them. And probably would also depend what you use the models for. Like for knowledge it's probably worse than say, code.

[-]

sine120@reddit

Let me know how it benches, I'd be curious. Optimizing everything tickles my brain in just the right way.

[-]

Top-Rub-4670@reddit

You seem to be an expert at quants, so if I may ask why are you using XL? In the past week I've trying very hard to find a difference between M/L/XL quants but I can't.

KLD benchmarks do show a small difference between M and L (usually none between L and XL), but real benchmarks are harder to find.

Is there a specific test that would show L/XL superiority?

Sure the size difference is usually "only" 0.5-1.5GB, but that's still a fair amount of context you're losing! So there has to be a reason, a specific test that would show L/XL being better?

Thanks!

[-]

grumd@reddit

Well XL is not a real quantization method, the underlying tensors are still normal Q6_K, Q5_K, Q8_0, etc. You can open the quant details on huggingface and actually take a look at individual tensors and layers. With 122B from unsloth, the difference between Q4_K_M and Q4_K_XL is that output.weight goes from Q6_K -> Q8_0, and layer 46 has improved tensors, for example blk.46.ffn_down_exps.weight goes from Q5_K to Q6_K. Honestly the difference is super minor but the size of XL is only bigger by 0.5GB so I thought why not use the "better" quant anyway. The rule of thumb is simply use the biggest model you can fit.

[-]

Hyrnos@reddit

Has anyone tried the Gemma REAP versions ? Like with 20% expert weights pruned

[-]

taking_bullet@reddit

You got my attention buddy. I'm gonna try these REAP versions this weekend.

[-]

Techngro@reddit

Did you try them? How was it?

[-]

Hyrnos@reddit

Let me know how it goes !

[-]

ea_man@reddit

If you don't waste VRAM you should be able to fit Qwen_Qwen3.5-27B-IQ4_XS.gguf 15.2 GB with some 80k spare context at Q_4.

- https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGU

Either use integrated graphics for DE or kill X11, otherwise if you tune it properly you should be able to run LXqt with some 40k context.

BTW: Qwen_Qwen3.5-27B-IQ3_XXS.gguf 11.3 GB runs the same way on a 12GB GPU.

[-]

Crafty-Hope-4195@reddit

What arguments are you using? Mine seem to spit out trash for low quants.

[-]

ea_man@reddit


bin_vulkan/llama-server \
-m models/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-IQ3_XXS.ggu
f \
       --host 0.0.0.0 \
       --reasoning-budget 1 \
       -np 1 \
       --fit-target 70 \
       -ctk q4_0 \
       -ctv q4_0 \
       -fa on \
       --temp 0.3 \
       --repeat-penalty 1.05 \
       --top-p 0.9 \
       --top-k 20 \
       --min-p 0.04 \
       -b 256 \
       --ctx-size 60000 \
       --n-gpu-layers 999 \

[-]

Crafty-Hope-4195@reddit

Thank you

[-]

Top-Rub-4670@reddit

That's insane, how do you fit 15.2GB in 16GB VRAM? Where is the KV cache going? The context? Hell, your OS' desktop renderer?

[-]

ea_man@reddit

That's insane, how do you fit 15.2GB in 16GB VRAM?

you just said that

Where is the KV cache going? The context

in the remaning 750MB, 40k is 300MB at q_4

Hell, your OS' desktop renderer?

Read the post. Don't waste VRAM on composer / rendering.

[-]

lly0571@reddit

Gemma4-26B-A4B-IQ4_XS for speed and Qwen3.5-27B-Q3_K_XL for quality. Both of them can handle ~32k context with 16GB.

[-]

Crafty-Hope-4195@reddit

Can you share your arguments?

[-]

lly0571@reddit

Gemma4-26B:

./build/bin/llama-server --model /data/huggingface/Gemma4-26B-A4B-GGUF/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf -a Gemma4-26B-A3B --ctx_size 40960 -ngl 99 -fa on  --port 8000 --temp 1.0 --top-p 0.95 --top-k 64 -np 1 --chat-template-kwargs '{"enable_thinking":false}'

Performance w/ 1x4060Ti 16GB

./build/bin/llama-bench --model /data/huggingface/Gemma4-26B-A4B-GGUF/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf -ncmoe 0 -ngl 99 -d 0,16384,32768 -fa 1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15949 MiB):
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, VRAM: 15949 MiB
| model                          |       size |     params | backend    | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------: | -------------------: |
| gemma4 ?B IQ4_XS - 4.25 bpw    |  12.48 GiB |    25.23 B | CUDA,BLAS  |       8 |  1 |           pp512 |      3084.65 ± 16.55 |
| gemma4 ?B IQ4_XS - 4.25 bpw    |  12.48 GiB |    25.23 B | CUDA,BLAS  |       8 |  1 |           tg128 |         74.63 ± 0.24 |
| gemma4 ?B IQ4_XS - 4.25 bpw    |  12.48 GiB |    25.23 B | CUDA,BLAS  |       8 |  1 |  pp512 @ d16384 |       2098.72 ± 4.25 |
| gemma4 ?B IQ4_XS - 4.25 bpw    |  12.48 GiB |    25.23 B | CUDA,BLAS  |       8 |  1 |  tg128 @ d16384 |         66.15 ± 0.05 |
| gemma4 ?B IQ4_XS - 4.25 bpw    |  12.48 GiB |    25.23 B | CUDA,BLAS  |       8 |  1 |  pp512 @ d32768 |       1591.27 ± 2.24 |
| gemma4 ?B IQ4_XS - 4.25 bpw    |  12.48 GiB |    25.23 B | CUDA,BLAS  |       8 |  1 |  tg128 @ d32768 |         60.17 ± 0.06 |

build: 80d8770 (66)

You should offload a few MoE layers if you want to use image input:

./build/bin/llama-server --model /data/huggingface/Gemma4-26B-A4B-GGUF/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf --mmproj /data/huggingface/Gemma4-26B-A4B-GGUF/mmproj-BF16.gguf -a Gemma4-26B-A3B --ctx_size 32768 -ncmoe 4 -ngl 99 -fa on  --port 8000 --temp 1.0 --top-p 0.95 --top-k 64 -np 1 --image-min-tokens 560 --image-max-tokens 560  --ubatch-size 1024 --batch-size 1024 --chat-template-kwargs '{"enable_thinking":false}'

Qwen3.5-27B

./build/bin/llama-server --model /data/huggingface/Qwen3.5-27B-UD-Q3_K_XL.gguf  -a Qwen3.5-27B-Instruct-Q3 -ngl 99 --ctx_size 24576 --jinja -fa on  --port 8008 --temp 0.7 --top-p 0.8 --min-p 0.00 --top-k 20 -np 1 

# Q8 KV cache
./build/bin/llama-server --model /data/huggingface/Qwen3.5-27B-UD-Q3_K_XL.gguf  -a Qwen3.5-27B-Instruct-Q3 -ngl 99 --ctx_size 40000 --jinja -fa on  --port 8008 --temp 0.7 --top-p 0.8 --min-p 0.00 --top-k 20 --cache-type-k q8_0 --cache-type-v q8_0 -np 1

That one is slow for 4060Ti, but would be fast if you have a 4070Ti S or higher GPU.

[-]

Top-Rub-4670@reddit

I haven't noticed any difference between Q3_K_XL and Q3_K_M and the benchmarks seem to agree. Has your experience been different?

[-]

Herocem@reddit

Gemma 4 26B-A4B for me at Q4, 128k. I get 60/ts when context is empty and goes all the way down to 40 t/s when it gets full. I am running it on 5070 ti, 32 gb ddr4 3600, Ryzen 7 5800X3D.

I use it for my personal assistant project on n8n.

[-]

k0valik@reddit

If it's not too much hassle can you please share your llama.cpp config? I have the exact same config (except CPU is intel counterpart), but I struggle to fit into 16gb, although I generally have 15gb vram available due to heavy frontend apps and system apps (also shamefully admitting I'm running on windows)

[-]

Herocem@reddit

.\llama-server.exe -m .\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -c 65536 --fit on -fa on -t 8 --no-mmap --jinja -ctk q8_0 -ctv q8_0 --temp 0.8 --top-p 0.95 --top-k 64 --host 0.0.0.0

[-]

Ps3Dave@reddit

And this is my qwen 3.5 35b a3b config

Thank you so much. I don't know why but I was having trouble running qwen 3.5 in my setup (12GB VRAM + 32GB RAM) but with your command line it worked perfectly. Getting 42t/sec even with 256k context.

[-]

Crafty-Hope-4195@reddit

I don't get that at all, though I am attempting to restrict it to using my second slow gpu, which probably doesn't have the best pcei speeds right now. If I don't restrict it and use both GPUs and system ram, then it goes up in 30t/s.

[-]

lemon07r@reddit (OP)

Hmm I want to try this, but at the same time that only marginally faster than dense 27b at iq3.. and I get the feeling a dense 27b model would still be smarter and more capable.

[-]

the__storm@reddit

Was it maybe offloading to system memory? It's a lot faster if you can squeeze it into VRAM, which is just barely possible with the 26B at Q4 (IQ4, if you want any space for context).

But yeah the 27B dense is going to be significantly smarter.

[-]

sine120@reddit

Like you said, 27B at IQ3_XXS does well. I have 64GB of system RAM, so I tend to run MoE's in harnesses with a small amount of system prompt if possible. Qwen3-Coder is good, 3.5-35B-A3B is good, and Gemma4-26B is good. If I don't need as much intelligence/ coding ability, 3.5-9B is also pretty good, and I want to play with Qwopus to see how it handles.

I wish there were something up-to-date in the 12-20B range, as that would probably give 16GB folks enough context to be more useful and use higher quants.

[-]

grumd@reddit

You should try 122B at IQ3_XXS, at a low quant it outperforms 27B. 27B gets ahead of 122B at higher quants

[-]

Big-Wear-8148@reddit

how would it fit 16gb vram ?

[-]

grumd@reddit

It doesn't need to. It's a MoE model, experts can be offloaded to CPU/RAM

[-]

reery7@reddit

Mhm, the 27B IQ3_XSS is just okayish. Q3 is somewhat always a last resort. For visual input I usually test with a picture to identify a species and the IQ3_XXS fails miserably.
The qwen3-vl-8b-instruct (MLX for me) is way better in that regard, almost twice as fast as well. Qwen3.5 27B distilled 4 bit quant is also a significant step up, but not usable on 16 GB VRAM.

[-]

ansibleloop@reddit

My issue is I want the entire model in my GPU for speed, but with my monitors I only have like 15GB of RAM with 12GBish for the model and 3GB for context

I need to offload some of that and try Gemma 4

[-]

FullOf_Bad_Ideas@reddit

I'd try 3.10bpw EXL3 quant

https://huggingface.co/UnstableLlama/Qwen3.5-27B-exl3-3.10bpw

[-]

sine120@reddit

For low context/ quick chats, you can fit pretty good intelligence in 16GB, but for longer context work you'll pretty much need to give up on that and accept it's going to be a background task.

[-]

xeeff@reddit

please let me know how Qwopus (9B/35B A3B/27B) works out for you, and what your use cases are. i'll be waiting :)

[-]

sine120@reddit

Is there a 35B Qwopus? I only see 4/9/27B.

[-]

xeeff@reddit

any findings yet? 🙏

[-]

xeeff@reddit

you're right, MoE isn't here yet.

[-]

IrisColt@reddit

Your comment is really helpful. I have 24GB of VRAM and 64GB of RAM as well, but I need to fit 128K of context with Gemma 4 27B Q4_K_M. (For Qwen 3.5 27B, 256K is the practical limit without a noticeable speed hit.) With Gemma 4, though, it's essentially impossible to reach 64K context without running into RAM speed penalties.

Is Gemma 4 at IQ3_XSS performing well too?

[-]

hideo_kuze_@reddit

With Gemma 4, though, it's essentially impossible to reach 64K context without running into RAM speed penalties.

Even with the newly dropped TurboQuant improvement?

[-]

IrisColt@reddit

I checked and IQ3_XSS with Gemma4 is sadly unusable...

[-]

Spicy_mch4ggis@reddit

Qwopus is pretty decent, I quite like it

[-]

ThankGodImBipolar@reddit

No AMD representation here yet; I've been able to run Gemma 4 27B Q8 at 15-20 tok/s on my 7800XT. I've also tried a Q4_K_M quant (Heretic, if that makes any difference), and that runs at ≈25tok/s. I haven't rebuilt llama.cpp since Gemma 4 came out, so it's possible it may run faster on the current branch. I'm planning on doing some more messing around tonight and may update if I can find some improvements.

In addition to that, I've also been using Qwen 3.5 Coder Next (64GB of system RAM) at IQ4_XS, and that runs at ≈28tok/s. Not sure whether this or Gemma 4 27B is better for coding; will have to experiment some more.

I'd appreciate if anyone has any insight into whether these speeds seem appropriate for my hardware, if I'm using stupid quants, etc.. I'm going to keep following along with this thread.

[-]

zkstx@reddit

I'm getting 60-70 tps TG / 1300 tps PP, up to 55k context window (100k+ @ Q8 KV) with Qwen3.5 35B IQ3_XXS on my RX6800 XT, llama.cpp compiled for rocm. Presumably, IQ3_XXS leads to a certain amount of brain damage but it handles pretty much anything I throw at it pretty well. I can definitely recommend trying a smaller quant that fits fully into VRAM, it's a lot of fun.

[-]

-Ellary-@reddit

5060ti 16gb / 32gb ram.

gemma-4-26B-A4B-it-IQ4_XS - 90k of context (Q8) - all layers - 90tps.
gemma-4-31B-it-IQ4_XS - 16k of context (Q8) - 52 layers - 10tps.
gemma-4-31B-it-IQ3_XXS - 45k of context (Q8) - all layers - 25tps.
Qwen3.5-27B-IQ4_XS - 20k of context - all layers - 25tps.
Qwen3.5-27B-heretic-v3.i1-IQ3_XXS - 77k of context - all layers - 25tps.
Skyfall-31B-v4.2-IQ3_XXS - 32k of context (Q8) - all layers - 25tps.

IQ3_XXS is surprisingly good, It is around Q2K size but performance is really better.
I'd say there is just no point of running 9b model at Q8, just run IQ3_XXS 27b, size is the same.

[-]

Richardbobbyryan1125@reddit

Just the exact comment I was looking for , 5060ti 16gb , you're the 1st person I found that actually uses the same as mine, perfectly helpful man

[-]

InternationalNebula7@reddit

This is a very helpful post.

5080 16gb; no vision

gemma-4-31B-it-Q3_K_S - 18994 context (Q8) - all layers - tg 45tps, pp 1577tps
gemma-4-31B-it-Q3_K_M.gguf - 18k context (Q8) - 55 layers - tg 17.5 tps, pp 1100tps
gemma-4-26B-A4B-it-UD-IQ4_NL.gguf - 18k context (Q8) - all layers - tg 136, pp 5567tps

[-]

Morphon@reddit

Qwen 3.5 over here!

35B-A3B at Q6K and 128k context (expert weights pushed to CPU). 35t/s. Very usable speeds, low precision loss because of the big quant.

122B-A110B at IQ3_S and 128k context (again, expert weights on CPU). 15t/s. Still usable speeds, but not as "Just ask the AI and get an answer right away" level of speed. Less precision, but MUCH better domain knowledge.

These two have replaced almost everything else I've used.

[-]

n8mo@reddit

Agreed.

35B-A3B is by farrrr my favourite model atm. Fast enough on my 5070ti and smart enough for most things I use it for.

[-]

Kahvana@reddit

Give it a try when you can, they both compliment each other well

[-]

OneStoneTwoMangoes@reddit

What quant of Qwen 35 runs well on 5070Ti Laptop?

[-]

toalv@reddit

How do you push expert weights to CPU? I'm using Ollama, does it do this automatically or do I need to use llama.cpp or similar?

[-]

mlhher@reddit

You should stop using Ollama and use llama.cpp. You do not need any config for llama.cpp just use "-fit on". You can use it for all models, llama.cpp just smartly fits whatever gives the best speed.

Ollama should be avoided for many many reasons.

[-]

Jayfree138@reddit

Does llama.cpp swap models in and out of VRAM as needed or do you have to do it manually? With an Ollama backend if i call a different model than the one that is loaded and i dont have enough VRAM to fit both it'll drop the previous model out of VRAM automatically to make space for the one i'm using rather than overflow to system RAM.

This enables me to string together multiple models in a sequence with minimal VRAM usage, which is critical on a consumer GPU with limited memory.

If llama.cpp can do that with minimal setup i'll seriously think about switching.

[-]

fligglymcgee@reddit

Yes, llama.cpp now has a release called llama-server that handles this pretty well. Llama-swap is a bit more flexible, but either are good choices and both will hot swap models on demand for you.

[-]

Jayfree138@reddit

Thanks, I'll check that out!

[-]

toalv@reddit

Could you expand on those reasons?

[-]

Wild_Requirement8902@reddit

try out lmstudio nice ui and if you have several computer running it you can link them together to be able to load and unload models from any of your pc in any pc connected through their link feature.

[-]

ThankGodImBipolar@reddit

LMStudio is an ollama frontend...

[-]

russjr08@reddit

Since when, and what makes you say that? LM Studio can be meh in a lot of ways, but if you open the settings and go to runtimes, you should see the versions of llama.cpp that it is using under the hood.

Or just simply uninstall Ollama and run it.

[-]

4xi0m4@reddit

The main issues are: custom model format that makes quantization harder, closed-source and slower to adopt new llama.cpp features (like the new MoE CPU offload), and limited flexibility for fine-tuning. Also Ollama quantization tooling lags behind what llama.cpp can do directly. For 16GB VRAM users squeezing the most out, direct llama.cpp with correct quantization flags usually wins.

[-]

lemon07r@reddit (OP)

Slower speed, usually more issues, more complicated/less simple to use (ironically).

[-]

lolwutdo@reddit

I thought -fit was on by default?

[-]

DragonfruitIll660@reddit

I think Ollama can use n-cpu-moe to offload experts to regular ram. If I remember right there is a slider for it (I haven't really used Ollama, generally just use llama.cpp but I remember hearing about it)

[-]

Morphon@reddit

I'm using LMStudio. Not sure what the actual flags would be if running this on the cmdline.

This allows me to use my 64GB of system RAM to circumvent the speed tax on these bigger models. KV Cache and some layers sit on the GPU. Inference experts sit in RAM and are partially run on the CPU.

It's been a huge game changer for me.

[-]

toalv@reddit

I'm on 64GB as well, appreciate the tips.

[-]

LoSboccacc@reddit

What the prompt processing speed of that?

[-]

Monad_Maya@reddit

Why IQ3_S on 122B, system specs?

[-]

Morphon@reddit

RTX 5070 (12gb VRAM), AMD 5900XT (64GB DDR4)

[-]

Monad_Maya@reddit

64GB, got it.

I was trying to run gpt-oss 120B last year and found the memory capacity insufficient. Had to move to 128GB to get breathing room.

It's a worthwhile upgrade (prices not withstanding).

I'm on 7900XT 20GB + 5900X + 128GB DDR4.

I'd suggest a quant of Qwen 3.5 27B, it's slightly better than Qwen 35B although way slower too. Your experience might vary from mine.

[-]

Di_Vante@reddit

How did you get 122b properly configured? Did you set like specific params, or are using stock? I'm only getting trash from it :(

[-]

AvidCyclist250@reddit

I use this, with models downloaded via lm studio.

cd llama.cpp ./build/bin/llama-server \ -m "/home/-----YOURNAME----/.lmstudio/models/unsloth/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-IQ3_XXS.gguf" \ --jinja \ -b 2048 -ub 2048 \ --temp 0.6 \ -ctk q8_0 -ctv q8_0 \ -fitc 68304 --fit on -fitt 256 \ --cache-ram 0 --parallel 1 \ -t 6 --reasoning-budget 1024 \

[-]

Di_Vante@reddit

Awesome, ill try this out. Tyvm!

[-]

AvidCyclist250@reddit

Can lower q8 to to q4 for more context. Might want to test your use cases before doing that. Haven't noticed any big drawbacks, but others have said it made a difference for them

[-]

Popular_Tomorrow_204@reddit

Im a complete beginner, so i might not understand correctly.

35B-A3B at Q6K and 128k context (expert weights pushed to CPU). 35t/s. Very usable speeds, low precision loss because of the big quant.

Are you using it for coding or other Tasks? If yes would you recommend it for coding?

[-]

Morphon@reddit

I don't use it for "vibe coding". But I will ask it questions about syntax and standard library functions, and occasionally for some code review tasks (how can I make this function more efficient/idiomatic, etc...). If it has good training data for the language (like JavaScript, Python, etc...), it does quite well for these tasks. Rarer languages (Smalltalk) - not so great. It will hallucinate methods like you wouldn't believe! :-)

[-]

iamapizza@reddit

Could you share your llama server arguments, might help for comparison.

[-]

Morphon@reddit

Just using the standard LMStudio defaults. It's an Unsloth Dynamic.

Slide up the context to 128k (or 64k on my machine with the 5070).

GPU Offload to maximum. Unified KV Cache, Offload KV Cache to GPU memory. Number of Layer to force experts to CPU - Maxmimum.

12GB RTX 5070 with an AMD 5900XT (only DDR4) with 64gb cranks out over 20t/s.

[-]

OlegDoDo@reddit

Running qwen2.5:7b on a Snapdragon ARM64 laptop, 16GB RAM, CPU only. Getting around 20–40 sec per response — not instant, but totally usable for document work. For 8GB I'd go gemma3:4b, runs noticeably faster. Both through Ollama + AnythingLLM, no Docker needed.

[-]

Fyksss@reddit

gemma 4 31B Q3_K_S and IQ3_XXS

[-]

Olangotang@reddit

How are youg getting usable processing speed? Q3 takes ages to process, and I'm getting 4 tokens per second on a 5070Ti.

[-]

Fyksss@reddit

i'm getting 17 token/s with 5060 ti 16 g - 10-20K context.

by the way, even with Q3_K_XL, i'm getting 12 tokens/s with gpu offload 57 + kv cache Q4_0. sometimes better than Q3_K_S. but when i reach over 20k+ context, it drops to 5-6 t/s.

i'm using this settings for q3_k_s (lm studio):

[-]

4xChe@reddit

Qwen3.5:35b A3B here, running on two 6600XT's... not perfect but enough for my local needs. Tried different engines: ollama, llama cpp, lm studio.... Best speed I got is on Kobold at roughly 25 t/s

[-]

celebrateurmom@reddit

I'm using Allen Institute (AI2) OLMo 32b Instruct on my NvMe and OLMo 7b Think on my 4060ti 16Gb. Very cognitive and 5-7 tokens/ sec. Very Satisfied 💯

[-]

Uriziel01@reddit

Qwen 3.5 Coder Next, hands down. I've beed testing Gemma 4 for the last couple days so maybe I'll switch for general assistent but for coding still Qwen is the best for my 16GB VRAM use cases.

[-]

tuliosarmento@reddit

Is there a 3.5 coder next?

[-]

Monkey_1505@reddit

There is not. There is a qwen3.

[-]

Monkey_1505@reddit

Can I ask why you are using the ik fork?

[-]

Enough_Big4191@reddit

16gb still feels like the most annoying tier because one small step up in quant and suddenly the whole setup gets worse. i keep coming back to models that fully fit and stay fast, because once i start offloading layers the quality gain usually stops feeling worth it.

[-]

lemon07r@reddit (OP)

I had an 8gb card for years then a 12gb one for a few years too, before recently getting back 16gb. Trust me, those were more annoying tiers to be in.

[-]

moahmo88@reddit

5070ti,32gb ram:
bartowski/Qwen_Qwen3.5-27B-GGUF IQ3_XS ,ctx 131k ,45t/s
HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive q5k_m,ctx 131k,62t/s

[-]

Equivalent_Bit_461@reddit

Is it even worth using models this low in quant? I just took the moe pill and run everything quant 6, most important bits in gram, rest on ram, also thanks to turbo quant easily can stay over 100k context. Sure quant 6 might not be lighting fast but at least is not severely reduced

[-]

Top-Rub-4670@reddit

In my tests IQ3/Q3 has been fine for both Qwen 3.5 27b and Gemma 4 31b. Asking specific questions about some deep knowledge is definitely worse than Q4+, but the reasoning seems to mostly be there? At least it hasn't failed any of my go-to test tasks.

I found that Q3 was "fine" for role playing in Gemma 4 26b, but it doesn't follow directions as well as Q4+ and it tends to get confused in long contexts. It also frequently forget its personality and starts talking neutral. As for Q2 it's the same, but worse, plus it starts making lots of typos. I haven't noticed any significant difference between Q4/Q5/Q6/Q8. So there seems to be a threshold at Q4 for 26b, and possibly for other similarly sized MoE models?

But Q3 for Qwen 3.5 9b and Gemma 4 E4B is like a lobotomy, they fail all the "complex" tasks I've tried.

Note that I have tried all the small quants out there for the models I've talked about. The static ones, the imatrix ones, the unsloth ones. It doesn't make any real difference, the Q3/Q4 cliff is real!

[-]

InternationalNebula7@reddit

What speeds with what config and hardware?

[-]

Ell2509@reddit

Just FYI there is no edge. Everyone wants the next size up.

[-]

embeeweezer@reddit

I'm in the qwen3.5 35b MoE ballpark as well. Would like to get the 27B up to speed though. Anyone got a Speculative Decoding config running?

[-]

PhlarnogularMaqulezi@reddit

Speed and context length wise, Qwen3.5-9b Q8 has been outstanding for its size.

[-]

bnolsen@reddit

I've been using q6 on my 3060 12gb. Its,more of a general household llm. I also have a 128gb strix halo.

[-]

Dabalam@reddit

I understand people getting 60 t/s won't be fretting about their speed, but people using Q3 dense models at 20 t/s could be getting 2 to 3 times the speed with similar quant MoE or the same speed at Q4. I'm surprised the speed difference isn't so important to most.

[-]

Witty_Mycologist_995@reddit

Gemma 26b all the way

[-]

random_boy8654@reddit

Gemma 26b vs qwen 3.5 35B ?

[-]

Witty_Mycologist_995@reddit

Gemma.

[-]

throwaway957263@reddit

What quant did you use? I tried https://ollama.com/VladimirGav/gemma4-26b-16GB-VRAM

But it leaves you with 1GB VRAM for KV cache, leaving you with 8192 context

[-]

Witty_Mycologist_995@reddit

I just ran the unsloth version. On llama cpp

[-]

MerePotato@reddit

Unsloths Q6_K quant of Gemma 4 26BA4B with MoE offloading (--n-cpu-MoE) is your best bet imo, just make sure you're on the latest build of llama.cpp.

[-]

AlterTableUsernames@reddit

Greath answers here, anyone a recommendation for 8GB VRAM + 32GB DDR4?

[-]

AlwaysLateToThaParty@reddit

Qwen3.5 9B Heretic Q6_K or Q8_0, depending how much else i have in VRAM. My work computer is locked down. Can't even plug a phone into it to charge it. But at least it has an RTX 5000 in it. So that's what I use if I need to use inference at work. Not as good as my home system, but it works a treat.

[-]

Danmoreng@reddit

Why ik_llama over llama.cpp?

[-]

lemon07r@reddit (OP)

Supposedly has optimizations that make it faster, which I think upstream ends up getting some of too, but a lot more slowly

[-]

Danmoreng@reddit

Well, I tested that a few months ago and found no performance benefits, that’s why I stick to llama.cpp. The only benefit apparently might be different quants (the IQ ones) which llama.cpp won‘t get because of personal differences: https://github.com/ggml-org/llama.cpp/pull/19726#issuecomment-3946355613

If you want to try out llama.cpp, I got some scripts to build from source and settings I found optimal for the Qwen 3.5 family here: https://github.com/Danmoreng/local-qwen3-coder-env

The 27B model in Q4 is too large for 16GB though. I prefer MoE variants, since they have decent performance if split between GPU and CPU. For example I get around ~70 t/s with the 35B model on my RTX 5080 mobile.

[-]

lemon07r@reddit (OP)

Yeah there seems to be sort of an ebb and flow of llama.cpp catching up, ik having stuff added, etc. I think the gap has gotten pretty small now, but since ik works too I havent had a reason not to use it. It does compile a little slower though

[-]

TastyStatistician@reddit

Gemma 4 26b is currently the best for 16gb VRAM.

Qwen 3.5 is also great but thinks way too much. 4b or 9b with thinking off are great if you need large context room.

[-]

RandomTrollface@reddit

Gemma 4 31b and qwen 3.5 27b both iq3_xxs. They seem smarter to me than the smaller models at higher quant.

[-]

Long_comment_san@reddit

Damn can't wait to get a reasonably priced GPU with 32 gb VRAM. R9700 is quite close as is B70, but nah, I do play games as well. No idea why AMD doesn't just click it and push something in the 800$ with 24gb with slower VRAM.

[-]

ea_man@reddit

Aye, I want a 9070xt Super with double VRAM.

[-]

Maleficent_Celery_55@reddit

i wish amd did something like 7900xtx this generation. i hope they do it next time.

[-]

ea_man@reddit

Let's hope that prices keep going down.

Personally I don't want the power of an *090xtx , I would be happy if the *070 was 16-24GB and the *070XT was 24-32BG because this gen to pay 650 for 9070xt 16GB ain't sweet, 600 for 9070 wasn't sound.

[-]

Correct-Boss-9206@reddit

I have been running Gemma4 Q4_K_M and it runs pretty fast for my use case. 28 tok/s on my 5070ti quality feels solid at that quant.

[-]

LostDrengr@reddit

Currently using Gemma4-26B-IQ3 plenty of room for context and its hitting 124t/s on 5080.

[-]

WhataburgerFreak@reddit

I’m with you as well at q3_m_k with that same card at 135k context at q8 on cache

[-]

lolwutdo@reddit

Qwen 3.5 397b

[-]

kiwibonga@reddit

I upgraded to 32GB from 16GB because it wasn't comfortable enough with Devstral Small 2 24B, similar constraints to Qwen 27B which I use now.

With turboquant though we will be able to have full quality and full context in 32 GB which is really cool.

Not really answering your question, but highly recommend going 32GB. A 5060Ti 16GB is only $500-700.

[-]

H3g3m0n@reddit

The IQ3_XXS of Gemma-31b should allow for around 60k context. Someone posted benchmarks of twitter that it's almost as good as the Q4. Could even get more context with something like turboquant/rotorquant if your willing to figure out which random fork is decent.

Unfurtunatly as of now CUDA 13.2 has a bug that causes it to output gibbirish in llama.cpp I tried downgrading to 13.1 which solved the gibberish issue but ran into another bug that caused it to crash if loading the vision mmproj. Might try 13.0 or a 12.x and see if they solve both bugs.

Currnelty I'm just sticking with the MOE of Qwen which gets the full context and decent speeds with n-cpu-moe offloading. It seems better than the Gemma4 MOE.

[-]

lacerating_aura@reddit

Qwen3.5 122B IQ4XS with maxed out, almost, beyond 200k anyway, bf16 context. Its a dedicated machine for hosting the model and some very light services. 16gb vram, 64gb ram. Or Q8KXL of Gemma 26B, dense models aren't just worth at 3070 class gpu.

[-]

Erwylh_@reddit

Gemma4-E4B-f16 with long ctx. But I needed vision and audio processing capabilities as well, so it was a suprise that I got the perfect model for my usecase.

[-]

popcornkiller1088@reddit

gemma-4-26B-A4B-it-UD-IQ3_S.gguf is awesome for 16gb Vram ( RTX 4080 Super)

While I can hit 90 t/s at 32k context on the main card, bridging the second PC let me bump the context up to 130k. Speed dropped to 20 t/s, but having that massive window is a total game changer.

Experimenting with llama.cpp RPC servers to bypass VRAM limits. Using an RTX 4080 Super + an RTX 3060 Ti (8GB) via Ethernet.

[-]

send-moobs-pls@reddit

I'm in the 8GB poor house but I just can't not find anything that compares to qwen 3.5 models right now. I'll say maybe their weakness is like creativity or role play or something because the qwen vibe is pretty "codex" feeling, Gemma might be better if you specifically want creativity or personality like that. But for general thinking, tasks, tools etc I'm basically still in shock at how the qwen 9B makes everything else I can run look like a joke

[-]

lemon07r@reddit (OP)

I really need to look into the Gemma models, but I'm not entirely convinced they will be better than the qwen 3.5 models. EQ bench actually shows qwen 3.5 27b model to be the better writer than any of the gemma models.

[-]

Shamp0oo@reddit

You can run the IQ4_XS quant of Qwen 3.5 27B with 16 GB of VRAM and up to 40k context (q8). See my comment and follow up comment for instructions. I recently switched to the unsloth IQ4_XS quant which is slightly bigger and therefore only allows for around 32k context but it felt more robust with tool calls in Open WebUI to me.

[-]

Guilty_Rooster_6708@reddit

Gemma 26B and Qwen3.5 35B. MoE all the way

[-]

the__storm@reddit

I've been using Gemma 4 26B at IQ4_XS; gets about 65K context at fp16. I agree that the IQ4 is more compressed than I'd like, but I find that Gemma is still quite good at non-coding tasks.

I have 64GB system memory but it's dual channel DDR4 so I'm loathe to offload anything with lots of active parameters to it. If there was an updated Coder-Next (80B-A3B) that would be a nice option.

[-]

DragonfruitIll660@reddit

Using UD Q3 of Gemma 4 31B today, likely going to be my new main model (it feels like a higher weight model all of a sudden). Otherwise I generally use GLM 4.5 Air Q4KM with n-cpu-moe maxxed out and you still get 8-12 TPS based on context.

[-]

anzzax@reddit

Try this one https://huggingface.co/Intel/Qwen3.5-35B-A3B-gguf-q2ks-mixed-AutoRound, I run it with '--n-cpu-moe 8'. It's very fast with still acceptable quality, but if you want the smartest option - find quant of qwen3.5-27b that you can fit into 16GB

[-]

Sadman782@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1scw979/gemma_4_for_16_gb_vram/
You can use Gemma 4 26B MoE IQ4_XS