16gb vram users: what have you been using? Qwen3.6 27b? Gemma 31b at Q3? How has it been?
Posted by Adventurous-Gold6413@reddit | LocalLLaMA | View on Reddit | 11 comments
Do you guys use q3 to fit it in vram? Or have you had bad results?
I had luck fitting qwen3.5 27b in my 16gb vram with turboquant with 80ctx with the IQ_4XS quant.
But now the hidden size of qwen3.6 is larger(so iQ4_XS is 15.4gb rather than 14.7) :( which makes me upset. I had to use Q3_K_XL version for qwen3.6 27b and while it worked amazingly for openclaw chat, like 10%of the time it couldn’t make the correct tool calls or would not write proper formatting of cron jobs. Causing an error.
I am considering trying Gemma 4 31b at Q3 is it even worth it?
(Gemma 26ba4b has been good chatting wise but sucked for other use cases like Reddit summaries. Etc)
Gesha24@reddit
IMO with this low amount of VRAM your only choice to have something workable is to run MoE model with offload to CPU and regular RAM. Thankfully, the Qwen3.6-35B model is quite solid and should run at useable speeds even if you split it between RAM and VRAM.
DanielusGamer26@reddit
"low amount of VRAM" *cries in the corner with 16GB of VRAM*
Gesha24@reddit
I am hitting context limits with 32...
DanielusGamer26@reddit
Ouch
pop0ng@reddit
im using this Qwen3.6-35B-A3B-UD-Q2_K_XL for opencode / openclaude. its decent and fast
quanhua92@reddit
Does the quality of the output decrease when using Q2 instead of Q4?
pop0ng@reddit
I haven't tried Q4, sorry
eugene20@reddit
Can you define fast, because I was using the qwen3.6-35b-a3b q4_k_m and it's usually 60tok/sec here but often spends a very long time thinking, I just tried qwen/qwen3.6-27b and got only 45 tok/sec too. I'm using the recommended model settings, I expected a drop from qwen coder next but not quite this much.
lacerating_aura@reddit
3.6 A3B, Q8KXL wih cpu-moe, full 262k bf16 context and f32mmproj, uses something like 13gb. This is for usual tasks, its not the fastest, but still get avg 500tk/s pp and 20tk/s tg at 60kish context. For shits and giggles, 3.5 A10B Q3 equivalent, but that drops to like below 100tk/s pp and below 10tk/s at 100kish context. A3B maintains speed but is just dumber.
INT_21h@reddit
Main workhorse is 3.6-35B-A3B Q4_K_L with some CPU offload (--n-cpu-moe 15). I also have 27B UD-Q4_K_XL around as a "big gun" but I'm using offloading with it too so it's pitifully slow. I've had enough bad luck with Q3's of older Qwen models that now I stick with Q4 and either eat the speed cost or fall back to the MoE.
Zealousideal_Fill285@reddit
Unfortunatelly the 31b won't fit on 16 GB, but I'm not sure if this is your goal, but you can try also the Qwen 3.6 35b at Q4 quants and see if its good for your needs. It doesn't fit fully, but due to it's MoE architecture its fast as it activates only few experts at once, instead of all of them. Also Qwen 3.6 family is better in general creative tasks than 3.5.