Qwen 3.6 27B (IQ3XXS) vs 35B A3B (IQ4XS)?
Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 22 comments
Just was wondering what people feel is better. I do need 262K context so these are the biggest quants of each I can fit on my 3090 with KVcache at Q8. Both are the unsloth quants.
Main use case is openclaw and openwebui. Currently have 27B loaded but I'll have to get home to try out IQ4XS 35B.
Joozio@reddit
Ran both on a 16GB M4 Mac Mini for an always-on agent loop. 35B-A3B at IQ4XS gave noticeably better tps than the 27B dense at the same memory footprint, MoE wins for me when the active params stay small https://thoughts.jock.pl/p/almost-fried-ai-agent-mac-mini-mistakes-2026 Different story on a 3090 with 24GB and Q8 KV though, dense fits cleaner there.
Prize_Negotiation66@reddit
definetly first. and q8 cache isn't worth it, better q4 model with turboquant 2, 3 or 4 bit
My_Unbiased_Opinion@reddit (OP)
I'm using LMstudio. Would setting KVcache to Q4 and getting a higher quant for the model be the better setup? I know there have been some recent KVcache optimizations but in the past, it was recommended to not use KVcache at Q4 for tool calls.
superdariom@reddit
I've used both turbo3 and now q4_0 with qwen 3.6 with a lot of tool calling and technical work and have yet to see it get anything wrong. This is at 256k context
andy2na@reddit
Isn't q8 cache better than any turboquant cache?
Pablo_the_brave@reddit
You clearly miss that the model quant also limit the context. Any q3 is worse than any q4 and kvcache didn't change it. I had some work about running maxed qwen 3.6-27 iq4xs with 16GB VRAM and iq4xs was always better than the biggest q3. Even with turbo3.
My_Unbiased_Opinion@reddit (OP)
Do you think 35B A3B at iq4XS is better than 27B dense at iq3xxs? Both models at max context and KVcache at Q8 completely fill my VRAM.
hurdurdur7@reddit
Apples and oranges i'm afraid. If i were you i woulld offload expert layers of 35B to cpu and run a higher quant than either specified here.
My_Unbiased_Opinion@reddit (OP)
I can do partial offload of the 35B and get like 40 t/s. But then the question is if 35B even at Q8 can beat 27B at IQ3XXS.
mateszhun@reddit
For my usecases whenever I tried anything below Q4, it was useless.
hurdurdur7@reddit
I think you have to test that with your actual use case, dont trust the benchmaxxed results blindly.
ai_guy_nerd@reddit
The 35B A3B usually has a better "feel" for complex logic, but for OpenClaw and WebUI, the 27B is often a sweet spot for speed and context handling. If that 262K context is a hard requirement, check how the A3B handles long-range retrieval since those higher quants can sometimes drift.
The IQ4XS on the 35B will definitely be more stable for reasoning tasks. If the speed hit isn't too painful on the 3090, the 35B is typically the move.
Check if there are any newer GGUFs for the 27B that might fit your VRAM better, as some of the newer quants are surprisingly close in quality to the larger models.
LaurentPayot@reddit
Interesting benchmarks at https://kaitchup.substack.com/p/summary-of-qwen36-gguf-evals-updating
LaurentPayot@reddit
LaurentPayot@reddit
rootdood@reddit
I’ve been getting incredible agentic performance and quality from Unsloth’s 35B A3B Q2_K_XL at full context with 40 GPU offload, 20 CPU offload, and Q8_0 KV on an RTX5080 16GB.
I felt really hamstrung with the Q4_K_L because it would top at 30tps and dwindle down to sub-10 as the context grew. With the Q2 I’m getting over 60tps, and it’s night and day. We’re actually interacting, and Xtreme coding.
I can get 140tps with the IQ2_XSS but it would just always be going in circles and needed constant correction. “Ok, so is this how it’s done in Unity?” “Oh, you’re absolutely right, I’ll go break a bunch of other stuff you didn’t even ask about brrrrrr”
Definitely, if you want really good, fast local AI, you’re spending $5000. I’m thinking 2x 48GB 4080/4090 or just swap my 5080 for a 5090. But, I know that 32GB is just going to be the next ceiling I wish I could break through when things get tight.
FullstackSensei@reddit
262k on such a low quant with quantized kv will be pretty much shite. At least to the extent you care about the output.
hurdurdur7@reddit
I wouldn't want to rely on precision tasks with anything under q4-k-m. If i do coding i won't go below q6k. Wrong answers or broken tool calls waste time.
ea_man@reddit
27B if you want depth, A3B if you want speed (which you may want to try at lower quant too).
SosirisTseng@reddit
I currently use unsloth Qwen 3.6-27B (Q4_K_M) + Q8 KVcache on a 4090 with
--no-mmproj-offload --fit on --fit-target 400 --fit-ctx 131072. llama.cpp can fit a context size of 208,640. I believe you can fit more context with IQ4_XS.spvn@reddit
you don't need to squeeze the entire 35B A3B into VRAM. you can use a larger quantisation and offload some of it to system RAM. Can look into using ik_llama as well.
For the 27B I wouldn't go below Q4. I tried Qwen3.5 Q3 27B once and thought it was really stupid for coding. I can't do the math but you can q8_0 for k cache and turbo3 for v cache and that should save you a ton of space in terms of context size. I was using the TheTom turboquant fork. Maybe you can try squeezing 262k context with a q4 quant at least.
bighead96@reddit
I tried both and preferred the 35B A3B plus it was faster!