Choice for agentic LLM or help optimize Qwen3.5-35B-A3B for 24GB VRAM

Posted by marivesel@reddit | LocalLLaMA | View on Reddit | 18 comments

RTX3090 24GB VRAM, WSL install of Ollama latest and Hermes Agent latest.
First I have tried Gemma4:31B - so slow!
Then Gemma4:26B MoE - fast, but so many mistakes for few days repeatable.

Then I've found Qwen3.5-35B-A3B Q4_K_M here in Reddit and OH BOY, IT'S GORGEOUS! It's fluently making what I want. But... rather slowish! Then I found that the file itself is 23GB, and I have given context of 32K, overfilling my VRAM with more than 1.5GB (and my RAM is DDR4 ECC, slow).

Question is - can I somehow optimize to fill the whole model in my VRAM with 16K/32K context, or should I try lower quality model, which would you suggest?

I like the speed and quality of MoE models, I am not writing a super complex stuff, just some automations and helping around in my business with regular tasks.

[-]

marivesel@reddit (OP)

Thanks to all suggestions, I've tried Qwen3.5:27b it is really smart at agentic tasks, I will probably stick with it (and in future install a pure Linux distro on my server to run smooth). It may be slower than the MoE, but for now I don't mind.

Another question: My RTX3090 runs along with AMD EPYC 64-core CPU, and 8x32GB (256GB total) DDR4 ECC RAM. Could I utilize somehow that RAM to help, but not slow down the system of agentic tasks as much?
Currently I'm fitting the Qwen3.5:27b with 16K context with q4 cache, but increasing a bit the context would feel better?

cmndr_spanky mentioned "llama cpp server instead of Ollama and hand pick what layers get GPU priority", what could I utilize with that?

[-]

Objective-Stranger99@reddit

Try the UD IQ4 XS variant. The file is 17 GB, but it matches Q4_K_M in perplexity and most benchmarks. You can easily reach a context length of 192K.

[-]

DinoZavr@reddit

Qwen3.5-35B-A3B has only 3B active parameters and you definitely can run it in Q8_0 quant ( have 16GB GPU and run it in Q6_K for more context).
The catch is that dense model Qwen3.5-27B is noticeable superior to the MoE model in question, as you get all 27B parameters working, not just 3B. The size of 27B that fits my 16GB VRAM entirely is iQ2_M, but at this low quant the output quality is not stellar, so i run iQ4 quant for twice slower inference, but get excellent results.
With 24GB you definitely can run Q4 (and it would fit your VRAM leaving space for 8K .. 16K context, idk how much as i dont have 24GB),
so my suggestion is also to try Qwen3.5-27B model with 16K context it would probably run fine and fast in Q4_K_S quantization (of better).

[-]

SimilarWarthog8393@reddit

If you're willing to use pure Linux instead of WSL that would already likely improve performance. Also try using vanilla llama.cpp or ik_llama.cpp which you can manually tune to maximize performance further.

[-]

Tagedieb@reddit

I am running the 27B version of Qwen3.5 on my 3090. I set context size exactly to the limit so that it doesn't offload by try and error.

[-]

Sevealin_@reddit

What kind of context window are you able to achieve? With q8 kv cache, I was only able to squeeze 8192 on my single 3090.

[-]

Tagedieb@reddit

Currently running llama-server -m /models/Qwen3.5-27B-Q4_K_M.gguf --host 0.0.0.0 --port 8080
--n-gpu-layers 99 --ctx-size 262144 --parallel 1 --flash-attn on --cache-type-k q4_0 --cache-type-v q4_0 --no-warmup

I am experimenting with 4 bit k/v. Maybe I will go back to 8 bit with smaller context window to compare. The GPU is in a headless computer, so not a lot of memory is reserved for graphics.

[-]

RipperFox@reddit

-ctx-size 262144

There goes your VRAM.. btw: You can ASK your Qwen3.5 about these terms :)

[-]

Sevealin_@reddit

Ah yeah, I wasn't doing anything crazy but I didn't notice a huge difference between q4 and q8 kv.

[-]

Acrobatic_Bee_6660@reddit

Qwen3.5-27B dense is what you want. It's better than the 35B-A3B MoE for agentic work despite fewer total parameters — the full 27B active params give you stronger reasoning than 3B active in the MoE variant. Multiple sources confirm this, and I've been running it daily on a 7900 XTX.

Q4_K_M is ~17 GB, leaves plenty of room for 32K context fully on GPU. Speed will be significantly better than the 35B MoE because you're decoding through 27B dense instead of routing through a 35B sparse architecture with overhead.

Re: Gemma 4 — I've tested both variants extensively with TurboQuant KV cache compression:

Gemma 4 26B-A4B MoE: Fast but quality issues are real. I saw the same thing you did. The MoE architecture with only 4B active params just doesn't have enough reasoning depth for agentic tasks.
Gemma 4 31B Dense: Better quality but at 19.6 GB (Q4_K_M) it's tight on 24GB with context.

For your use case (automations, business tasks, Hermes Agent), Qwen3.5-27B dense is the sweet spot on a 3090. Fast, fits with room for context, and the quality is genuinely a step above everything else in this VRAM class.

[-]

cmndr_spanky@reddit

I’ve deeply experimented and you’re straight up wrong sorry.

If you switch to llama cpp server instead of Ollama and hand pick what layers get GPU priority the 35b moe model on his hardware will be much faster than the 27b dense model and depending on system ram will fit 65k context easily. And will be more or less as good for simple agentic use cases.

[-]

Acrobatic_Bee_6660@reddit

You’re right on speed — I misspoke. The 35B-A3B MoE with 3B active params will decode much faster than a 27B dense model.

My point was about output quality for agentic tasks, not raw throughput. In my experience, the 27B dense model gives more reliable reasoning and needs fewer retries on multi-step tasks. For simpler automations, the 35B-A3B MoE is absolutely a valid choice.

So I’d frame it as: MoE wins on speed, dense 27B can still win on reliability.

[-]

marivesel@reddit (OP)

I'm downloading 27B just to check how it works, but can you show me more or where to read about the llama cpp settings? I have plenty of RAM (256GB, it's an EPYC server), but it get's slow once something jumps on the RAM

[-]

DrAlexander@reddit

You say 27b will be faster than the 35b moe. Maybe I'm doing something wrong, but on a 3090 I get 35tk/s with the 27b and 130tk/s with the 35b a3b. Could you share your config for the 27b?

[-]

cmndr_spanky@reddit

He’s straight up wrong that’s why. See my reply to him

[-]

audioen@reddit

You'll have 27B as option. It is much better still despite fewer total parameters. It will almost certainly infer way faster on GPU hardware with real memory bandwidth figures.

[-]

cmndr_spanky@reddit

27b will be even slower as a full dense model (even though fits in vram)

[-]

hejwoqpdlxn@reddit

Try dropping to 16K first, that alone might sort it out without changing anything else. If you still want more VRAM headroom, Q3_K_M of the same model brings it down to around 17-18GB. On a 35B MoE the quality difference between Q4 and Q3 is pretty small, especially for automation and business tasks rather than complex reasoning. Staying on the same MoE architecture makes sense given you already found it runs how you want, just needs to fit better. I built a small tool called willitrun that checks this stuff before you download anything, shows you what fits at each quantization level for your exact GPU