Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching
Posted by Clean_Initial_9618@reddit | LocalLLaMA | View on Reddit | 25 comments
Hey everyone,
I’ve been experimenting with running Qwen models locally on my setup:
GPU: RTX 3090 (24GB VRAM)
RAM: 64GB
CPU: Ryzen 5700X
OS: Windows 11
What I’m currently running
Qwen 3.6 35B (UD Q4_K_M)
llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0
Qwen 3.6 27B (UD Q4_K_XL)
llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0
My use case
- Hermes agent (on Raspberry Pi 5) → Reddit scraping, job scraping, basic automation
- Local coding (OpenCode / QwenCode) → small scripts, debugging, patching
- Occasional infra setup via prompts
Issues I’m facing
- 35B is too slow
- Even simple tasks take way too long to respond. Feels unusable for anything iterative.
- 27B is faster but unreliable
- Code often breaks
- Takes 20–30 mins even for simple tasks sometimes
What I’m looking for
- Better model + quant recommendations
- Something that actually works well on a 3090
- Good balance between speed + coding reliability
- Ways to improve throughput (t/s)
- Are my flags bad?
- Context size too high?
- Anything obvious I’m missing?
- Auto model loading / routing (Right now I have to):
- Kill server
- Paste new command
-
Reload model
-
Is there a way to:
- Auto-switch models based on request?
- Or keep multiple models warm and route between them?
What’s your stack?
Thanks in advance for any suggestions or help really appreciate it.
ConsciousEar877@reddit
Can you post your OC processing time? Then we know its just normal time or is it really slow? Your slow is a relative word. I wiped my pc clean and installed Ubuntu. Right off the bat, u get right of bloat ware and other memory hoggers. Then i would gather the technical specs of your computer and feed it to an ai model and then choose the right qwen 3.6 model (qwen 3.6 is the best local llm model atm of writing. It has different quant model. Prefer the IQuant model). Then u run optimizating by chatting with a AI model. Good luck
Clean_Initial_9618@reddit (OP)
The initial is 100 but then over time it falls to 20 t/s when I use qwen3.6 35b with hermes agent
Jester14@reddit
Just use
-fitand stop guessing.Clean_Initial_9618@reddit (OP)
What flags should I remove and just use -fit ?
TheTerrasque@reddit
fit is default on if you don't set ngl and don't set context. It will fit as many layers as it can down to a (default) 4k context, then after all layers are tucked in it'll spend the rest of the vram on context.
L0ren_B@reddit
https://github.com/noonghunna/club-3090 this is all you need . Good luck
PreparationTrue9138@reddit
Hi, I tried it for dual rtx 3090 setup and it felt really slow. My setup via llamacpp is a lot faster for 27b at least in terms of prompt processing speed I get 1000-1600 prompt processing for 27 b and club-3090 was unusable
TheTerrasque@reddit
These are my settings:
I let fit figure out the context number, but if you want to set static, probably around 100k. Depends on how much vram windows takes. This is on linux and a P40, but should be fairly similar.
Two options:
GrungeWerX@reddit
Drop your kv cache to q8, drop both ctx to 100K, max your cpu's thread size, and full gpu offload. Instant speed increases.
Clean_Initial_9618@reddit (OP)
How to make sure full gpu offload and max cpu threads???
pepedombo@reddit
On windows when I terminate llama.cpp it outputs usage stats so paste it to gpt or just paste full log.
Same with auto-loader. Chat gpt will write simple autoloader when you find proper llama-server syntax 😄
GrungeWerX@reddit
I don't use llama-server, so just ask claude or gemini.
Fedor_Doc@reddit
Warning: they will spit out a bunch of deprecated options, or options turned on by default
I always ask LLMs to read latest documentation for llama.cpp and model itself, use search, and then explain each setting.
Otherwise it is pointless
ego100trique@reddit
For 27B you might want to try Q4 model quant
lerg96@reddit
the config you use needs more than 24 vram for the 35b, Windows by default doesn't throw OOM but loads the excess to your ram But in a way that's inefficient for the moe
Try - - n-cpu-moe 34 and check the vram usage in task manager you don't want to excess your dedicated memory so it does not uses the "shared memory"
Fedor_Doc@reddit
Reduce context, get rid of --reasoning budget and --reasoning format.
Reasoning budget just stops token generation abruptly. It harmed model performance a lot in my testing, even with the message.
Test after each setting change on a simple prompt / workflow. Less is more.
You can find official temp / top-k settings in model card on huggingface.
grumd@reddit
Are you just copypasting commands from somewhere? "--reasoning-parser deepseek" with qwen models is crazy
Clean_Initial_9618@reddit (OP)
I used Claude code to run and fine tune till it found the best settings. But apparently it’s not best
Steus_au@reddit
I couldn’t make them working, always make me disappointed so Im back to 3.5-122b hopping 3.6 will be released soon
a-babaka@reddit
Works stable with \~40t/s
cicoles@reddit
3090 does not have FP8 or FP4 support. Running those models on it does nothing, sadly. It was one of the main reason why I sold the 3090s.
jacek2023@reddit
Start from smaller context, I have problem with big context on Qwen on 3x3090
NNN_Throwaway2@reddit
KV cache quantization is a sure way to degrade output quality.
You are probably trying to load too much context.
In general, fewer flags are better. Test and add flags if you need them one at a time. Makes it easier to troubleshoot without needed to run to reddit.
ImportancePitiful795@reddit
You are using -ctk f16 -ctv f16. Ofc the 3090 will choke. Doesn't have the VRAM to load the model and all that KV Cache.
Try -ctk q8_0 -ctv q8_0
Thomasedv@reddit
35B is slow? This is a MoE, it should be going at like 130 t/sec at the start, and drop to 90 t/sec as you get close to max context. I'm also using a 3090.
I use 35B with Q to fit the visual model. Not using any quantization of the kv cache as that hit speed a lot. Try reducing your parameters, like all the batch sizes, unless you measured and tested them.
Llama-switch to auto-change model, never used it myself.