Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching

Posted by Clean_Initial_9618@reddit | LocalLLaMA | View on Reddit | 25 comments

Hey everyone,

I’ve been experimenting with running Qwen models locally on my setup:

GPU: RTX 3090 (24GB VRAM)

RAM: 64GB

CPU: Ryzen 5700X

OS: Windows 11

What I’m currently running

Qwen 3.6 35B (UD Q4_K_M)

llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

Qwen 3.6 27B (UD Q4_K_XL)

llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

My use case

Hermes agent (on Raspberry Pi 5) → Reddit scraping, job scraping, basic automation
Local coding (OpenCode / QwenCode) → small scripts, debugging, patching
Occasional infra setup via prompts

Issues I’m facing

35B is too slow
Even simple tasks take way too long to respond. Feels unusable for anything iterative.
27B is faster but unreliable
Code often breaks
Takes 20–30 mins even for simple tasks sometimes

What I’m looking for

Better model + quant recommendations
Something that actually works well on a 3090
Good balance between speed + coding reliability
Ways to improve throughput (t/s)
Are my flags bad?
Context size too high?
Anything obvious I’m missing?
Auto model loading / routing (Right now I have to):
Kill server
Paste new command
Reload model
Is there a way to:
Auto-switch models based on request?
Or keep multiple models warm and route between them?

What’s your stack?

Thanks in advance for any suggestions or help really appreciate it.

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 -ctk q8_0 -ctv q8_0 --jinja -fa on --port 8081 --host 0.0.0.0 --chat-template-kwargs '{"preserve_thinking":true}'

ik_llama.cpp/llama-server \ --model /home/ubu/llm/models/unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_M.gguf \ --host 0.0.0.0 \ --port 8001 \ --ctx-size 140000 \ --flash-attn on \ --n-gpu-layers 999 \ --temp 0.6 \ --top-p 0.95 \ --min-p 0.00 \ --top-k 20 \ --merge-qkv \ --jinja \ -fa on \ --no-mmap \ --cache-type-v q8_0

[-]

ConsciousEar877@reddit

Can you post your OC processing time? Then we know its just normal time or is it really slow? Your slow is a relative word. I wiped my pc clean and installed Ubuntu. Right off the bat, u get right of bloat ware and other memory hoggers. Then i would gather the technical specs of your computer and feed it to an ai model and then choose the right qwen 3.6 model (qwen 3.6 is the best local llm model atm of writing. It has different quant model. Prefer the IQuant model). Then u run optimizating by chatting with a AI model. Good luck

Clean_Initial_9618@reddit (OP)

The initial is 100 but then over time it falls to 20 t/s when I use qwen3.6 35b with hermes agent

Jester14@reddit

Just use -fit and stop guessing.

What flags should I remove and just use -fit ?

TheTerrasque@reddit

fit is default on if you don't set ngl and don't set context. It will fit as many layers as it can down to a (default) 4k context, then after all layers are tucked in it'll spend the rest of the vram on context.

L0ren_B@reddit

https://github.com/noonghunna/club-3090 this is all you need . Good luck

PreparationTrue9138@reddit

Hi, I tried it for dual rtx 3090 setup and it felt really slow. My setup via llamacpp is a lot faster for 27b at least in terms of prompt processing speed I get 1000-1600 prompt processing for 27 b and club-3090 was unusable

These are my settings:

I let fit figure out the context number, but if you want to set static, probably around 100k. Depends on how much vram windows takes. This is on linux and a P40, but should be fairly similar.

Auto model loading / routing

Two options:

Llama.cpp router mode: https://huggingface.co/blog/ggml-org/model-management-in-llamacpp
Llama-swap : https://github.com/mostlygeek/llama-swap

GrungeWerX@reddit

Drop your kv cache to q8, drop both ctx to 100K, max your cpu's thread size, and full gpu offload. Instant speed increases.

How to make sure full gpu offload and max cpu threads???

pepedombo@reddit

On windows when I terminate llama.cpp it outputs usage stats so paste it to gpt or just paste full log.

Same with auto-loader. Chat gpt will write simple autoloader when you find proper llama-server syntax 😄

I don't use llama-server, so just ask claude or gemini.

Fedor_Doc@reddit

Warning: they will spit out a bunch of deprecated options, or options turned on by default

I always ask LLMs to read latest documentation for llama.cpp and model itself, use search, and then explain each setting.

Otherwise it is pointless

ego100trique@reddit

For 27B you might want to try Q4 model quant

lerg96@reddit

the config you use needs more than 24 vram for the 35b, Windows by default doesn't throw OOM but loads the excess to your ram But in a way that's inefficient for the moe

Try - - n-cpu-moe 34 and check the vram usage in task manager you don't want to excess your dedicated memory so it does not uses the "shared memory"

Reduce context, get rid of --reasoning budget and --reasoning format.

Reasoning budget just stops token generation abruptly. It harmed model performance a lot in my testing, even with the message.

Test after each setting change on a simple prompt / workflow. Less is more.

You can find official temp / top-k settings in model card on huggingface.

grumd@reddit

Are you just copypasting commands from somewhere? "--reasoning-parser deepseek" with qwen models is crazy

I used Claude code to run and fine tune till it found the best settings. But apparently it’s not best

Steus_au@reddit

I couldn’t make them working, always make me disappointed so Im back to 3.5-122b hopping 3.6 will be released soon

a-babaka@reddit

Works stable with \~40t/s

cicoles@reddit

3090 does not have FP8 or FP4 support. Running those models on it does nothing, sadly. It was one of the main reason why I sold the 3090s.

jacek2023@reddit

Start from smaller context, I have problem with big context on Qwen on 3x3090

NNN_Throwaway2@reddit

KV cache quantization is a sure way to degrade output quality.

You are probably trying to load too much context.

In general, fewer flags are better. Test and add flags if you need them one at a time. Makes it easier to troubleshoot without needed to run to reddit.

ImportancePitiful795@reddit

You are using -ctk f16 -ctv f16. Ofc the 3090 will choke. Doesn't have the VRAM to load the model and all that KV Cache.

Try -ctk q8_0 -ctv q8_0

Thomasedv@reddit

35B is slow? This is a MoE, it should be going at like 130 t/sec at the start, and drop to 90 t/sec as you get close to max context. I'm also using a 3090.

I use 35B with Q to fit the visual model. Not using any quantization of the kv cache as that hit speed a lot. Try reducing your parameters, like all the batch sizes, unless you measured and tested them.

Llama-switch to auto-change model, never used it myself.