RTX 3090 + 27B model performance issues (llama.cpp) what am I doing wrong

Posted by Clean_Initial_9618@reddit | LocalLLaMA | View on Reddit | 30 comments

Hey folks — looking for some advice on improving my local LLM setup (and also exploring agentic coding workflows).

Current setup:

GPU: RTX 3090 (24GB VRAM)
RAM: 64GB
Using llama.cpp with a Qwen3.6 27B Q6 model (GGUF)
Running through OpenCode

Issue:
Responses are really slow, and sometimes it just starts producing errors or low-quality output. Feels like something’s not tuned right or I’m pushing the hardware too far.

Current command:

llama-server.exe -m "C:models/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q6_K.gguf" -ngl 99 -c 65536 -np 1 -fa 1 -ctk q8_0 -ctv q8_0 -b 1024 -ub 256 -t 16 --no-mmap --temp 0.6 --top-p 0.95 --top-k 20 --reasoning on --host 0.0.0.0 --port 8080 --metrics --slots --props

What I’m trying to figure out:

Are any of these flags hurting performance?
Is Q6 just too heavy for a 3090? Would Q4/Q5 be a better balance?
Better batching / threading / context settings I should try?
Anything obvious I’m missing with llama.cpp tuning?

Also curious about:
I’m trying to get into more agentic coding workflows locally (multi-step reasoning, tool use, etc.).

Any good setups, frameworks, or patterns that work well with llama.cpp?
How are you guys structuring prompts / tools / memory for coding agents?
Any lightweight harnesses or repos worth checking out?

Would really appreciate any tips, configs, or examples from people running similar hardware. Thanks in advance for all your advice and help.

[-]

bigh-aus@reddit

llama-server \
  -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \
        --alias "desktop" \
  --no-mmproj \
  -a qwen3.6-coder \
  --host 0.0.0.0 --port 8080 \
  -ngl all -sm row -fa on \
  --ctx-size 98304 -n 32768 \
  -b 2048 -ub 512 \
  -np 1 -kvu \
  -ctk q8_0 -ctv q8_0 \
  --jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"enable_thinking":true,"preserve_thinking":true}' \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 \
  --presence-penalty 0 --repeat-penalty 1

This is what i use if it helps on a 3090. Works pretty well but I don't use it with opencode

[-]

Important_Quote_1180@reddit

27B Local Inference on Single RTX 3090

qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.

• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.

[-]

Clean_Initial_9618@reddit (OP)

Sorry can you explain just a little more please

[-]

Important_Quote_1180@reddit

Here are the tricks that make this actually work instead of choking at 12 tokens a second:

Turboquant 3-Bit KV Cache. Nobody talks about this. Instead of stashing every attention computation in full 16-bit precision — the safe, default choice — we compress the KV cache to 3-bit. It's like storing your recipes on cocktail napkins instead of index cards. You think you'll lose something. You don't. On a 3090 with 24GB, this is the difference between "fits" and "OOM-killed by your own ambition."

MTP — Multi-Token Prediction. vLLM speculates the next three tokens using auxiliary heads, then verifies them in one pass against the main model . It's like hiring three sous-chefs to prep while you plate. When it works — and this is the crucial part — you get roughly triple the throughput. We're seeing 71 to 83 tokens per second. Not "usable." Fast.

[-]

Ashthot@reddit

Do you use llama.cpp with spiritual buuns fork ?

[-]

Clean_Initial_9618@reddit (OP)

I am currently using toms fork of llamacpp would vllm better better ?

[-]

Colie286@reddit

i am running Q4 with 100k context und my 3090. It consumes about 23GB VRAM. Q6 could be too large.

[-]

Ashthot@reddit

What are your flag for llama.cpp ? Do you use also spiritual buuns fork with dflash ? I ve a 3090 and will be happy to see your conf . Thanks

[-]

Clean_Initial_9618@reddit (OP)

How's Q4 performing for coding ??

[-]

Colie286@reddit

Pretty good in my opinion. I also can get up to 180k context with kv cache q8

[-]

Great_Guidance_8448@reddit

180k context fits into 24 gig VRAM?

[-]

Clean_Initial_9618@reddit (OP)

What agentic harness are you using for coding how do you promot the local model to make code changes without too many errors

[-]

Colie286@reddit

Basically, my setup looks like this:

Everything runs in Docker: Qwen 3.6 27B UD Q4, Open WebUI, and other tools such as Whisper ASR, Docling, and ComfyUI. For coding tasks or anything that requires a larger context window, I have to disable some of the other services because otherwise there isn’t enough VRAM available.

I primarily use OpenClaude in VS Code not as a plugin, but through the terminal and the results have been pretty good.

[-]

pulse77@reddit

If you would like to run Qwen3.6-27B with --cache-type-k q8_0 and --cache-type-v q8_0 completely in 24GB VRAM:

...with quantization Q6_K the maximum context is about 32K <--- and you have 64K
...with quantization UD-Q5_K_XL the maximum context is about 96K

So you need to lower the context size to 32K or use 5-bit quantization to get 64K context...

[-]

GoodTip7897@reddit

I have 130k context at q8_0 with Qwen 3.6 27B at UD-Q5_K_XL on a single 7900xtx (stable over full context prefill and has 37 t/sec tg128).

So you can definitely push a bit above 96K

[-]

Clean_Initial_9618@reddit (OP)

Would UD-Q5_K_XL be decent at coding I thought Q6 is better so was trying to work with that

[-]

pulse77@reddit

It is decent at coding!

Another decent coding option is Qwen3.6 35B A3B at UD-Q8_K_XL with 256K. It will have the same speed as Qwen3.6 27B at UD-Q5_K_XL as some work will be off-loaded to the CPU/RAM.

[-]

Pablo_the_brave@reddit

First of all for such low ctx even q4_0 will be ok. I will go for lighter imatrix model and stay witch q8_0 at 128k context.

[-]

Anbeeld@reddit

Do you insist on llama.cpp? Folks get massively better performance in vLLM because it supports MTP.

[-]

tmvr@reddit

You run out of VRAM that is why it's slow. The weights alone are 21GB, try one of the lower quants, switching to Q5_K_XL will get you additional 2.5GB, just with that you will fit and can probably increase context a bit.

[-]

qwen_next_gguf_when@reddit

Just use the Q4 bro.

[-]

Ell2509@reddit

Qwen3.6 28b uses hybrid attention. Part of it is net gated attention. That currently has an issue on llama.cpp. it causes the whole conversation to be reprocessed each message, so latency grows quickly with length.

[-]

pepedombo@reddit

I used to code on already structured data and it really takes time to find proper model, especially to find differences between quants. (5070+5060+4060)

I started with 35b q8 and quickly changed to 27b, 35b is chaotic and pumps context quick :)

Here's actual testing ground:
// 35b
llama-server.exe -m "H:\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-Q8_0.gguf" -ngl 999 -sm layer -ts 1,1,1 -c 200000 -np 2 --no-mmap -ctk f16 -ctv f16 -fa on -b 1024 -ub 256 -t 8 -tb 8
// 27B q8
llama-server.exe -m "H:\.lmstudio\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-Q8_0.gguf" -ngl 999 -sm layer -ts 1,1,1 -c 100000 -np 1 --no-mmap -ctk f16 -ctv f16 -fa on -b 1024 -ub 512 -t 8 -tb 8
// 27b q4 dual 80k each
llama-server.exe -m "H:\.lmstudio\models\lmstudio-community\Qwen3.6-27B-GGUF\Qwen3.6-27B-Q4_K_M.gguf" -ngl 999 -sm layer -ts 1,0,1 -c 160000 -np 2 --no-mmap -ctk f16 -ctv f16 -fa on -b 1024 -ub 512 -t 8 -tb 8
// 27b q5 xl
llama-server.exe -m "H:\.lmstudio\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-Q5_K_M.gguf" -ngl 999 -sm layer -ts 1,0,1 -c 100000 -np 1 --no-mmap -ctk f16 -ctv f16 -fa on -b 1024 -ub 512 -t 8 -tb 8

27B Q8 is solid and seems to follow my intentions without much hassel, good reference. In my case the price is huge, because it gives \~12tps average or 2x10tps in parallel. Annoying. 4060 bottleneck.

27B Q5 runs at \~27 t/s on a 5070 + 5060 setup and uses almost 31 GB VRAM at 100k context, so Q6 is likely not practical on 24 GB. Too early to judge, but in prompt comparisons with Q8, Q5 feels like it can lose some detail.

As you see you'll need to play around to find the best that suits you. With 27b I'm ok with \~20-25tps, time will show if that 25tps is worth 10tps difference to q8 which might not loose that one detail. With more limited vram you will probably play with q_8 variants. More variables = more testing = more time.

I switched from opencode to qwen code fork because it handles tools better so it doesn't annoy me every time it needs to parse codebase. Used to play with code-wiki etc but it needs special care so for now i'm going brute force, looking for compromise between quality and speed.

Short about ctx: i'm trying to finish any task in 50-100k context window, the bigger it gets the worse. As my project is already structured every area is encapsulated so i prefer modular approach.

[-]

jacek2023@reddit

By default CUDA on Windows is "smart" and offloads to RAM itself. Try lower quant.

[-]

wwabbbitt@reddit

In your logs look for a line like this:

load_tensors: offloaded 65/65 layers to GPU

Most likely you are getting much less than 65/65, so layers are being offloaded to the CPU which slows down your tokens. This will definitely be the case if you are using Q6. Most of us are using Q4 with 24GB VRAM.

[-]

Pablo_the_brave@reddit

No, just nvidia will use ram offloading :) This is nvidia CUDA driver magic - swap vram to ram...

[-]

Puzzleheaded-Drama-8@reddit

-np 1 reduces speed of 27B on my 7900xtx from 36tk/s to 25tk/s (I don't even do parallel requests). Could be just a vulkan bug, but you could try without it.

[-]

goofyahead@reddit

Hi there!

Doing a lot of testing in my own hardware with qwen3.6 and Gemma4.

A few things:

Qwen3.6Qwen3.6 Q6_K

Seems too big to me to leave room for your K/V cache (your context) and still work on your GPU only, you have to calculate for both things, or it will start using your system RAM, much slower.

that model is \~22 GB of just weights, then you have very very little for context, which you set at 64k with Q8 (this is usually preferred since quant at cache seems to hurt a lot)
https://unsloth.ai/docs/models/qwen3.6

Also I would recommend you to play with UD (unsloth dynamic) quants, they do the reduction at different layers to try to keep the model closer to the original. (not sure which model are you using specifically)

Just as a reference to be able to keep everything in my 5090 I'm testing a Q5 :D

I can go on about the parameters but this is part of the learning and the fun, I would recommend you to use llama.cpp and get to read this different quants, test them, ask your AI what each of this fields mean, etc.

Just some heads up, always read the model card, they recommend some settings cause thats what they did in the learning/RL phase, in my case I learned the hard way the params for Qwen3.6 for top_k and temp are not what I expected and got into crazy loops.

Have fun!

[-]

RogerRamjet999@reddit

You're definitely at the edge of what 24GB can do, you probably should at least test a Q5 model, that would most likely give you enough space, and if it doesn't work, it's a pretty good indication that you have some other issue.

[-]

jonahbenton@reddit

Sounds like the model is being served from RAM. Run nvidia-smi to confirm that the OS can see the card and then after starting llama-cpp and loading the model that the memory on the card is being used. llama-cpp should also log that it can see the card.