Qwen 27b MTP Config, Llama.cpp Single 3090
Posted by GotHereLateNameTaken@reddit | LocalLLaMA | View on Reddit | 31 comments
What setup are you using for qwen 27b on a single 3090?
Here's what I've started using today. It has to compact often but I'm worried about giving up more accuracy and reliability with a lower quant:
llama-server -m /Models/q3.6/Qwen3.6-27B-Q5_K_S.gguf -c 65536 -ngl -1 -t 8 -ctk q8_0 -ctv q8_0 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 2 --fit off --mmproj /Models/q3.6/mmproj-Qwen3.6-27B-f16.gguf --no-mmproj-offload
I'm getting around 65tk/s.
I've also seen these recommendations: https://github.com/noonghunna/club-3090/blob/master/docs/SINGLE_CARD.md
They seem to be using the q4 quant. How are you weighing the tradeoffs?
imp_12189@reddit
Does anyone now how to run Gemma model as well? I can't find anything about it with llamacpp.
PixelSage-001@reddit
Running a 27B model on a single 3090 with MTP enabled is basically the holy grail of local inference right now. The memory bandwidth on the 3090 handles the extra speculative decoding overhead beautifully. What context size are you able to comfortably push before you start getting OOM errors during prompt processing?
j_tb@reddit
A few weeks ago might have agreed with you. But ds4 on a new M5 max with 128GB is blowing my mind right now. https://github.com/antirez/ds4
Maximum_Parking_5174@reddit
Thats comparing a very expensive laptop with a old GPU that could be bought for 1/10 of the price. Still pretty cool running Deepseek on a laptop.
j_tb@reddit
Ok sure. But which one is “the holy grail of local inference”?
LightBrightLeftRight@reddit
It’s been ridiculous for me, I have a Qwen 3.6 27b on 3090 at home and Ds4 on my laptop and the laptop does so much better
txgsync@reddit
Yep hand tuned kernels are really interesting. Ds4 is breaking some interesting ground. Never thought I’d get this performance on such a large model on my M4 Max.
Now if only I figured out continuous batching for subagents…
GotHereLateNameTaken@reddit (OP)
The context is in the settings i posted: 65536.
cibernox@reddit
That’s the main issue for me. I need more context. 128k minimum.
Rasekov@reddit
Depending on your workload you can go down to Q4 for the model and/or Q8/Q4 for KV, with both of those things(27B-UD-Q4_K_XL model, Q8 K and Q4 V) you get around 192K context in a 3090 even using it for video output.
Using the model at Q3 gives you room for 256K but at Q3 you might be better off just using the 35B MoE and offloading to RAM.
ResponsibleTruck4717@reddit
you can try with q4_0 for kv Im pleased with the results.
WizardlyBump17@reddit
yo, if i dont use mtp on the mtp gguf, will i have 1:1 results with the normal gguf?
wllmsaccnt@reddit
Yes, though keep in mind the MTP models are slightly larger and use up more VRAM (may impact your context size and/or kv cache quant choices, if you were already near your VRAM limit).
WizardlyBump17@reddit
yea i saw that. For me i only had to increase some moe to the cpu and for some reason qwen3.6 27b didnt need more layers offloaded to the cpu
Last_Mastod0n@reddit
I use the unsloth q6 model on my 4090. Its incredible
filip-z@reddit
Can you post the settings that you use? I'm also on a single 4090 and trying to dial in the best setting.
Last_Mastod0n@reddit
These are my settings in LM Studio. Im not sure if they are perfect but they work for me.
Btw if you have integrated graphics or another GPU be sure to plug all the display cables into the mobo or other gpu and then turn the 4090 off and back on in device manager. That will free up 100% of the vram for the model (it forces all the vram to the integrated graphics)
legatinho@reddit
How much context can you fit with it?
Last_Mastod0n@reddit
So I dont actually code with it. Im running custom vision analysis pipelines. So I have found that 10k context is plenty for me. I could test it and see how much will fit.
Maximum_Parking_5174@reddit
Interesting numbers.
I just tested Qwen3.6 27B int4 with dflash on vLLM.
I get this:
Singel req: 117t/s (TG) - 1540t/s (PP)
Eight paralell requests: 396t/s (TG) - 2400t/s (PP)
This is on 4 RTX 3090 at 260W and 262K max tokens.
audioen@reddit
Also see if you should do -fa 1 to enable flash attention. It might go a little faster if you do.
GotHereLateNameTaken@reddit (OP)
Yep that sped things up too thanks!
Ke5han@reddit
So what token/s do you get with this config? I was using q4 xl and witb mtp I am getting about 45-50, and last night I switched to 35b a3b q4 nl mtp version, hermes calls get 100t/s, I am in windows so linux maybe even faster
urarthur@reddit
35b moe is noticably lesser quality. I would keep the 27b. Also my 35 was running at 140 tgs with q3 xl.
urarthur@reddit
50-60 with q4 xl on windows. I think longer prompts it increases speed with mtp=6. shorter chats =2 drafters.
urarthur@reddit
i am experimwnting with 200k context with kv=4
DiscipleofDeceit666@reddit
You’re using it wrong. You need to manage context and tasks programmatically. And define a way to run unit tests to feed back into the model because you can’t trust what it does.
I personally call the qwen code cli through a python script that reads from a task and test config file
GotHereLateNameTaken@reddit (OP)
By this do you mean you have some preset configurations for various tasks and you select from them? If so, what are your configs. If not, could you explain a bit more specifically what you mean?
DiscipleofDeceit666@reddit
Nope, your presets are fine. Its how you are calling qwen. If you interact with it directly and let it muck about, it's just going to do too much to be useful.
So you need something else that interacts with qwen for you, one that ensures you get a clean slate for each task. One that reads the tasks you created and only passes in what's relevant to make the code change. I wrote this toml and the python script picks it up and only passes these tasks one at a time to qwen code CLI. each task starts with near 0 context so you arent dealing with compaction.
Then you have your unit tests so that it can fix its mistakes on code change.
The important part of this is you need a way to manage context because local LLM can't be trusted to manage it on its own.
tasks.tomlsagiroth@reddit
This and for now I dont look elsewhere https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md
Basically Q5_K_S with Q4_K_M Drafter (think of it like aerodynamic tunnel for model running behind it) I get circa 180k headless context but i compact earlier anyway at around 100k
Potential-Leg-639@reddit
I use 3.6 in Q4_K_XL quant, the version with that Unsloth did all their tests as well. Their „preferred“ version works perfectly fine for me as well.