What speed is everyone getting on Qwen3.6 27b?
Posted by Ambitious_Fold_2874@reddit | LocalLLaMA | View on Reddit | 187 comments
I'm getting \~13 tps on Q8_0, with a context window of 128000, K Q8_0, V Q8_0
this is on 3x GPUS (1x2060super 8gb, 2x5060ti 16gb), via llamacpp
unsure if this is slow or to be expected?
*/llama-server --port 8080 --model */llama.cpp/Qwen3.6-27B-Q8_0/Qwen3.6-27B-Q8_0.gguf -mm */Qwen3.6-27B-Q8_0/mmproj-BF16.gguf -np 1 --temperature 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking": true}' --cache-type-k q8_0 --cache-type-v q8_0 -c 128000 --fit-target 1536
(--fit-target 1536 was to allow some space for the vision capability to work)
InevitableArea1@reddit
7900xtx (24gb vram) -100k context - LM Studio - Q5_K_XL (unsloth) - 19 tok/s
Amazing considering it's the smallest model that can actually do the simulation analysis I want/need. The qwen3.5 35b MoE great, but 27b dense is another level entirely
Most-Trainer-8876@reddit
q5 with 100K context on 24GB Vram, can you please share your settings?
InevitableArea1@reddit
Here, nothing too fancy, just key thing is LM Studio's Unified KV Cache and flash attention. Here are my settings, also using AMD's ROCM Runtime (v2.13.0). Also using Qwen's recommended sampling parameters Temp=0.7-1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
matrik@reddit
1 x RTX 3090, Q5_XL: 78 t/s
Most-Trainer-8876@reddit
Can't believe it, share your settings?
MalabaristaEnFuego@reddit
28tok/s on RTX A5000 and it's an incredibly local model for 27B.
Most-Trainer-8876@reddit
please share your settings, I am unable to push context to 100K, even with q8_0 kv cache.
Prestigious-Use5483@reddit
35 t/s with RTX 3090 | 5Q_K_XL | 32K Context (F16)
Most-Trainer-8876@reddit
please share your settings
ridablellama@reddit
4090 - 40 tok/s with my short quick chats in lm studio. highly unoptimized but q8_0 - unsloth q4km
Icy_Butterscotch6661@reddit
I was getting 45 on a 4090 at small context lengths (with llama.cpp binaries downloaded frpm their GitHub).
Most-Trainer-8876@reddit
4090 got 24GB vram, right? How much context are you able to push?
SuitableElephant6346@reddit
3060, like 3 token a sec on q4 😅😭
throwaway9977558866@reddit
Can you share your config?
chankeypathak@reddit
Can someone explain it to me in laymen terms. I have gtx 1650, ryzen 5 3600, 32gb ram.
Shall I ditch the idea of hosting an llm
youcloudsofdoom@reddit
dual 3090 here. I'm getting 30 t/s with around 1200 p/p at 192k context on Q6_K.
ngl 99
b 4096
ub 1024
t 4
tb 16
fa on
caches are Q8
unsloth recommended temp etc all there.
Anyone doing any better, any suggestions? Feels like I'm leaving power on the tables somewhere....
logic_prevails@reddit
Sick setup, Q6 is the sweet spot
youcloudsofdoom@reddit
Yeah, I'm not mad at at it, even at about 50% context fill I'm getting 1100 p/p and 25 t/s, so I shouldn't complain really. I've been spoiled by my 100 t/s Qwen 3.6 35B experience....
x10der_by@reddit
Q4 about 3 t/s on rtx 4070s 12Gb, 32Gb DDR4
RevolutionaryGold325@reddit
Isn't it the same model as Qwen3.5? Just a bit more training and better fine tuning. Should give you the same exact speed.
ea_man@reddit
Hidden layers are \~25% bigger than 3.5, so a bit more railed on tool using I guess. Bigger file.
RevolutionaryGold325@reddit
Not true. I went and double checked. The configs are pretty much identical:
https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/config.json
https://huggingface.co/Qwen/Qwen3.5-27B/blob/main/config.json
ea_man@reddit
Model Overview
Qwen3.6-27B.
RevolutionaryGold325@reddit
We must be looking at different configs because that is not what the model configs define.
RevolutionaryGold325@reddit
Line 18 in the new configs:
https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/config.json#L18
Line 16 in the old config:
https://huggingface.co/Qwen/Qwen3.5-27B/blob/main/config.json#L16
AlwaysTiredButItsOk@reddit
Wrong. 3b active parameters = faster. Probably closer to 3.5 4b or 2b speeds
LocoMod@reddit
TIL dense models are just MoE’s in disguise. /s
rosco1502@reddit
I believe it is slightly larger
spaceman_@reddit
Single and dual AMD Radeon Pro R9700 numbers with llama.cpp with both ROCm and Vulkan for IQ4_NL, Q6_K and Q8_0.
Single cards are obviously swapping in case of the Q8_0 benchmark.
I have not yet tried the new tensor parallellism, because I previously got horrible numbers on both backends. Not sure if this has since been fixed.
unsloth_Qwen3.6-27B-GGUF_IQ4_NL
Single card, ROCm:
Single card, Vulkan:
Two cards, ROCm:
Two cards, Vulkan:
unsloth_Qwen3.6-27B-GGUF_Q6_K
Single card, ROCm:
Single card, Vulkan:
Two cards, ROCm:
Two cards, Vulkan:
unsloth_Qwen3.6-27B-GGUF_Q8_0
Single card, ROCm:
Single card, Vulkan:
Two cards, ROCm:
Two cards, Vulkan:
ilintar@reddit
Q5_K_M with 2x5070, on
-sm tensor --spec-default52 t/s.FoxiPanda@reddit
RTX 5090 running UD_Q5_K_XL - ~45tok/s at token 1000, more like 35 at token 100000.
I have not optimized my launcher script at all yet though. YMMV.
RedParaglider@reddit
Man 45 t/s is super useable too.
FoxiPanda@reddit
Strix Halos only have ~250GB/s of mem bandwidth. That's ~7x less than a 5090. It's gonna be 7x slower lol
No_Mango7658@reddit
5090 is about 2tbps for anyone who doesn’t know.
FoxiPanda@reddit
1792GB/s but who's counting ;)
No_Mango7658@reddit
You just be running running stock 😂🤩. 5090 is amazing for these size models
FoxiPanda@reddit
I am very much in the state of not catching my 5090 on fire :D
No_Mango7658@reddit
Probably smart
CMPUTX486@reddit
How fast it is from AMD max?
RedParaglider@reddit
11 t/s on a q4
My_Unbiased_Opinion@reddit
I probably would do either unquantized 3.6 35B Moe or a quantized 3.5 122B Moe on strix halo IMHO.
annodomini@reddit
Yeah, with these really strong dense models coming out, I'm feeling like I need to pick up a desktop chassis with the discrete GPU. It's neat what I can run on my laptop, but I could really use more memory bandwidth.
bigh-aus@reddit
For my uses, I’d go context.
I tried out q4_k_xl on a 3090 with 96k context, about 30tps. At 46k tokens. (Openclaw coding).
FoxiPanda@reddit
Yeah I'm leaning that way too. I'll try it and see how far I can push it above 128K
bigh-aus@reddit
I had it pull a story off the backlog, complete it, but included one compaction.
psxndc@reddit
I really appreciate you posting your docker command. Thank you!
FoxiPanda@reddit
For sure, no promises it won't change more as I figure out more optimizations, but this one feels really good so far. I actually just started a new post that pulls mostly from this comment of mine and is basically bait for someone who's way more vLLM knowledgeable than I am to tear apart my launch parameters :D
codables@reddit
Can you share your command line params? I’m assuming this is llama.cpp?
FoxiPanda@reddit
Unoptimized version uses this currently...I'm am working at the moment so no time to poke at it further atm but it will change (there is so much room from improvement from this):
codables@reddit
Thank you for sharing!
FoxiPanda@reddit
Note, I updated the parent comment with an entirely new setup that absolutely smokes this un-optimized llama.cpp setup. See https://old.reddit.com/r/LocalLLaMA/comments/1sss5og/what_speed_is_everyone_getting_on_qwen36_27b/oho2if8/
Limp_Classroom_2645@reddit
Man that's slow
Possible-Pirate9097@reddit
How fast is ur PP?
PinkySwearNotABot@reddit
pretty fast when i want it to be
Far_Cat9782@reddit
/flush ftw
FoxiPanda@reddit
Dense models be dense. Again, no optimization yet. I can probably get it to like... 50-60 with TG with some work. Frankly it is totally usable though. I can't read faster than 20-25tok/s on a single request. Background stuff obviously would be happier being faster.
iportnov@reddit
That's desktop, not laptop I assume?
FoxiPanda@reddit
Correct. I am power limiting to 500W though so my GPU doesn't catch on fire :)
Certain-Cod-1404@reddit
Are you on linux? If so do you know the specific command you used ? And is it a set it once forget about it thing or do you need to setup systemd stuff to run on start up
FoxiPanda@reddit
I am not on linux on this specific system, but uh, there is definitely a way to do this in linux.
It's something like:
sudo nvidia-smi -pl
_hephaestus@reddit
For whatever reason, this is just per session and you’ll likely need a systemd script to have it persist across boots unless nvidia changed something.
FoxiPanda@reddit
Indeed, but really he should just have an LLM agent set this up for him. But yeah, that's something like:
/etc/systemd/system/nvidia-powerlimit.service
[Unit] Description=Set NVIDIA GPU power limit After=nvidia-persistenced.service
[Service] Type=oneshot ExecStart=/usr/bin/nvidia-smi -pl 500
[Install] WantedBy=multi-user.target
sudo systemctl enable nvidia-powerlimit
This will vary between different linux flavors though I imagine. I am going to assume he's using like something like a 2007 build of Gentoo.
Certain-Cod-1404@reddit
Ok thank you so much and is there any meaningful drop in performance compared to not limiting wattage ?
Ambitious_Fold_2874@reddit (OP)
Does power limiting help with GPU longevity too? In that case I might have to start figuring out how to set this up too :/
FoxiPanda@reddit
I mean, the 12VHPWR / 12V-2x6 power connector is CLEARLY flawed. People would not be constantly reporting melty cables/connectors if it wasn't. Even a 0.1% failure rate in this case is thousands of cards failing.
So, reducing the power a little can at least back away from that hairy edge of the hardware's capability. 500W is still an insane power envelope and that last 100W is pretty diminishing returns as far as speed goes imo.
Gargle-Loaf-Spunk@reddit
well where's the fun in that. `--yolo`
Honest-Ad8881@reddit
9070XT16G - 32G
Qwen3.6-27B ctx = 32K
IQ4-XS 15token/1s
UD-IQ3-XL 23.5token/1s
UD-IQ3-XXS 35.5 token/1s
into_devoid@reddit
Is this useable?
Big_Mix_4044@reddit
Same as 3.5 27B. 30tps tg and 1k tg pp at 200k context window (with slight degradation as the context grow)
shokuninstudio@reddit
Q4
[ Prompt: 207.6 t/s | Generation: 22.6 t/s ]
Q8 on a MacBook Pro 48GB is producing graphical glitches all over my screen so I shut it down. In theory there is plenty of RAM but llama.cpp has been grabbing more memory than needed lately.
Frank-w618@reddit
You can try oMLX on Mac; it uses much less memory compared to llama.cpp and is also faster.
bigh-aus@reddit
Try this and report back :)
IronColumn@reddit
what kind of diffrence do you see?
IronColumn@reddit
what macbook generation?
simracerman@reddit
UD_Q4_K_XL - 12 t/s. 64k context on 5070 Ti 16GB, partial offload to iGPU using llama.cpp vulkan backend. Just finished a lengthy code review of an app I’ve been building with Opencode. I’m super impressed with the level of depth the 3.6-27B has brought.
Iory1998@reddit
I get 22-23 t/s, Q8 KV FP16 at 170K using 1 RTX3090 AND RTX5070TI
gnnr25@reddit
MacBook Air M2 16GB
[ Prompt: 2.8 t/s | Generation 2.2 t/s]
9B when? *cries*
Adventurous_Farm3073@reddit
Dual 5090 power limited to 420w unsloth q8 get around 40t/s on q4 get around 70t/s
CMatUk@reddit
7900 XTX 24GB, 64GB DDR5, 7950X
LM Studio - Vulcan Llama.cpp
K/V Cache Quant - Q8
32K Context
Qwen3.6-27b Q4_K_M (unsloth) 40.0 tok/sec
Qwen3.6-27b Q5_K_xl (unsloth) 35.3 tok/sec
Qwen3.6-27b Q6_K (Lmstudio) 15.8 tok/sec
lurkatwork@reddit
You sellin that xtx?
Mirayum@reddit
Do you mind sharing your launcher/parameter options?
_ballzdeep_@reddit
"Qwen3.6-27B-UD-Q4_K_XL":
aliases: ["qwen36d", "qwen35d", "Qwen3.6-27B-UD-Q4_K_XL.gguf", "Qwen3.5-27B-UD-Q5_K_XL.gguf"]
timeouts:
responseHeader: 0
cmd: |
${llama}
--model /models/Qwen3.6-27B-UD-Q4_K_XL.gguf
--spec-type ngram-mod --spec-ngram-size-n 16 --draft-min 4 --draft-max 32
--jinja --ctx-size ${OC_CTX} --parallel 1
--fit on --fit-target 0 -fa on -ctk q8_0 -ctv q8_0
-b 4096 -ub 1536 --cache-ram 0 --ctx-checkpoints 12
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0
--reasoning-format deepseek
This gives me TG:40 tps and PP:1350tps with 42% ngram acceptance.
Sticking_to_Decaf@reddit
FP8 with speculative decoding (mtp, 2), about 85tps on 1x Pro 6000 max-q.
Loud-Decision9817@reddit
3090 Q3 getting 114 tokens per second but I have my own custom software. 256k context
Steve_Streza@reddit
30 tok/s 7900 XTX on UD-Q4_K_XL. I've put no effort in tuning yet. 90 tok/s on 35B-A3B with UD-Q3_K_XL.
Lazzollin@reddit
2~3tps on a rtx a5000 and offloading to ram (cpu ryzen 7 9800x3d) I think my settings might be quite far from the most performance I could be getting tho and I just kept working with the 35b
patricious@reddit
in LM Studio I get 208.30 t/s on a RTX 5090, Q8_0, 262k context size, temp: 0.6, Top K 20, Repeat Penalty 1, Top P 0.95.
For some reason I can't get Llama.cpp to run properly, maybe I am not choosing the rights settings in the script file.
Kahvana@reddit
You're likely confusing processing speed with generation speed.
Apprehensive-Fly4076@reddit
i think you're mixing it with qwen3.6 35b, are you sure its the qwen3.6 27b that came out today? Me and some others on a 5090 are getting around 50t/s.
patricious@reddit
Yes my bad, its the qwen3.6 35b model and the speed is around 50-60 t/s.
DramaLlamaDad@reddit
You're getting downvoted because this isn't possible unless you have something like 5.4tb/sec memory bandwidth and most people here knows it. Check the model again, you're probably on something else.
patricious@reddit
forgot to add: UD_Q4_K_XL from Unsloth
mister2d@reddit
Qwen3.6-27B-UD-Q4_K_XL.gguf
28 t/s, all layers on gpu.
ttkciar@reddit
Interesting! I expected PCIe IPC to hit tensor parallelism perf more than that. Thanks for sharing!
A couple of questions, if you don't mind: Is that system PCIe 2.0 or 3.0? And are you using any kind of bridging link between the 3060s?
mister2d@reddit
PCIe 3.0. One gpu in a 8x slot while the other in the 16x slot. No physical linking other than the bus.
Kitchen-Year-8434@reddit
Vllm, mtp 3, FP8, rtx 6k - about 120 t/s.
DeltaSqueezer@reddit
and for prompt processing?
AustinM731@reddit
Have you had any issues running the FP8 KV cache?
JermMX5@reddit
Would you mind posting your startup script with params? We’ve been using llamacpp on our rtx pro 6000 box in aws and want to start looking at vllm
LegacyRemaster@reddit
same
eribob@reddit
Dual rtx 3090, FP8 quant in vllm, tp=2, mtp 2: pp=1650t/s, tg=26t/s
PassengerPigeon343@reddit
Was looking for my HW on the list, thank you for the baseline, especially including prompt professing.
AdamDhahabi@reddit
3090 + 2x 5070 Ti, all cards around 900GB/s mem bandwidth, no tp and no mtp, running Unsloth Q8 with full unquantized context at 25 t/s. Something seems off with your 26 t/s.
Pleasant-Shallot-707@reddit
I get about 40-50 tok/s on my M5 Max
Eveerjr@reddit
what do you use for inference? I'm getting like half of that on my M5 Max
Eveerjr@reddit
24tok/s on M5 Max with MLX
IronColumn@reddit
what quant? i get 10 on an m1 max with llama.cpp and unsloth q4 k_m gguf; id honestly expect an m5 to go faster than 2.5x
Eveerjr@reddit
I'm trying now with the qwen3.6-27b-nvfp4 mlx version and I get 26tok/s. It's quite pleasing to use tbh, the prompt processing is quite fast. This model is so good it doesn't even feel like it's running local.
Mine is the base model with 32 GPU cores.
monjodav@reddit
not normal lol
Eveerjr@reddit
What would be normal? I just downloaded from lmstudio
verdooft@reddit
[ Prompt: 2,2 t/s | Generation: 1,4 t/s ], Q6_K_XL
UniversalJS@reddit
Are you running it on a Gameboy?
verdooft@reddit
A Notebook without GPU and slow RAM. :-)
l33t-Mt@reddit
13.5 with Nvidia p40.
Late_Night_AI@reddit
You’re hurting your performance with that 2060.
For LLMs, when you split a model across GPUs, the work has to pass through every card involved. If one card is much weaker, lower VRAM, or in this caee slower, it becomes the bottleneck. You might actually see a performance boost if you dont use the 2060 and off load a little but to system ram instead if you need more than 32gb. Also lower quants are faster, so if you want a speed boost you could got to a Q6 or Q4 if it doesn’t hurt the quality too much for your use case
rm-rf-rm@reddit
Yeah feeling its slow on my end as well
Q8, llama.cpp, mac studio with m3 ultra. ~20tps
jedisct1@reddit
~15 tok/s with omlx on an Apple M5.
logic_prevails@reddit
What quant
jedisct1@reddit
Q8
New-Implement-5979@reddit
Single 5060ti - Q3_K_M with 70k context window I get 21 tokens per second
maschayana@reddit
M5 Max
Nvfp4 MLX by mlx-community
Tg = 31.12 t/s 2272 tokens Ttft = 0.54s
PinkySwearNotABot@reddit
wow. should i not even bother with the GGUF Q6? i have m1 max 64gb
maschayana@reddit
I would go with the 35b a3b. To me so far overall much better experience
PinkySwearNotABot@reddit
but...the benchmarks :(
maschayana@reddit
At least for me it's the balance of quality and speed and the moe just excels
DramaLlamaDad@reddit
Matches my results. On Macbook Pro m5 pro 64gb box, basically half the everything of the Max, I get 14.8 tok/sec.
Signor_Garibaldi@reddit
Do you find it usable at this speeds?
DramaLlamaDad@reddit
Obviously, I just got 3.6 setup but based on past experiences, 15 is not great for active coding. Fine for background tasks, which is what I'll use it for. Probably my new overnight code reviewer and daily code change summarizer.
Signor_Garibaldi@reddit
I shouldn't probably ask it on this sub, but isn't it too much hassle for what you could accomplish with api in a sensible time and marginal cost or is it more of a hobby/satisfaction kind of a thing?
DramaLlamaDad@reddit
For an example, I just loaded it up and asked it to review 5 files with total of around 1000 lines of changes in Cline Code on it. Took about 10 minutes to complete the review but did a decent job. If you've got some house work to do and can start requests and go afk for a while and come back later, works great! If you're a professional trying to get stuff done, about 50tok/sec is really the minimum you should shoot for, not 10-15tok/sec. :/
PromptInjection_@reddit
8 tokens / s, Q5, AMD Strix Halo
edsonmedina@reddit
I get about 7.4 t/s on Strix Halo with Q8
Ell2509@reddit
You should be getting more than that.
You are likely setting up your command in such a way as to have data taking multiple round trips over pcei. That tanks your speed.
Flashy_Management962@reddit
Use bf16 cache for k and v for qwen models, do not use --fit-target on dense models, use -sm tensor
jacek2023@reddit
FerLuisxd@reddit
Does it fit 100% on that gpu? or with ram offload? also it is Q2 right?
jacek2023@reddit
49/65 means it's offloaded to RAM
zkstx@reddit
Hmm, what are the flags you use? I suspect you could squeeze some more layers (perhaps even the entire model) if you are willing to live with a little less context and a lower fit-target
IronColumn@reddit
10 t/s
m1 max studio 32gb
unsloth q4 k_m
Kahvana@reddit
On 2x ASUS PRIME RTX 5060 Ti 16GB I'm getting \~300 t/s processing and \~20 t/s generating, Nothing in context, Unsloth Q5_K_M, 128k max context. Will edit the message when I get to my pc.
viperx7@reddit
llama.cpp on a 4090+3090 setup I get TG 29t/s and PP 2500t/s
I am struggling with settingup vllm can't seem to figure out optimal flags and exact model to use if anyone has similar setup and would like to share thier config I will be thankful
skibare87@reddit
Around 80 tok/s with speculative decoding active and 96k context window.
chisleu@reddit
15 tokens per second with an m4 max / 128 with 8bit quant.
Makers7886@reddit
4x3090 BF16 model and cache + MTP + vLLM with "instruct general" mode is:
ProfessionalSpend589@reddit
Is it worth it to run it at BF16?
I see another user in this thread and wonder what bad things are you seeing when quantised…
Makers7886@reddit
It eliminates day 0 quant issues, some FP8's don't run on 3090s (w8a8, like Q3.6), and int8/fp8 is too tight for 2 gpus w/max context and concurrency which means I have to use all 4 with an obscene amount of kv cache laying around. Since this is only for my projects/uses I dont need more than 12 concurrent calls and is a waste of gpus. May as well run bf16 model and cache with max context + concurrency. I do blame lower quants when I hit tool issues at large context like 200k+ and see consistency drop off but I have no data around that.
Also my main goto is 122b fp8 on 8x3090s which is faster and consistently beats q3.5 27b bf16 (in my uses) and want to give q3.6 27b it's best shot.
robertpro01@reddit
Which mobo?
Makers7886@reddit
romed8-2t + epyc 7502
QuinsZouls@reddit
26 tps using RX 9070 16gb and turboquant at 130k of context windows using vulkan backend
RoomyRoots@reddit
Around 20t/s on a RX 7800 XT, same as 3.5. I feel that since my last llama.cpp build I got some performance degradation but I don't have time right now to fix it.
Blindax@reddit
With lm studio Q8 128k context, 2k token generation I get around 7t/s with 5090 and 3090 (vs 150t/s with 35b). At 80k context I get around 23t/s. I have noticed issues with both 3.6 versions in lm studio (thinking loops etc and apparently optimization too).
ziphnor@reddit
n00B here with 2 x RTX 5060 TI 16gb , Intel Core 2 Ultra 235 with 64GB DDR5 (6400).
Using ik-llama.cpp with Q5_K_XL i get 23-24 t/s (\~112 pp). This is with memory OC ( +6000 MTs, which is apparently fairly standard on these cards), with standard memory speed it think it was 19-20 t/s.
Embarrassed_Adagio28@reddit
Dual tesla v100 16gb gpus run it at 28 tokens per second at q5 on lmstudio.
ziphnor@reddit
Can you share a bit more info on that setup?
Tormeister@reddit
76.3 tok/s
RTX 5090
vLLM 0.19
fp8_e4m3 KV
cyankiwi/Qwen3.6-27B-AWQ-INT4
logic_prevails@reddit
30 tk/s UD_Q5_K_XL, 5070 ti and 3080
fuse1921@reddit
Getting mid 20s with 27B Uncensored Q6 on 3x 3090 at full context
Dundell@reddit
3.62 t/s on gtx 1080 ti + 16GBs ddr4 ... I'll just stick with 3.6 35B moe which was 35t/s. Maybe my x6 RTX 3060's can handle it better in some configuration for speed and at least 100k context size.
p211@reddit
May I ask how you got the 3.6 35B Moe to 35t/s on your setup?
Dundell@reddit
/home/dundell-discordbot/llamv2/llama.cpp/build/bin/llama-server -m /home/dundell-discordbot/llamv2/Qwen3-5/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf --mmproj /home/dundell-discordbot/llamv2/Qwen3-5/mmproj-F16_3-6-35b.gguf --ctx-size 100000 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1 --fit on --flash-attn on --no-warmup --host
0.0.0.0 --port 8188 --api-key someapikey -a Qwen3.5-Thinking --temp 1 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --cache-ram 0 --image-min-tokens 1024 --jinja
xMarkv@reddit
+1 I’m also following. Trying to get 3.6 35b on a 3060 but it falls on its face during prompt processing
usuallyalurker11@reddit
I got \~4 t/s on my $800 laptop Lunar Lake iGPU Intel Core 140V 32GB LPDDR5X.
I was not surprised considering the Qwen 3.5 9b same quant I got \~12 t/s this one is even heavier so it makes sense
MarionberryWeird4021@reddit
Are you using internal GPU or something external connected with your Computer? Just to know...
toolman10@reddit
RTX 5090, the sweet spot for me in LM Studio is:
unsloth/Qwen3.6-27B-Q6_K (24.37 GB on disk)
ctx 256k
KV Q4
Getting \~50 tk/s
Diecron@reddit
prompt eval time = 14350.90 ms / 38697 tokens ( 0.37 ms per token, 2696.49 tokens per second)
eval time = 442.77 ms / 24 tokens ( 18.45 ms per token, 54.20 tokens per second)
ea_man@reddit
Works fine here with Vulkan on a 6700xt.
I mean not fast but as fine as the old one...
AeroelasticCowboy@reddit
my R9700 with Q5KM is PP @ 980 and TG at 27
Clean_Initial_9618@reddit
What's mmproj how's it helpful???
Diecron@reddit
It's the vision decoder - lets the model take image inputs as well
ea_man@reddit
Same as old 3.5 yet I can use 1/4th of the context size (10K) for the IQ3_XXS on a 12GB GPU due to the bigger size, I hope that Bartowsky will release a slightly smaller IQ3...
launch:
/home/eaman/llama/bin_vulkan/llama-server \
-m /home/eaman/lm/models/unsloth/Qwen3.6-27B-UD-IQ3_XXS.gguf \
--host 0.0.0.0 \
-np 1 \
--fit-target 20 \
-ctk q4_0 \
-ctv q4_0 \
-fa on \
--temp 0.3 \
--repeat-penalty 1.05 \
--top-p 0.9 \
--top-k 20 \
--min-p 0.04 \
-b 512 \
--ctx-size 10000 \
--jinja \
--reasoning-budget 1 \
--chat-template-kwargs '{"enable_thinking":false}' \
--no-mmap \
/home/eaman/llama/bin_vulkan/llama-server \
-m /home/eaman/lm/models/unsloth/Qwen3.6-27B-UD-IQ3_XXS.gguf \
--host 0.0.0.0 \
-np 1 \
--fit-target 20 \
-ctk q4_0 \
-ctv q4_0 \
-fa on \
--temp 0.3 \
--repeat-penalty 1.05 \
--top-p 0.9 \
--top-k 20 \
--min-p 0.04 \
-b 512 \
--ctx-size 10000 \
--jinja \
--reasoning-budget 1 \
--chat-template-kwargs '{"enable_thinking":false}' \
--no-mmap \
/home/eaman/llama/bin_vulkan/llama-server \
-m /home/eaman/lm/models/unsloth/Qwen3.6-27B-UD-IQ3_XXS.gguf \
--host 0.0.0.0 \
-np 1 \
--fit-target 20 \
-ctk q4_0 \
-ctv q4_0 \
-fa on \
--temp 0.3 \
--repeat-penalty 1.05 \
--top-p 0.9 \
--top-k 20 \
--min-p 0.04 \
-b 512 \
--ctx-size 10000 \
--jinja \
--reasoning-budget 1 \
--chat-template-kwargs '{"enable_thinking":false}' \
--no-mmap \0.0.0.0 \
-np 1 \
--fit-target 20 \
-ctk q4_0 \
-ctv q4_0 \
-fa on \
--temp 0.3 \
--repeat-penalty 1.05 \
--top-p 0.9 \
--top-k 20 \
--min-p 0.04 \
-b 512 \
--ctx-size 10000 \
--jinja \
--reasoning-budget 1 \
--chat-template-kwargs '{"enable_thinking":false}' \
--no-mmap \
FinBenton@reddit
Q6_K_XL 210k context, 53 tok/sec output on 5090.
Opteron67@reddit
Fp8 model - dual 5090 102tk/s (single request)
Responsible-Exit68@reddit
RTX5090, UD-Q5_K_XL
1) Small prompt
Generation: 59 tok/s
2) 90k prompt -
Pre-fill: 2187 tok/s
Generation: 47 tok/s
dinerburgeryum@reddit
You can, if you're feeling saucy, move to ik_llama.cpp and use split mode graph for uplift on multiple cards. I went from 20tps on full content 3090+A4000 to 30tps. It doesn't seem mind-blowing, but a 50% uplift wasn't nothin.
Ambitious_Fold_2874@reddit (OP)
I don’t know what I’m doing wrong but setting up ik_llama.cpp was a huge PITA and for some reason didn’t help with speeds; but this was a while back and with a different model and setup, so maybe it would work better here
Corosus@reddit
every time i try ik vs llama.cpp it acts really broken in opencode, just trying now it was trying to do everything via git commands only and gets stuck, but llama.cpp its flawless
BigYoSpeck@reddit
-sm tensor does the same thing now in llama.cpp
legit_split_@reddit
Credit to the amazing Johannes Gaessler, however I think it's not quite as mature as ik_llama's implementation.
dinerburgeryum@reddit
I've never gotten it to work, but I think it's because I'm on heterogeneous cards.
iChrist@reddit
I get the exact same speeds using the same quants (26t/s, 3090Ti, Q4, 128k context(
AeroelasticCowboy@reddit
what's your prompt processing speed? and is that 26t/s at bench depth setting of 128,000?
kevin_1994@reddit
on RTX 4090 + RTX 3090 using unsloth's Q8_XL quant
using
taskset -c 0,15 /home/kevin/ai/llama.cpp/build/bin/llama-bench -m /home/kevin/ai/models/Qwen3.6-27B-UD-Q8_K_XL.gguf -ub 4096,8192 -b 4096,8192 -ngl 999 -t 16results:
Wirhoss@reddit
Arc Pro B70 i'm getting \~18tok/s UD-Q4_K_XL still early testing
car_lower_x@reddit
No idea what’s good or bad but getting 6.15 tok/sec with 7236 tokens on a 5090
0r1g1n0@reddit
```
uv run mlx-openai-server launch \
--model-path mlx-community/Qwen3.6-27B-4bit \
--model-type multimodal \
--served-model-name qwen36-local \
--reasoning-parser qwen3_5 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice
```
m4 max 36gb \~22 tokens per second
Maleficent_Bridge_41@reddit
vllm, bf16, rtx 6k - \~480t/s at \~4 requests/s via vllm bench
StanPlayZ804@reddit
I'm getting around 4 tokens/s running at BF16 with the full 256K context window.