Can I improve performance for qwen 3.6 27b?

Posted by wgaca2@reddit | LocalLLaMA | View on Reddit | 40 comments

Hardware
OS: Windows 11 Pro 10.0.26200, Build 26200
CPU: Intel Core Ultra 7 270K Plus, 24 cores / 24 threads, max clock 3.7 GHz
RAM: 32 GB DDR5 @ 5600 MHz, 2x16 GB Crucial CP16G56C46U5.C8D
GPU: 2x NVIDIA GeForce RTX 3090, 24 GB VRAM each, compute capability 8.6
NVIDIA driver: 596.21
Windows GPU driver: 32.0.15.9621

Model
Name: qwen36-q6-tools-192k-nothink:latest
Ollama model ID: 42e91752a44b
Architecture: qwen35
Parameters: 26.9B
Quantization: Q6_K

Ollama Runtime / Model Parameters
GPU offload: 65/65 layers, 100% GPU
Configured context: 196,608 tokens
num_ctx: 196,608
num_batch: 1,024
num_predict: 8,192
temperature: 0.45
top_k: 20
top_p: 0.8
repeat_penalty: 1
stop tokens: <|im_start|>, <|im_end|>

Runner Settings Observed In Ollama Logs
FlashAttention: enabled
KV size: 196,608
Parallel: 1
NumThreads: 8
UseMmap: false
MultiUserCache: false
LoRA: none
GPU layers: 65

Observed Load With num_batch 1024
Total model memory reported by Ollama: ~38.6 GiB
All 65/65 layers offloaded to GPU

Layer / Memory Split From Load Log
CUDA0: 35 layers, weights 9.4 GiB, KV cache 7.6 GiB, compute graph 843.8 MiB
CUDA1: 30 layers, weights 10.2 GiB, KV cache 8.1 GiB, compute graph 1.5 GiB
CPU: weights 994.6 MiB, compute graph 20.0 MiB

Currently getting 2000-5000 evaluation tokens and 15-20 generating tokens. Is that the limit for this context size?

[-]

Thomasedv@reddit

I'm using llama.cpp and a single 3090, with a Q4 quant(and q8kv cache) admittedly, but getting 50-70 generation tokens with the current multi token prediction(MTP) branch. The speculative decoding is probably your best bet at speedups. If you code, you might get even more on code already in context with ngram speculation.

The only unclear thing is if multi gpu works with MTP and speculative decoding.

[-]

keepthememes@reddit

im using mtp on multi gpu and it gives me up to 2x t/s of no mtp

[-]

wgaca2@reddit (OP)

Q4 is different

[-]

HopefulConfidence0@reddit

He is using q4, but point is MTP will also work with q6. You should try llama.cpp with MTP branch.

[-]

see_spot_ruminate@reddit

What is this ask, it has common flags: windows, ollama...

I get 40 to 50 t/s (up to 100 t/s with 2 requests) with the fp8 model with linux and my 5060ti setup. Like everything stop using ollama and windows.

[-]

wgaca2@reddit (OP)

3090 does not have fp8

[-]

see_spot_ruminate@reddit

My 5060ti setup does along with enough vram with 4 of them to run it in vllm.

[-]

wgaca2@reddit (OP)

What does your 5060 have to do with my question?

[-]

see_spot_ruminate@reddit

I think that people, not to insult you but this could be something search for later so I am including it, go out and buy things that the community "recommends" but does not really understand what they are doing.

Like, you have these 2x 3090 that... why did you get them? To do some local work, but you just got it because the 3090 is a meme at this point. You use windows even though it is not really the OS for this kind of hobby/work and then get bad results when you try to use it. Again, this isn't just you, but so many people come to this subreddit and get sold on the 3090 or even other hardware but refuse to spend a bit understanding why that hardware or software is suggested. I think it is like this for any hobby.

I included my 5060ti bragging because at the end of the day I will get more performance out of it. I am not going to convince you, but maybe someone else that reads this to stop buying old cards they don't understand or to give linux a try.

tl;dr try linux, ditch windows, stop buying old cards that are losing out to optimizations over time.

[-]

AdIllustrious436@reddit

I included my 5060ti bragging because at the end of the day I will get more performance out of it.

Really? A 5060ti runs what ? Q3 ? Q2 for agentic ? That's not 'more performance out of it' by any means.

[-]

see_spot_ruminate@reddit

My quad setup using vllm runs the fp8 directly from qwen at full context and 40 t/s tg. Maybe that is not enough for you??

[-]

AdIllustrious436@reddit

Mb I missed the 'quad'. I thought you claimed better perf out of a single 5060ti over a 3090.

[-]

see_spot_ruminate@reddit

lol, people like to hate on the 5060ti and I think that it offers more performance in this day than 1-2 3090s. I hijacked ops question, but its also to have some other things when or if people search for answers. So sorry to op.

[-]

wgaca2@reddit (OP)

I don't know how you compare 4x 5060 ti which will cost significantly more than 2x 3090 alone (if you don't overpay of course) and the motherboard/cpu combo required for running 4 gpus with lanes directly to the cpu is a significant overhead. I am not saying one is better than the other, but they are not comparable.

[-]

see_spot_ruminate@reddit

my 5060ti cost around $400 a piece (last year) and connecting them is not actually that hard if you don't go over 4 cards. You can use any consumer motherboard for the most part. Its really not that expensive. Also you don't need that many pcie lanes if you are just doing inference, like one of my cards is on x1 lanes and it works just fine. People overthink this. Just min-max the situation.

[-]

wgaca2@reddit (OP)

400 a piece but not in Europe, You will be lucky to get one under £500, most 16gb versions sell over £470. Also, connecting gpu's through chipset pcie slots massively bottlenecks any model splits.

[-]

see_spot_ruminate@reddit

I don't see those bottlenecks, if anything it just gets to the cards vram bandwidth first for me.

[-]

AdIllustrious436@reddit

I get it.

I guess people love the 3090 because that's the ceiling price to run these models comfortably with a fairly simple mono card setup.

[-]

see_spot_ruminate@reddit

Now that we can run a good, small-ish, dense model at home with some good results then we should all really rethink the "buy the 3090" recommendations. This along with the pretty good MOE model from qwen, people don't need to spend a lot of money. That said, I don't think we should chase any single model.

[-]

wgaca2@reddit (OP)

Running 3x 5060ti 16gb at pcie5 x4 or x8 is more expensive then 2x 3090 at pcie 4 x8

[-]

see_spot_ruminate@reddit

Do you have like a flash drive or something? or an extra sata drive lying around? Don't go spending a lot of money, but put in that extra drive, install ubuntu 24.04 LTS (not the newest 26 until like august or something to work out bugs) and go from there. Check out r/linuxupskillchallenge. For your cards you won't even have to update all that much. Try the vulkan binaries if you don't want to compile at first and have fun. Ditch windows.

[-]

see_spot_ruminate@reddit

Maybe?

[-]

rmhubbert@reddit

Ampere doesn't have hardware support for 3090, but vllm can support FP8 on Ampere through software (Marlin kernal, I think?). Not as fast as the hardware implementation, but plenty fast in practice. Worth trying, in any case. I run Qwen3.5-122B-A10B at FP8 over 8 3090, and it averages 80-100tps.

[-]

autisticit@reddit

Switch to vllm. At least to know what perfs you can attain.

[-]

wgaca2@reddit (OP)

vllm provided almost identical numbers. since i am using it local only for me i decided to stick with ollama. Unless you can suggest a specific setup to increase single user performance on vllm?

[-]

Nepherpitu@reddit

https://github.com/noonghunna/club-3090

It needs tuning, but I'm getting 100tps decode (on 4 or 8 3090) with vLLM. Must be faster for 2x3090 and FP8 weights.

[-]

see_spot_ruminate@reddit

3090 do not have fp8

[-]

Nepherpitu@reddit

vLLM has fallback to marlin kernel which handle fp8 fine

[-]

see_spot_ruminate@reddit

That is the software to work around it, the instruction set is not in the chip.

[-]

wgaca2@reddit (OP)

That's what I found out as well

[-]

Rattling33@reddit

Try above club 3090, I have gotten 80~100 tg/s (up to coding or not), 1400~2500 pp/s for single instance (2+ cocurrency gets more in total) all the way above 128k with these speed.

[-]

L0ren_B@reddit

+ one for club 3090. They are an amazing community and sort stuff out fast! Qwen 27B became an amazing fast model thanks to them on my 2x3090!

[-]

wgaca2@reddit (OP)

I will give it a try today and see if it gets better. I tried to configure vllm the other day but it gave same output as ollama. I guess using the repo settings might be better

[-]

samoxis@reddit

If you tried vllm and got similar numbers to ollama, context size is likely the bottleneck regardless of backend. At 196k you're bandwidth-bound, not compute-bound. Try 8k context and compare again.

[-]

No-Refrigerator-1672@reddit

It's impossible. On Ampere, vllm easily does 5x the speed of llama.cpp, and ollama by extension (it is based on llama.cpp). You probably set up it wrong. You can copy the launch commands from my post.

[-]

wgaca2@reddit (OP)

I will look into it

[-]

Herr_Drosselmeyer@reddit

Ballpark, that doesn't sound wrong.

[-]

samoxis@reddit

196k context is killing your speed — KV cache alone is 15GB+ across both GPUs. Drop to 8k-16k for daily use and you'll see 40-50 t/s easily on 2x3090. Reserve the big context for specific tasks only. Also OLLAMA_KV_CACHE_TYPE=q8_0 helps reduce KV memory without much quality loss.

[-]

Daemontatox@reddit

Your first mistake is using ollama , use llama.cpp , iyou have the GPU , use vllm with tensor parallelism and fp8 , idk if rtx 3090 support nvfp4 or no . Or if you are using the same prompts/msgs like data processing with static system prompt , try sglang with same settings and use FlashInfer cutllas.

Another option thats really great is MAX from Modular , its faster than vllm by a tiny bit but its not as stable sadly

[-]

tmvr@reddit

Using llamacpp directly will already be faster (about 2x your numbers) and switching to vLLM and tensor parallel and MTP will speed it up even more.