[-]

Makers7886@reddit

4x3090 BF16 model and cache + MTP + vLLM with "instruct general" mode is:

[-]

ProfessionalSpend589@reddit

Is it worth it to run it at BF16?

I see another user in this thread and wonder what bad things are you seeing when quantised…

[-]

It eliminates day 0 quant issues, some FP8's don't run on 3090s (w8a8, like Q3.6), and int8/fp8 is too tight for 2 gpus w/max context and concurrency which means I have to use all 4 with an obscene amount of kv cache laying around. Since this is only for my projects/uses I dont need more than 12 concurrent calls and is a waste of gpus. May as well run bf16 model and cache with max context + concurrency. I do blame lower quants when I hit tool issues at large context like 200k+ and see consistency drop off but I have no data around that.

Also my main goto is 122b fp8 on 8x3090s which is faster and consistently beats q3.5 27b bf16 (in my uses) and want to give q3.6 27b it's best shot.

[-]

ShadyShroomz@reddit

have you found 3.6 27b better than 3.5 122b? im running with 4x 3090's and getting like 60 tok/s with 27b at fp8...

[-]

robertpro01@reddit

Which mobo?

[-]

Makers7886@reddit

romed8-2t + epyc 7502

[-]

shokuninstudio@reddit

Q4

[ Prompt: 207.6 t/s | Generation: 22.6 t/s ]

Q8 on a MacBook Pro 48GB is producing graphical glitches all over my screen so I shut it down. In theory there is plenty of RAM but llama.cpp has been grabbing more memory than needed lately.

[-]

Frank-w618@reddit

You can try oMLX on Mac; it uses much less memory compared to llama.cpp and is also faster.

[-]

mitreffahcs@reddit

I've tried both vllm-mlx and vllm-metal and have not had impressive speeds with either one. Not sure if it's something with the mlx-community models specifically compared to how the GGUF models are tunes. What MLX models have you been using? I'm running a Mac Studio M2 Ultra.

[-]

bigh-aus@reddit

Try this and report back :)

[-]

IronColumn@reddit

what kind of diffrence do you see?

[-]

IronColumn@reddit

what macbook generation?

[-]

mitreffahcs@reddit

I'm not sure about tokens per second, but it can delete 800+ lines of research notes in 11.7 seconds.

[-]

shuwatto@reddit

4090+5080 with unsloth Q6_K_XL

- pp 1900tk/s

- gen 29tk/s

- ctx 252k

[-]

ridablellama@reddit

4090 - 40 tok/s with my short quick chats in lm studio. highly unoptimized but q8_0 - unsloth q4km

[-]

Icy_Butterscotch6661@reddit

I was getting 45 on a 4090 at small context lengths (with llama.cpp binaries downloaded frpm their GitHub).

[-]

Most-Trainer-8876@reddit

4090 got 24GB vram, right? How much context are you able to push?

[-]

Icy_Butterscotch6661@reddit

Missed the notification for this, my bad. I doubt it's any use anymore but here you go:

./llama-server \
  -m "path/to/Qwen3.6-27B-Q4_K_M.gguf" \
  -ngl 999 \
  --no-mmap \
  -b 2048 \
  -ub 512 \
  --port 1234 \
  --host 0.0.0.0 \
  --main-gpu 0 \
  -ts 1,0 \
  --ctx-size 50000 \
  -ctk q8_0 \
  -ctv q8_0 \
  -np 1 \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

[-]

Most-Trainer-8876@reddit

50K context? Max I was able to push was 54K, and there's no need for q8_0, go ahead and use f16 because due to the architecture of qwen 3.6, setting q8 kv is pointless as majority of them are still unaffected and set to f16/bf16.

[-]

jedisct1@reddit

~15 tok/s with omlx on an Apple M5.

[-]

havnar-@reddit

I’m on an m5 pro 64gb and only get about half. Teach me your secrets

[-]

logic_prevails@reddit

What quant

[-]

jedisct1@reddit

Q8

[-]

maschayana@reddit

M5 Max

Nvfp4 MLX by mlx-community

Tg = 31.12 t/s 2272 tokens Ttft = 0.54s

[-]

DramaLlamaDad@reddit

Matches my results. On Macbook Pro m5 pro 64gb box, basically half the everything of the Max, I get 14.8 tok/sec.

[-]

havnar-@reddit

What!? I’m stuck at 7 tps 8bit and 10tps at 4 bit … same specs

[-]

Signor_Garibaldi@reddit

Do you find it usable at this speeds?

[-]

DramaLlamaDad@reddit

Obviously, I just got 3.6 setup but based on past experiences, 15 is not great for active coding. Fine for background tasks, which is what I'll use it for. Probably my new overnight code reviewer and daily code change summarizer.

[-]

Signor_Garibaldi@reddit

I shouldn't probably ask it on this sub, but isn't it too much hassle for what you could accomplish with api in a sensible time and marginal cost or is it more of a hobby/satisfaction kind of a thing?

[-]

DramaLlamaDad@reddit

For an example, I just loaded it up and asked it to review 5 files with total of around 1000 lines of changes in Cline Code on it. Took about 10 minutes to complete the review but did a decent job. If you've got some house work to do and can start requests and go afk for a while and come back later, works great! If you're a professional trying to get stuff done, about 50tok/sec is really the minimum you should shoot for, not 10-15tok/sec. :/

[-]

PinkySwearNotABot@reddit

wow. should i not even bother with the GGUF Q6? i have m1 max 64gb

[-]

maschayana@reddit

I would go with the 35b a3b. To me so far overall much better experience

[-]

PinkySwearNotABot@reddit

but...the benchmarks :(

[-]

maschayana@reddit

At least for me it's the balance of quality and speed and the moe just excels

[-]

RoomyRoots@reddit

Around 20t/s on a RX 7800 XT, same as 3.5. I feel that since my last llama.cpp build I got some performance degradation but I don't have time right now to fix it.

[-]

sarcasmguy1@reddit

can you share the config you use for the 7800 please?

[-]

Mallock78@reddit

Here are my LM Studio settings for Qwen3.6 35B A3B (the 27B dense only gets around 6 t/s on my rig) with Vulkan runtime. Getting about 26 t/s, plugged into VSCode with Cline and it's working great with maximum 262k context! It can one-shot a runnable Vue SPA using https://impeccable.style/ and iterate improvements to it. It can also create fun SVG images.

[-]

verdooft@reddit

[ Prompt: 2,2 t/s | Generation: 1,4 t/s ], Q6_K_XL

[-]

UniversalJS@reddit

Are you running it on a Gameboy?

[-]

danigoncalves@reddit

I am running on a 12GB A3000 and don't get much better than this 😂

[-]

verdooft@reddit

A Notebook without GPU and slow RAM. :-)

[-]

IronColumn@reddit

10 t/s

m1 max studio 32gb

unsloth q4 k_m

[-]

PANIC_EXCEPTION@reddit

I'm getting 11.93 tps with MLX @ 4bit. Faster, but not by much. M1 Max 64 GB on MacBook Pro.

[-]

IronColumn@reddit

hmm yeah that's what I feared... i see big numbers from MLX speedups, under specific circumstances. Other places it can be quite marginal. I'm always hoping for a free 40%

[-]

shcherban91@reddit

LM Studio on GMKtec K8+ 8845HS with single SODIMM DDR5@5600.

Qwen3.6 27b - 2.18tok/s,
Actually using Qwen3.6 35B A3B instr at quite usable 13.5tok/s (for some background tasks)

[-]

shcherban91@reddit

Added second stick: now have 27B runs 4.23tok/s , and 35B A3B - 23.5tok/s

[-]

quantier@reddit

RTX 6000 PRO on Q8 - getting 44 tok/s in LM Studio with 256K context in LM Studio with KV Cache Quantized to Q8

Honestly I am wondering if I am missing some setting because I was thinking I could get much better speeds. Anyone else with a RTX 6000 PRO?

[-]

Easy_Werewolf7903@reddit

I have a RTX 6000 Pro Max Q, I am getting 35 tk/s on UD-Q8_K_XL at 260k context. The max q is about 10 to 20 percent less performant. 35/44 = 0.795 so the ratio tracks. If you want speed try prefilling, you can get some insane tok/s on a dense model.

[-]

quantier@reddit

Sorry I also have a Max Q :) - I think I am getting a speed boost from the Q8 KV cache

[-]

Apprehensive_Cry_969@reddit

on a 5070 ti 16gb, 7800x3d, 32gb ddr5 6000mhz. i'm only getting like 8 t/s. with the 35b version im getting 38 t/s.

[-]

FoxiPanda@reddit

RTX 5090 running UD_Q5_K_XL - ~45tok/s at token 1000, more like 35 at token 100000.

I have not optimized my launcher script at all yet though. YMMV.

[-]

michaelsoft__binbows@reddit

You're running that under powershell via docker desktop on windows?

I hate powershell so much. I thought LLMs were going to change my tune on that but they also completely suck at writing powershell. Still better than me, but why bother going back and forth fixing trivial failures when they are completely reliable oneshotting solutions in bash.

[-]

FoxiPanda@reddit

This is.. a long story but basically I have some dedicated machines for AI but my best Nvidia GPU (a 5090) currently is in my gaming desktop which runs windows and well, the rest is history.

[-]

michaelsoft__binbows@reddit

Yeah i'm with you but neither wsl2 nor docker (with kernel under a vm?) will give good efficiency. should you not be running vllm windows binary?

I too would like to play games on the rig without a reboot. so i will be looking into it soon enough.

[-]

FoxiPanda@reddit

Yeah don't get me wrong, this setup is cursed. It also works. I'm not going to look a 45-50tok/s TG and 2000+tok/s PP gift horse in the mouth. I'm just gonna use it... at least until I rip this 5090 out of this system and go shove it into a dedicated AI box.

[-]

michaelsoft__binbows@reddit

I just achieved last night 190tok/s on single inference with the 5090 on linux with vllm and mtp 5. I am still in disbelief.

If windows vllm can run this similarly i am gonna be beside myself

[-]

FoxiPanda@reddit

What's your setup (hardware), launch params in vLLM, and what quant do you run?

[-]

michaelsoft__binbows@reddit

#!/usr/bin/env bash
set -euo pipefail

# Portable known-good Qwen3.6 27B AWQ MTP5 launch for RTX 5090-class 32 GB GPUs.
# Serves OpenAI-compatible API at http://127.0.0.1:8083/v1 by default.

IMAGE="${IMAGE:-vllm/vllm-openai:cu130-nightly}"
MODEL="${MODEL:-cyankiwi/Qwen3.6-27B-AWQ-INT4}"
HOST_PORT="${HOST_PORT:-8083}"
CONTAINER_PORT="${CONTAINER_PORT:-6000}"
HF_CACHE_DIR="${HF_CACHE_DIR:-$HOME/.cache/huggingface}"
VLLM_CACHE_DIR="${VLLM_CACHE_DIR:-$HOME/.cache/vllm}"

mkdir -p "$HF_CACHE_DIR" "$VLLM_CACHE_DIR"

exec docker run --rm -it \
  --gpus all \
  -p "${HOST_PORT}:${CONTAINER_PORT}" \
  --ipc=host \
  -v "${HF_CACHE_DIR}:/root/.cache/huggingface" \
  -v "${VLLM_CACHE_DIR}:/root/.cache/vllm" \
  -e HF_HOME=/root/.cache/huggingface \
  -e VLLM_CACHE_ROOT=/root/.cache/vllm \
  --entrypoint vllm \
  "$IMAGE" \
  serve "$MODEL" \
  --dtype auto \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.86 \
  --host 0.0.0.0 \
  --port "$CONTAINER_PORT" \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 4 \
  --kv-cache-dtype fp8 \
  --no-enable-prefix-caching \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"mtp","num_speculative_tokens":5}'

courtesy of gpt5.5

[-]

michaelsoft__binbows@reddit

Some of these params are not good. oh also i'm not using this quant right now. this one can only fit like 38k context. hold on...

[-]

michaelsoft__binbows@reddit

Here are the latest params

[qwen36-vllm] profile: autoroundmtp5090ctx50k
[qwen36-vllm] model: Lorbus/Qwen3.6-27B-int4-AutoRound
[qwen36-vllm] endpoint: http://127.0.0.1:8083/v1
[qwen36-vllm] command:
docker run --rm -it --gpus all -p 8083:6000 --ipc=host -v /home/slu/.cache/huggingface:/root/.cache/huggingface -v /home/slu/.cache/vllm:/root/.cache/vllm -e HF_HOME=/root/.cache/huggingface -e VLLM_CACHE_ROOT=/root/.cache/vllm --entrypoint vllm vllm/vllm-openai:cu130-nightly serve Lorbus/Qwen3.6-27B-int4-AutoRound --dtype auto --max-model-len 50000 --gpu-memory-utilization 0.90 --host 0.0.0.0 --port 6000 --max-num-batched-tokens 8192 --max-num-seqs 1 --kv-cache-dtype fp8 --no-enable-prefix-caching --trust-remote-code --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --speculative-config \{\"method\":\"mtp\"\,\"num_speculative_tokens\":5\}

Do tinker with it. But yeah the speed is insane. Zero desire to test out 35B if i can get this much speed on 27B.

[-]

michaelsoft__binbows@reddit

Oof i just had my gpu drop off and the machine needs a reboot. mtp5 might be a bit insane, idk

[-]

FoxiPanda@reddit

I have not had a lot of luck with MTP above 3. It seems to cause fascinating instabilities in the models I've tried it on.

[-]

michaelsoft__binbows@reddit

GPU fell off the bus. I launched a reboot over ssh and when i went back in my room its fans were going full blast and i had to hold the power button...

[-]

michaelsoft__binbows@reddit

No need to be shy about dual booting though. You gotta shut down all the apps anyway to free up vram.

[-]

RedParaglider@reddit

Man 45 t/s is super useable too.

[-]

FoxiPanda@reddit

Strix Halos only have ~250GB/s of mem bandwidth. That's ~7x less than a 5090. It's gonna be 7x slower lol

[-]

No_Mango7658@reddit

5090 is about 2tbps for anyone who doesn’t know.

[-]

FoxiPanda@reddit

1792GB/s but who's counting ;)

[-]

No_Mango7658@reddit

You just be running running stock 😂🤩. 5090 is amazing for these size models

[-]

FoxiPanda@reddit

I am very much in the state of not catching my 5090 on fire :D

[-]

No_Mango7658@reddit

Probably smart

[-]

CMPUTX486@reddit

How fast it is from AMD max?

[-]

RedParaglider@reddit

11 t/s on a q4

[-]

My_Unbiased_Opinion@reddit

I probably would do either unquantized 3.6 35B Moe or a quantized 3.5 122B Moe on strix halo IMHO.

[-]

annodomini@reddit

Yeah, with these really strong dense models coming out, I'm feeling like I need to pick up a desktop chassis with the discrete GPU. It's neat what I can run on my laptop, but I could really use more memory bandwidth.

[-]

bigh-aus@reddit

For my uses, I’d go context.

I tried out q4_k_xl on a 3090 with 96k context, about 30tps. At 46k tokens. (Openclaw coding).

[-]

FoxiPanda@reddit

Yeah I'm leaning that way too. I'll try it and see how far I can push it above 128K

[-]

bigh-aus@reddit

I had it pull a story off the backlog, complete it, but included one compaction.

[-]

psxndc@reddit

I really appreciate you posting your docker command. Thank you!

[-]

FoxiPanda@reddit

For sure, no promises it won't change more as I figure out more optimizations, but this one feels really good so far. I actually just started a new post that pulls mostly from this comment of mine and is basically bait for someone who's way more vLLM knowledgeable than I am to tear apart my launch parameters :D

[-]

codables@reddit

Can you share your command line params? I’m assuming this is llama.cpp?

[-]

FoxiPanda@reddit

Unoptimized version uses this currently...I'm am working at the moment so no time to poke at it further atm but it will change (there is so much room from improvement from this):

  --n-gpu-layers 999 `
  --ctx-size 131072 `
  --parallel 1 `
  --threads 16 `
  --temp 1.0 `
  --top-p 0.95 `
  --min-p 0.00 `
  --top-k 20 `
  --repeat-penalty 1.1 `
  --mlock `
  --flash-attn on `
  --cache-type-k q8_0 `
  --cache-type-v q8_0 `
  --host 0.0.0.0 `
  --port 6000

[-]

codables@reddit

Thank you for sharing!

[-]

FoxiPanda@reddit

Note, I updated the parent comment with an entirely new setup that absolutely smokes this un-optimized llama.cpp setup. See https://old.reddit.com/r/LocalLLaMA/comments/1sss5og/what_speed_is_everyone_getting_on_qwen36_27b/oho2if8/

[-]

Limp_Classroom_2645@reddit

Man that's slow

[-]

Possible-Pirate9097@reddit

How fast is ur PP?

[-]

PinkySwearNotABot@reddit

pretty fast when i want it to be

[-]

Far_Cat9782@reddit

/flush ftw

[-]

FoxiPanda@reddit

Dense models be dense. Again, no optimization yet. I can probably get it to like... 50-60 with TG with some work. Frankly it is totally usable though. I can't read faster than 20-25tok/s on a single request. Background stuff obviously would be happier being faster.

[-]

iportnov@reddit

That's desktop, not laptop I assume?

[-]

FoxiPanda@reddit

Correct. I am power limiting to 500W though so my GPU doesn't catch on fire :)

[-]

Certain-Cod-1404@reddit

Are you on linux? If so do you know the specific command you used ? And is it a set it once forget about it thing or do you need to setup systemd stuff to run on start up

[-]

FoxiPanda@reddit

I am not on linux on this specific system, but uh, there is definitely a way to do this in linux.

It's something like:

sudo nvidia-smi -pl

[-]

_hephaestus@reddit

For whatever reason, this is just per session and you’ll likely need a systemd script to have it persist across boots unless nvidia changed something.

[-]

FoxiPanda@reddit

Indeed, but really he should just have an LLM agent set this up for him. But yeah, that's something like:

/etc/systemd/system/nvidia-powerlimit.service

[Unit] Description=Set NVIDIA GPU power limit After=nvidia-persistenced.service

[Service] Type=oneshot ExecStart=/usr/bin/nvidia-smi -pl 500

[Install] WantedBy=multi-user.target

sudo systemctl enable nvidia-powerlimit

This will vary between different linux flavors though I imagine. I am going to assume he's using like something like a 2007 build of Gentoo.

[-]

Certain-Cod-1404@reddit

Ok thank you so much and is there any meaningful drop in performance compared to not limiting wattage ?

[-]

Ambitious_Fold_2874@reddit (OP)

Does power limiting help with GPU longevity too? In that case I might have to start figuring out how to set this up too :/

[-]

FoxiPanda@reddit

I mean, the 12VHPWR / 12V-2x6 power connector is CLEARLY flawed. People would not be constantly reporting melty cables/connectors if it wasn't. Even a 0.1% failure rate in this case is thousands of cards failing.

So, reducing the power a little can at least back away from that hairy edge of the hardware's capability. 500W is still an insane power envelope and that last 100W is pretty diminishing returns as far as speed goes imo.

[-]

Gargle-Loaf-Spunk@reddit

well where's the fun in that. `--yolo`

[-]

lly0571@reddit

1800-1900t/s PP and ~60t/s TG with 4x 3080 20GB TP4 running Qwen3.6-27B-FP8 with vLLM.

The setup can reach 80+t/s with MTP, but MTP would be much slower for long context currently. So I perfer MTP off.

vllm serve Qwen3.6-27B-FP8 -tp 4 --max-model-len 262144 --gpu-memory-utilization 0.91 --max-num-seqs 128 --tool-call-parser qwen3_coder --reasoning-parser qwen3 --enable-auto-tool-choice --max_num_batched_tokens 4096 --enable-prefix-caching --default-chat-template-kwargs {"enable_thinking": true, "preserved_thinking": true} --quantization fp8 --performance-mode throughput

Benchmark result

# pp 2048 tg 256
============ Serving Benchmark Result ============
Successful requests:                     4         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  21.37     
Total input tokens:                      8192      
Total generated tokens:                  1024      
Request throughput (req/s):              0.19      
Output token throughput (tok/s):         47.91     
Peak output token throughput (tok/s):    63.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          431.22    
---------------Time to First Token----------------
Mean TTFT (ms):                          1284.64   
Median TTFT (ms):                        1272.36   
P99 TTFT (ms):                           1322.03   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.91     
Median TPOT (ms):                        15.91     
P99 TPOT (ms):                           15.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.91     
Median ITL (ms):                         15.92     
P99 ITL (ms):                            16.23     
==================================================


# pp 16384 tg 256
============ Serving Benchmark Result ============
Successful requests:                     4         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  54.27     
Total input tokens:                      65536     
Total generated tokens:                  1024      
Request throughput (req/s):              0.07      
Output token throughput (tok/s):         18.87     
Peak output token throughput (tok/s):    61.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          1226.41   
---------------Time to First Token----------------
Mean TTFT (ms):                          9363.54   
Median TTFT (ms):                        9364.49   
P99 TTFT (ms):                           9386.58   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.49     
Median TPOT (ms):                        16.48     
P99 TPOT (ms):                           16.53     
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.49     
Median ITL (ms):                         16.54     
P99 ITL (ms):                            16.86     
==================================================

# pp 65536 tg 256

============ Serving Benchmark Result ============
Successful requests:                     4         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  159.30    
Total input tokens:                      262144    
Total generated tokens:                  1024      
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         6.43      
Peak output token throughput (tok/s):    55.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          1652.00   
---------------Time to First Token----------------
Mean TTFT (ms):                          35214.84  
Median TTFT (ms):                        35250.68  
P99 TTFT (ms):                           35289.80  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.08     
Median TPOT (ms):                        18.10     
P99 TPOT (ms):                           18.15     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.19     
Median ITL (ms):                         18.30     
P99 ITL (ms):                            18.67   
==================================================

# pp 2048 tg 256 16 reqs

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  26.47     
Total input tokens:                      32768     
Total generated tokens:                  4096      
Request throughput (req/s):              0.60      
Output token throughput (tok/s):         154.75    
Peak output token throughput (tok/s):    512.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1392.72   
---------------Time to First Token----------------
Mean TTFT (ms):                          11688.95  
Median TTFT (ms):                        12662.71  
P99 TTFT (ms):                           18317.95  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.57     
Median TPOT (ms):                        53.82     
P99 TPOT (ms):                           94.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.57     
Median ITL (ms):                         32.00     
P99 ITL (ms):                            795.38    
==================================================

[-]

spaceman_@reddit

Single and dual AMD Radeon Pro R9700 numbers with llama.cpp with both ROCm and Vulkan for IQ4_NL, Q6_K and Q8_0.

Single cards are obviously swapping in case of the Q8_0 benchmark.

I have not yet tried the new tensor parallellism, because I previously got horrible numbers on both backends. Not sure if this has since been fixed.

unsloth_Qwen3.6-27B-GGUF_IQ4_NL

Single card, ROCm:

Context Size	PP Mean	TG Mean
0	1121.27	28.64
10000	1128.50	27.71
20000	1068.27	26.81
40000	948.40	25.18
60000	856.59	23.82
100000	713.44	21.43

Single card, Vulkan:

Context Size	PP Mean	TG Mean
0	812.69	31.14
10000	823.60	30.25
20000	788.17	29.40
40000	725.95	27.81
60000	670.40	26.44
100000	582.67	24.02

Two cards, ROCm:

Context Size	PP Mean	TG Mean
0	1428.52	27.82
10000	1864.92	26.89
20000	1789.76	26.02
40000	1633.76	24.15
60000	1472.44	23.17
100000	1214.65	20.83

Two cards, Vulkan:

Context Size	PP Mean	TG Mean
0	868.76	26.20
10000	1276.79	25.57
20000	1287.64	25.39
40000	1214.21	24.35
60000	1126.35	23.52
100000	979.55	21.55

unsloth_Qwen3.6-27B-GGUF_Q6_K

Single card, ROCm:

Context Size	PP Mean	TG Mean
0	627.35	22.29
10000	629.74	21.65
20000	611.17	21.09
40000	572.31	20.07
60000	536.21	19.17
100000	476.05	17.57

Single card, Vulkan:

Context Size	PP Mean	TG Mean
0	717.13	24.20
10000	744.30	23.63
20000	709.14	23.10
40000	658.87	22.13
60000	610.64	21.26
100000	536.72	19.66

Two cards, ROCm:

Context Size	PP Mean	TG Mean
0	799.31	22.09
10000	1033.84	21.50
20000	1027.78	20.93
40000	970.71	19.93
60000	911.66	19.02
100000	808.71	17.44

Two cards, Vulkan:

Context Size	PP Mean	TG Mean
0	830.29	21.72
10000	1150.00	21.52
20000	1161.56	21.21
40000	1100.27	20.39
60000	1028.63	19.59
100000	907.51	18.32

unsloth_Qwen3.6-27B-GGUF_Q8_0

Single card, ROCm:

Context Size	PP Mean	TG Mean
0	150.33	12.27
10000	139.26	11.95
20000	134.54	11.14
40000	121.09	9.70
60000	110.54	8.30
100000	93.51	6.72

Single card, Vulkan:

Context Size	PP Mean	TG Mean
0	152.47	12.63
10000	148.42	12.28
20000	141.96	11.73
40000	130.51	9.93
60000	122.09	8.53
100000	108.24	6.72

Two cards, ROCm:

Context Size	PP Mean	TG Mean
0	1374.39	18.31
10000	1764.64	17.90
20000	1721.27	17.51
40000	1566.19	16.79
60000	1416.06	16.15
100000	1180.52	14.97

Two cards, Vulkan:

Context Size	PP Mean	TG Mean
0	856.10	18.26
10000	1286.68	18.08
20000	1270.92	17.72
40000	1195.51	17.18
60000	1121.57	16.62
100000	979.39	15.65

[-]

putrasherni@reddit

than you very much

[-]

Bohdanowicz@reddit

27-40 tks single thread.

300+ tks multi.

(APIServer pid=1447918) INFO 04-24 20:18:38 [loggers.py:259] Engine 000: Avg prompt throughput: 2003.5 tokens/s, Avg generation throughput: 151.2 tokens/s, Running: 16 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.0%, Prefix cache hit rate: 0.0%

(APIServer pid=1447918) INFO 04-24 20:18:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 310.4 tokens/s, Running: 16 reqs, Waiting: 0 reqs, GPU KV cache usage: 49.1%, Prefix cache hit rate: 0.0%

Linear scaling kvcache/performance scaling with sequence.
Nothing crazy on config. Running vanilla Qwen fp8 release and benching before I try some turboquants.

vllm serve Qwen/Qwen3.6-27B-FP8 \

--port 8000 \

--max-model-len 262144 \

--max-num-seqs 16 \

--kv-cache-dtype fp8 \

--gpu-memory-utilization 0.92 \

--enable-prefix-caching \

--reasoning-parser qwen3 \

--enable-auto-tool-choice \

--tool-call-parser qwen3_coder

[-]

houchenglin@reddit

Qwen3-27B-Q4_K_M on RTX 2060 12G (PCIe x16) + RTX 5060 Ti 16G (PCIe x 4)

PP: from 653 → 356 t/s as context grows (13K → 29.5K tokens).
TG: flat at \~16.5 t/s r

```
-m Qwen3-27B-Q4_K_M.gguf -ngl 999 -ts 15,7

-fa 1 --no-mmap -b 4096 -ub 4096

--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48

-c 96000 -n 32768 -t 8 -ctk q8_0 -ctv q8_0 --parallel 1

--temperature 0.6 --jinja --min-p 0.0 --top-k 20 --top-p 0.95
```

[-]

audiophile_vin@reddit

RTX PRO 4500 Blackwell (32GB) — \~65 tok/s steady on Qwen3.6-27B NVFP4 + MTP

Setup:

RTX PRO 4500 Blackwell, 32GB GDDR7, 200W TGP
WSL2 (Ubuntu 24.04) on Windows 11
vLLM 0.19.2rc1 (cu130-nightly Docker image)
Model: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP (modelopt NVFP4, MTP head grafted back in BF16)
BF16 KV cache, 131K context

Numbers (single-stream, thinking disabled, vllm bench serve):

Steady-state TG: 60–73 tok/s (engine logs, varies by content)
Mean: \~65 tok/s, peaks 73
TPOT: 17 ms
TTFT: 240 ms median
Acceptance length: 3.19 mean (3.35–3.97 on easier text)
Per-position acceptance: 87/72/61% mean, 99/94/91% on best windows
Model footprint: 18.55 GB
KV cache: 9.77 GB available, \~37K token pool

vLLM launch (compose command block):

yaml

- sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
- --quantization
- modelopt
- --speculative-config
- '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
- --max-model-len
- "131072"
- --max-num-batched-tokens
- "4096"
- --max-num-seqs
- "10"
- --gpu-memory-utilization
- "0.93"
- --enable-prefix-caching
- --no-scheduler-reserve-full-isl
- --trust-remote-code
- --reasoning-parser
- qwen3
- --enable-auto-tool-choice
- --tool-call-parser
- qwen3_coder
- --default-chat-template-kwargs
- '{"preserve_thinking":true}'
- --language-model-only

[-]

michaelsoft__binbows@reddit

This thread gave me the motivation to try it out on my 5090.

I just hit 190tok/s on single inference asking for a tetris game. holy hell...

docker run --rm -it --gpus all -p 8083:6000 --ipc=host -v /home/slu/.cache/huggingface:/root/.cache/huggingface -v /home/slu/.cache/vllm:/root/.cache/vllm -e HF_HOME=/root/.cache/huggingface -e VLLM_CACHE_ROOT=/root/.cache/vllm --entrypoint vllm vllm/vllm-openai:cu130-nightly serve Lorbus/Qwen3.6-27B-int4-AutoRound --dtype auto --max-model-len 50000 --gpu-memory-utilization 0.90 --host 0.0.0.0 --port 6000 --max-num-batched-tokens 8192 --max-num-seqs 1 --kv-cache-dtype fp8 --no-enable-prefix-caching --trust-remote-code --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --speculative-config \{\"method\":\"mtp\"\,\"num_speculative_tokens\":5\}

[-]

SeaEar9637@reddit

llama.cpp
Qwen3.6-27B-UD-Q4_K_XL
Qwen3.6-35B-A3B-UD-Q4_K_XL

Server #1 (ultra-cheap): 3 x P102-100 (10 GB VRAM), 8 GB RAM
27B - 12 t/s, 35B - 40 t/s

Server #2: 3 x 2080TI (22 GB mod), 128 GB RAM
27B - 24 t/s, 35B - 70 t/s

[-]

picosec@reddit

I'm getting a little over 32 tokens/s for generation on a RTX 3090 + a RTX 3090ti under WSL (Ubuntu 24.04) with tensor parallelism and a 300W power limit.

*/llama-server --model */Qwen3.6-27B-UD-Q8_K_XL.gguf --ctx-size 65536 -np 1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --no-mmap -sm tensor

Unfortunately, I don't have an NVLINK bridge and the cards have the connector or different locations.

[-]

fuse1921@reddit

Pretty much same. 25-30 t/s with Qwen3.6-27B-UD-Q8_K_XL on Dual 3090s with 192Kcontext on Ubuntu 24.04 with 50/50 tensor parallel, about 280W draw on each.

[-]

picosec@reddit

Thanks, interesting. I tried a 192K context and was getting 34 tk/s pretty consistently, though I haven't come close to filling up the context.

I am running the cards in my old Threadripper 3970X system so they are both in full PCIe 4.0 x16 slots which might explain some of the difference.

[-]

fuse1921@reddit

Yeah who knows, lol. I'm also at Gen4/x16 for both on a 3970x. Im running through LocalAI, so there is likely some overhead. I'm generally happy with it. I only run dense when I need it. Otherwise I've got Unsloth Qwen3.6 35B-A3B UD-IQ4_NL_XL running at 68k context on a 3090Ti for most tasks at around 120-130 tok/s.

[-]

pepedombo@reddit

5070ti+5060ti+4060ti

27b Q8 (all gpus 200k context) -> starts at 13.5tps and goes like \~11.5 at 50k :)
Funny part is parallel -np 2, both prompts (100k each) run the same speed \~11tps, pci-e bottleneck, more likely 4060ti bottleneck.
27b Q8 32k ctx (5070+5060) -> 18tps, no more room for large ctx.
27b UD-Q6-K-XL -> very close to q8.
27b Q4 200k (all gpus) -> \~20tps
27b Q4 200k (5070+5060) -> 26 tps.

Tested on llama.cpp. Moe models go easier with my setup, dense are dense :)

[-]

ziphnor@reddit

n00B here with 2 x RTX 5060 TI 16gb , Intel Core 2 Ultra 235 with 64GB DDR5 (6400).

Using ik-llama.cpp with Q5_K_XL i get 23-24 t/s (\~112 pp). This is with memory OC ( +6000 MTs, which is apparently fairly standard on these cards), with standard memory speed it think it was 19-20 t/s.

[-]

shreddicated@reddit

By memory OC, did you mean GPU memory overclock? If true can you share how? Thanks!

[-]

ziphnor@reddit

Here is the script i used to search for safe memory OC. It is probably pretty redundant for RTX 5060 TI as it seems +3000mhz (+6000 MTs) is both the max and easily achieveable. It uses coolbits.

https://pastebin.com/ykFxKDLs

The script to enable coolbits (you need a dummy display for headless GPUs apparently): https://pastebin.com/27P5dUD8

Again, these are one-off AI generated scripts, so I have no idea if there are better ways. Regardless be sure to install cuda_memtest and give it a good long test.

After i created my scripts i also stumbed on: martinstark/nvoc: GPU overclocking utility for Blackwell RTX 50-series on Linux

[-]

shreddicated@reddit

Thanks so much! I'm going to take a look later today. What PSU are using with these 2 cards? Have you tried 128k or 192k context?

[-]

Kitchen-Year-8434@reddit

Vllm, mtp 3, FP8, rtx 6k - about 120 t/s.

[-]

Such_Advantage_6949@reddit

are u using native env or docker image for the vllm?

[-]

Kitchen-Year-8434@reddit

Native env. venv I create locally, install nightly wheels, then layer on select PR's on top.

[-]

DeltaSqueezer@reddit

and for prompt processing?

[-]

Kitchen-Year-8434@reddit

Fast. 4-6k+ pp across the board; pp is fast enough and relatively uniform enough across various inference engines (exllamav3, vllm, llama.cpp at least) that I tend not to really keep an eye on it.

[-]

AustinM731@reddit

Have you had any issues running the FP8 KV cache?

[-]

Kitchen-Year-8434@reddit

I have not. Broadly speaking, FP8 or int-based w/the matrix rotations have behaved indistinguishably from FP16 for me, and that's on use-cases up to 200k+ tokens writing code. Part of it is that agentic harnesses will self-correct (opencode, pi, claude code, etc), so slight drift isn't a complete wash. Another part: attention mechanisms tend to struggle more at that context limit more so than any kind of impact from kv caches (hence the subagent, compaction, new conversation, etc approach).

So yeah - ultimately it's free KV cache.

Not as necessary (if at all) on Qwen3.5/3.6 line given delta net linear attention. Kind of wish gemma-4 had that, or at least the various inference engines had more robust implementation of the sliding window vs. full attention for gemma-4.

[-]

JermMX5@reddit

Would you mind posting your startup script with params? We’ve been using llamacpp on our rtx pro 6000 box in aws and want to start looking at vllm

[-]

LegacyRemaster@reddit

same

[-]

rm-rf-rm@reddit

Yeah feeling its slow on my end as well

Q8, llama.cpp, mac studio with m3 ultra. ~20tps

[-]

jon23d@reddit

That would be 3x faster than what I'm getting on mine. What settings are you using?

[-]

rm-rf-rm@reddit

Nothing special, llama-swap config:

-m /Users/$MODEL
-c 128000 
--jinja
--temp 0.6
--top-p 0.95
--top-k 20 
--min-p 0.00

[-]

Theta-Lev@reddit

On RTX 4050 Laptop, using Q5-K-L i've got \~3.4 tokens/second, so for now staying on Qwen 3.6 35B

[-]

Cimbom2000@reddit

Noob question can someone please tell me how to proper setup the config for a macbook M1 Max 65GB RAM ?

[-]

MalabaristaEnFuego@reddit

28tok/s on RTX A5000 and it's an incredibly local model for 27B.

[-]

Most-Trainer-8876@reddit

please share your settings, I am unable to push context to 100K, even with q8_0 kv cache.

[-]

MalabaristaEnFuego@reddit

ollama show qwen3.6:27b
  Model
    architecture        qwen35
    parameters          27.8B
    context length      262144
    embedding length    5120
    quantization        Q4_K_M

  Capabilities
    completion
    vision
    tools
    thinking

  Parameters
    min_p               0
    presence_penalty    1.5
    repeat_penalty      1
    temperature         1
    top_k               20
    top_p               0.95 

OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_GPU_OVERHEAD=0
OLLAMA_NUM_PARALLEL=1
OLLAMA_MAX_LOADED_MODELS=1

input_tokens    4734
output_tokens   2894
total_tokens    7628
prompt_tokens   4734
completion_tokens   2894
response_token/s    28.26
prompt_token/s  1749.15
total_duration  106552405661
load_duration   119001977
prompt_eval_count   4734
prompt_eval_duration    2706459419
eval_count  2894
eval_duration   102415970714
approximate_total   "0h1m46s"

| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A5000               Off |   00000000:01:00.0  On |                  Off |
| 45%   76C    P0            229W /  230W |   23771MiB /  24564MiB |     97%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

%Cpu(s):  6.8 us,  0.2 sy,  0.0 ni, 92.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31448.9 total,    497.0 free,   3742.7 used,  27689.7 buff/cache
MiB Swap:     32.0 total,     23.0 free,      9.0 used.  27706.2 avail Mem

[-]

Most-Trainer-8876@reddit

have you actually tried running the model at higher context? like at 150K or 250K?? not just loading but also generating to check?

[-]

MalabaristaEnFuego@reddit

Yeah, Open WebUI has a slider for that. I usually adjust the model's context for speed to max context ratio, then go into the admin panel>models and save the configuration for each model there. You can adjust a lot of knobs to tweak things in Open WebUI to get the most effective performance from each model for use cases. There's also a little system prompt box in there for instructions which is incredibly useful. Have one LLM draft a maximum efficiency system prompt for what you want the other model to do, review and make adjustments to your liking, then save it in the model's system prompt

It wrote 4 drafts of code for me before producing a final draft, and reasoned well through all drafts. When I stick it in my Continue yaml in VSCode, I'll give it a static context value there as well.

If you create JSONS of your settings data for each model at each test point, you can iterate to peak efficiency. I use flagship LLMs to essentially "tune" the open source guys to use case, then I have essentially a maximum optimized local LLM.

If you're using Ollama, I recommend setting the Ollama Think variable to off. It's separate from the model's reasoning. Set reasoning to low to reduce loops and overthinking. There are so many settings people don't know about that can really improve a model's performance.

[-]

Most-Trainer-8876@reddit

No, I mean running the model at higher contexts, I still don't understand how you are able to load model with full context i.e. 256K with just 24GB vram. At max I can get is 81K without getting Out of Memory crash.

[-]

MalabaristaEnFuego@reddit

ollama show qwen3.5:27b

Model

architecture qwen35

parameters 27.8B

context length 262144

embedding length 5120

quantization Q4_K_M

requires 0.17.1

Capabilities

completion

vision

tools

thinking

Parameters

top_k 20

top_p 0.95

presence_penalty 1.5

temperature 1

OLLAMA_FLASH_ATTENTION=1

OLLAMA_KV_CACHE_TYPE=q8_0

OLLAMA_GPU_OVERHEAD=0

OLLAMA_NUM_PARALLEL=1

OLLAMA_MAX_LOADED_MODELS=1

input_tokens 4734

output_tokens 2894

total_tokens 7628

prompt_tokens 4734

completion_tokens 2894

response_token/s 28.26

prompt_token/s 1749.15

total_duration 106552405661

load_duration 119001977

prompt_eval_count 4734

prompt_eval_duration 2706459419

eval_count 2894

eval_duration 102415970714

approximate_total "0h1m46s"

| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |

|-----------------------------------------+------------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+========================+======================|

| 0 NVIDIA RTX A5000 Off | 00000000:01:00.0 On | Off |

| 45% 76C P0 229W / 230W | 23771MiB / 24564MiB | 97% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

%Cpu(s): 6.8 us, 0.2 sy, 0.0 ni, 92.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

MiB Mem : 31448.9 total, 497.0 free, 3742.7 used, 27689.7 buff/cache

MiB Swap: 32.0 total, 23.0 free, 9.0 used. 27706.2 avail Mem

[-]

Pleasant-Shallot-707@reddit

I get about 40-50 tok/s on my M5 Max

[-]

Eveerjr@reddit

what do you use for inference? I'm getting like half of that on my M5 Max

[-]

goodbits@reddit

i'd also like to know what your inference engine and quantization is.

[-]

SuitableElephant6346@reddit

3060, like 3 token a sec on q4 😅😭

[-]

throwaway9977558866@reddit

Can you share your config?

[-]

SuitableElephant6346@reddit

Tbh, I tried it yesterday, and just used all default config.

Prob could get more tokens per sec if I played with it more..

[-]

Prestigious-Use5483@reddit

35 t/s with RTX 3090 | 5Q_K_XL | 32K Context (F16)

[-]

Most-Trainer-8876@reddit

please share your settings

[-]

Prestigious-Use5483@reddit

Temperature 0.6 Top P 0.95 Top K 20 Min P 0 Repetition Penalty Off Presence Penalty Off

[-]

yeah_me_@reddit

Strix Halo 128GB (GMKtec Evo-X2)
Q4KM
LM Studio (Bazzite)
Vulcan

11.8tps

From the tests I've run (for now only single-file HTMLs) I think I like it's outputs more than Minimax M2.7 Q3, but the speed is not really usable. However, this is the 1st model that makes me think that maybe getting a 5090 might be worth the money. I was hoping to run some tests to see whether this would make a good subagent model to be managed by CC/Cursor, but at these speeds it won't make much sense.

[-]

InevitableArea1@reddit

7900xtx (24gb vram) -100k context - LM Studio - Q5_K_XL (unsloth) - 19 tok/s

Amazing considering it's the smallest model that can actually do the simulation analysis I want/need. The qwen3.5 35b MoE great, but 27b dense is another level entirely

[-]

Most-Trainer-8876@reddit

q5 with 100K context on 24GB Vram, can you please share your settings?

[-]

InevitableArea1@reddit

Here, nothing too fancy, just key thing is LM Studio's Unified KV Cache and flash attention. Here are my settings, also using AMD's ROCM Runtime (v2.13.0). Also using Qwen's recommended sampling parameters Temp=0.7-1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

[-]

matrik@reddit

1 x RTX 3090, Q5_XL: 78 t/s

[-]

Most-Trainer-8876@reddit

Can't believe it, share your settings?

[-]

chankeypathak@reddit

Can someone explain it to me in laymen terms. I have gtx 1650, ryzen 5 3600, 32gb ram.

Shall I ditch the idea of hosting an llm

[-]

youcloudsofdoom@reddit

dual 3090 here. I'm getting 30 t/s with around 1200 p/p at 192k context on Q6_K.

ngl 99

b 4096

ub 1024

t 4

tb 16

fa on

caches are Q8

unsloth recommended temp etc all there.

Anyone doing any better, any suggestions? Feels like I'm leaving power on the tables somewhere....

[-]

logic_prevails@reddit

Sick setup, Q6 is the sweet spot

[-]

youcloudsofdoom@reddit

Yeah, I'm not mad at at it, even at about 50% context fill I'm getting 1100 p/p and 25 t/s, so I shouldn't complain really. I've been spoiled by my 100 t/s Qwen 3.6 35B experience....

[-]

x10der_by@reddit

Q4 about 3 t/s on rtx 4070s 12Gb, 32Gb DDR4

[-]

RevolutionaryGold325@reddit

Isn't it the same model as Qwen3.5? Just a bit more training and better fine tuning. Should give you the same exact speed.

[-]

ea_man@reddit

Hidden layers are \~25% bigger than 3.5, so a bit more railed on tool using I guess. Bigger file.

[-]

RevolutionaryGold325@reddit

Not true. I went and double checked. The configs are pretty much identical:

https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/config.json

https://huggingface.co/Qwen/Qwen3.5-27B/blob/main/config.json

[-]

ea_man@reddit

qwen3.5

Model Overview

Type: Causal Language Model with Vision Encoder
Training Stage: Pre-training & Post-training
Language Model
Number of Parameters: 27B
Hidden Dimension: 4096

Qwen3.6-27B.

Type: Causal Language Model with Vision Encoder
Training Stage: Pre-training & Post-training
Language Model
Number of Parameters: 27B
Hidden Dimension: 5120

[-]

RevolutionaryGold325@reddit

We must be looking at different configs because that is not what the model configs define.

[-]

RevolutionaryGold325@reddit

Line 18 in the new configs:

https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/config.json#L18

Line 16 in the old config:

https://huggingface.co/Qwen/Qwen3.5-27B/blob/main/config.json#L16

[-]

AlwaysTiredButItsOk@reddit

Wrong. 3b active parameters = faster. Probably closer to 3.5 4b or 2b speeds

[-]

LocoMod@reddit

TIL dense models are just MoE’s in disguise. /s

[-]

rosco1502@reddit

I believe it is slightly larger

[-]

ilintar@reddit

Q5_K_M with 2x5070, on -sm tensor --spec-default 52 t/s.

[-]

Honest-Ad8881@reddit

9070XT16G - 32G

Qwen3.6-27B ctx = 32K

IQ4-XS 15token/1s

UD-IQ3-XL 23.5token/1s

UD-IQ3-XXS 35.5 token/1s

spec-type = ngram-mod             
spec-ngram-size-n = 20           
draft-min = 24
draft-max = 64            
ctx-size = 33056
batch-size = 2048
ubatch-size = 512
flash-attn = on 
ngl = 99
reasoning = 0
temperature = 0.7
t = 8
jinja = on
no-mmap = on
cache-type-k = q4_0
cache-type-v = q4_0

[-]

into_devoid@reddit

Is this useable?

[-]

Big_Mix_4044@reddit

Same as 3.5 27B. 30tps tg and 1k tg pp at 200k context window (with slight degradation as the context grow)

[-]

simracerman@reddit

UD_Q4_K_XL - 12 t/s. 64k context on 5070 Ti 16GB, partial offload to iGPU using llama.cpp vulkan backend. Just finished a lengthy code review of an app I’ve been building with Opencode. I’m super impressed with the level of depth the 3.6-27B has brought.

[-]

Iory1998@reddit

I get 22-23 t/s, Q8 KV FP16 at 170K using 1 RTX3090 AND RTX5070TI

[-]

gnnr25@reddit

MacBook Air M2 16GB

llama-cli  -m ~/models/unsloth-Qwen3.6-27B-UD-IQ3_XXS.gguf -ngl 0 -c 4096 -fa on -ctk q8_0 -ctv q8_0 --no-mmap -b 128 -ub 128 --threads 8 -np 1 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 1.5 --repeat-penalty 1.0 --chat-template-kwargs "{\"enable_thinking\":false}" -p "prompt here"

[ Prompt: 2.8 t/s | Generation 2.2 t/s]

9B when? *cries*

[-]

Adventurous_Farm3073@reddit

Dual 5090 power limited to 420w unsloth q8 get around 40t/s on q4 get around 70t/s

[-]

CMatUk@reddit

7900 XTX 24GB, 64GB DDR5, 7950X
LM Studio - Vulcan Llama.cpp
K/V Cache Quant - Q8

32K Context

Qwen3.6-27b Q4_K_M (unsloth) 40.0 tok/sec

Qwen3.6-27b Q5_K_xl (unsloth) 35.3 tok/sec

Qwen3.6-27b Q6_K (Lmstudio) 15.8 tok/sec

[-]

lurkatwork@reddit

You sellin that xtx?

[-]

Mirayum@reddit

Do you mind sharing your launcher/parameter options?

[-]

_ballzdeep_@reddit

"Qwen3.6-27B-UD-Q4_K_XL":

aliases: ["qwen36d", "qwen35d", "Qwen3.6-27B-UD-Q4_K_XL.gguf", "Qwen3.5-27B-UD-Q5_K_XL.gguf"]

timeouts:

responseHeader: 0

cmd: |

${llama}

--model /models/Qwen3.6-27B-UD-Q4_K_XL.gguf

--spec-type ngram-mod --spec-ngram-size-n 16 --draft-min 4 --draft-max 32

--jinja --ctx-size ${OC_CTX} --parallel 1

--fit on --fit-target 0 -fa on -ctk q8_0 -ctv q8_0

-b 4096 -ub 1536 --cache-ram 0 --ctx-checkpoints 12

--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0

--reasoning-format deepseek

This gives me TG:40 tps and PP:1350tps with 42% ngram acceptance.

[-]

Sticking_to_Decaf@reddit

FP8 with speculative decoding (mtp, 2), about 85tps on 1x Pro 6000 max-q.

[-]

Loud-Decision9817@reddit

3090 Q3 getting 114 tokens per second but I have my own custom software. 256k context

[-]

Steve_Streza@reddit

30 tok/s 7900 XTX on UD-Q4_K_XL. I've put no effort in tuning yet. 90 tok/s on 35B-A3B with UD-Q3_K_XL.

[-]

Lazzollin@reddit

2~3tps on a rtx a5000 and offloading to ram (cpu ryzen 7 9800x3d) I think my settings might be quite far from the most performance I could be getting tho and I just kept working with the 35b

[-]

patricious@reddit

in LM Studio I get 208.30 t/s on a RTX 5090, Q8_0, 262k context size, temp: 0.6, Top K 20, Repeat Penalty 1, Top P 0.95.

For some reason I can't get Llama.cpp to run properly, maybe I am not choosing the rights settings in the script file.

[-]

Kahvana@reddit

You're likely confusing processing speed with generation speed.

[-]

Apprehensive-Fly4076@reddit

i think you're mixing it with qwen3.6 35b, are you sure its the qwen3.6 27b that came out today? Me and some others on a 5090 are getting around 50t/s.

[-]

patricious@reddit

Yes my bad, its the qwen3.6 35b model and the speed is around 50-60 t/s.

[-]

DramaLlamaDad@reddit

You're getting downvoted because this isn't possible unless you have something like 5.4tb/sec memory bandwidth and most people here knows it. Check the model again, you're probably on something else.

[-]

patricious@reddit

forgot to add: UD_Q4_K_XL from Unsloth

[-]

mister2d@reddit

Qwen3.6-27B-UD-Q4_K_XL.gguf

28 t/s, all layers on gpu.

32k context
2x 3060
DDR3 RAM
llama.cpp
tensor parallelism

[-]

ttkciar@reddit

Interesting! I expected PCIe IPC to hit tensor parallelism perf more than that. Thanks for sharing!

A couple of questions, if you don't mind: Is that system PCIe 2.0 or 3.0? And are you using any kind of bridging link between the 3060s?

[-]

mister2d@reddit

PCIe 3.0. One gpu in a 8x slot while the other in the 16x slot. No physical linking other than the bus.

[-]

eribob@reddit

Dual rtx 3090, FP8 quant in vllm, tp=2, mtp 2: pp=1650t/s, tg=26t/s

[-]

PassengerPigeon343@reddit

Was looking for my HW on the list, thank you for the baseline, especially including prompt professing.

[-]

AdamDhahabi@reddit

3090 + 2x 5070 Ti, all cards around 900GB/s mem bandwidth, no tp and no mtp, running Unsloth Q8 with full unquantized context at 25 t/s. Something seems off with your 26 t/s.

[-]

Eveerjr@reddit

24tok/s on M5 Max with MLX

[-]

IronColumn@reddit

what quant? i get 10 on an m1 max with llama.cpp and unsloth q4 k_m gguf; id honestly expect an m5 to go faster than 2.5x

[-]

Eveerjr@reddit

I'm trying now with the qwen3.6-27b-nvfp4 mlx version and I get 26tok/s. It's quite pleasing to use tbh, the prompt processing is quite fast. This model is so good it doesn't even feel like it's running local.

Mine is the base model with 32 GPU cores.

[-]

monjodav@reddit

not normal lol

[-]

Eveerjr@reddit

What would be normal? I just downloaded from lmstudio

[-]

l33t-Mt@reddit

13.5 with Nvidia p40.

[-]

Late_Night_AI@reddit

You’re hurting your performance with that 2060.

For LLMs, when you split a model across GPUs, the work has to pass through every card involved. If one card is much weaker, lower VRAM, or in this caee slower, it becomes the bottleneck. You might actually see a performance boost if you dont use the 2060 and off load a little but to system ram instead if you need more than 32gb. Also lower quants are faster, so if you want a speed boost you could got to a Q6 or Q4 if it doesn’t hurt the quality too much for your use case

[-]

New-Implement-5979@reddit

Single 5060ti - Q3_K_M with 70k context window I get 21 tokens per second

[-]

PromptInjection_@reddit

8 tokens / s, Q5, AMD Strix Halo

[-]

edsonmedina@reddit

I get about 7.4 t/s on Strix Halo with Q8

[-]

Ell2509@reddit

You should be getting more than that.

You are likely setting up your command in such a way as to have data taking multiple round trips over pcei. That tanks your speed.

[-]

Flashy_Management962@reddit

Use bf16 cache for k and v for qwen models, do not use --fit-target on dense models, use -sm tensor

[-]

jacek2023@reddit

load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloaded 49/65 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  3044.35 MiB
load_tensors:        CUDA0 model buffer size =  8246.00 MiB

[-]

FerLuisxd@reddit

Does it fit 100% on that gpu? or with ram offload? also it is Q2 right?

[-]

jacek2023@reddit

49/65 means it's offloaded to RAM

[-]

zkstx@reddit

Hmm, what are the flags you use? I suspect you could squeeze some more layers (perhaps even the entire model) if you are willing to live with a little less context and a lower fit-target

[-]

Kahvana@reddit

On 2x ASUS PRIME RTX 5060 Ti 16GB I'm getting \~300 t/s processing and \~20 t/s generating, Nothing in context, Unsloth Q5_K_M, 128k max context. Will edit the message when I get to my pc.

[-]

viperx7@reddit

llama.cpp on a 4090+3090 setup I get TG 29t/s and PP 2500t/s

I am struggling with settingup vllm can't seem to figure out optimal flags and exact model to use if anyone has similar setup and would like to share thier config I will be thankful

[-]

skibare87@reddit

Around 80 tok/s with speculative decoding active and 96k context window.

[-]

chisleu@reddit

15 tokens per second with an m4 max / 128 with 8bit quant.

[-]

QuinsZouls@reddit

26 tps using RX 9070 16gb and turboquant at 130k of context windows using vulkan backend

[-]

Blindax@reddit

With lm studio Q8 128k context, 2k token generation I get around 7t/s with 5090 and 3090 (vs 150t/s with 35b). At 80k context I get around 23t/s. I have noticed issues with both 3.6 versions in lm studio (thinking loops etc and apparently optimization too).

[-]

Embarrassed_Adagio28@reddit

Dual tesla v100 16gb gpus run it at 28 tokens per second at q5 on lmstudio.

[-]

ziphnor@reddit

Can you share a bit more info on that setup?

[-]

Tormeister@reddit

76.3 tok/s

RTX 5090

vLLM 0.19

fp8_e4m3 KV

cyankiwi/Qwen3.6-27B-AWQ-INT4

[-]

logic_prevails@reddit

30 tk/s UD_Q5_K_XL, 5070 ti and 3080

[-]

fuse1921@reddit

Getting mid 20s with 27B Uncensored Q6 on 3x 3090 at full context

[-]

Dundell@reddit

3.62 t/s on gtx 1080 ti + 16GBs ddr4 ... I'll just stick with 3.6 35B moe which was 35t/s. Maybe my x6 RTX 3060's can handle it better in some configuration for speed and at least 100k context size.

[-]

p211@reddit

May I ask how you got the 3.6 35B Moe to 35t/s on your setup?

[-]

Dundell@reddit

/home/dundell-discordbot/llamv2/llama.cpp/build/bin/llama-server -m /home/dundell-discordbot/llamv2/Qwen3-5/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf --mmproj /home/dundell-discordbot/llamv2/Qwen3-5/mmproj-F16_3-6-35b.gguf --ctx-size 100000 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1 --fit on --flash-attn on --no-warmup --host
0.0.0.0 --port 8188 --api-key someapikey -a Qwen3.5-Thinking --temp 1 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --cache-ram 0 --image-min-tokens 1024 --jinja

[-]

xMarkv@reddit

+1 I’m also following. Trying to get 3.6 35b on a 3060 but it falls on its face during prompt processing

[-]

usuallyalurker11@reddit

I got \~4 t/s on my $800 laptop Lunar Lake iGPU Intel Core 140V 32GB LPDDR5X.

I was not surprised considering the Qwen 3.5 9b same quant I got \~12 t/s this one is even heavier so it makes sense

[-]

MarionberryWeird4021@reddit

Are you using internal GPU or something external connected with your Computer? Just to know...

[-]

toolman10@reddit

RTX 5090, the sweet spot for me in LM Studio is:

unsloth/Qwen3.6-27B-Q6_K (24.37 GB on disk)
ctx 256k
KV Q4

Getting \~50 tk/s

[-]

Diecron@reddit

--threads 8 \
--device CUDA0 \
--parallel 1 \
--flash-attn auto \
--jinja \
--no-mmap \
--temp 0.6 \
--min-p 0.0 \
--top-p 0.95 \
--top-k 20 \
--gpu-layers 999 \
--cache-type-v q8_0 \
--cache-type-k q8_0

prompt eval time = 14350.90 ms / 38697 tokens ( 0.37 ms per token, 2696.49 tokens per second)

eval time = 442.77 ms / 24 tokens ( 18.45 ms per token, 54.20 tokens per second)

[-]

ea_man@reddit

Works fine here with Vulkan on a 6700xt.
I mean not fast but as fine as the old one...

prompt eval time =    167.70 tokens per second)
       eval time =     22.11 tokens per second)
      total time =   43291.90 ms /  6338 tokens

[-]

AeroelasticCowboy@reddit

my R9700 with Q5KM is PP @ 980 and TG at 27

[-]

Clean_Initial_9618@reddit

What's mmproj how's it helpful???

[-]

Diecron@reddit

It's the vision decoder - lets the model take image inputs as well

[-]

ea_man@reddit

Same as old 3.5 yet I can use 1/4th of the context size (10K) for the IQ3_XXS on a 12GB GPU due to the bigger size, I hope that Bartowsky will release a slightly smaller IQ3...

prompt eval time =    167.70 tokens per second)
       eval time =     22.11 tokens per second)
      total time =   43291.90 ms /  6338 tokens
----
srv    load_model: loading model '/home/eaman/lm/models/unsloth/Qwen3.6-27B-UD-IQ3_XXS.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 11732 MiB of device memory vs. 11782 MiB of free device memory
llama_params_fit_impl: will leave 50 >= 20 MiB of free device memory, no changes needed

launch:

/home/eaman/llama/bin_vulkan/llama-server \

-m /home/eaman/lm/models/unsloth/Qwen3.6-27B-UD-IQ3_XXS.gguf \

--host 0.0.0.0 \

-np 1 \

--fit-target 20 \

-ctk q4_0 \

-ctv q4_0 \

-fa on \

--temp 0.3 \

--repeat-penalty 1.05 \

--top-p 0.9 \

--top-k 20 \

--min-p 0.04 \

-b 512 \

--ctx-size 10000 \

--jinja \

--reasoning-budget 1 \

--chat-template-kwargs '{"enable_thinking":false}' \

--no-mmap \

/home/eaman/llama/bin_vulkan/llama-server \

-m /home/eaman/lm/models/unsloth/Qwen3.6-27B-UD-IQ3_XXS.gguf \

--host 0.0.0.0 \

-np 1 \

--fit-target 20 \

-ctk q4_0 \

-ctv q4_0 \

-fa on \

--temp 0.3 \

--repeat-penalty 1.05 \

--top-p 0.9 \

--top-k 20 \

--min-p 0.04 \

-b 512 \

--ctx-size 10000 \

--jinja \

--reasoning-budget 1 \

--chat-template-kwargs '{"enable_thinking":false}' \

--no-mmap \

/home/eaman/llama/bin_vulkan/llama-server \

-m /home/eaman/lm/models/unsloth/Qwen3.6-27B-UD-IQ3_XXS.gguf \

--host 0.0.0.0 \

-np 1 \

--fit-target 20 \

-ctk q4_0 \

-ctv q4_0 \

-fa on \

--temp 0.3 \

--repeat-penalty 1.05 \

--top-p 0.9 \

--top-k 20 \

--min-p 0.04 \

-b 512 \

--ctx-size 10000 \

--jinja \

--reasoning-budget 1 \

--chat-template-kwargs '{"enable_thinking":false}' \

--no-mmap \0.0.0.0 \

-np 1 \

--fit-target 20 \

-ctk q4_0 \

-ctv q4_0 \

-fa on \

--temp 0.3 \

--repeat-penalty 1.05 \

--top-p 0.9 \

--top-k 20 \

--min-p 0.04 \

-b 512 \

--ctx-size 10000 \

--jinja \

--reasoning-budget 1 \

--chat-template-kwargs '{"enable_thinking":false}' \

--no-mmap \

[-]

FinBenton@reddit

Q6_K_XL 210k context, 53 tok/sec output on 5090.

[-]

Opteron67@reddit

Fp8 model - dual 5090 102tk/s (single request)

[-]

Responsible-Exit68@reddit

RTX5090, UD-Q5_K_XL

1) Small prompt
Generation: 59 tok/s

2) 90k prompt -
Pre-fill: 2187 tok/s
Generation: 47 tok/s

[-]

dinerburgeryum@reddit

You can, if you're feeling saucy, move to ik_llama.cpp and use split mode graph for uplift on multiple cards. I went from 20tps on full content 3090+A4000 to 30tps. It doesn't seem mind-blowing, but a 50% uplift wasn't nothin.

[-]

Ambitious_Fold_2874@reddit (OP)

I don’t know what I’m doing wrong but setting up ik_llama.cpp was a huge PITA and for some reason didn’t help with speeds; but this was a while back and with a different model and setup, so maybe it would work better here

[-]