How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s

Posted by runcertain@reddit | LocalLLaMA | View on Reddit | 22 comments

I'm trying to get these high tokens per second that I'm seeing on here using the new speculative decoding techniques.

Hardware: 2x3090, AMD 9900X, 32GB RAM, Gigabyte B850 AI TOP. Running Ubuntu 24.04, CUDA 13.0, NVIDIA-SMI 580.105.08

I'm running a specific forked driver version so that I can get the 3090s to communicate via P2P:

nvidia-smi topo -p2p r

GPU0    GPU1

GPU0 X OK
GPU1 OK X

Legend:

X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported U = Unknown

For DFlash:

I followed this readme: https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md

I built beellama (with the 3090 params set) and downloaded the recommended spiritbuun draft files and unsloth q5_k_s. Getting around 40t/s.

For MTP:

I built the most recent llama.cpp and tried the MTP versions of Unsloth Qwen3.6 UD-Q4_K_XL and UD-Q8_K_XL. Getting 50ish t/s.

As far as I remember, I was getting 40 t/s on basic Qwen3.5-27B, so where's the 2-3x speed generation.

Here's an example of some of my commands:

from llama.cpp: build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q8_K_XL.gguf" \ -ngl 99 -c 32000 -fa on -np 1 \ --spec-type draft-mtp --spec-draft-n-max 6 --host 0.0.0.0 \ --port 8082

from llama.cpp: build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf" \ -ngl 99 -c 245600 -fa on -np 1 \ --spec-type draft-mtp --spec-draft-n-max 6 --host 0.0.0.0 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --flash-attn on \ --cache-ram 0 \ --jinja \ --no-mmap \ --reasoning off \ --port 8082

from beellama: build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-Q5_K_S.gguf" \ --spec-draft-model "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/dflash-draft-3.6-q4_k_m.gguf" \ --spec-type dflash \ --spec-dflash-cross-ctx 2048 \ --host 0.0.0.0 \ --port 8082 \ -np 1 \ --kv-unified \ -ngl all \ --spec-draft-ngl all \ -b 2048 -ub 512 \ --ctx-size 245600 \ --cache-type-k turbo4 --cache-type-v turbo3_tcq \ --flash-attn on \ --cache-ram 0 \ --jinja \ --no-mmap --mlock \ --no-host --metrics \ --log-timestamps --log-prefix --log-colors off \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking":true}' \ --temp 0.6 --top-k 20 --min-p 0.0

[-]

Trick-Assignment-828@reddit

The bottleneck with dual 3090s on PCIe P2P (no NVLink) is inter-GPU bandwidth — you're getting \~16 GB/s vs the 600 GB/s NVLink would give you. Speculative decoding helps but can't fully compensate for that.

A few things to check:

For MTP: --spec-draft-n-max 6 might be too aggressive. Start with 3-4 and benchmark — if the draft acceptance rate is low you're wasting cycles verifying bad tokens. Add --metrics and check tokens_drafted vs tokens_accepted.

For DFlash: turbo4/turbo3_tcq cache types are very new and driver-sensitive. With CUDA 13.0 + 580 driver you should be fine, but try dropping to q8_0/q8_0 first to isolate whether the cache type is hurting you.

The real ceiling: on dual 3090 PCIe you're realistically looking at 60-80 t/s on Qwen3.6 27B Q4 with good speculative decoding. The 2-3x numbers people post are usually on NVLink pairs or single GPU with a fast draft model that has very high acceptance rate.

What's your tokens_accepted_per_drafted ratio showing?

[-]

runcertain@reddit (OP)

This is what I got for the below settings:

prompt eval time = 165.91 ms / 33 tokens ( 5.03 ms per token, 198.90 tokens per second) eval time = 14611.98 ms / 771 tokens ( 18.95 ms per token, 52.76 tokens per second) total time = 14777.89 ms / 804 tokens draft acceptance rate = 0.69746 ( 521 accepted / 747 generated)

~/llama.cpp$ build/bin/llama-server -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 245600 -fa on -np 1 --spec-type draft-mtp --spec-draft-n-max 3 --host 0.0.0.0 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --cache-ram 0 --jinja --no-mmap --reasoning off --port 8082 --metrics

[-]

runcertain@reddit (OP)

Thanks for this. For MTP I’m getting around 50 t/s after trying n max 2 and 3 for a small improvement. I will come back with the tokens accepted metric.

For Dflash I’m getting 65 t/s now with a much lower kv usage thanks those turbo quants. For me it seems the clear winner since it’s taking up so much less space than MTP, and MTP kv can’t be quantized.

Now I gotta start shopping for an nvlink…

[-]

DeProgrammer99@reddit

Qwen's MTP is trained for predicting the next 3 tokens, not 6, so dial that down first.

[-]

runcertain@reddit (OP)

Thanks, I've been experimenting with that param.

Am I missing something with the lack of KV cache? Suddenly Q4_K_M 27B models need nearly 48GB VRAM at full context.

[-]

DeProgrammer99@reddit

No idea about that. Should take a bit over 8 GB for the full context at Q8_0, given that's what my logs say for when I last ran it in full precision at half context.

[-]

Ke5han@reddit

is it 2 or 3 I remember it's trained for predicting the next 2 tokens? I could he wrong as I am setting that value to 2 instead of 3

[-]

DeProgrammer99@reddit

I don't know why I said it was trained for 3; Qwen recommends 3 in one place and 2 in another place on the model card.

The optimum number depends on the combination of your use case (probably including context length) and your hardware. Take a look at some other people's benchmarks of different numbers for spec-draft-n-max:

https://www.reddit.com/r/LocalLLaMA/comments/1ta4rvs/comment/olb9pl6/

https://www.reddit.com/r/LocalLLaMA/comments/1t7l56a/comment/ol2zudu/

https://www.reddit.com/r/LocalLLaMA/comments/1tfj9jv/qwen_3627b_dense_with_mtp_on_strix_halo_windows/

https://github.com/ggml-org/llama.cpp/pull/22673#issue-4375489925

https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4469129285

https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4375891776

https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4376778281

[-]

Loud-Swim-2932@reddit

I'd be interested to know if you've ever noticed during your testing that mtp causes tokens to get mixed up. When using it in OpenCode, I've seen things like the tags closing too early. Gemma4

[-]

runcertain@reddit (OP)

If I notice that I'll reply here. I'm still trying to wrap my head around the lack of KV quantization with MTP and how a Q4_K_M 27B model fills almost all 48GB of VRAM when running with full context.

[-]

Ok-Measurement-1575@reddit

Just add -sm tensor and watch it double. MTP adds like 10t/s on top from my quick play last night.

[-]

runcertain@reddit (OP)

When I do that, I'm getting this:

0.04.604.009 E llama_init_from_model: simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented 0.04.604.013 E common_init_result: failed to create context with model '/home/harris/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf' Segmentation fault (core dumped)

[-]

Ok-Measurement-1575@reddit

Stop quanting kv cache.

[-]

runcertain@reddit (OP)

This gets me 56 t/s but now I'm using like 44 out of 48GB VRAM which is a pretty big tradeoff.

build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf" \ -ngl 99 -c 245600 -fa on -np 1 \ --flash-attn on \ --spec-type draft-mtp --spec-draft-n-max 3 --host 0.0.0.0 \ --jinja \ --reasoning off \ --port 8082

[-]

Ok-Measurement-1575@reddit

I get 64 and roughly +10 with mtp.

You can get about 80 base with vllm if you can stomach 7 minutes for a model change.

[-]

Ke5han@reddit

Should the spec type just be "mtp"?

[-]

runcertain@reddit (OP)

error while handling argument "--spec-type": unknown speculative type: mtp

usage: --spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache comma-separated list of types of speculative decoding to use (default: none)

                                    (env: LLAMA_ARG_SPEC_TYPE)

[-]

Creative-Type9411@reddit

you're running an old Build if you spectate is just MTP, they changed it before merging the branch to draft-mtp

[-]

jacek2023@reddit

do you see better performance with p2p? can you compare with and without?

[-]

MaruluVR@reddit

I tried using the P2P driver, it only helps if both cards are the same model and if you are using --split-mode tensor and have a minimum of 8 pcie 3 layers per card if you go below that you will get a performance loss. (I usually have one of my cards at 4x so I disabled it)

[-]

jacek2023@reddit

I see tensor mode gain on my 3090s without p2p

[-]

MaruluVR@reddit

I am talking about the reverse you dont really see a benefit from p2p without split tensor, even without p2p split tensor still is good.