How do I get the superfast DFlash / MTP tokens per second that I'm seeing on here? Dual 3090s

Posted by runcertain@reddit | LocalLLaMA | View on Reddit | 22 comments

I'm trying to get these high tokens per second that I'm seeing on here using the new speculative decoding techniques.

Hardware: 2x3090, AMD 9900X, 32GB RAM, Gigabyte B850 AI TOP. Running Ubuntu 24.04, CUDA 13.0, NVIDIA-SMI 580.105.08


I'm running a specific forked driver version so that I can get the 3090s to communicate via P2P:

nvidia-smi topo -p2p r

GPU0    GPU1

GPU0 X OK
GPU1 OK X

Legend:

X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported U = Unknown


For DFlash:

I followed this readme: https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md

I built beellama (with the 3090 params set) and downloaded the recommended spiritbuun draft files and unsloth q5_k_s. Getting around 40t/s.

For MTP:

I built the most recent llama.cpp and tried the MTP versions of Unsloth Qwen3.6 UD-Q4_K_XL and UD-Q8_K_XL. Getting 50ish t/s.

As far as I remember, I was getting 40 t/s on basic Qwen3.5-27B, so where's the 2-3x speed generation.


Here's an example of some of my commands:

from llama.cpp: build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q8_K_XL.gguf" \ -ngl 99 -c 32000 -fa on -np 1 \ --spec-type draft-mtp --spec-draft-n-max 6 --host 0.0.0.0 \ --port 8082

from llama.cpp: build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf" \ -ngl 99 -c 245600 -fa on -np 1 \ --spec-type draft-mtp --spec-draft-n-max 6 --host 0.0.0.0 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --flash-attn on \ --cache-ram 0 \ --jinja \ --no-mmap \ --reasoning off \ --port 8082

from beellama: build/bin/llama-server \ -m "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/Qwen3.6-27B-Q5_K_S.gguf" \ --spec-draft-model "$HOME/.cache/llama.cpp/Qwen3.6/unsloth/dflash-draft-3.6-q4_k_m.gguf" \ --spec-type dflash \ --spec-dflash-cross-ctx 2048 \ --host 0.0.0.0 \ --port 8082 \ -np 1 \ --kv-unified \ -ngl all \ --spec-draft-ngl all \ -b 2048 -ub 512 \ --ctx-size 245600 \ --cache-type-k turbo4 --cache-type-v turbo3_tcq \ --flash-attn on \ --cache-ram 0 \ --jinja \ --no-mmap --mlock \ --no-host --metrics \ --log-timestamps --log-prefix --log-colors off \ --reasoning on \ --chat-template-kwargs '{"preserve_thinking":true}' \ --temp 0.6 --top-k 20 --min-p 0.0