MTP with Dual 3090's on Qwen 27B
Posted by DashinTheFields@reddit | LocalLLaMA | View on Reddit | 5 comments
Does anyone know if MTP works with more than one 3090' yet? I see the 5090's talking about it, but would like to know for us poors.
sheetis@reddit
I've been using it with the llama.cpp PR with a pair of AMD 7900 XTX. Just make sure to use tensor parallelism, it doesn't seem to work great on row/layer currently as the loaded-as-model MTP layers don't seem to span GPUs for the pipeline parallelism variants. TP works around this by presenting as a single virtual Meta() device.
TL;DR -- If it already works for ROCm, CUDA should be set.
Fluffywings@reddit
Can you detail your config and setup? Got an xtx and about to pick up an xt to replace the 2070 super.
DashinTheFields@reddit (OP)
So i should implement the PR that has been publicized? Sounds like worth a try.
How much more context have you been seeing? I'm getting like 130K right now. They both share memory, with one gpu doing it's thinking, I'm wondering if I could run two servers and point one of the gpu's to do work also.
It would be nice to go overboard. I've already been using Qwen 27 for for about 50% compared to claude now, it's amazing.
sheetis@reddit
I use a Q8_0 Quant of Qwen3.6-27b. With it and a full 131072 context on the dual-gpu setup, I see about 83% VRAM usage on each card (so a bit of free room).
Without MTP at this quant across 2 cards, I see low 30s tokens/sec in token generation. With MTP enabled, depending on draft model acceptance, I get anywhere from 55 tok/sec (a 72k context benchmark I have for myself) up to mid 70s tok/sec generation during real work.
So while the draft "model" (it loads it as a separate model even though it's part of the same on on Qwen) uses some vram, everything still fits well.
robertpro01@reddit
It works, and works great my friend.