24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Posted by mdda@reddit | LocalLLaMA | View on Reddit | 34 comments

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM).

Results (Q4_K_M models, 128k context):

Model	tok/s	Key flags
Qwen 3.6 35B-A3B	\~24	--n-cpu-moe 30, K=turbo4 V=turbo3
Gemma 4 26B-A4B (no MTP)	\~20	--n-cpu-moe 20, K=V=turbo3, --flash-attn
Gemma 4 26B-A4B + MTP (naive)	\~21	embedding table silently on CPU
Gemma 4 26B-A4B + MTP (fixed)	\~24.5	--override-tensor-draft "token_embd\.weight=CUDA0"

The trick is MoE offloading: llama.cpp can park the cold expert weights in system RAM, and stream over PCIe to the GPU, while keeping hot layers + KV cache on GPU. The system is fully PCIe bandwidth-limited (GPU sits at \~40-50% utilisation while PCIe 3.0 x16 is maxed out).

Biggest finding: Gemma 4's MTP speculative decoding barely helps out of the box (\~5% gain). Turns out llama.cpp unconditionally keeps the token embedding table on CPU. Normally that's fine (just a get_rows lookup), but Gemma 4's MTP assistant has a tied LM head - so every draft token does a full 262k×1024 matmul across PCIe. Forcing it onto GPU with --override-tensor-draft gives the real \~22% speedup and \~79% draft acceptance rate.

Setup pain points (Fedora 42 + Pascal GPU):

Pin akmod-nvidia to 580xx branch (Pascal is going legacy)
Force gcc-14 for CUDA 12.9 (newer gcc rejected)
Patch CUDA's math_functions.h for glibc 2.41 compatibility
Used the AtomicBot-ai/atomic-llama-cpp-turboquant fork for both TurboQuant cache + Gemma MTP support

Full blog post with all the grindy build details (every command, and the debugging deep-dive into the MTP embedding table issue)

I'm also planning a YouTube video walkthrough soon - I'll update when that's live.

Happy to answer questions about the setup.

[-]

CatTwoYes@reddit

Been running Qwen 3.6 27B Q4_K_M for coding/agentic tasks for a while. Tool calling and single-file edits are rock solid. The quant only shows its teeth on multi-file refactors — the model starts missing cross-file dependencies that fp16 catches. For a $200 machine though, that's a tradeoff I'll take every time. The real bottleneck isn't the quant quality, it's what happens to TG speed when context actually fills up past 32k.

[-]

mdda@reddit (OP)

This is good to know - I guess I should look at speeds at different context lengths for each given max_context_length. Lower max values (like 32K - if that's all that work practically) would allow for more layers on the GPU (i.e. faster tok/s in general).

[-]

R_Duncan@reddit

Gemma kv cache do not works that way. 1. model is dumber with cache quantization 2. serious prompts (64k context full) will detonate your setup.

[-]

mdda@reddit (OP)

I'd be happy to test this out systematically - could you suggest something that would score the 64k prompt usage? I'm definitely expecting slowness, but should also quantify how back the 4-bit KV cache makes things.

[-]

R_Duncan@reddit

Just preprocess 4-5 scripts/code files together and ask something about those.

[-]

OsmanthusBloom@reddit

Thanks for the great writeup and summary. Very nice to see reasonable generation speeds with such old and relatively cheap hardware! Also the discovery about MTP issues with the Gemma4 embedding table was a useful finding. I hope that the MTP implementations will eventually take care of this edge case.

I don't see the point of aiming for 128k context on a setup like this. The PP speeds you got were around 50-60 tokens/sec, so it would take around 30 minutes to chug through a prompt of 100k tokens. Of course you may be aiming for some kind of slow analysis tasks where you can just leave your system chugging for half an hour or more, but for example realtime agentic coding will be frustratingly slow with such low PP speeds.

I would suggest dropping the context size to, say, 64k or even less, depending on what you really need for your use case; this will free up half your KV cache VRAM for better uses. Then I would increase ubatch size from the default 512 to 1024 or 2048. That will eat some VRAM (you may have to offload more expert layers into RAM) but should increase PP speeds significantly.

If you want to benchmark PP/TG speeds in more detail, you can either use llama-bench directly (it takes most of the same options as llama-server) or if you need to do it externally through accessing the API provided by a running llama-server, take a look at the llama-benchy utility.

[-]

mdda@reddit (OP)

Thanks for the pointers to llama-bench - I'll be doing the context length benchmarking soon (though I'm guessing long context processing is going to make disappointing reading).

I wanted to go for 128k rather than smaller, since I figure that the agentic/coding usecases are what I'd mainly use it for - but in 'slow cooker mode' rather than live/realtime coding tasks.

Looking at how I use Claude Code (Sonnet 4.5/6) at the moment, any decent request tasks more than a couple of minutes - at which point it only makes sense to be multi-tasking with other stuff (unless it's reading reddit until you have read it all...). So leaving it cooking for a bunch of time is just an extension of that idea. Even at 20tok/s it's faster than I could be reading/typing, and it's not me doing it. OTOH, I did find that Qwen was around 2x more verbose than Gemma in its thinking, so that's a different issue.

[-]

OsmanthusBloom@reddit

Makes sense. But even if you want to use it for slow cooking, I would still encourage you to try out increasing ubatch size. It may well double your PP speeds and thus halve your cooking times (though TG may suffer a bit).

Personally I tend to set local models for at most 64k context because beyond that the model is not going to be very usable anyway. The model will get more and more confused by the long context (and aggressive KV cache quantization will make this even worse) and also the TG speed will degrade. So it makes sense instead to divide the task into smaller parts that can be processed independently with shorter prompts/context.

[-]

mdda@reddit (OP)

Yes - I'll testing out the ubatch sizing (particularly since others have encouraged me to dig into the context size effect on tok/s - which I'm guessing will make me sad).

The idea of dividing tasks into smaller pieces is interesting in itself - however, I'm not sure whether 'agentic coding harnesses' have a dial to tune this on right now... There are clearly some orchestration choices that makes sense for different model sizes (and for mixed model sizes too)

[-]

OsmanthusBloom@reddit

Here was someone who managed to massively (\~5x) increase PP speeds by upping the ubatch size: https://www.reddit.com/r/LocalLLaMA/comments/1tany5t/comment/olbg29o/

Your 1080 setup will probably bottleneck earlier than that though.

[-]

Prudent-Ad4509@reddit

The problem is with Q4. It is better than nothing but I can not rely on it for anything except creative writing.

[-]

mdda@reddit (OP)

Well actually... We know that some model creators are specifically targeting the Q4 quant level during training - though it would only really work out if the internals of the quants full match (just independently doing a 4-bit quant of the full model will surely give worse results than using the actual Q4-during-training version).

[-]

Prudent-Ad4509@reddit

The creators of one of such models have said that this is apparently not worth it and they will stop doing that. I think that it was Kimi and I think that they still quietly decided to continue with Q4 training, but this needs to be checked. But this is a huge model. Huge models suffer less from training at Q4 and from being quantized to Q4 from higher original precision.

[-]

Client_Hello@reddit

In my experience Q4 handles smaller scripts just fine, definitely useful bash, powershell, and python for specific tasks.

[-]

Prudent-Ad4509@reddit

Like I've said, it is better than nothing. Qwen3.5 122B is still useful even at UD Q3 (which preservers the most important parts way higher than Q3), but it already loses proper code formatting capability. Q3 at medium size models like 122B and Q4 at small models like 35B/27B is the point where the model starts to fail at edge cases, but is not fully collapsing yet.

[-]

JustANerd420@reddit

On Qwen3.6 you can actually get 252k context with \~10tok/s with 8GB VRAM + 32GB RAM:

https://www.youtube.com/watch?v=8F_5pdcD3HY

[-]

mdda@reddit (OP)

Good info! I was figuring that 128k was 'a lot', and would rather have the +20% speed, and half the full context (which, TBH, these models might not be that great at using when it gets so large). YMMV

[-]

Client_Hello@reddit

Your tests are all with small context, usually under 2000 total tokens. While you reserved 128k, you didn't actually use it. Reserving the larger context reserved VRAM, causing more layers to offload to CPU, for a small performance hit.

Your blog post is 14k tokens. Drop it into a prompt and ask the LLM to translate to Spanish. Expect tok/s to fall by 20%

If you actually use the full 128k context it will crawl even slower.

[-]

mdda@reddit (OP)

Actually - rather than the blog post, is there something more standardised (like a script) that other people have used? I could leave it chugging and come back with a nice graph.

[-]

Client_Hello@reddit

The problem with standardized tests is these models are trained on them. Give it real data to see how it performs.

I suggested your blog post since these models were not trained on this data. Using public domain books or open source code can work, but now you're testing with data the model was.trained on.

[-]

monter72@reddit

First rate blog post and this summary, my hat is off to you. I have similar 20ish tok/s result with Qwen-3.6B-A3B Q4M on older i5 5600 32GB ddr3, with 1080ti 11GB. My sweet spot is --n-cpu-moe 28. I will upgrade tomorrow to 3060 12GB to test your theory that PCI is the bottleneck.

[-]

mdda@reddit (OP)

Every 1GB VRAM should help. Have a look at the GPU utilisation (eg: nvidia-smi) : Mine definitely capped out lower than I was expecting, but for the PCIv3 bus being saturated being the bottleneck.

The 3060 should benefit you in several ways : a couple+ more layers in VRAM; PCIv4 (32Gb/s in a -16 slot) and Tensor Cores.

Actually, now that I look at some of the comments in an old thread on TomsHardware perhaps my PCIv3 itself (being on 8GT/s) is less than it should be due to motherboard set-up. I'll dig in!

[-]

Worldly-Entrance-948@reddit

This is seriously impressive, squeezing 24 tok/s out of a 1080 with 128k context and MoE offload feels like the kind of hacky wizardry llama.cpp was born for

[-]

mdda@reddit (OP)

Totally : This was my first time digging into llama.cpp (which is why the blog post is painfully long - it has all the gnarly details). But the software is awesome - particularly since it's happy to compile for older hardware.

This is in contrast with Nvidia's ONNX releases which always want a card with Tensor Cores. (Unless someone has found the magic combination to get a decent ONNX version running on them... Please let me know!)

[-]

ikkiho@reddit

how's PP look at full 128k tho? tg numbers always look great with moe offload until you actually fill the context, kv quant compounds it. is the mtp embed override going upstream or just living in your fork?

[-]

mdda@reddit (OP)

I'm happy to run some context-stress tests too - is there a standardised script that other people have used?

And I didn't create my own fork... : """To add Gemma 4 MTP functionality, I found the AtomicChat GGUFs, which pointed at a more modern fork : AtomicBot-ai/atomic-llama-cpp-turboquant (which is itself forked from TheTom/llama-cpp-turboquant) and is the right combination of features for the MTP head + RotorQuant cache."""

Pretty sure it'll find its way upstream at some point : Also sure that the llama.cpp team is deluged with new stuff at the moment - and the different MTPs (which are much more non-standard than the models themselves) are a pain to add in a way that has good cross-model command-line option compatibility.

[-]

julp@reddit

That's interesting about needing to force the MTP onto GPU. Seems like an odd design decision on Google's end.

[-]

mdda@reddit (OP)

Actually, the original YouTube video by Codacus that inspired me also showed tests for the Qwen MTP arrangement, but he found that the Qwen speculative decoding didn't accelerate much at all. His explanation (which made sense when I watched it) was that the Qwen \~MTP addon model also had some state-space kind of layers in it which forced sequential processing - and so didn't net increase speeds. Maybe I'll revisit this, and just check whether the same issue applies as the Google one.

Anecdote time : In 2023, I talked to some people in the Keras team at Google. And the topic of low-sized models came up. They were actually mind-blown that anyone would actively be offloading GPU weights to RAM. Being so accelerator-rich at Google (with the TPUs etc) meant that they had (despite clearly being excellent engineers) never considered that the GPU-poor might ever think of such a thing.

[-]

OldEffective9726@reddit

That's a steal for only $200!!!

[-]

rm_rf_all_files@reddit

32gb ram machine

[-]

mdda@reddit (OP)

What a difference 2 years makes... And maybe this post increases the value people put on these lowly cards 😄

[-]

According_Study_162@reddit

I found a 32gb i7 box somebody was throulast year on the street. Then again I live in SF :/

[-]

mdda@reddit (OP)

At the time (\~2 years ago) it wasn't stand-out-cheap, but it was the best-value-for-money I could find that I could pick up locally. Buying a separate 1080 would have cost 'actual' dollars (I can see them locally for \~80 USD right now). But I found that when the sale was for a whole machine, sells attribute almost zero 'extra' for the graphics card - because it's EOL, etc. Even back then, they were kind enough to double-check with me that I understood how old the GPU was.

[-]

kwizzle@reddit

Amazing how moe models make those low VRAM cards useful