80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Posted by janvitos@reddit | LocalLLaMA | View on Reddit | 141 comments

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speed with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec on the benchmark found here: https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py

This is on an RTX 4070 Super, so results with other cards might vary.

To run llama.cpp with MTP support, you need to build it from source and add a draft PR that hasn't yet been merged with the master branch. You can find a very nice guide on how to do that here and also download the Qwen3.6 MTP GGUF: https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF

llama.cpp command:

llama-server \
  -m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -fitt 1664 \
  -c 131072 \
  -n 32768 \
  -fa on \
  -np 1 \
  -ctk q8_0 \
  -ctv q8_0 \
  -ctkd q8_0 \
  -ctvd q8_0 \
  -ctxcp 128 \
  --no-mmap \
  --mlock \
  --no-warmup \
  --spec-type mtp \
  --spec-draft-n-max 2 \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

The most important parameter here is -fitt 1664. Since part of the model is offloaded to CPU because of its size, this tells llama.cpp to properly balance the load on your GPU/CPU to get the best possible performance, and leaves 1664 MB of free memory for the MTP draft model and KV cache. Since I'm running my dGPU as a secondary GPU (monitor plugged in the iGPU), I can use all the available 12GB VRAM for inference. 1664 might be too small if you use your dGPU as your primary GPU.

Benchmark results:

mtp-bench.py

 code_python        pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=80.8
 code_cpp           pred=  58 draft=  40 acc=  37 rate=0.925 tok/s=81.8
 explain_concept    pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=70.0
 summarize          pred=  53 draft=  40 acc=  32 rate=0.800 tok/s=75.4
 qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=77.8
 translation        pred=  22 draft=  16 acc=  13 rate=0.812 tok/s=81.9
 creative_short     pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=69.2
 stepwise_math      pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=76.5
 long_code_review   pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=73.2

If you have any questions, feel free to ask :)

Cheers.

[-]

Otherwise-Way1316@reddit

Thanks for this. Didn't think it was possible. Now achieving 100+ t/s with Qwen 3.6 35B on llama.cpp. Very usable and useful indeed.

[-]

Undyne76@reddit

sorry if this is a noob question but the q4 has 24gb so would it fit in 12gb of vram?

[-]

otacon6531@reddit

Not even close. I think with absolutely no vram in llaam.cpp it is till like 20gb of vram. but you can always offload to your cpu ram. It just cost you speed (token/second)

[-]

Undyne76@reddit

Thanks for the answer, I think what I was missing is the A3B part. I looked it up and my basic understanding is that those 3B are the most used params so if you load those into VRAM it won't have to look up as often those other params which you can load into the cpu RAM, so the token speed wont decrease that much even if you can't fit the whole model into VRAM.

[-]

EducationalGood495@reddit

Hi, I am new to LLMs and planning to buy either 2080Ti 11Gb or 3060 12Gb to run Qwen 35B with offlaoding to cpu. Both are second-hand and good value but 2080Ti has 70Watts more power draw, 1 fewer gigs of vram but has roughly 2x bandwidth. What do you think?

[-]

PeteInBrissie@reddit

3060 all the way

[-]

Rahul159359@reddit

https://youtu.be/8F_5pdcD3HY?si=LSz7gjmJvweFsvmL

[-]

ItsRektTime@reddit

I got the following benchmark results on a 3060 12GB and R5 5600 with 32GB RAM:

// python3 mtp-bench.py
  code_python        pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=40.3
  code_cpp           pred=  58 draft=  40 acc=  37 rate=0.925 tok/s=49.3
  explain_concept    pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=41.6
  summarize          pred=  53 draft=  40 acc=  32 rate=0.800 tok/s=44.4
  qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=45.6
  translation        pred=  22 draft=  16 acc=  13 rate=0.812 tok/s=40.8
  creative_short     pred= 192 draft= 166 acc= 108 rate=0.651 tok/s=38.1
  stepwise_math      pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=46.7
  long_code_review   pred= 192 draft= 146 acc= 118 rate=0.808 tok/s=43.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1285,
  "total_draft": 986,
  "total_draft_accepted": 781,
  "aggregate_accept_rate": 0.7921,
  "wall_s_total": 36.04
}

[-]

Resident_Worker_5807@reddit

Do you run on windows or other OS?

[-]

ItsRektTime@reddit

It was on a WSL2 Debian distro

[-]

mdda@reddit

I've got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand rig (i7-6700 w 32 GB RAM + GTX 1080 w 8GB VRAM) : But I apparently I need >4 upvotes before I can post the story...

[-]

Still-Notice8155@reddit

Qwen3.6-35B-A3B-MTP on GTX 1070 8GB + i7-11700 16GB

Config: turboquant+MTP | n-cpu-moe 32 | turbo4/turbo3 KV | ctx 131K | ctx-checkpoints 8

---

Gen t/s degradation (attention O(n) cost):

0K: 48 t/s ████████████████████████████████

10K: 31 t/s █████████████████████

30K: 28 t/s ██████████████████

50K: 23 t/s ███████████████

80K: 23 t/s ███████████████ ← DeltaNet plateau

100K: 19 t/s ████████████

125K: 13.6 t/s █████████

Curve flattens 30-80K thanks to 30 DeltaNet O(1) layers. Only 10 attention layers drive degradation.

PP t/s (batch-driven, unaffected by context):

Short prompt (<20 tok): 41 t/s avg — overhead bound

Batched prompt (50+ tok): 135 t/s avg — GPU parallel

At 125K ctx: still 78-95 t/s PP

Draft acceptance: 58-86% depending on task predictability. Lifetime: \~90%.

VRAM: 7.5 GB used, 633 MB free at 131K. Turbo4/turbo3 KV = 590 MB (vs 720 MB q4_0).

RAM: 12 GB used (model no-mmap = 13.2 GB + MoE CPU offload + 500 MB prompt cache). 2 GB free with checkpoints=8.

Improvement over non-MTP baseline:

Non-MTP MTP+turbo Speedup

5K: 27.4 → 48 = 1.8x

80K: \~7 → 23 = 3.3x

125K: \~3 → 13.6 = 4.5x

The gap widens at high context — MTP saves \~constant time per token regardless of context, while attention cost grows linearly.

[-]

Ok_Jury_8311@reddit

can you please share steps to have turboquant+MTP running?

[-]

Still-Notice8155@reddit

git clone https://github.com/jmpangilinan/llama-cpp-turboquant.git

[-]

DunderSunder@reddit

Qwen3.6-35B-A3B-MTP

which quant is this?

[-]

Still-Notice8155@reddit

Qwen3.6-35B-A3B IQ4_XS + MTP on GTX 1070 8GB

Hardware

CPU: i7-11700 (8c/16t)

RAM: 32 GB DDR4-3200

GPU: GTX 1070 8GB (Pascal, stock clocks, no OC)

OS: Ubuntu 26.04, CUDA 12.4, driver 580.142

Model

Name: Qwen3.6-35B-A3B (MoE, 256 experts, 8 active, 3B active params)

Quant: IQ4_XS (19.4 GB, 4.37 BPW)

MTP: Q8_0 draft heads, 3-token speculative decoding

Arch: 30 DeltaNet (O(1)) + 10 quadratic attention (O(n)) layers

Context: 131,072 tokens

Server Flags

--n-cpu-moe 35 --no-mmap --parallel 1 --ctx-checkpoints 32

--spec-type mtp --spec-draft-n-max 3

--cache-type-k turbo4 --cache-type-v turbo4

--jinja -c 131072 -fit off

Build: llama.cpp master + PR #22673 (MTP) + turboquant cache patches

Turbo4 KV cache: 4-bit WHT quantization for K and V

Gen Speed vs Context

0–15K: 32.1 t/s

15–40K: 28.1 t/s

40–70K: 24.3 t/s

70–100K: 23.0 t/s

100–131K: 18.1 t/s

Prompt Processing

0–15K: 148 t/s

40–70K: 107 t/s

100–131K: 64 t/s

Draft Acceptance (MTP)

Per-task: 42–89% (varies by difficulty)

Global: 75–80% lifetime

VRAM at 131K

GPU model: 4,578 MB

KV cache: 1,122 MB (turbo4 compressed)

Recurrent: 251 MB

Compute: \~493 MB

Total: \~7.6 GB / 475 MB free

RAM

22 GB used / 9 GB free (32 GB total, --no-mmap)

Have retested in 32GB ram. Still good performance. I'm not sure about the quality degredation.

[-]

Still-Notice8155@reddit

I have tried this benchmark https://github.com/alexziskind1/codeneedle

## Qwen3.6-35B-A3B IQ4_XS + MTP — CodeNeedle Positional Recall

Tests exact line-by-line recall: stuff entire source into context, reproduce

functions verbatim. Pass = ≥8/20 lines match exactly including whitespace.

MTP speculative decoding at n=3, turbo4 quantized KV cache.

### Results

HTTP no-think: 10/11 PASS (91%), 187/220 lines (85%), 50 total hallucinations

HTTP think: 9/11 PASS (82%), 186/220 lines (85%), 66 total hallucinations

jQuery no-think: 14/16 PASS (88%), 283/320 lines (88%), 319 total hallucinations

jQuery think: 14/16 PASS (88%), 271/320 lines (84%), 43 total hallucinations

### MTP Draft Acceptance

Global Per-task range

HTTP no-think 94% 86-100%

HTTP think 93% 86-100%

jQuery no-think 91% 51-100%

jQuery think 87% 62-100%

[-]

Still-Notice8155@reddit

Qwen3.6-35B-A3B-MTP-UD-Q2_K_XL.gguf. I would love to test the Q4_K_M, but I don't have enough RAM for now.

[-]

FirefoxMetzger@reddit

what does the turboquant refer to here? K/V cache or or model quantization?

[-]

Still-Notice8155@reddit

KV cache.

[-]

Creative-Type9411@reddit

the guide link is missing? for the "You can find a very nice guide on how to do that here and also download the..."??

[-]

chille9@reddit

50 t/s with rtx 4060Ti 16Gb and 32gb ram! Also using the q5 quant at a 98k context! Magnificent.

[-]

Loouiz@reddit

Is it stable? Did you make any other adjustments? I'm trying this with a 16gb 4080 super an 32gb ram and I'm gettin oom here and there...

[-]

chille9@reddit

I´ve made very small adjustments. I also recompiled using the instructions that op had listed.

Here´s the bat file i run in my llama dir where you can see my settings.
https://pastebin.com/dSkkKX60

It´s been really stable for me. I hope you can solve it!

[-]

Loouiz@reddit

I've been running your config with a 16gb 4080 super, 7800x3d, 32gb ram. It is amazing, but I still get an occasional oom here and there. Any tips?

[-]

janvitos@reddit (OP)

Raise -fitt to something higher. Try 128 increments. If you're using 1536, try 1664 😄

[-]

Loouiz@reddit

Oh, I've also been trying to come up with a way to use a 1080 I have gathering dust, but couldn't come up with anything, mainly because of pascal architecture. My only goal is agentic coding. Any ideas or resources you recommend?

[-]

leonbollerup@reddit

Sadly.. the quality in the answer... goes to hell.. atleast in tests:
--

This is the prompt:
---
A city is planning to replace its diesel bus fleet with electric buses over the next 10 years. The city currently operates 120 buses, each driving an average of 220 km per day. A diesel bus consumes 0.38 liters of fuel per km, while an electric bus consumes 1.4 kWh per km.

Instructions:

Verify your data
Use tables to represent data where you can

Relevant data:

- Diesel emits 2.68 kg CO₂ per liter.

- Electricity grid emissions currently average 120 g CO₂ per kWh, but are expected to decrease by 5% per year due to renewable expansion.

- Each electric bus battery has a capacity of 420 kWh, but only 85% is usable to preserve battery life.

- Charging stations can deliver 150 kW, and buses are available for charging only 6 hours per night.

- The city's depot can support a maximum simultaneous charging load of 3.6 MW unless grid upgrades are made.

- Electric buses cost $720,000 each; diesel buses cost $310,000 each.

- Annual maintenance costs are $28,000 per diesel bus and $18,000 per electric bus.

- Diesel costs $1.65 per liter; electricity costs $0.14 per kWh.

- Bus batteries need replacement after 8 years at a cost of $140,000 per bus.

- Assume a discount rate of 6% annually.

Tasks:

Determine whether the current charging infrastructure can support replacing all 120 buses with electric buses without changing schedules.
Calculate the annual CO₂ emissions for the diesel fleet today versus a fully electric fleet today.
Project cumulative CO₂ emissions for both fleets over 10 years, accounting for the electricity grid getting cleaner each year.
Compare the total cost of ownership over 10 years for keeping diesel buses versus switching all buses to electric, including purchase, fuel/energy, maintenance, and battery replacement, discounted to present value.
Recommend whether the city should electrify immediately, phase in gradually, or delay, and justify the answer using both operational and financial evidence.
Identify at least three assumptions in the model that could significantly change the conclusion.

---

Result:

[-]

leonbollerup@reddit

[-]

q-admin007@reddit

Awesome work. I have a 5070 Ti 16GB connected via Oculink with a Strix Halo. Will give it a go later with UD-Q6_K_XL. It seems to be the sweetspot in terms of precision on smaller systems. I also would rather half my context and use f16 there.

[-]

zerozero023@reddit

Nice write-up. The -fitt flag is something I never paid attention to before — makes sense for hybrid GPU/CPU setups. Did you notice any quality difference with Q4_K_XL vs higher quants at this context size?

[-]

eliko613@reddit

Great writeup — the -fitt tuning is genuinely underappreciated. Most people just set -ngl 99 and wonder why their CPU is saturated.

A few things that helped me squeeze out a bit more on a similar split setup:

Bumping --ctxcp slightly (128 worked better for me than 64 at longer context) — worth benchmarking your specific use case
--spec-draft-n-max 2 is conservative; if your draft model is fast you can push to 3–4 and get meaningful throughput gains
With preserve_thinking: true the KV cache fills up fast at 131k context — make sure you're actually using that window or trim -c to free headroom

Also been using zenllm.io for quick parameter testing before committing to long runs — handy for dialing in temp/top-p without burning local resources. Not affiliated, just a useful scratch pad.

What's your tok/s looking like on this config?

[-]

RaspNAS@reddit

Redditのこのコンフィグの提供スレにリプで投げるから本文スタートで行く。英文変だったら教えて I tried the MTP benchmark on llama.cpp too. Thanks a lot! This ultra-high-speed LLM is insane !!!! Hardware: - GPU: RTX 3060 12GB - CPU: Ryzen 9 5950X (16 threads) - RAM: DDR4-3200 40GB - OS: Windows 11 Pro (on Proxmox with PCIe-Passthrough)

Administrator in 🌐 letwir-main in ~\Documents via  v24.14.0 via 🐍 v3.14.2 (.venv)
❯ curl https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py -o mtp-bench.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7709  100  7709    0     0  77194      0 --:--:-- --:--:-- --:--:-- 79474

Administrator in 🌐 letwir-main in ~\Documents via  v24.14.0 via 🐍 v3.14.2 (.venv)
❯ sd "8080" "11434" .\mtp-bench.py

Administrator in 🌐 letwir-main in ~\Documents via  v24.14.0 via 🐍 v3.14.2 (.venv)
❯ py .\mtp-bench.py
  code_python        pred= 192 draft= 156 acc= 138 rate=0.885 tok/s=38.9
  code_cpp           pred= 192 draft= 180 acc= 131 rate=0.728 tok/s=35.0
  explain_concept    pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=33.7
  summarize          pred=  53 draft=  48 acc=  36 rate=0.750 tok/s=37.4
  qa_factual         pred= 192 draft= 180 acc= 131 rate=0.728 tok/s=35.2
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=31.6
  creative_short     pred= 192 draft= 207 acc= 122 rate=0.589 tok/s=31.1
  stepwise_math      pred= 192 draft= 174 acc= 133 rate=0.764 tok/s=35.8
  long_code_review   pred= 192 draft= 192 acc= 127 rate=0.661 tok/s=32.8

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1350,
  "total_draft_accepted": 959,
  "aggregate_accept_rate": 0.7104,
  "wall_s_total": 46.07
}

llama-server --port 11434 --host 0.0.0.0 --threads 16 --threads-batch 16 -m "A:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q3_K_XL.gguf" -fitt 1736 -c 131072 -n 32768 -fa on -np 1 -ctk q8_0 -ctv q8_0 -ctkd q8_0 -ctvd q8_0 -ctxcp 64 --no-mmap --mlock --no-warmup --spec-type mtp --spec-draft-n-max 3 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --jinja --webui-mcp-proxy

[-]

Few-Annual-4415@reddit

hello, can you explain how to generate the mtp model from huggingface model?

[-]

Resident_Worker_5807@reddit

can i run it on Windows + Vulkan?

[-]

trialbuterror@reddit

Will this work for 9060xt 16gb 16gb ddr4 5600g processor ?

How effective is coding softwares ?

[-]

PeteInBrissie@reddit

I’ve done this today and for some reason OpenCode is looping weirdly compared to the non-MTP setup. If I work it out I’ll share here

[-]

PeteInBrissie@reddit

OK My setup is R5 5060G, 64GB RAM, RTX4060Ti. In OpenCode it was looping like mad until I set my context to 65576. Unfortunately OpenCode is also pushing 18,000 tokens at it which means an initial reaction time of about 3 minutes - after which it's really quick. Pretty sure I was seeing 90t/s at one stage last night.

[-]

b0ts@reddit

On my 3070 (8GB) with a Ryzen 9 7900 and 64GB DDR5 6400:

[-]

zabadey@reddit

Sorry for my dumb question, but does it mean that I can also use it with my 16gb ram mbp m5?

[-]

Snoo40301@reddit

Is this using the official llama.cpp or a fork for the MTP ?

[-]

pwmcintyre@reddit

legend! i'm finally getting useful results on my 4070 12GB

[-]

yoomiii@reddit

wake me up when MTP PR is merged

[-]

cognitium@reddit

Are you actually getting good output from that model though? It's the fastest local model I've ever used because only 3B are active at a time but it'll use half of it's context endlessly soliloquizing about how it's a good model that follows the rules and then doesn't follow them.

[-]

janvitos@reddit (OP)

Try it with thinking disabled: --chat-template-kwargs '{"enable_thinking": false}' (might be slightly different for Windows).

I feel it's pretty darn good at coding this way. Make sure you use the right launch parms for instruct / non-thinking mode though: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

[-]

cognitium@reddit

Alright, I'll try those. I spent most of yesterday playing with qwen3.6 35B and 27B and they both have issues with over thinking. The speed of 35B is what's most impressive.

[-]

Substantial-Thing303@reddit

you can also try deepseek's scratchpad grammar on qwen3.6 to cut down on the thinking: https://github.com/noonghunna/club-3090/blob/master/docs/STRUCTURED_COT.md

[-]

_bones__@reddit

Getting 60t/s on an RTX3080 12GB with this setup. So quite useful!

I am getting a huge preprocessing time in an existing session, which is a bit weird, as I didn't have that with regular Qwen 3.6 before this, a Q3 that got me 45t/s.

Definitely interesting stuff, thanks for posting.

[-]

BitGreen1270@reddit

This is very cool, thanks for sharing. I used the same prompt on the non-MTP and the MTP version and got the following:

Non-MTP - [ Prompt: 80.3 t/s | Generation: 21.6 t/s ]

MTP - [ Prompt: 71.9 t/s | Generation: 28.1 t/s ]

Prompt speed seems to have gone down, but token generation has gone up significantly. This is on my 780m iGPU.

[-]

zulutune@reddit

Hey OP thank you so much for this. I have an underutilized 5070ti and I’m going to try this out. Hopefully this weekend.

[-]

zulutune@reddit

Btw did you try DeepSeekV4? I’m kinda curious for this model too.

[-]

janvitos@reddit (OP)

I've tried DeepSeek V4 cloud and didn't like it at all. When using cloud models for coding, GPT 5.5 is my top choice. In my opinion, its deterministic behavior makes it extremely apt at one shotting large and complex code additions/modifications.

To be honest, I found Qwen3.6 35B A3B local to be in the same league as most other and bigger open LLMs, except GLM 5.1, which can debug and resolve issues that Qwen3.6 cannot.

[-]

rz2000@reddit

Have you tried it with different thinking parameters? Using the flash version locally, I’ve found that completely turning off thinking gets good results.

[-]

zulutune@reddit

Does that mean you have a macbook with 128GB?

[-]

rz2000@reddit

A Mac Studio with 256 GB. I think Mac Studios with 192GB+, or the maxed out M5 MacBook Pro is what antirez was targeting with this inference engine.

In a couple years this sort of performance will likely be cheap, and it would worry me more if I were Google, OpenAI, Anthropic than some of the other open model releases that suddenly made AI briefly crash.

I haven’t gotten Gemma 4 with MTP acceleration to work very reliably yet, but that is another way that local inference is becoming viable for much more than just hobbyist use.

[-]

zulutune@reddit

Gratz, you’re in a different league.

So how does Qwen 3.6 and DS4 compare, what’s your favorite? Do you ever feel the need to use cloud models, or does that level of GB’s really give you the raw power of a CC/Codex?

[-]

rz2000@reddit

I haven’t used either for much code assistance.

The “personality” of DeepSeek v4 is much more like GLM 4.6 or 4.7, which I think is pretty good, but without the need to quantiize it down to 4bits. DeepSeek v4 flash fits in 160GB of memory.

For tasks other than coding I find Qwen pretty unbearable. It seems very incurious and very worried about anything that might be innovative.

[-]

zulutune@reddit

Interesting observation haha :) thanks for sharing your insights

[-]

janvitos@reddit (OP)

I actually coded quite a bit of stuff with it, and couldn't justify continuing using it, even with its cheap price. Tried Flash and Pro, didn't see the difference. I'm even having doubts that DeepSeek V4 is actually running. It feels incredibly similar to DeepSeek V3. I did a quick test:

"From memory, tell me what's the most recent DeepSeek version? (Don't search the web).

Thinking: The user is asking me about the most recent DeepSeek version from memory, and specifically says not to search the web.

Based on my knowledge, the most recent DeepSeek models include:

- DeepSeek-V3 (released around December 2024)

- DeepSeek-R1 (released around January 2025)"

Which suggests it was trained on older data, which makes me think they might be using DeepSeek V3 behind their V4 endpoint. But that's all speculation, right? 😛

[-]

zulutune@reddit

Thanks for posting this!

[-]

StupidScaredSquirrel@reddit

Why -no-mmap?

[-]

janvitos@reddit (OP)

It's a general llama.cpp recommendation when using --mlock, which prevents swapping to disk. --no-map loads the entire model into RAM instead of loading parts when needed. Prevents disk I/O and makes memory usage more predictable.

[-]

BitGreen1270@reddit

I have a 780m igpu and adding --no-mmap makes it use 2GB extra RAM with nothing else changing. My prompt is just a 500 word story in the style of Roald Dahl. Since I only have 32GB, that's a pickle. No difference in tps though - still getting exactly the same. This is for non-MTP though. I'm downloading the MTP version to try out with your params (thanks so much!)

[-]

sh4rk1z@reddit

benchmarked mmap and no-mmap less than an hour ago on rtx2070S/ryzen3950x/64ram with vram limited to 6.25GiB for qwen-3.5-9B-ud-q4-k-xl so I can use my desktop while using the local model. Result over 3 runs:

- \~1.5% decode speed improvement.
- \~5.2% prompt processing improvement.
- 28 MB vram less used.
- std droped by 10-20x
- no disk io so less wear/tear

I'm still experimenting with some things (turboquant, trellis) and will post once done and then try the Qwen 3.6.

[-]

letsgoiowa@reddit

Std?

[-]

CircularSeasoning@reddit

The user has entered three letters, "Std", with a question mark, possibly hoping to elicit more information about (Something To Do?) with "Std"? I'm not sure what that means. I should ask for clarification.

Wait! The user's name is 'letsgoiowa' (i.e., "Let's go, Iowa!") so let me research what happens in Iowa in connection with the letters or acronym, "STD"...

[web search content omitted]

Ah.

I should helpfully advise the user to test for: Chlamydia.

All good.

Proceed.

[-]

letsgoiowa@reddit

Standard deviation lol

But thanks

[-]

sh4rk1z@reddit

😂😂😂

[-]

CircularSeasoning@reddit

letsgoiowa looking at me all σ_σ

[-]

theowlinspace@reddit

—mmap with —mlock shouldn’t use disk io after you’ve loaded the model because it locks the mapped pages in RAM

[-]

janvitos@reddit (OP)

I'd be curious to know which parameter gave you these improvements? mmap or no-mmap?

[-]

sh4rk1z@reddit

no-mmap

[-]

janvitos@reddit (OP)

Thanks!

[-]

StupidScaredSquirrel@reddit

But when is it useful to have mmap then? If having -no-mmap still loads what is needed?

[-]

Marksta@reddit

Because mmap is faster to load the model on Linux where it has a real system level mmap. And if you were to turn off the server and turn it back on again, the model would already be in mmap. Restarting something big like Deepseek without mmap would mean waiting a few minutes each time to load it, unload it, load it again...

[-]

farkinga@reddit

When the model is big, and when the weights will be in system ram anyway (e.g. a moe) , use mmap (on Linux) to avoid loading the whole model into ram. With mmap, Linux will load the weights into ram as needed. However, use no-mmap if you have a performance reason to keep the weights in ram anyway. It should run a little faster with no-mmap but it takes longer to start.

[-]

janvitos@reddit (OP)

To be honest, I would say try both and see what works best for ya 😄

[-]

dark-light92@reddit

So it doesn't mmap.

[-]

Weird_Night_2176@reddit

Been self-hosting AI for the past few months and finally got it to a point worth sharing. The stack:

- Jetson Orin Nano Super: CrewAI orchestration, 14 AI agents

- Orange Pi 5 Plus: Ollama model server

- Odroid XU4: PostgreSQL memory layer

- Jetson Nano 4GB: Tailscale mesh, network services

Total monthly cost: $8 (electricity + Claude API for final decisions only) The agents run a paper trading desk, generate SEO content for a local business client, write YouTube scripts, and send me a morning briefing every day via WhatsApp. All local, all private, zero cloud dependency.

Documenting the whole build on YouTube if anyone wants to follow along: https://www.youtube.com/@BlackBoxAILab

Happy to answer questions about the hardware setup or the agent architecture.

[-]

admajic@reddit

Huh? On a 3090 I'm getting average 150 tok/s and tops at 200 tok/s. Amazing how ooffloading destiny's u

[-]

PrometheusZer0@reddit

what's your setup? Lucebox?

[-]

admajic@reddit

Using mtp you need to pull it from git i did a write-up about it

[-]

janvitos@reddit (OP)

That's because the entire model loads into your VRAM, which is impossible on a 12GB GPU.

[-]

MistingFidgets@reddit

Spec Decode and MTP are really awesome. I have some benchmark data i want to share but can't post yet, need some upvotes on comments before localllama will let me.... help me out here

[-]

FirefoxMetzger@reddit

Hm, so the reason this works as well as it does is that you offload layers to host memory (i.e. your total footprint is >12GB) and you increase decode tok/s with speculative decoding using a draft model?

[-]

janvitos@reddit (OP)

Exactly!

[-]

the_masel@reddit

Interesting. Did you compare it without MTP? With my 5060 Ti 16GB, I get around a 15% increase to 66tok/s. Is this normal?

[-]

oviteodor@reddit

Thank you OP

[-]

Plastic_Use_4610@reddit

Seems really high for the hardware - well done

[-]

alchninja@reddit

Hey, thanks for the info! Could I ask what your CPU and RAM specs are? I'm on a Ryzen 5700x and 32GB DRR4-3600, just trying to get a feel for how much people are able too benefit from having newer CPUs and DDR5.

[-]

janvitos@reddit (OP)

Here's my specs:

AMD Ryzen 7 9700X
48GB DDR5-6000

I'm surprised I'm not encountering more issues with the 3 x DIMM RAM config. It's actually running great even with EXPO I 😄

I was able to run the same model (non-MTP) with 32GB, but it was tight. That's why I stole a 16GB DIMM from my son's gaming PC. With 48GB, I have a 10-12 GB buffer at all times when the model is loaded.

[-]

alchninja@reddit

Thanks! I bet your son is super happy about his missing RAM stick lol

Yep, getting into local LLMs and seeing how Kubuntu breathed new life into my 9 year old Dell XPS (I don't know how I lived without KDE Plasma for so long) finally pushed me away from Windows for good on all my machines. I still keep it on a partition just for the occasional gaming session with a friend (unfortunately the stuff we play needs Windows) but I can't imagine ever using it as my daily again.

[-]

OsmanthusBloom@reddit

Thanks a lot, this is inspiring! I'm trying to see if I can use MTP on my poor 3060 Laptop with just 6GB VRAM.

One stupid question though: how did you get mtp-bench.py working with current llama-server? What command did you use to run it?

For me it just gives 400 Bad Request errors regardless of how I try to run it. I suspect the problem is the call to "/completion" (I think it should be "/v1/completions"?)

[-]

janvitos@reddit (OP)

If you look at the end of mtp-bench.py, you see the following line:

ap.add_argument("--url", default="http://127.0.0.1:8080")

If your server is not already running on http://127.0.0.1:8080, you can either modify mtp-bench.py to match your server host/port, or change your server port to match mtp-bench.py, and it should work 😄

[-]

OsmanthusBloom@reddit

Yeah. My problem was that I was using llama-server with the --models-preset option, which means it will run a proxy server on port 8080 and start separate workers for the requested model. In this mode the REST API is more limited and mtp-bench didn't work. As soon as I switched to the traditional CLI mode (lots of cli options) mtp-bench started working without any options.

[-]

janvitos@reddit (OP)

Awesome! Glad you found the issue :)

[-]

leonbollerup@reddit

Have you run any test to compare the quality against a “normal” model ?

[-]

Due_Steak_1249@reddit

Have you observed any performance degradation as the context window reaches capacity? Historically, a 32k token limit appeared to be the optimal threshold for maintaining accuracy; for instance, Qwen3 reportedly showed a decline from 95% to 75% accuracy when scaling toward 128k.

Conversely, some users suggest that operating significantly below the 128k mark may increase the model's susceptibility to repetitive loops. I am interested in the current state of the art regarding this architecture and your practical experiences using it. It appears that users are currently forced to balance significant trade-offs between context volume and output reliability.

[-]

janvitos@reddit (OP)

I've coded quite a bit with Qwen3.6, not as much with MTP though. Did lot's of code additions, debugging and refactoring on \~10,000 line projects. Never noticed any degradation at all.

Unfortunately, I realized Qwen3.6 cannot compete against larger models like GPT 5.5 for more demanding coding tasks, and often simply cannot produce any working code. But I still feel like Qwen is very capable for small projects where logic isn't pushed too far. I've had much more success with Qwen than Gemma 4.

[-]

IrisColt@reddit

Thanks a lot!!!

[-]

coolaznkenny@reddit

Going to utilize this guide once i get my hands on a steam machine!

[-]

Sufficient_Sir_5414@reddit

How are you balancing the KV cache for the 128k context window alongside the MTP draft model on only 12GB? Did you have to aggressively tune the -fitt parameter or sacrifice context depth to maintain that 80% acceptance rate?

[-]

janvitos@reddit (OP)

That's the magic of -fitt: Once you find the sweet spot that doesn't cause any OOM, you get a rock solid local inference setup that can perform very well even on a hybrid GPU/CPU config. No long tuning. No sacrifices. Just a few code analysis / creation runs with the agent to fill the context and test the VRAM limit.

[-]

singlegpu@reddit

Any recommendations on where to learn more about this parameters?

[-]

janvitos@reddit (OP)

Here you go: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

You can also ask any AI to explain them more in detail 😄 I got some pretty good answers from Gemini.

[-]

feik696@reddit

I'm not too experienced with PCs, so I've mostly been using LM Studio, which has the same graphics card as yours. However, where LM Studio shows 30 tokens per second, I'm getting half that amount here. It's possible that I've made a mistake with the compilation, but then again, it wouldn't have started in the first place, right?

[-]

janvitos@reddit (OP)

I also started with LM Studio, but to be frank, I never got good results with it. When I switched to llama.cpp, it was a night and day difference. LM Studio is a wrapper around llama.cpp that seems to add latency to the process. And you can never really be sure which parameters it passes to llama.cpp. If you can run llama.cpp directly, I'm pretty confident you'll get much better tok/sec!

[-]

feik696@reddit

code_python pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=14.1

code_cpp pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=13.4

explain_concept pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=13.2

summarize pred= 53 draft= 40 acc= 32 rate=0.800 tok/s=14.1

qa_factual pred= 192 draft= 140 acc= 121 rate=0.864 tok/s=14.7

translation pred= 22 draft= 16 acc= 13 rate=0.812 tok/s=14.7

creative_short pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=13.1

stepwise_math pred= 192 draft= 140 acc= 121 rate=0.864 tok/s=14.5

long_code_review pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=13.7

Aggregate: {

"n_requests": 9,

"total_predicted": 1419,

"total_draft": 1060,

"total_draft_accepted": 877,

"aggregate_accept_rate": 0.8274,

"wall_s_total": 113.26

}

[-]

sirnixalot94@reddit

I haven’t tried MTP yet, but I have that same model running on an RTX 4080 16GB with —cpu-moe=20 and I’m getting 105t/s pp and right at 50t/s generation speed. I’m going to check this out and see if adding this in addition to that will improve my performance even more. Thanks for the findings!

[-]

janvitos@reddit (OP)

Definitely try the -fitt flag. It replaces --cpu-moe and the guessing work. The only thing you need to evaluate is the right amount of reserved RAM. So for non-MTP, I started with -fitt 256, but ran into OOM errors here and there. It was rock solid with -fitt 512. You can check your VRAM usage with nvidia-smi. For me, 11800MiB / 12282MiB is pretty much the max I can push.

[-]

unrevealedpains@reddit

how would It run on my 4GB VRAM, RTX 3050? I know this might be a stupid question but I am new to all of this

[-]

janvitos@reddit (OP)

Not stupid at all! You should try it 😄 I'd be curious and happy to see the result!

[-]

mindinpanic@reddit

Promising! Did you get any issues with the coding agent context?

[-]

janvitos@reddit (OP)

Nope 😄

[-]

EmelineRawr@reddit

Interesting, I also have a 4070 SUPER and was happy with a 40 tk/sec, I'll try your thing, thanks!!

[-]

cookieGaboo24@reddit

Hey! Followed your list to the T, and yet fit crashed out and drops me down to only 4k ctx when I try to use it. And with MTP my tokens on a 3060 12gb drop from 35 down to 4. Do you perhaps know what the issue could be? Best regards

[-]

HavenTerminal_com@reddit

the spec-draft-n-max 2 vs 3 finding is the kind of thing you only figure out by running both. appreciate you logging it.

[-]

janvitos@reddit (OP)

And I recommend that everyone test their own values, as I've seen others find success with 3 or 4 😄

[-]

FrostWolfDota@reddit

I have a 16GB AMD cpu, will try to reproduce it when I find some time. Never tried using llama.cop directly, only through LM studio.

[-]

house_monkey@reddit

Wish I could reproduce my 16GB AMD cpu

[-]

Independent-Flow3408@reddit

This is a really useful writeup, thanks.

The "-fitt 1664" detail is the part I would have missed. For long-context coding workflows, did you notice the speed dropping mainly from KV/cache pressure, or from CPU/GPU balancing once the context gets large?

Also curious if you tested this with an agent workflow like OpenCode/Continue, or only direct llama.cpp prompting.

[-]

janvitos@reddit (OP)

Speed drops towards 50 tok/sec when context has filled up near 128K. But that's still very reasonable and usable. Didn't notice quality degradation.

I've been using this with Opencode for the past few days without any issues or hiccups. I can analyze the entire codebase of a small project, which fills up the context near 75K, and continue working on it for a little while without apparent problems.

So yeah, I would consider this as pretty stable 😄

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

ai_without_borders@reddit

the 80 tok/s is with 128K context loaded — at shorter contexts (4-8K) you would be pushing 100+ easily. MTP overhead shows up more in prompt processing than in token generation, so the win is biggest on long generation runs vs short QA bursts. good config though, -no-mmap with mlock is the right call for sustained throughput.

[-]

asterion24r@reddit

Espectacular

[-]

iamapizza@reddit

Does it work if you use --fit, --fit-target, and --fit-ctx? Supposedly these args should be taking care of using as much vram as possible.

[-]

evilbarron2@reddit

Hmm. I get 100+ tok/sec (as measured by the llama-serve WebUI) with qwen3.6 35b A3b on my 3090 with my prompt.

[-]

429_TooManyRequests@reddit

Wow this post is perfect timing. I have a 3080 Ti and was depressed I couldn’t get this exact model working last night. I’ll try it out today and send results!

[-]

ducksoup_18@reddit

I have 2 3060s for a total of 24gb vram. I'd love to see these kind of numbers with that setup. Will try.

[-]

janvitos@reddit (OP)

You'll likely get even better speeds since the model will probably all fit in VRAM.

[-]

masterlafontaine@reddit

What is prompt speed? Usually this is what makes agentic code the most boring and slow. It's usually about reading, say 50k, then writting 3k.

[-]

ElChupaNebrey@reddit

What is you speed on 27b

[-]

twiddlebit@reddit

27b wont fit on 12gb of vram so probably very not good

[-]

janvitos@reddit (OP)

I haven't even tried it after seeing other people's results. I know it wouldn't be fast enough for real coding use anyways, so I'll wait until some miracle happens or I buy a new GPU 😄

[-]

slimdizzy@reddit

I have a 3080 12gb I will try this on. Thanks muchly OP!

[-]

janvitos@reddit (OP)

Awesome!

[-]

damianzoys@reddit

I got some nice tok/s too, but the hallucinations make it almost impossible to us. It hallucinates tools and directories which aren’t there, even with low temperature. Any idea how to fix this?

[-]

janvitos@reddit (OP)

Are you getting these hallucinations with MTP only?

To be honest, I haven't noticed any issues with MTP and have used it to do some code work, but no major project yet. No tool issues at all. For my setup, Qwen3.6 is actually much more stable with tools than Gemma 4.

[-]

burdzi@reddit

Nice 🤩 does MTP also work for vision? If I give it images?

[-]

janvitos@reddit (OP)

I think there might be some issues with vision. You can read about it here: https://github.com/ggml-org/llama.cpp/issues/22867 and the official PR thread: https://github.com/ggml-org/llama.cpp/pull/22673

[-]

Fuzilumpkinz@reddit

I’ll try this for sure. I’m getting 40 atm but I’m on a 6700 xt. Curious if I can find any increases