RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

Posted by marlang@reddit | LocalLLaMA | View on Reddit | 148 comments

Spent an evening dialing in Qwen3.6-35B-A3B on consumer hardware. Fun side note: I had Claude Opus 4.7 (just the $20 sub) build the config, launch the servers in the background, run the benchmarks, read the VRAM splits from the llama.cpp logs, and iterate on the tuning — basically did the whole thing autonomously. I just told it what hardware I have and what I wanted to run.

Sharing because the common --cpu-moe advice is leaving 54% of your speed on the table on 16GB GPUs.

Hardware

GPU: RTX 5070 Ti (16GB GDDR7, Blackwell)
CPU: Ryzen 9800X3D (96MB L3 V-Cache)
RAM: 32GB DDR5
Stack: llama.cpp b8829 (CUDA 13.1, Windows x64)
Model: unsloth/Qwen3.6-35B-A3B-GGUF — UD-Q4_K_M (22.1 GB)

The finding — --cpu-moe vs --n-cpu-moe N

Everyone’s using --cpu-moe which pushes ALL MoE experts to CPU. On a 16GB GPU with a 22GB MoE model that means only \~1.9 GB of your VRAM gets used — the other \~12 GB sits idle.

--n-cpu-moe N keeps experts of the first N layers on CPU and puts the rest on GPU. With N=20 on a 40-layer model, the split uses VRAM properly.

Benchmarks (300-token generation, Q4_K_M)

Config	Gen t/s	Prompt t/s	VRAM used
`--cpu-moe` (baseline)	51.2	87.9	3.5 GB
`--n-cpu-moe 20`	78.7	100.6	12.7 GB
`--n-cpu-moe 20` + `-np 1` + 128K ctx	79.3	135.8	13.2 GB

+54% generation speed, +54% prompt speed vs. naive --cpu-moe. Jumping to 128K context is essentially free thanks to -np 1 dropping recurrent-state memory.

Startup command that works

llama-server.exe ^
  -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
  --n-cpu-moe 20 ^
  -ngl 99 ^
  -np 1 ^
  -fa on ^
  -ctk q8_0 -ctv q8_0 ^
  -c 131072 ^
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 ^
  --presence-penalty 0.0 --repeat-penalty 1.0 ^
  --reasoning-budget -1 ^
  --host 0.0.0.0 --port 8080

That’s Unsloth’s “Precise Coding” sampling preset. For general use: --temp 1.0 --presence-penalty 1.5.

Gotchas I hit (well, that Opus hit and fixed)

-np defaults to auto=4 slots. Wastes memory on recurrent state (\~190 MB). Set -np 1 for single-user setups (OpenCode etc.).
--fit-target doesn’t help here — -ngl 99 + --n-cpu-moe N already gives you deterministic control.
-ctk q8_0 -ctv q8_0 is nearly lossless and halves your KV cache vs fp16. 128K ctx only costs 1.36 GB VRAM.
Qwen3.6 is a hybrid architecture — only 10 layers are standard attention, the other 40 are Gated Delta Net (recurrent). That’s why KV memory is so small.

How to tune N for your GPU

Each MoE layer on GPU costs \~530 MB VRAM. Non-MoE weights are \~1.9 GB fixed. For a 40-layer model:

GPU VRAM	Recommended `N`
8 GB	stay with `--cpu-moe`
12 GB	`N=26`
16 GB	`N=20` (sweet spot)
24 GB	`N=8` (fits almost everything)

Start conservative, watch VRAM during a long-context generation, then step N down by 2-3 until you have \~2 GB headroom.

TL;DR

Replace --cpu-moe with --n-cpu-moe 20, add -np 1, and you get 79 t/s + 128K context on a 5070 Ti. The 9800X3D’s V-Cache carries the CPU side effortlessly.

And Claude Opus 4.7 on the $20 Pro sub is genuinely good enough now to run this kind of hardware-tuning loop end-to-end — launch servers in background, parse logs, iterate — without hand-holding. Kind of wild.

Happy to test other configs if anyone wants comparisons.

[-]

MykeGuty@reddit

Tengo lo mismo que tu configuración y esto va genial!!, Gracias a ti y a la comunidad :D

[-]

italianguy83@reddit

Funziona anche per RTX 5070 12Gb?

[-]

Historical_Roll_2974@reddit

I'm getting 30 tokens a second with an rx 9070xt but I'm also using lm studio so I can't get all the customisations

[-]

googleaddreddit@reddit

with rx 9070xt I get 42 t/s using vulkan.

[-]

Affectionate-Mode766@reddit

56-58 t/s with rx 7800xt using Vulkan

[-]

googleaddreddit@reddit

huh

[-]

Affectionate-Mode766@reddit

Please share

[-]

googleaddreddit@reddit

I did some more testing: https://i.redd.it/aqfjk65dxaxg1.png

[-]

googleaddreddit@reddit

I just built with GGML_NATIVE, which is the default anyway, but I had disabled it because of breakage some time ago.

[-]

dreamai87@reddit

It’s okay you are exploring all possible stuff But simple command -fit on will get the best from your configuration .

[-]

marlang@reddit (OP)

Solid tip, I actually went back and tested this properly after your comment. You’re right, --fit on arrives at the
same MoE split I calculated manually (20 layers overflowing, 20 on GPU). One command vs hardware math, so yeah, clearly
the better advice.

Full numbers for anyone reading:

Config	Context	Gen t/s	Prompt t/s
`--n-cpu-moe 20` (my manual tune)	128K	79.3	135.8
`--fit on` (bare)	4K (!)	88.2	270.0
`--fit on -c 131072`	128K	81.2	101.6

One caveat people should know: bare --fit on silently reduces your context to 4K because it treats -c as an
unset argument and minimizes it for max speed. If you want full context (coding/agentic use), you still have to set -c
explicitly — then fit only decides the offload split.

So the final recommendation for a 16GB GPU is basically:

  --fit on -c 131072 -np 1 -fa on -ctk q8_0 -ctv q8_0

Thanks for pushing back — updated my scripts.

[-]

SummarizedAnu@reddit

Do ya know why no mmap doesn't work for me. It's fine for like 24k context but after that my PC lags to 1 fps and I have to manually restart using button . I'm on arch Linux with rtx 3060 and 16Gb ram . Thanks 🙏

[-]

marlang@reddit (OP)

Here’s what’s happening: the model is 22 GB total. On your 3060 (12 GB) you can fit maybe 14 MoE layers on GPU, the other 26 MoE layers
stay on CPU = \~14 GB of model in RAM before you even open a context. Then KV cache + compute buffers grow with context size. At 24K
ctx you fit in 16 GB. Above that, you spill past 16 GB.
With mmap (default), Linux handles this fine, it just evicts cold model pages from the page cache when pressure hits. With --no-mmap,
every page is pinned in llama.cpp’s heap, so the kernel can only swap it. And because swap on a running model = constant thrashing, the
rest of your system (X, browser, everything) gets evicted first → lockup, needs power button.
Fixes, in order of ease:

Drop --no-mmap. You lose some prefill speed but the system stays responsive at any context.
If you also pass --mlock, definitely drop it. That’s what’s guaranteeing the hard lockup — mlock forbids the kernel from ever paging the model out, so there’s literally nothing the OOM-killer can do except evict your desktop session.

[-]

SummarizedAnu@reddit

im using the iq3 s model which is 13 GB . and context of 60k fills about 1Gb vram max . That was a no mmap problem where it broke something in arch i dont know what .
But now im using this .

Its faster than before .
/llama-server -m ../Qwen3.6-35B-A3B-UD-IQ3_S.gguf -ctk turbo4 -ctv turbo4 --jinja --flash-attn on -np 1 -ngl 30 --fit-ctx 65536 -ncmoe 20 --alias Qwen3.6-35B --fit on --fit-target 512 -b 1024 -ub 512 --temp 0.7 --top-p 0.95 --top-k 20 --min-p 0.0

But i dont know whats using all that swap ram .

Wait i found the command i was using causing problems
its : ./llama-server -m ../Qwen3.6-35B-A3B-UD-IQ3_S.gguf -ctk turbo4 -ctv turbo4 --jinja -ngl 30 -c 65536 -ncmoe
20 --no-mmap

but with mmap
./llama-server -m ../Qwen3.6-35B-A3B-UD-IQ3_S.gguf -ctk turbo4 -ctv turbo4 --jinja -ngl 30 -c 65536 -ncmoe
20
i had no problem running this but it was getting around 20-30 tps . Not good but not bad speed at that time..

[-]

Illustrious-Bid-2598@reddit

I value is sensitive to turbo. I would keep that regular ctk q8 with your ctv turbo 4

[-]

SummarizedAnu@reddit

Is that for speed?

[-]

Illustrious-Bid-2598@reddit

For quality. The speed gain from 8 - 4 is fractional that it’s worth keeping k at 8 as the quality gain is significant especially if you’re going to want to rely on tool calling.

[-]

ecompanda@reddit

the heuristic gets you 90% there for most single user inference. where it gets tricky is batched requests or when you want to bias toward more layers on gpu at lower context. manual gives you that control but for the typical case fit works well

[-]

Rangali-1@reddit

Zunächst erst mal vielen Dank für die tolle Arbeit!

mit der letzten finalen Version erhlate ich mit meinem 5070 ti und AMD Ryzen 9 9950X3D mit 64GB RAM allerdings nur knapp 69 Token/sekunde.

Ws mache ich falsch?

[-]

Auo98@reddit

how to do this in unsloth chat

[-]

relmny@reddit

Have you tried something like (relevant part being the "-ot ..."):

 -np 1 -c 131072 -ctk q8_0 -ctv q8_0 -ot "blk\.[0-9][0-9]\.ffn_.*_exps\.weight=CPU"

with that I get about 22.4t/s with Q6_K_L, while with (tested out of curiosity):

--fit on -c 131072 -np 1 -fa on -ctk q8_0 -ctv q8_0

I get about 7.1t/s

and with:

--fit on --fit-ctx 128000 --fit-target 512

I get 2.6t/s

[-]

Fristender@reddit

Thanks for posting the results! I would say the 2t/s tg lost with manual tuning is worth the 34t/s gain with pp

[-]

Life-Screen-9923@reddit

I use 'fit-target 256' to emulate 'ngl 99' on my rtx 3060 12gb

And add 'mlock' and 'no-mmap' for performance

[-]

dreamai87@reddit

thanks it helps you.
my comment was mainly to help on allocation of better split of model between gpu and cpu.
-c you have to provide otherwise it takes the default.

[-]

Danmoreng@reddit

No you need to use fit together with fit-ctx and fit-target: https://github.com/Danmoreng/local-qwen3-coder-env#server-optimization-details

np 1 sounds interesting though, haven’t tried that one out and might improve my config

[-]

st0n1th@reddit

Yeah I tried with 20 on cpu and it left 2GB free on my 5080 gpu and noticed more cpu than gpu utilization. Switched back to —fit.

[-]

iamapizza@reddit

Thank you for testing with fit. Even though it's doing a lot of the heavy lifting you did at least validate and learn stuff along the way (and teach me something too)

[-]

IrisColt@reddit

heh

[-]

mrgreatheart@reddit

Thank you, super helpful

[-]

Ferilox@reddit

Except when the model has a vision component. Then it kind of struggles.

[-]

abu_shawarib@reddit

Command line says it's already on by default.

[-]

lolwutdo@reddit

lol no, -fit gives absolute trash performance if you want to specify and use max context with 16gb gpu.

Using lmstudio and maxing out gpu and moe layers with max context gives better performance than whatever the fuck -fit does in llama.cpp; quite literally a jump from 8 tokens per second on lcpp to 70+ tokens per second on lmstudio.

[-]

Danmoreng@reddit

Use fit with fit-ctx and fit-target: https://github.com/Danmoreng/local-qwen3-coder-env#server-optimization-details

[-]

lolwutdo@reddit

I appreciate it, but thats definitely not just just one command like the original comment implies.

[-]

SlipperyCorruptor@reddit

I'm doing 210k ctx window, 30 MoE offload. 7600x, 32GB RAM, 5080. Getting:

prompt eval time = 43716.75 ms / 20087 tokens ( 2.18 ms per token, 459.48 tokens per second)
eval time = 22200.84 ms / 907 tokens ( 24.48 ms per token, 40.85 tokens per second)
total time = 65917.60 ms / 20994 tokens

Output: 40.85 Tokens/sec

Honestly.. That thing is impressive AF.

I've put it through some tests and it even correctly identified environmental issue with testcontainers and Rancher:
https://github.com/testcontainers/testcontainers-java/issues/11482

prob saved me like two days of investigation

[-]

Illustrious-Bid-2598@reddit

What do you guys think on a a rtx 5060, i7-14700f 32gb ram, llama.cpp on wsl2 with models stored on Linux fs, windows debloated

[-]

Kiro369@reddit

I tried running Qwen3.6-35B-A3B-UD-IQ4_NL_XL was getting 9 t/s
With your last command it went up to 50 t/s!

Thanks a lot!

[-]

sherrytelli@reddit

using unlsoth/Qwen3.6-35B-A3B:Q4_K_M
with your final config i am able to get around stable 42-47 tg/s
my pc specs:
RTX 5060ti 16gb
i5-12400f
32gb ddr4 @ 3200 MHz

i model i was previously using: unsloth/glm-4.7-flash-23-23B-A3B:Q4_K_M. used to get around 32-35 tg/s

thanks for the config :)

[-]

nasty84@reddit

How do we change some of the settings in LM Studio. I am only getting 37 tokens per second with same kinda of hardware

[-]

nasty84@reddit

Here are LM Studio settings. Change them as you like based on your rig:

Load Qwen3.6-35B-A3B-UD-Q4_K_M from unsloth, then in the load modal → Advanced Configuration:

- Context Length: 131072 (or 65536 if it complains)

- GPU Offload: max (all the way right)

- CPU Thread Pool Size: 8 (matches your 7800X3D's 8 cores)

- Flash Attention: ON

- K Cache Quantization Type: Q8_0

- V Cache Quantization Type: Q8_0

- Offload MoE Experts to CPU: 20 ← the key setting

- Try mmap(): ON

- Keep Model in Memory: OFF

[-]

texifornian@reddit

Took some work - but on a similar setup-ish...

The Hardware:

GPU: RTX 5070 Ti (16GB GDDR7)
CPU: Intel Core Ultra 7 265K (Arrow Lake)
RAM: 64GB DDR5-4800 (Base clock, XMP pending)

Memory and CPU changes weren't letting me run the Q4, but going to Q3 got me to 70 t/s -

.\llama-server.exe \^

-hf unsloth/Qwen3.6-35B-A3B-GGUF \^

-m Qwen3.6-35B-A3B-UD-Q3_K_M.gguf \^

--device CUDA0 \^

--n-cpu-moe 15 \^

--mmproj "" \^

-t 8 \^

-fa on \^

-b 2048 \^

-ub 2048 \^

-ctk q8_0 \^

-ctv q8_0 \^

-c 128000 \^

--temp 0.6 \^

--chat-template-kwargs "{\"preserve_thinking\": true}" \^

--port 8033

[-]

raswill0@reddit

Awesome post!

Sharing my results (RTX 5060 Ti 16GB - 32GB DDR5 RAM - AMD Ryzen 7 8700F):

pp512 \~704 t/s
pp2048 \~1621 t/s
pp4096 \~1577 t/s
tg128 \~51 t/s

[-]

JustSayin_thatuknow@reddit

Amazing job!! I’ll be waiting for your final final final final final command boss! 😅🙏🏻💪💪💪💪

[-]

OldPappy_@reddit

Thanks for this. Im going to try some of these configurations out on my 9070XT

[-]

admajic@reddit

On a 3090 24gb vram get 110 token /s and i can afford a car too. Interesting world we live in.

[-]

smolpotat0_x@reddit

we are not a car.

[-]

admajic@reddit

The car = cost of a video card... geez

[-]

SinnersDE@reddit

Thanks for your hard work!
I share my results just if sb cares ( RTX 4080 16 GB, 32 GB DDR4)

.\llamacpp\llama-server.exe -m "./models/Qwen-3.6-35B-A3B-Q4_K_XL/Qwen-3.6-35B-A3B-Q4_K_XL.gguf" --fit on --fit-ctx 128000 --fit-target 256 -np 1 -fa on --no-mmap --mlock -b 2048 -ub 2048 -ctk q8_0 -ctv q8_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --reasoning-budget -1 --chat-template-kwargs "{\"preserve_thinking\": true}" --host 0.0.0.0 --port 8033

Getting: 58 t/s low to 43 t/s after CTX-Windows filled upto 60-70%. Didn´t get further.

[-]

SinnersDE@reddit

Just a short question:
How do i call the -no-mmap and -mlock flags in a preset.ini?

[ini_THINK_GENERAL_Qwen-3.6-35B-A3B-Q4_K_XL]

m = ./models/Qwen-3.6-35B-A3B-Q4_K_XL/Qwen-3.6-35B-A3B-Q4_K_XL.gguf

temp = 1.0

top-p = 0.95

top-k = 20

min-p = 0.0

presence-penalty = 1.5

repeat-penalty = 1.0

fit = on

; whether to adjust unset arguments to fit in device memory ('on' or 'off', default: 'on')

fitc = 128000

; minimum ctx size that can be set by --fit option, default: 4096

fitt = 256

; target margin per device for --fit, comma-separated list of values, single value is broadcast across all devices, default: 1024

np = 1

; number of server slots (default: -1, -1 = auto)

fa = on

; set Flash Attention use ('on', 'off', or 'auto', default: 'auto')

b = 2048

; logical maximum batch size (default: 2048)

ub = 2048

; physical maximum batch size (default: 512)

ctk = q8_0

; KV cache data type for K (allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1, default: f16)

ctv = q8_0

; KV cache data type for V (allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1, default: f16)

-no-mmap ???

; --mmap, --no-mmap whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)

-mlock ???

; force system to keep model in RAM rather than swapping or compressing

chat-template-kwargs = {"preserve_thinking":true}

reasoning-budget = -1

[-]

milpster@reddit

how do you deal with having such low context?

[-]

Emergency-Most1859@reddit

Bro 🔥🔥 Running this model with qwen code and it works better and kinda smarter than alibaba's cloud qwen that I've used before. They discontinued free tier so I started to look for alternatives.. Really impressed with that nodel quality. Works fine on RX6800 with 7900x3d (changed some flags though)

[-]

Dreeseaw@reddit

To add a datapoint, my recently-purchased prebuilt gaming PC (iBP Y40 Pro with a 5080 (16gb vram), 32gb ram, 9800) is executing fat 100k context prompts on the order of 45s, and breezing through opencode driven workflows (largely replacing the analysis portion of an optimization loop I work with).

OP this is black magic. Thank you.

[-]

Mister_bruhmoment@reddit

Hey, I basically have the last version of your rig besides the RAM - 4070 ti super, R7 7800X3D. Are those settings applicable in lm studio? I am still figuring out how everything works with LLMs atm

[-]

marlang@reddit (OP)

LM Studio settings for your rig

Load Qwen3.6-35B-A3B-UD-Q4_K_M from unsloth, then in the load modal → Advanced Configuration:

- Context Length: 131072 (or 65536 if it complains)

- GPU Offload: max (all the way right)

- CPU Thread Pool Size: 8 (matches your 7800X3D's 8 cores)

- Flash Attention: ON

- K Cache Quantization Type: Q8_0

- V Cache Quantization Type: Q8_0

- Offload MoE Experts to CPU: 20 ← the key setting

- Try mmap(): ON

- Keep Model in Memory: OFF

[-]

x10der_by@reddit

Wow it's magic. With "GPU Offload" to max and "Offload MoE Experts to CPU" to 20 speed increased from 15 to 50 t/s on my config O_o

[-]

moahmo88@reddit

Try this @ 59 t/s with 5070ti:
LM Studio settings for you：
Load Qwen3.6-35B-A3B-UD-Q5_K_M from unsloth，You can use Q5_K_M:
- GPU Offload: max (all the way right)
- Offload MoE Experts to CPU: 24 ← the key setting

[-]

nixudos@reddit

Thank you for the tip! I was struggling to get any meaningful speed on a 4090 with the Q6_K_XL size. This doubled my speed from 18 t/s to 42!

[-]

Comfortable_Dog1610@reddit

I have the same speed on 7900xtx using the Q6_KXL model. I can sleep tonight if I see you

[-]

BubrivKo@reddit

Thanks bro. On my AMD configuration I get like additional 15-20 tks :P

[-]

The_Dung_Beetle@reddit

Thanks! I get about 25 tok/sec. Rig I've tested this on : 6950XT/5700X3D/32GBDDR4@3200.

[-]

monacoax@reddit

What settings do you think for a 4090 + 12700k? Thanks for the info!

[-]

marlang@reddit (OP)

If you want the full 128K ctx: with 24 GB VRAM the KV cache at Q8_0 eats \~1.4 GB + compute buffers \~0.6 GB + non-MoE weights \~1.9 GB,

leaving \~19 GB for MoE experts. Each expert layer costs \~530 MB on GPU, so \~36 of the 40 layers fit.

and leave cpu on 8, i think its best for the 12700k

- Context Length: 131072 (or 65536 if it complains)

- GPU Offload: max (all the way right)

- CPU Thread Pool Size: 8

- Flash Attention: ON

- K Cache Quantization Type: Q8_0

- V Cache Quantization Type: Q8_0

- Offload MoE Experts to CPU: 36 ← the key setting

- Try mmap(): ON

- Keep Model in Memory: OFF

[-]

yoohjm@reddit

lmstudio user with a RTX 5070ti here. This is amazing, such a speed up from my previous config and that context length is much more than i thought i would be able to fit

many thanks!

[-]

Mister_bruhmoment@reddit

Thank you so much!

[-]

Embarrassed_Elk_4733@reddit

Yes, absolutely. My setup is a Ryzen 7 5800X3D + RTX 4070 Ti Super + 32GB DDR4. Running a 128K context window in LM Studio gives me around 39–40 tokens/s. However, when I used the Llama configuration provided by the original poster, the same hardware achieved 45–46 tokens/s in Llama. Sharing this for your reference!

[-]

Guilty_Rooster_6708@reddit

This is literally perfect for me. Thanks for the tip on mlock and ub !!

[-]

TodayExcellent9756@reddit

Hey, thanks a lot for this!
I’m able to achieve 55 tok/s using your config on these specs: RTX 5070 Ti (16GB) with 32GB DDR4 and an Intel Core i5-14600KF. The coding results are amazing too. Now I won’t get stuck when my Codex runs out of tokens. 🤣

[-]

BitGreen1270@reddit

Thank you for sharing. I only have a 780M which I'm using with vulkan and gemma:e4b. I assume most of what you've shared is not applicable to me because I don't have vram?

[-]

CriticalCup6207@reddit

The --n-cpu-moe flag is doing serious work here. For anyone who hasn't seen it: it offloads the MoE routing to CPU, which frees VRAM for the active expert weights and meaningfully improves throughput on cards that would otherwise bottleneck. On our setup (3090 + i9) we saw \~40% throughput improvement. The 9800X3D's cache size probably also helps with the routing overhead on the CPU side.

[-]

Ok-Palpitation-905@reddit

Nice.

[-]

moahmo88@reddit

Amazing work! Thanks a million for sharing.

[-]

rebelSun25@reddit

Nicely done

[-]

Cool-Cap2509@reddit

I just tried it. Getting 24 t/s in processing. What am I doing wrong? I got the same model, 9950X3D + 64GB RAM + 4080 Super. Can you please suggest any solution?

[-]

marlang@reddit (OP)

Check GPU is actually used. Share the first 30 lines of the server log. Look for:

- CUDA0 model buffer size = XXXX MiB — if this is 0 or tiny, nothing's on GPU.

90% RAM (\~58 GB on 64 GB) is abnormal. Expected: \~12-15 GB. Possible causes: running two servers, wrong quant (F16/Q8 instead of UD-Q4_K_M at 22 GB), or Windows counting mmap cache weirdly.

On 64 GB you can safely drop --no-mmap --mlock — you don't need them.

9950X3D is dual-CCD, only 8 cores have V-Cache. Default thread count bounces work across CCDs and tanks MoE. Add:

-t 8 --cpu-mask 0xFF

Run llama-bench for authoritative numbers:

llama-bench.exe -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -fitt 256 -fitc 65536 -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -ub 2048 -p 2048 -n 128 -r 3

Should give 3000+ pp2048 and \~100 tg128.

[-]

Cool-Cap2509@reddit

common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on

llama_params_fit_impl: projected to use 26614 MiB of device memory vs. 14997 MiB of free device memory

llama_params_fit_impl: cannot meet free memory target of 256 MiB, need to reduce device memory by 11873 MiB

llama_params_fit_impl: context size reduced from 262144 to 128000 -> need 2668 MiB less memory in total

llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 9456 MiB

llama_params_fit_impl: filling dense-only layers back-to-front:

llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 4080 SUPER): 41 layers, 5748 MiB used, 9248 MiB free

llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:

llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 4080 SUPER): 41 layers (21 overflowing), 14720 MiB used, 276 MiB free

llama_params_fit: successfully fit params to free device memory

llama_params_fit: fitting params to free memory took 0.34 seconds

llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4080 SUPER) (0000:01:00.0) - 15061 MiB free

llama_model_loader: loaded meta data with 54 key-value pairs and 733 tensors from q35.gguf (version GGUF V3 (latest))

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.

[-]

marlang@reddit (OP)

Your setup is actually fine, 21 layers overflow, GPU is being used correctly. This is netting you 24t/s?

[-]

Cool-Cap2509@reddit

BTW, am I getting half the token because it is one generation older?

[-]

Cool-Cap2509@reddit

I didn't use code. I gave it a New Yorker magazine and asked x article to summarize. And the processing was 24 t/s. Tried different variations with Gemini help, was able to get 1300 t/s for the 50% context. Then slowly it gets down to less than 100 t/s. But hey as you said I don't need to use those two lines memory related. That solved the 90% memory usage issue. Thank you for sharing, otherwise I would never download a 35B model. The highest I tried was Gemma 4 26B Q4 and that was already slow enough and spilled the VRAM usage.

[-]

Cool-Cap2509@reddit

I believe I am using the same model.

.\llama-bench.exe -m q35.gguf -ngl 99 --n-cpu-moe 20 -fa 1 -ctk q4_0 -ctv q4_0 -b 2048 -ub 2048 -p 2048 -n 128 -r 3

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16375 MiB):

Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes, VRAM: 16375 MiB

load_backend: loaded CUDA backend from C:\ai\llama\ggml-cuda.dll

load_backend: loaded RPC backend from C:\ai\llama\ggml-rpc.dll

load_backend: loaded CPU backend from C:\ai\llama\ggml-cpu-zen4.dll

| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -------: | -----: | -----: | -: | --------------: | -------------------: |

| qwen35moe 35B.A3B Q4_K - Medium | 20.60 GiB | 34.66 B | CUDA | 99 | 20 | 2048 | q4_0 | q4_0 | 1 | pp2048 | 2456.42 ± 139.48 |

| qwen35moe 35B.A3B Q4_K - Medium | 20.60 GiB | 34.66 B | CUDA | 99 | 20 | 2048 | q4_0 | q4_0 | 1 | tg128 | 77.47 ± 1.05 |

build: 23b8cc499 (8838)

[-]

BuildDevv@reddit

As a new player for local llm’s, scrolling through the comments, this community is very supportive. Thanks for the tip y’all!

[-]

BP041@reddit

Wow, 79 t/s with a Qwen3.6-35B-A3B on consumer hardware is fantastic! This is exactly the kind of optimization that pushes local LLM development forward. At NTU, and in my work with CanMarket's AI infrastructure, we're always looking for ways to maximize inference speed and context handling on diverse hardware setups. Could you elaborate on how much --n-cpu-moe impacted performance for you, or if you encountered any specific bottlenecks you had to tune around?

[-]

jadbox@reddit

This is amazing... but why can't Llama do this all automatically for us?

[-]

nextgenpotato@reddit

I have the exact same hardware as you do. Trying to run your final final command, I am getting OOM errors. What am I missing? I am on Ubuntu 26.04 and a noob when it comes to llama.cpp

[-]

Potential-Leg-639@reddit

A recent Fedora Kernel upgrade made my system unstable and als had strange issues in llama.cpp. With the Kernel version from before everything fine again.

[-]

FatheredPuma81@reddit

Everyone’s using --cpu-moe which pushes ALL MoE experts to CPU.

That certainly does sound like Opus with its training data being from pretty much around the time they switched from specifying the tensors to the much better --n-cpu-moe command.

[-]

fucking_cuntbag@reddit

Thanks for this - I have the same setup and was struggling getting a reasonable tps Had switched to lower quants but with this config I can get 80tps on iq4

[-]

bdsmmaster007@reddit

Ive not fumbled around with local hosting in quite a while, but qwen intrigues me. Tho im on AMD and not sure how its looking with the support. Can anybody estimate me how much i could get on a 7600x and a rx6800?

[-]

New_Spray_7886@reddit

I get 25 t/s with a 6700xt +32gb ram when setting aside vram for full context (prefill is ~300 t/s), so you should be quite workably higher than that. This qwen-3.6 moe is quite a bit more performant than even gemma-4 on these older consumer setups

[-]

deRTIST@reddit

i'm at 12 t/s with a 6800xt and 64 of ram, would you min sharing settings? Anything different than op?

[-]

New_Spray_7886@reddit

Here are two settings.

I use this when I want more ram available to use the computer at the same time (i.e. web-browsing), it is like OPs. Qwen3.6-35B-A3B-IQ4_XS (Bartowski) is 24-25 t/s here @ no context, 22 t/s @ 20% context (50k or so). Q4_K_M is a little slower at 20.5 t/s @ 20% context. I have many quants left to try but I like IQ4_XS so far.

export HSA_OVERRIDE_GFX_VERSION=10.3.0 llama-server \ --model /path/to/Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf \ --jinja \ -c 0 \ -ngl 99 \ --no-mmap \ --cpu-moe \ --n-cpu-moe 186 \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --temp 0.6 \ --parallel 1 \ --chat-template-kwargs '{"preserve_thinking":true}' \ --host 127.0.0.1 \ --port 8033 2>&1 | tee /path/to/log.txt
I use this if I'm not also using the computer - maybe this will be agents running overnight soon. Llama.cpp maximizes the performance by doing the fitting for you, so it is easier than testing how many layers you can offload. \~27.4 tok/s at no context as above.

export HSA_OVERRIDE_GFX_VERSION=10.3.0 llama-server \ --model /path/to/Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf \ --jinja \ -c 0 \ --no-mmap \ --fit on \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --temp 0.6 \ --parallel 1 \ --chat-template-kwargs '{"preserve_thinking":true}' \ --host 127.0.0.1 \ --port 8033 2>&1 | tee /path/to/log.txt

The logging is nice as if it runs slower than you want you can just ask the LLM to calculate how many --n-cpu-moe layers you should offload by uploading that file & your server start-up command. I tested smaller context sizes and the speeds were very minimally different on my setup so I'm keeping the max currently.

llama.cpp compiled with rocm, wmma, amdgpu_targets_gfx1030, and amdgpu_targets_gfx1031
OS: Gentoo Linux x86_64, Host: Z490 UD AC
Kernel: Linux 6.6.35-gentoo-dist
DE: Cinnamon 6.4.13
CPU: Intel(R) Core(TM) i7-10700K (16) @ 5.10 GHz
GPU: AMD Radeon RX 6700 XT [Discrete]
Memory: 22.44 GiB / 31.27 GiB (72%)

[-]

deRTIST@reddit

oh you're on xs, that makes sense, i'll try a lower quant then, right now i'm at 16tk/s after a couple of hours of hammering at it (apparently for my use case, op was right. ncpumoe was actually the way

thanks for the tips, i'll try it tomorrow

[-]

zkstx@reddit

I'm getting 60-70 tps TG / 1300 tps PP, with up to 55k context window (100k+ @ Q8 KV) for Qwen3.5 35B IQ3_XXS on my RX6800 XT running llama.cpp compiled for rocm. It handles most things I care about pretty well. Will switch to 3.6 very soon

[-]

deRTIST@reddit

yeah i think i might need to get a smaller quant honesly, q4m at the speeds i'm talking about is way too sluggish. quality is good but it takes 1h to fully execute a (by the end of it) 65k tk task

[-]

bdsmmaster007@reddit

thanks for the specific numbers :0, motivates me to fumble around

[-]

DefNattyBoii@reddit

I think the problem you will face is more with prompt processing speed. I'd recommend checking out the latest vulkan build for lama.cpp, thats the easiest to get started with. above 10 token/s is usable but not perfect, it depends on length of your context.

[-]

bdsmmaster007@reddit

thx for the pointers to start, will look into it\^\^!

[-]

SeriousPanic34@reddit

I'm getting 40 tk/s with 7900 gre 16gb + 32gb ram on core ultra 7 265k with OPs config

[-]

bdsmmaster007@reddit

oh damn giving me hope :0, thx for the reply

[-]

Artistic_Okra7288@reddit

I’ve been getting about 30tps tg at 1M context on my M5 Max 128GB with q4_0 kv and using unsloth Q8_K_XL gguf.

[-]

Nnyan@reddit

Thank you for this I’m just starting my local LLM project and have a few GPU options similar to yours.

[-]

Horror-Veterinarian4@reddit

16gb vram nice I know what my next move is i want to test see how gemma 4 26b e123abc whatever the fuck it us runs compared to this one on my ancient e5 2697 v2 and v100 16gb vram

[-]

nikolaiownz@reddit

Almost the same setup I have. I get 72ish tk/s

Thanks for this good thread. I am going to mess around with it next week. From what I saw just tinkering around with it and opencode - this is very good.

[-]

MysticOrbit7@reddit

I got 59 tok/s on (Edit 3 conf) 5060 Ti 16 GB + 9950x + 128 GB DDR6 . Anyone touched better on this chip ?

[-]

Late_Session7298@reddit

I’m using oMLX on m2 pro max 32 Gigs at 128K context with 35t/s speed

The most simplest setup ever!

[-]

konohrik@reddit

Why not use exl2 instead of gguf?

[-]

marlang@reddit (OP)

because 22gb model > 16gb vram

quick maths

[-]

omidmatin@reddit

Can you guys help me with the best config for my RTX 3090 + 5800X3D? I need large context window (at least 512K and preferably 1M). I think it's possible with this MoE model on TurboQuant?!

[-]

Ill-Stand-6678@reddit

onde encontrou um projeto llama.cpp com turboquant funcional?

[-]

omidmatin@reddit

I asked Gemini and just copy pasted it. Currently I got it working with 256K context and turbo3. But I used Q4_UD_S version. It's running 100 t/s. I can run Q4_UD_XL with 192K context with around 95 t/s but I also have to offload the mmproj to the RAM.

[-]

BassAzayda@reddit

I use the 3 fits so --fit on --fit-ctx 128000 --fit-target 512 Moe and dense works a treat everytime

[-]

marlang@reddit (OP)

Thank you! my startup scripts get better and better with every comment in here

[-]

minceShowercap@reddit

Are you editing OP with the latest update to your startup script?

I've found a few comments confirming that someone has helped and you have updated, but it's hard to know which is latest without you updating OP.

Great thread though. Exactly what I need because I want to try this model later for local coding.

Have you found it to be strong at local coding?

[-]

ecompanda@reddit

yeah updated the OP twice now. dropped all the manual layer flags and just using fit on instead. numbers are basically the same but the config is way cleaner.

[-]

marlang@reddit (OP)

I've been comparing it mainly against Gemma 4 and, subjectively, Qwen3.6-35B-A3B is clearly better for coding, and for agentic coding it's miles ahead.

[-]

PaceZealousideal6091@reddit

If you have already set the context at 128k, why do you need -fit-context 128k? Any reason? Can you explain how these 3 together helps?

[-]

vialoh@reddit

Does the `--n-cpu-moe` matter for those of us on Apple silicon? I suppose I could just ask AI... 😅

[-]

pefman@reddit

Good findings!

[-]

Objective-Stranger99@reddit

Your recommendation for 8 GB is wrong. I found --cpu-moe to be up to 20% slower on my GTX 1080 compared to tuning --n-cpu-moe

[-]

andy2na@reddit

what is the benefit of N=8 on a 24gb VRAM GPU for Qwen3.6-35B? With q8/q8 cache, you can already fit 256k context with the IQ4_NL quant, and likely still close to that with the Q4_K_M fully on GPU

My llama-swap config:

  "Qwen3.6-35B":
    cmd: >
      env CUDA_VISIBLE_DEVICES=0 /custom-bin/bin/llama-server
       --port ${PORT}
      --host 127.0.0.1
      --webui-mcp-proxy
      --model /models/qwen35/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf
      --mmproj /models/qwen35/qwen3.6-35b-mmproj-BF16.gguf
      --cache-type-k q8_0
      --cache-type-v q8_0
      --n-gpu-layers auto
      --split-mode none
      --main-gpu 0
      --threads 8
      --threads-batch 8
      --ctx-size 262144
      --image-min-tokens 1024
      --flash-attn on
      --parallel 1 
      --jinja
    filters:
      stripParams: "temperature, top_p, top_k, min_p, presence_penalty, repeat_penalty"
      setParamsByID:
        "${MODEL_ID}:thinking":
          chat_template_kwargs:
            enable_thinking: true
            preserve_thinking: true
          reasoning_budget: 4096
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.05
          presence_penalty: 1.5
          repeat_penalty: 1.0


        "${MODEL_ID}:thinking-coding":
          chat_template_kwargs:
            enable_thinking: true
            preserve_thinking: true
          temperature: 0.6
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 0.0 
          repeat_penalty: 1.0


        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
            preserve_thinking: false
          temperature: 0.7
          top_p: 0.8
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
          repeat_penalty: 1.0


        "${MODEL_ID}:instruct-reasoning":
          chat_template_kwargs:
            enable_thinking: false
            preserve_thinking: false
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
          repeat_penalty: 1.0

[-]

KptEmreU@reddit

Commenting to save. Great experiment

[-]

inquam@reddit

I managed to just squeeze Q5_S with 260k context into my 5090 enierly in vram when using Q8_0 for KV cache.

I was on Qwen 3 Coder a long time and then Qwen Coder Next for a bit. And also a sting on 3.5. But 3.6 seems pretty solid so far.

[-]

ecompanda@reddit

79 t/s at 128K is genuinely impressive for a 35B model on consumer hardware. the interesting part is what happens as you actually fill that context. MoE attention at max context can be unpredictable and some models drop to 30 40 t/s by 100K tokens in. did you observe any speed drop as the conversation grew, or did it hold steady?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

mr_Owner@reddit

You missed preserve thinking flag though, and play with ubatch size 4096 and drop lower. Ubatch impacts the pps and vram size.

[-]

altdotboy@reddit

I have spent the last week building my own harness. This has proven to be the most important test for my rig.

11118888888855 → 118885 | 79999775555 → 99755 | AAABBBYUDD → ? Solve the pattern, and put your final answer within \boxed{}.

My system would get it correct maybe 1 in 10 times. I had to tune my system settings and prompt to get it right at least 3 times in a row.

What this test exposes is the delicate sensitivity of MoE router gates. Simply put: are your prompts going to the correct experts?

Dense models have an easier time with the question. Give this a shot and see if your system gets it correct 3 times in a row with fresh context each time.

Quantization, incorrect settings, and poor system prompts will hurt your MoE model.

[-]

cesaqui89@reddit

Is it possible to apply those fine tunes for ollama?

[-]

mrgreatheart@reddit

Thank you. I have a very similar system to yours, and it’s great to know I can run 3.6 so well on it.

Does the —fit-ctx 128000 mean 128K context window in system RAM?

[-]

HockeyDadNinja@reddit

I'm running a 5060 ti 16G and 4060 ti 16G with 64G system ram here. A couple days ago I finally started tuning. I've added things from your post and now I'm running Qwen3.6-35B-A3B at Q8. 98k context, a small overflow to CPU.

I'm using opencode and it's doing really well. I can code with this! 27 t/s at the moment. That used 3090 is looking really good right now.

[-]

slippery@reddit

You just kicked off my next project. Thanks for the detailed write up!

I'm going to try to get it running on a 12 GB 4070Ti.

[-]

AcrobaticChain1846@reddit

hey, I'm trying to run the `unsloth/qwen3.6-35b-a3b UD Q2_K_XL`
I also have same hardware as yours

5070 Ti
64 GB RAM
9950x3d

Can you help me with setting things up?

https://www.reddit.com/r/LocalLLaMA/comments/1soqtry/comment/oguxrw1/

Also I'm getting really slow prompt processing...

I want to know which model you are using like is it q4_k_xl or something else?

Thank you :)

[-]

dreamai87@reddit

Okay I saw your comment that brought me here.
Looking at settings; move this slider to 50% then see the performance, bringing 100% will put all experts to CPU which also reduce performance but still better that what you are getting. so first check at 50% then look GPU suage from task manager 10/12 or 8/12.
reduce from 50% to 20% and see where you are getting this best gpu usage find balance 10/12 assuming rtx 5070 12gb vram

I replied on your thread as well

[-]

AcrobaticChain1846@reddit

That did the trick,
Now I'm running `qwen3.6-35b-a3b@q5_k_m` with following settings
Based on LM Studio Logs I'm getting 45-50tk/s and 300-500 pp/s

I think this can be further fine tuned as I can see my CPU usage is more compared to GPU but I'm not sure what parameters to play with

CPU utilization 62%
GPU Utilization 50% sometimes it spikes to 70-80%

[-]

dreamai87@reddit

First eject model then Bring slider to from 28 to 22 or somewhere around 20 see if gpu is 10gb or 12gb 90 % around leave 2 gb space for kv cache

[-]

AcrobaticChain1846@reddit

I have 5070ti so 16GB VRAM + 64GB RAM
MoE on CPU: 22 - my graphics driver froze.
MoE on CPU: 24 - Getting 5tk/s almost 10x performance decrease
MoE on CPU: 25 - Getting 50tk/s and \~80% GPU and \~50% CPU
MoE on CPU: 26 - Getting 16tk/s and \~90-100% GPU and \~40-50% CPU

Note: Evaluation Batch Size was set to 4096 in all scenarios.

I generally use 2048 but I was getting slow prompt processing speeds so I set it to 4096

Hope these stats are useful considering how a single MoE more or less on CPU is causing such drastic changes in token generation speeds.

[-]

PiotreksMusztarda@reddit

Confirming on Linux (Ubuntu 26.04, 5070 Ti, CUDA 12.4 with sm_89 PTX fallback), 76 tok/s with your --fit config, and heads up: if you load the vision mmproj, add --no-mmproj-offload or it OOMs right after model load.

[-]

kisiel02@reddit

I only get like 15t/s with rtx5070 (19 layers on GPU) and ddr4 ram sadly. And when compressing KV cache to q8 I get 25t/s. Seems to much of a boost, I have like 10/12gb VRAM and 26/32gb RAM taken

[-]

Several_Newspaper808@reddit

Hey, great info, thanks! I wonder though, how much of the perf is from the ddr5 ram and whatever bus speed you have from the pcie on your mb?

[-]

frozenYogurtLover2@reddit

anyone else getting crashes and segfaults (error 139) with prompt cache enabled

[-]

met_MY_verse@reddit

I’m running a smaller quant with less context entirely in VRAM, I’m assuming this is faster than offloading any experts at all?

[-]

AncientGrief@reddit

Nice work. Did some testing myself too now. 4090RTX with 131k context size. Used Open code to create a C# Snake-Clone with SFML 3.0 ... 75% context used (it had to actually look up the nuget specs for SFML 3.0 to fix some errors it produced automatically, it's a rather new release afaik) ... works pretty well and was about done in < 5 Minutes.

One shotted it easily.

\~159.9 tok/s

With:

  & 'E:\Tools\llama\llama-cli.exe' `
    --model 'D:\AI\Models\Qwen3.6-35B-A3B-Q4_K_M.gguf' `
    --threads 16 `
    --threads-batch 16 `
    --ctx-size 131072 `
    --batch-size 1024 `
    --ubatch-size 512 `
    --gpu-layers auto `
    --flash-attn on `
    --cache-type-k q8_0 `
    --cache-type-v q8_0 `
    --split-mode none `
    --main-gpu 0 `
    --mlock `
    --no-mmap `
    --fit on `
    --fit-target 1536 `
    --fit-ctx 131072 `
    --conversation `
    --simple-io `
    --reasoning off `
    --single-turn `

And opencode.json:

{
    "$schema": "https://opencode.ai/config.json",
    "provider": {
      "llama.cpp": {
        "npm": "@ai-sdk/openai-compatible",
        "name": "llama-server (local)",
        "options": {
          "baseURL": "http://127.0.0.1:8080/v1",
          "apiKey": "dummy"
        },
        "models": {
          "qwen36-35b-a3b": {
            "name": "qwen36-35b-a3b (local)",
            "limit": {
              "context": 131072,
              "output": 8192
            }
          }
        }
      }
    }
  }

[-]

TurnUpThe4D3D3D3@reddit

Solid numbers! The 4090's 24GB VRAM really flexes here — almost double my gen speed. Makes sense with --fit-target 1536 giving you way more headroom to pack MoE layers on GPU compared to my 256.

One thing I noticed: you're using --split-mode none + --main-gpu 0 with --gpu-layers auto. That's clean for single-GPU. But since the 4090 can likely fit nearly all MoE layers, you might want to check what --fit actually landed on for N (it should log it at startup). Curious if you're hitting the full GPU or still offloading some experts to CPU.

Also, one-shotting a Snake clone with SFML 3.0 lookups at 75% context is a great real-world test. The model handling live package spec retrieval and auto-correcting from it is impressive for a 35B MoE.

^(This comment was generated by GLM-5.1)

[-]

o0genesis0o@reddit

I used to do this test by hands with the previous 30B A3B model. Managed to bring tg from 20-ish to 40-ish on my 4060Ti with 64k max context by playing around with n-cpu-moe.

[-]

Ill_Initiative_4233@reddit

Попробовал данную связку на своём компе I7 14700KF ram 32 GPU 5070ti 3312 tokens 53s 61.76 t/

[-]

Ranmark@reddit

iirc you can drop your top_p, presence_penalty, and reasoning_budget arga as they by default has these values. https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

[-]

FriendlyTitan@reddit

Have you tested higher batch and ubatch numbers? I notice that for myself, giving up more experts to cpu and giving vram to batch improves prefill speed massively. Set -b and -ub to 4096 or even higher if you want to experiment.

[-]

truthputer@reddit

In my testing context 256k (44 t/s) was slightly faster than context 128k (35 t/s). But my hardware is weird and heavily leans on the CPU with that context size.

Commenting here to remind me to try your config and will update this comment later.

[-]

Jackw78@reddit

The prefill speed is either inaccurate due to cold startup or something very wrong with the setup. Should be 1k minimum for a 5070ti

[-]

marlang@reddit (OP)

You were right. I went back with llama-bench.exe, the right tool instead of a short completion test, and got:

- pp512: 927 t/s

- pp2048: 1068 t/s

- tg128: 82 t/s