Speed difference between Windows 11 and Linux with llama.cpp: a myth when using medium and large MoE models

Posted by Far-Usual5771@reddit | LocalLLaMA | View on Reddit | 55 comments

As the title says, there is no speed difference between Linux and Windows when using llama.cpp. I myself kept two operating systems on my computer for a long time because of this misconception. But when I got tired of constantly switching, I decided to check how much performance I’d lose if I moved to Windows.

First, a brief overview of the PC used in these tests:

- CPU: Core Ultra 7 265KF under water cooling, with a slight overclock to 5.6/4.7 GHz core frequencies

- Motherboard: Asus Z890 with three PCIe slots, two of them PCIe 4.0 x4

- RAM: Kingston Beast DDR5 192 GB (4×48 GB) at 6400 MHz, with slightly reduced voltage and relaxed timings to keep temperatures down

- GPUs: Nvidia GeForce RTX 5080 16 GB + RTX 5060 Ti 16 GB + RTX 5060 Ti 16 GB, all undervolted with a slight memory overclock

- PSU: 1200 W 80 Plus Gold — 1000 W would have been enough, but I went with headroom from the start

Operating systems used: Ubuntu 26.04 with KDE and GNOME — I also ran one test with Xfce — and Windows 11 with all updates installed. The llama.cpp version was the same across the board, built via cmake the day before yesterday, which happened to include a commit for reducing VRAM usage: “llama: use f16 mask for FA to save VRAM”.

Models tested: Qwen 3.5 122B Q8, Qwen 3.5 397B iq4_xs, MiniMax 2.7 Q5.

llama.cpp launch parameters: `-nocb -dio --no-mmap -np 1 -t 15 -tb 15 -c 50000` (for coding, `-c 150000`) `-mg 0 -fa on --reasoning-budget 19000 --reasoning-budget-message " ... reasoning budget exceeded, need to answer." --no-mmproj`. It was also configured to start with the RTX 5080 by setting `CUDA_VISIBLE_DEVICES=1,2,0`. Linux : '-fit on' , Windows :' -fit-target 250'

Results:

- Qwen 3.5 122B: PP 300, TG 28 on Windows; PP 290, TG 28.5 on Linux

- Qwen 3.5 397B: PP 140, TG 16 on Windows; PP 150, TG 15.2 on Linux

- MiniMax 2.7: PP 220, TG 17 on Windows; PP 230, TG 16 on Linux

All tests were run 4 times each, across the following tasks:

1) A brief article summary with 8k tokens of prompt processing.

2) Translating a portion of a book from Chinese — 20k tokens of prompt processing.

3) A Java test — the percentage results were the same across all models. Deliberate errors were introduced in two classes, with a total of 85k tokens of prompt processing.

Well, WSL turned out to be the slowest — I ran a test with just Qwen 3.5 397B, and the speed dropped from PP 140, TG 16 down to 110 PP and 13.5 TG.

I’ve laid out the exact llama.cpp launch parameters, so anyone can easily reproduce the results on their own hardware. Of course, everyone’s setup is different, but the performance ratio won’t change for MoE models with hybrid CPU+GPU offloading.

And running such large models doesn’t require a ton of space, massive power draw, or all the other things people often list. From the wall, the 397B model pulled only 550–600 watts according to the readings. I also attached a photo of the PC — in a closed case, air convection is better with 140 mm fans.

[-]

Kornelius20@reddit

I so desperately want to be on Linux but windows is the only way my scrappy oculink setup even detects the gpu I'm running :(

[-]

BoogerheadCult@reddit

Dumb tests, because you are not pushing the system enough so there is not enough differences, on resource-constrained systems such as running very large models while trying to squeeze every last drop out of your RAM and VRAM, then Linux always shines.

There is no coincidence that you can allocate more VRAM and run larger model on Ryzen Max in Linux but can't do it on Windows.

Sorry OP, you are not technical enough for this kind of comparisons, just stay with Windows, it is for low IQ people anyway.

[-]

minus_28_and_falling@reddit

Good to know there's no difference, so I can happily stay on Linux.

[-]