Speed difference between Windows 11 and Linux with llama.cpp: a myth when using medium and large MoE models
Posted by Far-Usual5771@reddit | LocalLLaMA | View on Reddit | 55 comments
As the title says, there is no speed difference between Linux and Windows when using llama.cpp. I myself kept two operating systems on my computer for a long time because of this misconception. But when I got tired of constantly switching, I decided to check how much performance I’d lose if I moved to Windows.
First, a brief overview of the PC used in these tests:
- CPU: Core Ultra 7 265KF under water cooling, with a slight overclock to 5.6/4.7 GHz core frequencies
- Motherboard: Asus Z890 with three PCIe slots, two of them PCIe 4.0 x4
- RAM: Kingston Beast DDR5 192 GB (4×48 GB) at 6400 MHz, with slightly reduced voltage and relaxed timings to keep temperatures down
- GPUs: Nvidia GeForce RTX 5080 16 GB + RTX 5060 Ti 16 GB + RTX 5060 Ti 16 GB, all undervolted with a slight memory overclock
- PSU: 1200 W 80 Plus Gold — 1000 W would have been enough, but I went with headroom from the start
Operating systems used: Ubuntu 26.04 with KDE and GNOME — I also ran one test with Xfce — and Windows 11 with all updates installed. The llama.cpp version was the same across the board, built via cmake the day before yesterday, which happened to include a commit for reducing VRAM usage: “llama: use f16 mask for FA to save VRAM”.
Models tested: Qwen 3.5 122B Q8, Qwen 3.5 397B iq4_xs, MiniMax 2.7 Q5.
llama.cpp launch parameters: `-nocb -dio --no-mmap -np 1 -t 15 -tb 15 -c 50000` (for coding, `-c 150000`) `-mg 0 -fa on --reasoning-budget 19000 --reasoning-budget-message " ... reasoning budget exceeded, need to answer." --no-mmproj`. It was also configured to start with the RTX 5080 by setting `CUDA_VISIBLE_DEVICES=1,2,0`. Linux : '-fit on' , Windows :' -fit-target 250'
Results:
- Qwen 3.5 122B: PP 300, TG 28 on Windows; PP 290, TG 28.5 on Linux
- Qwen 3.5 397B: PP 140, TG 16 on Windows; PP 150, TG 15.2 on Linux
- MiniMax 2.7: PP 220, TG 17 on Windows; PP 230, TG 16 on Linux
All tests were run 4 times each, across the following tasks:
1) A brief article summary with 8k tokens of prompt processing.
2) Translating a portion of a book from Chinese — 20k tokens of prompt processing.
3) A Java test — the percentage results were the same across all models. Deliberate errors were introduced in two classes, with a total of 85k tokens of prompt processing.
Well, WSL turned out to be the slowest — I ran a test with just Qwen 3.5 397B, and the speed dropped from PP 140, TG 16 down to 110 PP and 13.5 TG.
I’ve laid out the exact llama.cpp launch parameters, so anyone can easily reproduce the results on their own hardware. Of course, everyone’s setup is different, but the performance ratio won’t change for MoE models with hybrid CPU+GPU offloading.
And running such large models doesn’t require a ton of space, massive power draw, or all the other things people often list. From the wall, the 397B model pulled only 550–600 watts according to the readings. I also attached a photo of the PC — in a closed case, air convection is better with 140 mm fans.
Kornelius20@reddit
I so desperately want to be on Linux but windows is the only way my scrappy oculink setup even detects the gpu I'm running :(
BoogerheadCult@reddit
Dumb tests, because you are not pushing the system enough so there is not enough differences, on resource-constrained systems such as running very large models while trying to squeeze every last drop out of your RAM and VRAM, then Linux always shines.
There is no coincidence that you can allocate more VRAM and run larger model on Ryzen Max in Linux but can't do it on Windows.
Sorry OP, you are not technical enough for this kind of comparisons, just stay with Windows, it is for low IQ people anyway.
minus_28_and_falling@reddit
Good to know there's no difference, so I can happily stay on Linux.
Far-Usual5771@reddit (OP)
But I wasn't proselytizing to anyone — at least read the thread title to begin with. If you like Linux, that's fine; personally, switching was a hassle for me because: one, games. Two, some of my work software is strictly Windows-only. And finally, three, I use Office 365, and there's no proper office editor on Linux at all. Sure, you can use the web version, but why would I when there's a fully functional native option on Windows.
minus_28_and_falling@reddit
I don't know why you imply I didn't read the thread title or that I think you were proselytizing. You are obviously free to use whatever os you like, no judgement.
silenceimpaired@reddit
Oh, I got it. English isn’t OP’s first language. This is based on his last comment. His post says, “I decided to check how much performance I’d lose if I moved to Windows.” I think most native speakers would have said, “if I moved BACK to Windows.” Considering the fact his comment makes it clear he wants to stay with windows. That and the phrase “But I wasn't proselytizing to anyone.”
I would guess India, though it could be a middle eastern country.
Lots of guessing here. OP could just be crazy ;)
Far-Usual5771@reddit (OP)
Yes, you’re right — English isn’t my native language, so the occasional funny slip-up is bound to happen. If anything isn’t clear, please cut me some slack for it not being my main language :))
silenceimpaired@reddit
No problem at all. I am confident you speak far better in English than I in your native language. I was merely pointing out to the other commenter why there was a potential misunderstanding.
Don’t assume people are opposed to your efforts! I appreciate the work you put in. I and others just happen to like Linux and/or hate Windows.
I’m curious, was I close in my guess as to your homeland?
silenceimpaired@reddit
It’s funny… OP said the same thing. I always heard Linux was faster anyway so kind of weird having people concerned Windows is faster.
Far-Usual5771@reddit (OP)
Can I get links to such threads? Because when it comes to claims that Linux is faster, you'll find thousands of topics — which are, in fact, a myth.
sagiroth@reddit
This is the way
CapoDoFrango@reddit
lol, why you ever thought that gnome, xfce or kde are going to make any difference? this stuff is gui agnostic
def_not_jose@reddit
How's Qwen 3.6 27b Q8 doing on 5080 + 2x 5060 Ti? 3x GPUs means no tensor parallelism, right?
geldonyetich@reddit
Funny enough, I was using some instructions I found to force AI MAX 395+ ROCM support (just replace the GPU with gfx1151) over Ollama and they worked earlier but don't work now. Some mysterious Windows update process killed it; it can no longer detect the GPU that way. So there's one point for Linux, I guess.
thereisonlythedance@reddit
There’s no need to use the flag -fa on. It’s on by default these days.
Far-Usual5771@reddit (OP)
I know, but the danger with default performance-related flags is that if something breaks, you won’t even be able to tell right away what went wrong. When you set them explicitly, it just works every time. Besides, it’s launched from a bat file on Windows anyway.
korino11@reddit
linux can be faster. Because you can make a custom kernel with hard tuning and avoid any mitigations.
Far-Usual5771@reddit (OP)
So will there be actual tests with hardware specs and commands, comparing Linux and Windows, or just more fairy tales? Preferably with at least a MoE 120B model, or is it going to be the same empty talk about how fast Linux is — which is pure myth. I've provided the hardware, the build, and the exact commands — all easily reproducible?
korino11@reddit
If you ever did a tweaked custom kernel. Than you CAN understand HWAT oportuninty it give to you and you can after that understand that such oportunitys absent in windows.. Otherwise is just a spend of time on lamers quastions..
Far-Usual5771@reddit (OP)
More fairytales about super performance. Sure, maybe when you've got an ancient PC that's decades old and you're fighting for every megabyte — then it might matter. But honestly, wouldn't it just be easier to upgrade your PC? Even a Raspberry Pi would be more powerful in that case. And bringing this up as an example in an AI thread about MOE models that weigh at least 130 GB... It just shows a complete lack of understanding of the process. No matter how you compile your kernel, no matter how many services you strip out, how much do you actually save — a couple of gigabytes? Really useful when the model is a 397B one that's 200 GB. The only things that will actually help and affect speed are VRAM and DDR5 overclocking, full stop. That was the entire point of my post.
Honestly, I’d happily switch to any OS — even a command-line-only one — if it gave me at least a 15% improvement. And that "margin of error" percentage everyone keeps going on about is just from not knowing how to figure out the software. It's funny how some people can fine-tune a kernel and configure a whole system, yet find it too hard to open a program's help and spend a couple of minutes understanding its commands.
korino11@reddit
NOw matter really?!? levels of abstractions, latency in data.. ok-ok.. no matter 😃
Far-Usual5771@reddit (OP)
And of course, we won’t see any numbers or commands — nothing. Naturally, when you hold a fairy tale up against reality, the facts speak for themselves. I see no point in writing any more; everything that’s been said over these last few posts has just been empty words and nothing else.
igor__004@reddit
Since you’re running a hybrid CPU+GPU offloading setup for these massive models (397B on 48GB total VRAM means a lot is hitting system RAM), I’m curious if you noticed any difference in CPU utilization or memory bandwidth bottlenecks between the two OSes?
Usually, the “Linux is faster” argument comes from how the OS scheduler handles CPU-bound workloads and memory mapping ( mmap ), but since you passed --no-mmap and forced direct I/O ( -dio ), that probably leveled the playing field entirely. Did you test if enabling mmap would bring the performance gap back, or does -dio just make it irrelevant now?
Far-Usual5771@reddit (OP)
And why bother, when even llama.cpp itself says that mmap in hybrid mode kills performance?
igor__004@reddit
I was mostly wondering whether the OS gap still shows up at all, but in this setup it may just be a non-factor.
DunderSunder@reddit
why would mmap affect performance?
igor__004@reddit
mmap can matter because it changes how weights are paged and cached between disk and RAM. On very large models, that can affect load time and sometimes throughput if memory pressure is high. With --no-mmap / direct I/O, you’re basically bypassing that path, so the difference can shrink a lot.
DunderSunder@reddit
so if all of the weights are in GPU it shouldn't matter.
igor__004@reddit
Yeah, mostly. If the model is fully GPU-resident, mmap becomes much less important for inference speed. It still can affect loading and host-side memory behavior, but the big performance differences usually show up when weights are not fully on GPU.
CoolConfusion434@reddit
In my case, and because I run an Intel B70 GPU, Windows and Vulkan is actually faster than both Vulkan and SYCL on Linux. Even with the latest SYCL drivers, Vulkan beats it on Windows. It's been a pain in the butt to redo all my startup scripts in PowerShell but here we are. Also, I literally download and drop the pre-compiled llama/Vulkan binaries from Github releases so even that is faster.
This all says Intel has their work cut out for them. When you have a "generic" driver like Vulkan beat your own driver on your own device, something's missing. Here's hoping they can spare the resources to look into this.
nicholas_the_furious@reddit
Any way you can test 100% VRAM models? I have trouble as soon as my 3rd GPU is engaged. 2 is fast, 3 is fast to start and then throttles down so slow as KV cache builds up, I assume.
I have always assumed it was a windows p2p issue.
Far-Usual5771@reddit (OP)
What do you mean, memory load? I already posted the commands — all my cards are nearly full with VRAM. If you're using a dense model rather than MoE, the moment you spill over from VRAM, performance will nosedive.
nicholas_the_furious@reddit
So in this case you're always spilling over, meaning that you have lots of model weights sitting in RAM and not only in VRAM. Something I noticed is, with my setup, that when I add a 3rd card and the models are only in VRAM I get a massive slowdown compared to 2 cards on Windows. When I spill into RAM that problem is irrelevant because the bottleneck is now RAM/CPU and not my GPU connectivity.
What I was asking was if you could try a model that you could fit 100% in the VRAM of your GPUs instead of spilling into RAM and see how the performance is between Linux and Windows.
Far-Usual5771@reddit (OP)
Here you need to figure out what’s causing the speed drop when you add a third card – maybe it’s weaker than the first two, or the layers are distributed incorrectly. By the way, if your GPUs are different, definitely use `CUDA_VISIBLE_DEVICES` – it lets you load the main layers onto the more powerful cards. If you don’t use tensor split, even PCIe Gen 3 x4 is enough for the model to work when split across GPUs.
I only tried one run with Qwen 3.6 35B (q8, KV cache f16, MTP). The pp speeds were identical at 3000. But I didn’t use it as an example because with MTP the tg (token generation) speeds jump all the time: first run 130, second 150, and the third one drops to 105.
nicholas_the_furious@reddit
All cards are the same (3090) and the drop happens no matter which PCIe orientation. They're all on CPU lanes and all at least PCIe 4.0 x4. I've tried everything except testing p2p drivers with Linux because I don't have a second SSD to try Linux out on at the moment.
It starts out fast and the. Slows down exponentially. I've been troubleshooting for months with everything I can think of.
Far-Usual5771@reddit (OP)
You don’t need a second SSD for Linux — just write the Ubuntu image to a USB flash drive, then during installation choose the option “Install alongside Windows” (I think that’s what it was called — sorry if that’s not exact). Then simply allocate a bit of space for Linux and see if the problem goes away there.
ea_man@reddit
There's no way that windows uses the same VRAM of a properly configured KDE or light deskotp system as LXQt as Linux.
I'm getting now 200MB of consumption on a LXQt running Firefox with youtube tabs and a bunches of shells, 50MB headless.
Far-Usual5771@reddit (OP)
And I'll repeat what I said in my earlier post. Will there be comparison tests with command descriptions that can be reproduced to see what's actually worth it? Because for a MoE model weighing 140 or even 200 GB, a difference of a couple hundred megabytes of VRAM is nothing.
ea_man@reddit
Sure: show us how much vram is that windows wasting for running desktop + firefox + shell.
I got here W11 IoT running on the same hardware as up there :
Far-Usual5771@reddit (OP)
And once again, people can’t read, don’t want to check, but still feel the need to voice their opinion without reading. Reddit being Reddit. But let me repeat, maybe this time it’ll actually be read: for a large model, your few hundred megabytes are completely useless. A single layer there weighs gigabytes, so there won’t be any difference.
ea_man@reddit
Dude, don't worry, we know how VRAM is important: just be a dove and post your VRAM usage instead of "fairy tales".
Then maybe someone would take the time to tell you why it's important.
bastonpauls@reddit
My kde plasma uses 900 mb
ea_man@reddit
If that's what you like: good for you. I'd rather use those 780MB for context as I have just 16GB and I want to run dense models.
bastonpauls@reddit
I'm going to try to reduce VRAM usage; I only have 6GB of VRAM (GTX 1660).
sagiroth@reddit
You can squeeze much more context than windows and run headless. On cachy os I can ssh into systems, kill vram processes and have entire pool of vram available where on windows thats not really possible. Otherwise yeah performance shouldn't be affected
Far-Usual5771@reddit (OP)
And once again — will there be comparison tests with command descriptions that can be reproduced to see what's actually worth it? Because for a MoE model weighing 140 or even 200 GB, a difference of a couple hundred megabytes of VRAM is nothing.
sagiroth@reddit
Yeah on that scale then its not a problem at all. Speaking of fitting dense models so probably irrelevant then
Anbeeld@reddit
Well it's possible if you have iGPU.
sagiroth@reddit
You can do it without igpu. You can force software render . I either ro that or use another pc/laptop to connect to it
gladfelter@reddit
You're getting 6400mhz and stable system performance with four RAM sticks? What's your secret?
Far-Usual5771@reddit (OP)
As already answered, there are no secrets. A good motherboard, an Intel processor (because it handles memory much better), and a bit of time to dial in frequencies, voltages, timings, and run tests.
korino11@reddit
No any secters, just need a good motherboard with shielding on 12 levels
FormalAd7367@reddit
very interesting
Dependent-Guitar-473@reddit
thank you for the PC case photos, it helped 😅
Far-Usual5771@reddit (OP)
Added a photo of the internals — not sure why the emphasis on photos, but now there's a photo with the internals anyway.