Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.
Posted by Ok_Mine189@reddit | LocalLLaMA | View on Reddit | 65 comments
As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if any) performance I'm leaving on the table. So I did the sensible thing and run some benchmarks.
Setup:
- OS: Windows 11 25H2 vs Lubuntu 26.04
- Engine: Llama.cpp b8929, CUDA 13.1 (downloaded official prebuilt for Windows, compiled myself with CMake on Lubuntu)
- CPU: Intel Core i9-14900KF
- RAM: 64GB DDR5 6800 MT/s
- GPU: RTX 5080 16GB VRAM
- Drivers: 596.32 (Windows) / 595.x (Lubuntu)
The Results (Averaged)
I ran a 2500+ token prompt against llama-cli across several different models.
(Note: Gemma 4, OSS-20B & Qwen3.6 were fully offloaded to the GPU. Qwen3.5 & OSS-210B were hybrid CPU/GPU runs using -t 8 -tb 8 -fit on)
| Model | Win 11 (Prompt) | Lubuntu (Prompt) | Prompt Diff | Win 11 (Gen) | Lubuntu (Gen) | Gen Diff |
|---|---|---|---|---|---|---|
| Gemma-4-E4B-it (Q8_K_XL) | 6,232 t/s | 7,587 t/s | + 21.7% | 111.7 t/s | 116.7 t/s | + 4.4% |
| Qwen3.5-35B-A3B (Q8_K_XL) | 305 t/s | 742 t/s | + 143.2% | 48.1 t/s | 52.2 t/s | + 8.5% |
| GPT-OSS-20B (MXFP4) | 7,619 t/s | 8,140 t/s | + 6.8% | 195.8 t/s | 206.2 t/s | + 5.3% |
| Qwen3.6-27B (IQ4_XS) | 2,077 t/s | 2,235 t/s | + 7.6% | 43.8 t/s | 46.0 t/s | + 5.0% |
| GPT-OSS-120B (MXFP4) | 310 t/s | 649 t/s | + 109.3% | 43.4 t/s | 44.9 t/s | + 3.4% |
Takeaways
- Generation Speeds: Lubuntu is consistently about 4% to 8% faster across the board for token generation. It's a nice bump, but maybe not enough to justify an OS swap on its own if you only care about reading speed.
- Prompt Processing (Fully Offloaded): Linux handles prompt evaluation on the GPU noticeably faster. Even on the lower end, it's 6-7% faster, and up to 21% faster on the Gemma 4 run.
- Prompt Processing (CPU/GPU Hybrid): This is where it gets crazy. On the models where Llama.cpp had to lean on the CPU (-t 8 -tb 8), Linux completely obliterated Windows by over 100% to 140% in prompt processing speed.
Raw Run Logs:
Windows 11:
.\llama-cli -m "E:\models\unsloth\gemma-4-E4B-it-GGUF\gemma-4-E4B-it-UD-Q8_K_XL.gguf" -c 8192 -mli -fa on --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 -ngl all -np 1 --no-mmap --jinja --chat-template-kwargs '{\"enable_thinking\":true}'
[ Prompt: 4038.3 t/s | Generation: 111.6 t/s ][ Prompt: 7341.7 t/s | Generation: 111.8 t/s ][ Prompt: 6432.1 t/s | Generation: 111.9 t/s ][ Prompt: 7116.3 t/s | Generation: 111.7 t/s ]
.\llama-cli -m "E:\models\unsloth\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf" -c 16384 -mli -fa on --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 -np 1 --no-mmap --chat-template-kwargs "{\"enable_thinking\":true}" -t 8 -tb 8 -fit on -fitt 160M
[ Prompt: 296.5 t/s | Generation: 48.4 t/s ][ Prompt: 308.6 t/s | Generation: 48.0 t/s ][ Prompt: 313.7 t/s | Generation: 48.2 t/s ][ Prompt: 302.1 t/s | Generation: 47.8 t/s ]
.\llama-cli -m "E:\models\lmstudio-community\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf" -c 32768 -mli -fa on --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -ngl all -np 1 --no-mmap --jinja
[ Prompt: 7651.2 t/s | Generation: 195.6 t/s ][ Prompt: 7661.0 t/s | Generation: 196.6 t/s ][ Prompt: 7653.2 t/s | Generation: 196.6 t/s ][ Prompt: 7510.8 t/s | Generation: 194.6 t/s ]
.\llama-cli -m "E:\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-IQ4_XS.gguf" -c 8192 -mli -fa on --temp 1.0 --top-k 20 --top-p 0.95 --min-p 0.0 --presence_penalty 1.5 -ngl all -np 1 --no-mmap --jinja
[ Prompt: 1859.4 t/s | Generation: 43.2 t/s ][ Prompt: 2132.9 t/s | Generation: 43.0 t/s ][ Prompt: 2153.1 t/s | Generation: 44.5 t/s ][ Prompt: 2166.1 t/s | Generation: 44.5 t/s ]
.\llama-cli -m "E:\models\lmstudio-community\gpt-oss-120b-GGUF\gpt-oss-120b-MXFP4-00001-of-00002.gguf" -c 16384 -mli -fa on --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -np 1 --no-mmap --jinja -t 8 -tb 8 -fit on -fitt 160M[ Prompt: 324.3 t/s | Generation: 43.3 t/s ][ Prompt: 320.8 t/s | Generation: 43.4 t/s ][ Prompt: 284.9 t/s | Generation: 43.4 t/s ]
Lubuntu 26.04:
./llama-cli -m /home/user/models/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf -c 8192 -mli -fa on --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 -ngl all -np 1 --no-mmap --jinja --chat-template-kwargs "{\"enable_thinking\":true}"
[ Prompt: 7621,5 t/s | Generation: 116,6 t/s ][ Prompt: 7537,8 t/s | Generation: 116,6 t/s ][ Prompt: 7665,7 t/s | Generation: 116,7 t/s ][ Prompt: 7523,5 t/s | Generation: 116,8 t/s ]
./llama-cli -m /home/user/models/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf -c 16384 -mli -fa on --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 -np 1 --no-mmap --chat-template-kwargs "{\"enable_thinking\":true}" -t 8 -tb 8 -fit on -fitt 160M
[ Prompt: 739,4 t/s | Generation: 52,3 t/s ][ Prompt: 744,6 t/s | Generation: 52,0 t/s ][ Prompt: 746,3 t/s | Generation: 52,3 t/s ][ Prompt: 741,3 t/s | Generation: 52,2 t/s ]
./llama-cli -m /home/user/models/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf -c 32768 -mli -fa on --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -ngl all -np 1 --no-mmap --jinja
[ Prompt: 7819,8 t/s | Generation: 205,7 t/s ][ Prompt: 8250,8 t/s | Generation: 206,4 t/s ][ Prompt: 8254,9 t/s | Generation: 206,9 t/s ][ Prompt: 8237,0 t/s | Generation: 206,0 t/s ]
./llama-cli -m /home/user/models/Qwen3.6-27B-GGUF/Qwen3.6-27B-IQ4_XS.gguf -c 8192 -mli -fa on --temp 1.0 --top-k 20 --top-p 0.95 --min-p 0.0 --presence_penalty 1.5 -ngl all -np 1 --no-mmap --jinja
[ Prompt: 2238,1 t/s | Generation: 46,0 t/s ][ Prompt: 2232,3 t/s | Generation: 46,0 t/s ][ Prompt: 2235,4 t/s | Generation: 46,0 t/s ][ Prompt: 2237,3 t/s | Generation: 46,0 t/s ]
./llama-cli -m /home/user/models/gpt-oss-120b-GGUF/gpt-oss-120b-MXFP4-00001-of-00002.gguf -c 16384 -mli -fa on --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -np 1 --no-mmap --jinja -fit on -fitt 160M -t 8 -tb 8
[ Prompt: 650,0 t/s | Generation: 45,2 t/s ][ Prompt: 647,8 t/s | Generation: 45,0 t/s ][ Prompt: 650,3 t/s | Generation: 44,7 t/s ][ Prompt: 649,0 t/s | Generation: 45,0 t/s ]
DunderSunder@reddit
the diff should be less than 5%. Something is not right.
Ok_Mine189@reddit (OP)
Dunno, maybe it's using prebuilt windows binaries from the official llama.cpp repo vs compiling them myself on linux?
tomByrer@reddit
Windows might have more background tasks running. It has tons of telemetry, likes to hide the icons for running programs (like Steam)...
I used 2 debloaters like this one: https://github.com/Raphire/Win11Debloat
DunderSunder@reddit
Frankly, I don't know, though it could be useful if you get to bottom of this and see what the culprit is. Have you tried ik_llama? it's more optimized for hybrid situations.
Ok_Mine189@reddit (OP)
Using ik_llama on Lubuntu I was getting 47-48 t/s generation - vs 52 t/s with llama.cpp. Plus it was awkward as heck.
AvidCyclist250@reddit
It's unadulturated, pure epic cancer to set up llama.cpp on windows perfectly. Did you succeed?
Nutsack_VS_Acetylene@reddit
Just grab the prebuilt binaries, avoid the docker and compile from scratch pipelines unless you want to do something that requires them.
an0maly33@reddit
Not sure what you mean. You download 2 zips, extract them, run it.
Ok_Mine189@reddit (OP)
Is it though? I just downloaded prebuild binaries, plugged in the provided CUDA .dlls and it worked from the get go for me :)
mr_Owner@reddit
It's probably due to windows desktop manager wdm.exe.
When you passthrough your vidoe via mobo igpu then you get that perf increase also. Linux doesn't have heavy desktop rendering like windows. I believe that's is assumably the main difference.
javiers@reddit
I am surprised the gap isn’t bigger. Also why windows? Do you hate yourself or your computer?
Ok_Mine189@reddit (OP)
Well, I'm a gamer after hours and Windows still gives better performance for me (for gaming).
Potential-Leg-639@reddit
This is why everyone here suggests to use Linux for local models since quite some time :)
Pakobbix@reddit
Hmm interesting. I can't verify them with my own setup (Dual Boot Windows 11 Build 26200 + Zorin OS 18).
Unfortunately, Nvidia doesn't support voltage control on Linux and thus, my GPU is using 100% Power in Linux for the same performance I get with \~66-75% in Windows (no power control, just undervolting).
And that's currently my biggest "should I do the full switch or not" blocker. Gaming and Inference with up to 34% less power over time is just way too good to have.
a_beautiful_rhind@reddit
It does now with lact. They found the hidden API.
Pakobbix@reddit
Holy shit.. that's awesome. Bye bye windows :-) thank you for answering. Will test it out when I'm at home.
razorree@reddit
interesting, maybe different in compilation? you downloaded compiled version of llamacpp for windows? maybe without some "newer" extensions like AVX2 ?
I think Llamacpp prints all used extensions like:
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |you could compare them
Ok_Mine189@reddit (OP)
Here's the printout for Gemma 4 on Windows:
I'll add one from Linux in a moment.
Jester14@reddit
Windows build kinda full of CUDA bloat. Builds have different amount of threads specified and threads aren't always specified in the benchmark runs.
Ok_Mine189@reddit (OP)
Well I specified threads for the runs where I specified them explicitly. For others I was relying on the default behavior.
And it seems that the default for Windows is 24 threads and Linux is 8, but I rerun the Windows Gemma 4 bench with threads forced to 8 and the results were exactly the same, so that ain't it.
Ok_Mine189@reddit (OP)
Which is not surprising since the model fits fully in VRAM and the CPU has nothing to say in this matter, it's all GPU.
Monkey_1505@reddit
Yeah it would have to be one of the ones that doesn't fit in vram for the thread flag to make a difference.
Monkey_1505@reddit
I find the common advice of 'specify the same number of threads as you have cores' to be correct. Maybe just me, but I get significant slow downs if I stray too much from this.
twack3r@reddit
Is this via Wsl2 on Win11? Or directly on Win11?
Ok_Mine189@reddit (OP)
Directly on Win 11.
twack3r@reddit
Did you already compare to Wsl2?
Ok_Mine189@reddit (OP)
Nope, I recon that would yield the worst results as it incurs the biggest VRAM penalty (running Windows + Linux)?
Danmoreng@reddit
I actually tested win 11 + WSL and it was slower than pure Win 11. not worth trying out.
ambient_temp_xeno@reddit
It's not so much that there's an inherent issue with windows (necessarily anyway), it's that the cuda dev guy doesn't care about the windows performance. The difference used to be a lot bigger on my machines.
JamesEvoAI@reddit
There's a noticeable performance difference even when running CPU bound models. Windows has a pretty high baseline of performance overhead just to run the OS itself. Linux on the other hand comes in many possible configurations with different performance overheads, including a distro stripped down to only the bare minimum to run your inference engine. Also not having to constantly collect and send telemetry helps.
mstahh@reddit
It's funny that the world rests on the cuda dev guys shoulders
simracerman@reddit
Who’s he? Can’t we buy him coffee/beer in bulk?
Monkey_1505@reddit
That seems like a plausible explanation. There are pretty big PP/gen differences between hardware libraries or whatever they are called, and even versions of those.
Monkey_1505@reddit
Makes sense. It's the MoE's mainly, and linux has better ram management. Still quite a difference with those, noteworthy.
FullstackSensei@reddit
There's not much to RAM management here. It's all about scheduling. Even if the model fits entirely in VRAM, Windows is significantly slower because Microsoft things their telemetry, copilot, ads, and all the other slop should be given higher priority than whatever you're running on your machine.
draconic_tongue@reddit
"significantly slower" holy delusion
Monkey_1505@reddit
I mean, you could switch all that stuff off, right?
I know there are also paired down versions of windows. But I don't they give a nearly 2x in ram spill over LLM model's PP. IDK 🤷♂️
FullstackSensei@reddit
Most of it can't be switched off without going through several hoops, and even when you do, the moment any of those things has an update, Windows update will reinstall it again.
Where did OP say there's 2x RAM spillover? They said 2x performance, not 2x offload.
Monkey_1505@reddit
You misunderstood what I wrote. Wonky ordering but I was saying nearly 2x PP speed.
There are community custom versions of windows people make for gaming that have stripped out everything non-essential. You can also permanently turn off update with an app.
I suspect that would not give the same speed up as we are seeing here.
draconic_tongue@reddit
Windows doesn't use more ram. What's shown as "usage" is more like reservation, it's not actually used. You can go over your physical ram all the time and nothing happens, virutal memory takes over
FullstackSensei@reddit
I've tried those versions and they have their own issues.
You won't get the same performance even with those versions because the problem is the scheduler. Microsoft screwed that up a couple of years ago. I suspect the best chance would be W10.
Monkey_1505@reddit
I do recall slight frame rate differences between w10 and w11. You can get stripped down versions of windows 10 too. A very testable hypothesis in any case.
FullstackSensei@reddit
Meh, I just run Linux on my LLM machines. Way less headaches anyway.
Monkey_1505@reddit
Yeah, I just meant for testing/knowing whether the statement was true, or if the cause of the difference in PP speed here was something else.
iamapizza@reddit
Looking at all the performance threads posted here, it looks Linux with a GPU is the sweet spot between performance and value.
You mention lubuntu, but I assume the distro doesn't really matter? Or does it.
MalabaristaEnFuego@reddit
I'm shocked anyone would run LLMs any other way, just for the amount of extra RAM headroom you have over Windows, and the overall efficiency of the OS.
Windows is all about, "How can I saturate this experience with more features?" Linux is like, "Let's get some shit done today."
ea_man@reddit
Lubuntu runs on a light desktop LXQt that wastes less VRAM than other more sophisticated DE that have expensive eye candy.
Yet it ain't a matter of Lubuntu vs Kubuntu, you can install how many DE you want and use the one that suits the job better, I run a full KDE usually and have a streamed down LXQt if I have to run a 15.5GB LM on a 15.9GB GPU. You can ofc not use any DE at all, run just the services and waste as much as 50mb to keep just text terminals.
If you run just CLI tools like Opencode / Pi you can do with Tmux.
vogelvogelvogelvogel@reddit
same question here, any opinions on how much the distro matters?
sob727@reddit
it does not
it's about the version of the software (linux kernel, cuda) you have installed, not the name on the box
vogelvogelvogelvogel@reddit
ok thx
sob727@reddit
where it could hypothetically matter is if your distro installs a bunch of unnecessary software services that dloe down your machine, or if they have a default config for say the kernel that is massively suboptimal, but these are pathological cases
sob727@reddit
so yeah its more about config than the distro name itself
Ok_Mine189@reddit (OP)
I suspect any performance differences between distros are marginal compared to the gap between Linux and Windows. I went with Lubuntu because it still has the well-supported Ubuntu foundation underneath, while being one of the most lightweight flavors available.
Craftkorb@reddit
People are usually surprised to see that KDE has the best gaming performance in recent benchmarks - But I don't remember seeing LXDE in them.
But then, it shouldn't matter much. And if your machine is a server, then you don't have a desktop on it anyway, waste of resources.
Craftkorb@reddit
Not really LLM, but there's been a recent surge in performance tests between Distros for Gaming. Don't expect too many differences here though: The inference engine is already highly optimized, and as the Linux kernel is mostly the same on different Distros, it doesn't change too much.
Aerthlyomi@reddit
Scheduling would be part of the Kernel so apart for Linux version, it should not matter.
ea_man@reddit
And don't forget the amount of VRAM that Windows wastes, on Linux you can reduce that from 50 to 250MB, that means running like a 15.1GB QWEN + 80k of context Q_4 on a 16GB GPU.
truthputer@reddit
Suggestions:
Ok_Mine189@reddit (OP)
DunderSunder@reddit
There is big variance with small prompts. see you are getting like 8k prompt processing speed for a 2.5k prompt size, even a small delay can change pp/s a lot. try it a few times and see the numbers fluctuate.
Also you should be using llama-bench for benchmarking!
Ok_Mine189@reddit (OP)
That's why it's averaged over 4 runs each.
Long_comment_san@reddit
Probably some MS arse security thing.
iamapizza@reddit
Looking at all the performance threads posted here, it looks Linux with a GPU is the sweet spot between performance and value.
You mention lubuntu, but I assume the distro doesn't really matter? Or does it.
External_Dentist1928@reddit
Nice! Another reason to finally make the switch 👍
UltrMgns@reddit
Microsoft really came a long way... Down.