Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big.

Posted by Ok_Mine189@reddit | LocalLLaMA | View on Reddit | 65 comments

As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if any) performance I'm leaving on the table. So I did the sensible thing and run some benchmarks.

Setup:

OS: Windows 11 25H2 vs Lubuntu 26.04
Engine: Llama.cpp b8929, CUDA 13.1 (downloaded official prebuilt for Windows, compiled myself with CMake on Lubuntu)
CPU: Intel Core i9-14900KF
RAM: 64GB DDR5 6800 MT/s
GPU: RTX 5080 16GB VRAM
Drivers: 596.32 (Windows) / 595.x (Lubuntu)

The Results (Averaged)

I ran a 2500+ token prompt against llama-cli across several different models.

(Note: Gemma 4, OSS-20B & Qwen3.6 were fully offloaded to the GPU. Qwen3.5 & OSS-210B were hybrid CPU/GPU runs using -t 8 -tb 8 -fit on)

Model	Win 11 (Prompt)	Lubuntu (Prompt)	Prompt Diff	Win 11 (Gen)	Lubuntu (Gen)	Gen Diff
Gemma-4-E4B-it (Q8_K_XL)	6,232 t/s	7,587 t/s	+ 21.7%	111.7 t/s	116.7 t/s	+ 4.4%
Qwen3.5-35B-A3B (Q8_K_XL)	305 t/s	742 t/s	+ 143.2%	48.1 t/s	52.2 t/s	+ 8.5%
GPT-OSS-20B (MXFP4)	7,619 t/s	8,140 t/s	+ 6.8%	195.8 t/s	206.2 t/s	+ 5.3%
Qwen3.6-27B (IQ4_XS)	2,077 t/s	2,235 t/s	+ 7.6%	43.8 t/s	46.0 t/s	+ 5.0%
GPT-OSS-120B (MXFP4)	310 t/s	649 t/s	+ 109.3%	43.4 t/s	44.9 t/s	+ 3.4%

Takeaways

Generation Speeds: Lubuntu is consistently about 4% to 8% faster across the board for token generation. It's a nice bump, but maybe not enough to justify an OS swap on its own if you only care about reading speed.
Prompt Processing (Fully Offloaded): Linux handles prompt evaluation on the GPU noticeably faster. Even on the lower end, it's 6-7% faster, and up to 21% faster on the Gemma 4 run.
Prompt Processing (CPU/GPU Hybrid): This is where it gets crazy. On the models where Llama.cpp had to lean on the CPU (-t 8 -tb 8), Linux completely obliterated Windows by over 100% to 140% in prompt processing speed.

Raw Run Logs:

Windows 11:

.\llama-cli -m "E:\models\unsloth\gemma-4-E4B-it-GGUF\gemma-4-E4B-it-UD-Q8_K_XL.gguf" -c 8192 -mli -fa on --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 -ngl all -np 1 --no-mmap --jinja --chat-template-kwargs '{\"enable_thinking\":true}'
[ Prompt: 4038.3 t/s | Generation: 111.6 t/s ][ Prompt: 7341.7 t/s | Generation: 111.8 t/s ][ Prompt: 6432.1 t/s | Generation: 111.9 t/s ][ Prompt: 7116.3 t/s | Generation: 111.7 t/s ]

.\llama-cli -m "E:\models\unsloth\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf" -c 16384 -mli -fa on --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 -np 1 --no-mmap --chat-template-kwargs "{\"enable_thinking\":true}" -t 8 -tb 8 -fit on -fitt 160M
[ Prompt: 296.5 t/s | Generation: 48.4 t/s ][ Prompt: 308.6 t/s | Generation: 48.0 t/s ][ Prompt: 313.7 t/s | Generation: 48.2 t/s ][ Prompt: 302.1 t/s | Generation: 47.8 t/s ]

.\llama-cli -m "E:\models\lmstudio-community\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf" -c 32768 -mli -fa on --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -ngl all -np 1 --no-mmap --jinja
[ Prompt: 7651.2 t/s | Generation: 195.6 t/s ][ Prompt: 7661.0 t/s | Generation: 196.6 t/s ][ Prompt: 7653.2 t/s | Generation: 196.6 t/s ][ Prompt: 7510.8 t/s | Generation: 194.6 t/s ]

.\llama-cli -m "E:\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-IQ4_XS.gguf" -c 8192 -mli -fa on --temp 1.0 --top-k 20 --top-p 0.95 --min-p 0.0 --presence_penalty 1.5 -ngl all -np 1 --no-mmap --jinja
[ Prompt: 1859.4 t/s | Generation: 43.2 t/s ][ Prompt: 2132.9 t/s | Generation: 43.0 t/s ][ Prompt: 2153.1 t/s | Generation: 44.5 t/s ][ Prompt: 2166.1 t/s | Generation: 44.5 t/s ]

.\llama-cli -m "E:\models\lmstudio-community\gpt-oss-120b-GGUF\gpt-oss-120b-MXFP4-00001-of-00002.gguf" -c 16384 -mli -fa on --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -np 1 --no-mmap --jinja -t 8 -tb 8 -fit on -fitt 160M[ Prompt: 324.3 t/s | Generation: 43.3 t/s ][ Prompt: 320.8 t/s | Generation: 43.4 t/s ][ Prompt: 284.9 t/s | Generation: 43.4 t/s ]

Lubuntu 26.04:

./llama-cli -m /home/user/models/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf -c 8192 -mli -fa on --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 -ngl all -np 1 --no-mmap --jinja --chat-template-kwargs "{\"enable_thinking\":true}"
[ Prompt: 7621,5 t/s | Generation: 116,6 t/s ][ Prompt: 7537,8 t/s | Generation: 116,6 t/s ][ Prompt: 7665,7 t/s | Generation: 116,7 t/s ][ Prompt: 7523,5 t/s | Generation: 116,8 t/s ]

./llama-cli -m /home/user/models/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf -c 16384 -mli -fa on --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 -np 1 --no-mmap --chat-template-kwargs "{\"enable_thinking\":true}" -t 8 -tb 8 -fit on -fitt 160M
[ Prompt: 739,4 t/s | Generation: 52,3 t/s ][ Prompt: 744,6 t/s | Generation: 52,0 t/s ][ Prompt: 746,3 t/s | Generation: 52,3 t/s ][ Prompt: 741,3 t/s | Generation: 52,2 t/s ]

 ./llama-cli -m /home/user/models/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf -c 32768 -mli -fa on --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -ngl all -np 1 --no-mmap --jinja
[ Prompt: 7819,8 t/s | Generation: 205,7 t/s ][ Prompt: 8250,8 t/s | Generation: 206,4 t/s ][ Prompt: 8254,9 t/s | Generation: 206,9 t/s ][ Prompt: 8237,0 t/s | Generation: 206,0 t/s ]

./llama-cli -m /home/user/models/Qwen3.6-27B-GGUF/Qwen3.6-27B-IQ4_XS.gguf -c 8192 -mli -fa on --temp 1.0 --top-k 20 --top-p 0.95 --min-p 0.0 --presence_penalty 1.5 -ngl all -np 1 --no-mmap --jinja
[ Prompt: 2238,1 t/s | Generation: 46,0 t/s ][ Prompt: 2232,3 t/s | Generation: 46,0 t/s ][ Prompt: 2235,4 t/s | Generation: 46,0 t/s ][ Prompt: 2237,3 t/s | Generation: 46,0 t/s ]

./llama-cli -m /home/user/models/gpt-oss-120b-GGUF/gpt-oss-120b-MXFP4-00001-of-00002.gguf -c 16384 -mli -fa on --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -np 1 --no-mmap --jinja -fit on -fitt 160M -t 8 -tb 8
[ Prompt: 650,0 t/s | Generation: 45,2 t/s ][ Prompt: 647,8 t/s | Generation: 45,0 t/s ][ Prompt: 650,3 t/s | Generation: 44,7 t/s ][ Prompt: 649,0 t/s | Generation: 45,0 t/s ]

[-]

DunderSunder@reddit

the diff should be less than 5%. Something is not right.

[-]

Ok_Mine189@reddit (OP)

Dunno, maybe it's using prebuilt windows binaries from the official llama.cpp repo vs compiling them myself on linux?

[-]

tomByrer@reddit

Windows might have more background tasks running. It has tons of telemetry, likes to hide the icons for running programs (like Steam)...
I used 2 debloaters like this one: https://github.com/Raphire/Win11Debloat

[-]

DunderSunder@reddit

Frankly, I don't know, though it could be useful if you get to bottom of this and see what the culprit is. Have you tried ik_llama? it's more optimized for hybrid situations.

[-]

Ok_Mine189@reddit (OP)

Using ik_llama on Lubuntu I was getting 47-48 t/s generation - vs 52 t/s with llama.cpp. Plus it was awkward as heck.

[-]

AvidCyclist250@reddit

It's unadulturated, pure epic cancer to set up llama.cpp on windows perfectly. Did you succeed?

[-]

Nutsack_VS_Acetylene@reddit

Just grab the prebuilt binaries, avoid the docker and compile from scratch pipelines unless you want to do something that requires them.

[-]

an0maly33@reddit

Not sure what you mean. You download 2 zips, extract them, run it.

[-]

Ok_Mine189@reddit (OP)

Is it though? I just downloaded prebuild binaries, plugged in the provided CUDA .dlls and it worked from the get go for me :)

[-]

mr_Owner@reddit

It's probably due to windows desktop manager wdm.exe.

When you passthrough your vidoe via mobo igpu then you get that perf increase also. Linux doesn't have heavy desktop rendering like windows. I believe that's is assumably the main difference.

[-]

javiers@reddit

I am surprised the gap isn’t bigger. Also why windows? Do you hate yourself or your computer?

[-]

Ok_Mine189@reddit (OP)

Well, I'm a gamer after hours and Windows still gives better performance for me (for gaming).

[-]

Potential-Leg-639@reddit

This is why everyone here suggests to use Linux for local models since quite some time :)

[-]

Pakobbix@reddit

Hmm interesting. I can't verify them with my own setup (Dual Boot Windows 11 Build 26200 + Zorin OS 18).

Unfortunately, Nvidia doesn't support voltage control on Linux and thus, my GPU is using 100% Power in Linux for the same performance I get with \~66-75% in Windows (no power control, just undervolting).

And that's currently my biggest "should I do the full switch or not" blocker. Gaming and Inference with up to 34% less power over time is just way too good to have.

[-]

a_beautiful_rhind@reddit

Unfortunately, Nvidia doesn't support voltage control on Linux and thus, my GPU is using 100% Power in Linux for the same performance I get with ~66-75% in Windows (no power control, just undervolting).

It does now with lact. They found the hidden API.

[-]

Pakobbix@reddit

Holy shit.. that's awesome. Bye bye windows :-) thank you for answering. Will test it out when I'm at home.

[-]

razorree@reddit

interesting, maybe different in compilation? you downloaded compiled version of llamacpp for windows? maybe without some "newer" extensions like AVX2 ?

I think Llamacpp prints all used extensions like:

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

you could compare them

[-]

Ok_Mine189@reddit (OP)

Here's the printout for Gemma 4 on Windows:

system_info: n_threads = 24 (n_threads_batch = 24) / 32 | CUDA : ARCHS = 750,800,860,890,1200,1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

I'll add one from Linux in a moment.

[-]

Jester14@reddit

Windows build kinda full of CUDA bloat. Builds have different amount of threads specified and threads aren't always specified in the benchmark runs.

[-]

Ok_Mine189@reddit (OP)

Well I specified threads for the runs where I specified them explicitly. For others I was relying on the default behavior.

And it seems that the default for Windows is 24 threads and Linux is 8, but I rerun the Windows Gemma 4 bench with threads forced to 8 and the results were exactly the same, so that ain't it.

[-]

Ok_Mine189@reddit (OP)

Which is not surprising since the model fits fully in VRAM and the CPU has nothing to say in this matter, it's all GPU.

[-]

Monkey_1505@reddit

Yeah it would have to be one of the ones that doesn't fit in vram for the thread flag to make a difference.

[-]

Monkey_1505@reddit

I find the common advice of 'specify the same number of threads as you have cores' to be correct. Maybe just me, but I get significant slow downs if I stray too much from this.

[-]

twack3r@reddit

Is this via Wsl2 on Win11? Or directly on Win11?

[-]

Ok_Mine189@reddit (OP)

Directly on Win 11.

[-]

twack3r@reddit

Did you already compare to Wsl2?

[-]

Ok_Mine189@reddit (OP)

Nope, I recon that would yield the worst results as it incurs the biggest VRAM penalty (running Windows + Linux)?

[-]

Danmoreng@reddit

I actually tested win 11 + WSL and it was slower than pure Win 11. not worth trying out.

[-]

ambient_temp_xeno@reddit

It's not so much that there's an inherent issue with windows (necessarily anyway), it's that the cuda dev guy doesn't care about the windows performance. The difference used to be a lot bigger on my machines.

[-]

JamesEvoAI@reddit

There's a noticeable performance difference even when running CPU bound models. Windows has a pretty high baseline of performance overhead just to run the OS itself. Linux on the other hand comes in many possible configurations with different performance overheads, including a distro stripped down to only the bare minimum to run your inference engine. Also not having to constantly collect and send telemetry helps.

[-]

mstahh@reddit

It's funny that the world rests on the cuda dev guys shoulders

[-]

simracerman@reddit

Who’s he? Can’t we buy him coffee/beer in bulk?

[-]

Monkey_1505@reddit

That seems like a plausible explanation. There are pretty big PP/gen differences between hardware libraries or whatever they are called, and even versions of those.

[-]

Monkey_1505@reddit

Makes sense. It's the MoE's mainly, and linux has better ram management. Still quite a difference with those, noteworthy.

[-]

FullstackSensei@reddit

There's not much to RAM management here. It's all about scheduling. Even if the model fits entirely in VRAM, Windows is significantly slower because Microsoft things their telemetry, copilot, ads, and all the other slop should be given higher priority than whatever you're running on your machine.

[-]

draconic_tongue@reddit

"significantly slower" holy delusion

[-]

Monkey_1505@reddit

I mean, you could switch all that stuff off, right?

I know there are also paired down versions of windows. But I don't they give a nearly 2x in ram spill over LLM model's PP. IDK 🤷‍♂️

[-]

FullstackSensei@reddit

Most of it can't be switched off without going through several hoops, and even when you do, the moment any of those things has an update, Windows update will reinstall it again.

Where did OP say there's 2x RAM spillover? They said 2x performance, not 2x offload.

[-]

Monkey_1505@reddit

You misunderstood what I wrote. Wonky ordering but I was saying nearly 2x PP speed.

There are community custom versions of windows people make for gaming that have stripped out everything non-essential. You can also permanently turn off update with an app.

I suspect that would not give the same speed up as we are seeing here.

[-]

draconic_tongue@reddit

Windows doesn't use more ram. What's shown as "usage" is more like reservation, it's not actually used. You can go over your physical ram all the time and nothing happens, virutal memory takes over

[-]

FullstackSensei@reddit

I've tried those versions and they have their own issues.

You won't get the same performance even with those versions because the problem is the scheduler. Microsoft screwed that up a couple of years ago. I suspect the best chance would be W10.

[-]

Monkey_1505@reddit

I do recall slight frame rate differences between w10 and w11. You can get stripped down versions of windows 10 too. A very testable hypothesis in any case.

[-]

FullstackSensei@reddit

Meh, I just run Linux on my LLM machines. Way less headaches anyway.

[-]

Monkey_1505@reddit

Yeah, I just meant for testing/knowing whether the statement was true, or if the cause of the difference in PP speed here was something else.

[-]

iamapizza@reddit

Looking at all the performance threads posted here, it looks Linux with a GPU is the sweet spot between performance and value.

You mention lubuntu, but I assume the distro doesn't really matter? Or does it.

[-]

MalabaristaEnFuego@reddit

I'm shocked anyone would run LLMs any other way, just for the amount of extra RAM headroom you have over Windows, and the overall efficiency of the OS.

Windows is all about, "How can I saturate this experience with more features?" Linux is like, "Let's get some shit done today."

[-]

ea_man@reddit

Lubuntu runs on a light desktop LXQt that wastes less VRAM than other more sophisticated DE that have expensive eye candy.

Yet it ain't a matter of Lubuntu vs Kubuntu, you can install how many DE you want and use the one that suits the job better, I run a full KDE usually and have a streamed down LXQt if I have to run a 15.5GB LM on a 15.9GB GPU. You can ofc not use any DE at all, run just the services and waste as much as 50mb to keep just text terminals.
If you run just CLI tools like Opencode / Pi you can do with Tmux.

[-]

vogelvogelvogelvogel@reddit

same question here, any opinions on how much the distro matters?

[-]

sob727@reddit

it does not

it's about the version of the software (linux kernel, cuda) you have installed, not the name on the box

[-]

vogelvogelvogelvogel@reddit

ok thx

[-]

sob727@reddit

where it could hypothetically matter is if your distro installs a bunch of unnecessary software services that dloe down your machine, or if they have a default config for say the kernel that is massively suboptimal, but these are pathological cases

[-]

sob727@reddit

so yeah its more about config than the distro name itself

[-]

Ok_Mine189@reddit (OP)

I suspect any performance differences between distros are marginal compared to the gap between Linux and Windows. I went with Lubuntu because it still has the well-supported Ubuntu foundation underneath, while being one of the most lightweight flavors available.

[-]

Craftkorb@reddit

People are usually surprised to see that KDE has the best gaming performance in recent benchmarks - But I don't remember seeing LXDE in them.

But then, it shouldn't matter much. And if your machine is a server, then you don't have a desktop on it anyway, waste of resources.

[-]

Craftkorb@reddit

Not really LLM, but there's been a recent surge in performance tests between Distros for Gaming. Don't expect too many differences here though: The inference engine is already highly optimized, and as the Linux kernel is mostly the same on different Distros, it doesn't change too much.

[-]

Aerthlyomi@reddit

Scheduling would be part of the Kernel so apart for Linux version, it should not matter.

[-]

ea_man@reddit

And don't forget the amount of VRAM that Windows wastes, on Linux you can reduce that from 50 to 250MB, that means running like a 15.1GB QWEN + 80k of context Q_4 on a 16GB GPU.

[-]

truthputer@reddit

Suggestions:

Add Vulkan Compute benchmarks. Vulkan is a more modern API than CUDA and has been known to be more memory efficient and performant in some situations, plus it also has a fully open source implementation on Linux.
Benchmark at the same context window sizes so performance can be compared across models.
Benchmark at usable context window sizes. 8k is a meaningless joke. I have all my local LLMs set for 256k context or 192k if that’s what they were trained at, because I routinely use that context when coding.

[-]

Ok_Mine189@reddit (OP)

I'll try if I find the tinme :)
The prompt + generation fits in 8000 tokens anyways. I don't think it matters in such case if the model has ctx set to 8192 or 32768?
Well not everybody uses LLM for coding. Smaller context sizes are sufficient for many other tasks. Plus, I'm not gonna suffer waiting times with such long context benches. Your welcome to do it no your own though :)

[-]

DunderSunder@reddit

There is big variance with small prompts. see you are getting like 8k prompt processing speed for a 2.5k prompt size, even a small delay can change pp/s a lot. try it a few times and see the numbers fluctuate.

Also you should be using llama-bench for benchmarking!

[-]

Ok_Mine189@reddit (OP)

That's why it's averaged over 4 runs each.

[-]

Long_comment_san@reddit

Probably some MS arse security thing.

[-]

iamapizza@reddit

Looking at all the performance threads posted here, it looks Linux with a GPU is the sweet spot between performance and value.

You mention lubuntu, but I assume the distro doesn't really matter? Or does it.

[-]

External_Dentist1928@reddit

Nice! Another reason to finally make the switch 👍

[-]

UltrMgns@reddit

Microsoft really came a long way... Down.