Llama.cpp vs LM Studio on gaming PC

Posted by EaZyRecipeZ@reddit | LocalLLaMA | View on Reddit | 8 comments

Here is my experience, I've been using LM Studio with RTX 5080 and 64GB RAM using Windows 11. I'm very happy with LM Studio except the speed. I installed Windows WSL and compiled Llama.cpp. After playing with Gemma 4 26B Q8 and Qwen 3 Coder Next unsloth Q4 with Llama.cpp, I'm getting double the speed compared to LM Studio. I wish LM Studio provided the same speed, but unfortunately, it doesn’t.

[-]

Kyuiki@reddit

My experience is similar with a 4090. But I repurposed my old gaming PC (just went all in on a new build before prices get worst) and switched over to Linux + llama.cpp from Windows + LM Studio.

What I noticed with Gemma 4 31B is that the model crashes less. The TTFT is consistent and faster. I went from 6 t/s to like 16 t/s sometimes faster. I also don’t see the rare times where the model gets stuck a prompt process 0.0% or indefinitely.

It just runs so much better. I thought it was switching to Linux and removing windows bloat but your post makes it seem like it’s llama.cpp itself that runs better and I’m seeing that too!

I’m going to try installing llama.cpp on my gaming PC (5090) and route to it during non-gaming times. Then to the 4090 when in gaming mode, before inference seemed so fast on the 5090. Roughly 10 t/s faster using the same model and parameters. I bet it will be blazing fast outside of LM studio.

[-]

Sabin_Stargem@reddit

Try using KoboldCPP. It has a GUI, and incorporates LlamaCPP as the backend.

[-]

Southern-Chain-6485@reddit

You can use llama.cpp natively in windows

[-]

EaZyRecipeZ@reddit (OP)

Hmmm, I didn't know. Do you have a link?

[-]

Southern-Chain-6485@reddit

Go to the releases page https://github.com/ggml-org/llama.cpp/releases/tag/b8808 and there you have the releases for windows. Pick the cuda versions (download both the "windows x64 cuda 13 and the cuda 13 dll) and extract them somewhere, and you're good to go

[-]

EaZyRecipeZ@reddit (OP)

Thanks, works great. I don't know how I missed it. Wasted 20 minutes installing WSL and compiling cudatoolkit, llama.cpp.

[-]

PaceZealousideal6091@reddit

You can compile it and run in Windows natively as well. Use clang or msvc.

[-]

traveddit@reddit

I have an rtx 5080 in a windows 11 machine and I used wsl2 and vLLM and found good results. You should try it out with NVFP4 quant models if you're looking for speed on your card.