LM Studio running very slow compared to Ollama
Posted by EaZyRecipeZ@reddit | LocalLLaMA | View on Reddit | 10 comments
I've been using Ollama with Qwen2.5 Coder 14B Instruct Q8 and it works OK with my system. I wanted to try LM Studio and downloaded identical model. When I tried with Visual Studio Code in Cline, it was very slow. The only settings that I changed in LM Studio GPU Offload to MAX everything else were left at default settings. What settings should I look for or change? How to tune it properly.
AMD 9950x3d
GPU RTX 5080 (16gb)
Ram 64GB
nickless07@reddit
Set it to Power User or Developer. Go to Developer tab, turn on Verbose Logging (the 3 dots on Logging) - load the Model and post the output.
EaZyRecipeZ@reddit (OP)
Same model in Ollama takes about 20 seconds. When I try to do the same thing in LM Studio it takes 4 minutes. here is the log file https://pastebin.com/JrhvuvwX
nickless07@reddit
And we are still missing the essential lines. Start at:
[LM Studio] GPU Configuration:
Strategy: priorityOrder
Priority: [1,0]
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: ON
[LM Studio] Live GPU memory info (source 'LMS Core'):
GPU 0
This tells us how much memory is aviable on what GPU (some other process reserved vram and so on).
Continue with:
llama_model_load_from_file_impl: using device...
load_tensors...
load_tensors: offloaded 41/41 layers to GPU
This tells us if all layers are on GPU or if some settings in LM Studio prevent it from offloading all layers. And so on..
Anyway, moving on:
n_ctx = 32768
32k context is fine. Flash attention? Quant? Ollama default uses 4bit
Based on what i can see from the log it seems you use Q6 with KV to GPU The model page for that model on HF shows Q6 is 12.1 GB That matches the line:
llama_memory_breakdown_print: | - CUDA0 (RTX 5080) | 16302 = 0 + (20630 = 14179 + 6144 + 307) + 17592186040087 |
14179 MB model weights + 6144 MB KV-Cache + 307 MB Other GPU puffer. Total = 20630 MB Given that your GPU only has 16GB the remaining memory is offloaded to the slower CPU (System) RAM
You can either:
EaZyRecipeZ@reddit (OP)
Thank you very much for taking your time. You helped a lot. After playing with the settings and disabling "Offload KV Cache to GPU Memory" it started flying. Any tweaks or settings that you can recommend for loading a model with a bigger size than my VRAM? Since I have 16 core CPU can I utilize it somehow with my GPU?
nickless07@reddit
Try Qwen3 30B A3B - or other MoE. The dense models will be very slow if not most of the layer fit into GPU.
As for MoE models you have the option to just load the expert weights used in vram and the inactive ones can idle in system ram. That can have a similiar effect then the KV cache thing (you will see that load option only aviable on MoE models).
NNN_Throwaway2@reddit
Do you know what context size is?
Marksta@reddit
A 14B dense model at Q8 is going to be ~14GB. You're too tight on VRAM, and over flowing into system memory. Go to Q6_K or else you're just going to be fighting to fit context and your Windows UI for the last ounce of VRAM.
You can turn off Nvida's automatic offloading but the alternative is crashing when you overflow. Linux handles that pretty gracefully but not so sure Windows will.
EaZyRecipeZ@reddit (OP)
I just posted the log file from LM Studio, please see the first post.
suicidaleggroll@reddit
Qwen2.5 Coder 14B Instruct Q8 is 15.7 GB (at least the unsloth gguf is, not sure what you're using). There's no room left for context on a 16 GB card, that model is too big. My guess is the performance difference you're seeing is that LM Studio is using bigger context and offloading more of the model into CPU RAM than Ollama is. Either way, you need to drop back from Q8 to leave room for context.
lumos675@reddit
My exprience was exactly the oposite. You need to set correct settings for each model