Question about llama.cpp and OpenCode
Posted by Able_Limit_7634@reddit | LocalLLaMA | View on Reddit | 16 comments
I see a lot of people using llama.cpp with OpenCode, but I don’t really understand why they don’t just use LM Studio or Ollama. What are the advantages?
Also, what would you recommend for a MacBook M4 Pro with 48GB of RAM if my main use case is coding in Dart?
Prigozhin2023@reddit
I use Ollama for the free cloud models. Default to llama.cpp
kidousenshigundam@reddit
Once you see how much memory you can save my using llama.cpp on agentic workflows you’ll never go back
segmond@reddit
There's no LMStudio or Ollama without llama.cpp. That's why. Since you are asking, I'll assume you really want to learn. Just go with wisdom of the crowd. use llama.cpp with opencode. Never use ollama or LMStudio, if you can code in dart you can figure out llama.cpp and opencode.
Certain-Cod-1404@reddit
Llama cpp is faster than ollama and more up to date, just use llama swap for easy switching between models
pepedombo@reddit
I had various problems with multi-gpu setup in lmstudio even though it uses llama.cpp (windows 11).
I compiled llama.cpp to see how it goes and here's short summary.
ollama:
lm studio:
I ended up with llama.cpp. Downloaded same qwen 3.6 and I'm able to set it to 100k+ f16, 70tps start, 45tps at 100k. Completely no issues with vram utilization. Prompt cache works as intended and fills my system ram instead of shared memory.
grandnoliv@reddit
I don't know about your first question.
I've got a Mac mini M4 Pro with 48GB of RAM. I recommend you to use oMLX as the server in order to use MLX models that are faster than GGUF. LM Studio can serve MLX models but there is a cache issue that make it a lot slower than oMLX for all but the first API call.
If you're looking for model suggestions, Qwen3.6 35B A3B runs very well and fast., dense 27\~31B models are slow but they can be useful sometimes (Gemma4 and Qwen3.5 dense models).
ExistingAd2066@reddit
I read that LMX degrades more than llama.cpp in the context bigger than 32k
Able_Limit_7634@reddit (OP)
Thanks, I will give a try to oMLX
Own_Attention_3392@reddit
What do you think those other tools use under the hood?
Llama cpp.
It's just cutting out the middleman.
exact_constraint@reddit
Yup. Just cleaner all around. And I can rebuild llama.cpp immediately for new features. Can be particularly important around new model releases, when llama can get aggressively patched for model support.
There also seems to be a slightly speed advantage. Can’t quantify it, coulda just been improvements to llama.cpp itself vs whatever was running under LM Studio. But that kinda reinforces my first point lol.
Own_Attention_3392@reddit
Well you can go nuts tweaking the loading parameters with the CLI. I still use lm studio when I'm trying to mess around with something simple or test out mcp tools I'm developing, but I generally just use llama cpp directly for most stuff.
I was using Gemma 4 to transcribe and translate old Russian and Polish documents for my wife's genealogy research today and for that I just shoved lm studio at it and handed her the laptop.
idiotiesystemique@reddit
This is mad with ollama for plagia drama
Ollama/lm studio are slightly slower than raw llama.cpp, but far more convenient.
It depends what's more important for you and your current task. It's also perfectly reasonable to use a wrapper for low intensity tasks and raw llama for high intensity tasks.
fizzy1242@reddit
only nice thing that comes to mind about ollama is the docket style "pulling" of models, but i'm not a fan of the modelfiles and how it turns .gguf into those blobs.
also, ollama also takes longer to get the latest changes for new model fixes and architectures. llama.cpp is just less hassle in my opinion.
Jeidoz@reddit
Ollama is slow.
LM Studio is OK. It uses llama.cpp underhood. It provides nice UI and easy to search and download models. I mainly use it for dev server for agentic apps like OpenCode or Qwen Code. In Beta builds LM Studio often releases nice QoL features for recently released models or fixes for them.
ea_man@reddit
You are wasting VRAM to load LM and losing performance to do the same thing.
bennyb0y@reddit
Ollama slow