Question about llama.cpp and OpenCode

Posted by Able_Limit_7634@reddit | LocalLLaMA | View on Reddit | 16 comments

I see a lot of people using llama.cpp with OpenCode, but I don’t really understand why they don’t just use LM Studio or Ollama. What are the advantages?

Also, what would you recommend for a MacBook M4 Pro with 48GB of RAM if my main use case is coding in Dart?

[-]

Prigozhin2023@reddit

I use Ollama for the free cloud models. Default to llama.cpp

[-]

kidousenshigundam@reddit

Once you see how much memory you can save my using llama.cpp on agentic workflows you’ll never go back

[-]

There's no LMStudio or Ollama without llama.cpp. That's why. Since you are asking, I'll assume you really want to learn. Just go with wisdom of the crowd. use llama.cpp with opencode. Never use ollama or LMStudio, if you can code in dart you can figure out llama.cpp and opencode.

[-]

Certain-Cod-1404@reddit

Llama cpp is faster than ollama and more up to date, just use llama swap for easy switching between models

[-]

pepedombo@reddit

I had various problems with multi-gpu setup in lmstudio even though it uses llama.cpp (windows 11).

I compiled llama.cpp to see how it goes and here's short summary.

ollama:

it's modified llama.cpp
extremely good at utilizing vram for weights and context
consistient speed, qwen3.6-35b-q8 85k -> 57tps start, 45tps at 85k
i suspect it might produce "slow opinions" because it doesn't use system ram for cache prompts as sys memory stays calm
very limited settings, mostly env vars
no control over gpu assignment
models packed in bulks instead of gguf's, get annoying over time (download another repo to fetch gguf)
crappy built-in chat

lm studio:

very handy server and gguf management with options to set llama.cpp
downloads tend to stop, md sum checks happen to fail
i'm unable to set context same size as in ollama, if it's the same it outputs are slower by half
too big context goes into shared memory -> slow downs

I ended up with llama.cpp. Downloaded same qwen 3.6 and I'm able to set it to 100k+ f16, 70tps start, 45tps at 100k. Completely no issues with vram utilization. Prompt cache works as intended and fills my system ram instead of shared memory.

[-]

grandnoliv@reddit

I don't know about your first question.

I've got a Mac mini M4 Pro with 48GB of RAM. I recommend you to use oMLX as the server in order to use MLX models that are faster than GGUF. LM Studio can serve MLX models but there is a cache issue that make it a lot slower than oMLX for all but the first API call.

If you're looking for model suggestions, Qwen3.6 35B A3B runs very well and fast., dense 27\~31B models are slow but they can be useful sometimes (Gemma4 and Qwen3.5 dense models).

[-]

ExistingAd2066@reddit

I read that LMX degrades more than llama.cpp in the context bigger than 32k

[-]

Able_Limit_7634@reddit (OP)

Thanks, I will give a try to oMLX

[-]

Own_Attention_3392@reddit

What do you think those other tools use under the hood?

Llama cpp.

It's just cutting out the middleman.

[-]

exact_constraint@reddit

Yup. Just cleaner all around. And I can rebuild llama.cpp immediately for new features. Can be particularly important around new model releases, when llama can get aggressively patched for model support.

There also seems to be a slightly speed advantage. Can’t quantify it, coulda just been improvements to llama.cpp itself vs whatever was running under LM Studio. But that kinda reinforces my first point lol.

[-]

Own_Attention_3392@reddit

Well you can go nuts tweaking the loading parameters with the CLI. I still use lm studio when I'm trying to mess around with something simple or test out mcp tools I'm developing, but I generally just use llama cpp directly for most stuff.

I was using Gemma 4 to transcribe and translate old Russian and Polish documents for my wife's genealogy research today and for that I just shoved lm studio at it and handed her the laptop.

[-]

idiotiesystemique@reddit

This is mad with ollama for plagia drama

Ollama/lm studio are slightly slower than raw llama.cpp, but far more convenient.

It depends what's more important for you and your current task. It's also perfectly reasonable to use a wrapper for low intensity tasks and raw llama for high intensity tasks.

[-]

fizzy1242@reddit

only nice thing that comes to mind about ollama is the docket style "pulling" of models, but i'm not a fan of the modelfiles and how it turns .gguf into those blobs.

also, ollama also takes longer to get the latest changes for new model fixes and architectures. llama.cpp is just less hassle in my opinion.

[-]

Jeidoz@reddit

Ollama is slow.

LM Studio is OK. It uses llama.cpp underhood. It provides nice UI and easy to search and download models. I mainly use it for dev server for agentic apps like OpenCode or Qwen Code. In Beta builds LM Studio often releases nice QoL features for recently released models or fixes for them.

[-]

ea_man@reddit

You are wasting VRAM to load LM and losing performance to do the same thing.

[-]

bennyb0y@reddit

Ollama slow