MacBook M3, 24GB ram. What's best for LLM engine?
Posted by Familyinalicante@reddit | LocalLLaMA | View on Reddit | 42 comments
Like in title. I am in process of moving from windows laptop to MacBook Air M3, 24GB ram. I use it for local development in vscode and need to connect to local LLM. I've installed Ollama and it works but of course it's slower than my 3080ti16GB in windows laptop. It's not real problem because for my purpose I can leave laptop for hours to see result (that's the main reason for transition because windows laptop crash after an hour or so and worked loudly like steam engine). My question is if Ollama is fist class citizen in Apple or there's much better solution. I dont do any bleeding edge thing and use standard models like llama, Gemma, deepseek for my purpose. I used to Ollama and use it in such manner that all my projects connect to Ollama server on localhost. I know about LMstudio but didn't use it a lot as Ollama was sufficient. So, is Ollama ok or there much faster solutions, like 30% faster or more? Or there's a special configuration for Ollama in Apple beside installing it actually?
DunamisMax@reddit
I’ve been loving Ollama combined with OpenWebUI on my MacBook Pro M4 Pro and right now IMO the Gemma3:12b is the best overall model I can run.
Vaddieg@reddit
openwebui eats all your RAM. Use llama.cpp server and run Mistral Small 24B
DunamisMax@reddit
Doesn’t eat mine at all lol I can even run Gemma3:33b on my 24GB RAM. It is the only 33b model that can even run on this hardware. I still run the 12b cause it’s faster, but 33b works great even on OpenWebUI.
random-tomato@reddit
I assume you mean 27B? Gemma 3 does not have a 33B variant.
DunamisMax@reddit
Sorry, yes.
Dudmaster@reddit
That doesn't have a built in vector database, embedding/reranking though right?
AsliReddington@reddit
Ollama is just hot garbage. Just get a qwen model in int8 or the mistal small 3.1(24B) in int4
Run it all using llama.cpp installed via brew. Then
''' llama-server -m *.gguf -ngl 99 '''
The openai compatible endpoints will work with everything out there
Awkward-Desk-8340@reddit
Good morning,
I understand the reservations about Ollama, but for my part I find it rather stable and practical, especially for a quick setup. On my config with an RTX 4070, the performance is frankly correct with models like Mistral or Qwen in 4-bit quantization.
That said, I am interested in llama.cpp, especially to see what it gives in terms of GPU optimization and flexibility (loading GGUF models, CUDA/cuBLAS support, etc.). Do you have any concrete feedback on the performance with CUDA backend compared to Ollama? And possibly a guide to properly compiling llama.cpp with GPU support on Windows or WSL?
AsliReddington@reddit
I can share the comparisons on M4 Pro & an RTX 2070S(ubuntu tho)between the two as OOTB as it can be.
Awkward-Desk-8340@reddit
Hello yes 👍
AsliReddington@reddit
Also in case you weren't aware, Ollama wraps llama.cpp under the hood & is months behind in being up to date.
Awkward-Desk-8340@reddit
So ollama what is added value??
Dudmaster@reddit
It has dynamic loading of the models and a keep alive time which will unload models when they are not in active use. This is not in the base llama cpp
AsliReddington@reddit
It's just noob friendly is all & makes defaults which user doesn't have to bother with at that level of expertise. Kinda like ChromeOS vs Ubuntu.
extopico@reddit
I second this. I’m truly confused as to why ollama seems popular given its hostile stance towards users wanting to do anything other than what ollama insists on. Hard to explain what I mean, just stick with llama.cpp. If you want to chat llama-server even has a nice gui.
SkyFeistyLlama8@reddit
Ollama being a llama.cpp wrapper just makes it worse. I guess it's for people who want LLMs to be appliances whereas working with llama.cpp is more like being an LLM mechanic with a home garage.
Yes_but_I_think@reddit
Reason is they chose a catchy name - Ollama such a nice name.
loscrossos@reddit
this seems the real reason… llamacpp does not flow that smoothly off the tongue
AsliReddington@reddit
Exactly, it's just influencers across platforms who don't use these models in any meaningful way hyping it up. Always late to get new architecture, new VLMs anyone?
Glittering-Bag-4662@reddit
What do you recommend instead? TabbyAPI, kobold cpp, Aphrodite?
AsliReddington@reddit
Llama-server from llama.cpp or with any frontend for that matter
MoffKalast@reddit
I really wish we had a list of all compatible known frontends that work with just llama-server and don't require ollama's api.
AsliReddington@reddit
Any that work with chatgpt work with llama.cpp/llama-server save for the image/video stuff
MoffKalast@reddit
Not necessarily, a lot of them have claude and openai urls hardcoded for some odd reason.
AsliReddington@reddit
Not sure about claude ones but most openai frontends have space for BASE URL updation
Dudmaster@reddit
Does this automatically load and unload the different ggufs when requested?
Familyinalicante@reddit (OP)
Thank You, so I'll start playing with llama.cpp then🤗
gptlocalhost@reddit
Our tests using QwQ-32B and Gemma 3 (27B) on M1 Max 64G are as follows:
https://youtu.be/ilZJ-v4z4WI
https://youtu.be/Cc0IT7J3fxM
Familyinalicante@reddit (OP)
Thank you, I'll try use this model!
ShineNo147@reddit
You can use llm-mlx or LM studio. They are 20-30% faster than Ollama.
https://simonwillison.net/2025/Feb/15/llm-mlx/
s101c@reddit
MLX has shown to be faster option on every M1-M4 era Macs that I have tested it on. And I confirm the speed increase, about 20-30 percent.
And it might be a placebo effect, but with similar quantization level between GGUF and MLX models, I found MLX to be slightly more coherent. Again, it could be a placebo thing.
LevianMcBirdo@reddit
Adding speculative decoding increases that speed again by up to 150%, so up to 2.5 times faster. In my experience it's a little less than 2, but that depends on the task and the models used.
SkyFeistyLlama8@reddit
What speculative decoding config are you using on a Mac, like which combination of small and large models?
LevianMcBirdo@reddit
I only got it to run with mlx versions so far. qwen 2.5 coder 14B and 0.5B both at 4 bit. I tried 1.5B with similar speeds (slower generation, but more accepted toke s) , but that could change with bigger context.
Berberis@reddit
I’ve tried a lot and nothing beats LM studio for ease of use in options
extopico@reddit
LMStudio is a very close second in my level of hate, next to ollama. It’s also closed source, bloated to the extreme (multi GB docker images) and entirely inflexible. It’s made for corporate users that want their own interface, same positioning as LibreChat. You need to dedicate considerable time to figuring out how to work with either of these and then hope they don’t introduce breaking changes like LibreChat did with 0.77. So yes I also do not understand why anyone at all would recommend LMStudio to anyone running local models just for themselves.
jabbrwock1@reddit
Yes, very convenient. You can run it in OpenAI compatible server mode too.
TrashPandaSavior@reddit
Also supports MLX if OP wants to dabble in that. LM Studio would be my vote too for the macbook setup.
Unless they just want API, then I'd recommend `llama-swap` to configure the models OP wants to support and then have that run llama.cpp. Using llama-swap means the server can swap models on requests, which is a must for me.
techczech@reddit
I often find the same model faster on Ollama than LM studio even with ML. But I haven't done systematic testing.
TheProtector0034@reddit
Gemma 3 12b q8 is the best model you can run with decent performance.
Awkward-Desk-8340@reddit
https://github.com/ggml-org/llama.cpp?tab=readme-ov-file
Is this the basic framework?
So I need to find a tutorial
Chintan124@reddit
May be a 14b model at 20 tokens per sec? Would that be possible in case anyone has tried?