Running on a macbook, and having issues with crashing? Maybe this will help...
Posted by jonnywhatshisface@reddit | LocalLLaMA | View on Reddit | 5 comments
Just a friendly pointer on getting around some issues on macbooks. I hope someone finds this useful. I spent weeks of ripping my hair out with crashes, crap performance and issues - and being entirely too stubborn to harness the power of Google to find solutions to my issues. Though, I prefer doing things the hard way, which is rather ironic for someone who is taking an enjoyment in finding ways to build out local AI...
I'm running Qwen3.6 35b A3B on a 14" MBP M2 Max with 64GB ram, which feels like plenty for most local models that are dominating the charts. I'm currently using a 131k context, and I can easily use higher if I can tolerate the long prompt processing time of 1-2 minutes for reloading a session with a massive context. Otherwise, thanks to KV cache and etc, prompt processing is usually between 3 and 40 seconds for me even once the context is ridiculously huge (ie 100k+) - and the speed is fantastic (49 tokens/sec generation, 400+ on prompt processing) for the most part. (Qwen3.6 35b a3b)
My setup took WEEKS to fine-tune and get stable, so I figured I'd share it with some of you to help spread the love for anyone who was having issues running local models and agentic workflows on macbooks, given I received an onslaught of messages from colleagues, friends and people asking how I managed to make Qwen3.6 stable and use it the way I am (I have a pretty large project and Qwen3.6 is the driver of it, right down to having agents monitoring logs and automatically troubleshooting and fixing issues - which is a scary thought...)
So, a simple rundown, and then a better explanation below...
* Change display refresh rate from ProMotion to 60Hz
* Use GGUF models, NOT MLX
* Run with either llama.cpp or LM Studio (which uses llama.cpp under the hood). Ollama is slow, and to be blunt: horrible.
* Raise memory wire limit via iogpu.wired_limit_m . On my 64GB laptop, I have this at 61440
* Use Qwen3.6 35b A3B, either q4 or q6 quant. I find q4 - funny enough - to sometimes have a bit better precision, but I'm still flipping between the two . Make sure preserve_thinking is enabled - without this, it'll loop, fail tool calls and perform like a drunken monkey. Do NOT use the MTP version. It seems like it would be a no brainer to do it, but it'll actually cut the token generation speed down, not speed it up.
* Use OpenCode - NOT Claude Code. Make sure you set the limits on the model in opencode accordingly to your needs. The output token limit, for example, is low by default and will result in things like tool call failures/loops due to chopping off the arguments for the tool calls.
* Use RAG and persistent memories via MCP. I've moved on to a custom solution I'm building, but I was and sometimes still do use Serena MCP, which is unbelievably good.
* Leverage the power of SKILLS in OpenCode, and even the ability to make a custom agent that'll automatically start using memories for complex refactors and features. I was able to do incredible things on a 52k line code base with a context size of just 64k thanks to this concept.
Result: I'm running Qwen3.6 35b a3b with 490 tok/s prompt processing and between 49-65 tok/s generation. If I open an old session on a completely cold KV cache that's 80k+ tokens, it will take about 1.5 minutes to process that prompt. Subsequent prompts with cache hits for KV are anywhere from 2 to 30 seconds, and in extreme cases where for whatever reason the cache reuse misses, about 50 seconds. However, when reading files and etc - it's not processing the entire context anymore, and this operation is blazingly fast (It's worth noting that my system prompt alone is nearly 50k tokens at this point on one particular project, so your mileage may vary for better or for worse). All in all, it's actually faster for me than Claude through GHCP is, so it's a win.
Now, a more detailed breakdown:
1) MLX - I don't use it. It's unstable - particularly on a 14" macbook that thermal throttles. I stick with GGUF models, and there is a good reason behind it. GGUF pre-allocates all memory up front for both the model and the KV cache, so when you look at the memory usage - what you see is what it will use. MLX allocates on-demand, and you'll notice that after it finishes with a prompt the memory usage drops. Then during prefill and token generation, it's steadily going up again. This massive non-stop allocation/free/allocation/free process results in the system going haywire on reclaiming cache, and this slows down the gpu cores during this time. The WindowServer has an "Interacitivy Watchdog" in it that's pinging the GPU cores, and if they don't respond within a certain amount of ms, the kernel module will shoot the model in the head and you'll see an error about Interactivity Timeout. This is why MLX feels so unstable to some - and the fact that the 14" models begin thermal throttling makes it even worse because now the speed the core are operating at has been reduced. So, I stick with GGUF and I have zero model crashes (at least, not anymore)
2) The interactivity watchdog CANNOT be adjusted, configured, disabled or anything else - except in one case: you have no display. If you close your laptop and run it entirely in clamshell mode with zero display on it, and just ssh into it or access the model via API running on it, then you won't ever hit the watchdog issues because it doesn't care about the display if it doesn't have one. Let's be real: that's not practical for most of us. So, the secret sauce? Change your refresh rate from ProMotion to 60hz. When you do this, you'll notice 2 things. First, the prompt process and token generation speeds will skyrocket. This is because the GPU memory is unified, and ProMotion refreshes the display about 120 times per second. Dropping it down from 120Hz to 60Hz entirely cuts the memory bandwidth the WindowServer is using clean in half, and that bandwidth savings is now available to your model. It also doubles the response time threshold for the watchdog, so instead of 8ms - the timeout becomes 16ms. No more interactivity timeouts.
This is a balancing act on a lot of things, and it's also why I said earlier to avoid MTP version of Qwen. The slowdown in token processing and generation, for example, ties the GPU cores up just that much more - and pushes you to the edge of a race against the clock for the hopes that the interactivity watchdog won't shoot your model in the head.
3) Cooling. The default fan thresholds on OS X are crap. Grab the mac fans app and set a custom trigger for the fans for all GPU cluster sensors (my model has 2 clusters). The low temp shoudl be 50, and the high 80 (c). This will result in the fans running at a low speed once the GPU cores reach 50c, and at full speed once they reach 80. It should result in them not exceeding \~81-82c but mostly lingering around the 79-80 marker. No more thermal throttling.
4) Adjust your wired memory limit. By default, Mac OS X only allows up to 85% of the unified memory to be wired for GPU usage. That's fine for the models, but other things use the GPU, too. WindowServer and Chrome just to name a couple. Raise the limit via syctl iogpu.wired_limit_m . They say to leave at least 10GB for the system, I've left about 8 and I've been stable with no issues. I've even left as little as 4 and not had stability problems, but to each their own. It depends on what all you have running while you're running the model.
5) The runner is important. Use either llama.cpp - or LM Studio if you're wanting a GUI. LM Studio uses llama.cpp under the hood. The only difference is you don't have nearly as much granularity over the command-line options. For example, we had to wait 6 hours for MTP to be available in LM Studio (which, in my opinion, was irrelevant for something like Qwen MoE models). Avoid ollama: it's slow, period. It also downloads the models in chunked sharded out layers that are entirely unusable with any other runner, which is just poor form in my opinion. I personally use llama.cpp for the control, but I use LM Studio to download models because I prefer the clean layout visually when reading them. However, truth be told, since I found Qwen - I've not been downloading any other models, anyway?
6) Model specific: If using qwen3.6 35b a3b: I've seen people complain about looping problems and tool call issues, etc. This almost entirely boils down to your setup. Firstly, make sure preserve_thinking is enabled. If you're using LM Studio, it's under the inference tab. If you're using llama.cpp or anything else that you need to manually specify the jinja template, just add a set preserve_thinking = true into your template. This is absolutely critical for agentic workflows. It will screw up and slaughter every other tool call without it. Also, make sure your harness isn't the issue. OpenCode by default has a max token output limit, and this causes major issues. You need to raise and tweak the limits via your opencode config to prevent it from chopping the arguments of the tool calls off resulting in it failing and basically looping repeatedly with failed tool calls.
7) Do NOT use Claude Code with non-claude models. I'm convinced they want you to try to do that so that you have a flat out shit experience and run back to their models. It's simply not developed/designed to work that well without their model, period. The experience is going to be poor, and you're going to want to give up on local LLM's.
8) Use RAG and persistent memories. Serena MCP is a turnkey solution to get you started with that world. It provides semantic indexing, search, read and write capabilities that seriously shave down the context size and also simply helps the model find what it needs much faster. The persistent memories can be used in all sorts of ways, but I have agents I've made that the entire point of them is to deal with incredibly large code-bases, which I have them leverage the memories to create entire project plans, sub-tasks, patches/diffs and then execute the entire plan after it has everything figured out. This enabled me to entirely refactor a 52k line code base and also add a feature into it that totaled out 1600 lines across the entire code base, and literally have it all working immediately without any issues. With a 64k context, nonetheless (I generally use 131k personally).
9) For QWEN models and KV cache: Do NOT quantize the KV cache any smaller than q8. If you go to q4, the model will become mentally handicapped. I am not talking about quantized models like q4_K_M - that's a great model. I'm talking explicitly about the K/V cache quantization options. Either leave them alone/untouched if you can, or quantize them no more than q8. The model is resistent to the quantization at q8, meaning minimal precision loss - but it doesn't do so well with q4 at all. Do keep in mind that quantizing it will save some memory usage, but really - only do this IF you NEED to shave down the memory usage. With my 64GB ram, I'm running q6 version of the model (though tbh, I think q4 may be a bit "smarter" as funny as that sounds) with 131k context and it barely uses enough memory for me to even notice. I still have Chrome with 10+ tabs, Word, VS Code, some terminals, my mail and everything else under the sun open with almost no issues. Unless you see memory pressure and you're actually low on memory, there's no reason to quantize the KV cache - you'll just cause more performance issues by doing so.
Reddich07@reddit
Thank you for your effort in crafting this excellent and comprehensive summary. It’s filled with valuable tips and insightful explanations.
Accomplished_Ad9530@reddit
Which version of macOS are you running? Older versions had an issue setting the wired limit. Also, your wired limit of 61440 leaves very little for the rest of the system. I suspect that’s what’s causing instability. The rest about the screen refresh rate and such doesn’t sound right to me.
jonnywhatshisface@reddit (OP)
I agree my wired limit is a bit on the edge leaving only 4gb of wiggle room - but the max peak wired in use is about 30gb in total anyway so kinda irrelevant. Also, Mac is would run just fine with only 4GB. After all, they still sell 8gb models…
The reason I raised it higher is I’m also running normal usage in addition to the ai and I didn’t want one chrome tab to end up smacking a model.
I don’t have any instability - I run full non-stop agents 24x7 with zero crashes and running a 131k context. Currently running q6 qwen3.6 35b and wired memory is at 31GB with zero pressure.
The instability was absolutely not memory pressure. It was the interactivity watchdog from a combination of thermal throttling and bandwidth saturation. The biggest contributors to curbing that was the cooling and the refresh rate drop. At 120Hz / ProMotion, the watchdog is hard-coded to kill things if a gpu core doesn’t respond within 8ms. A high batch size of 512 or 1024 would consume the gpu core just long enough to push it over the threshold when either thermal throttling or the display refresh consuming memory bandwidth enough to trigger the latency. Reducing both has 100% eliminated crashes.
I’m running latest Tahoe, btw.
(Friendly heads up - I’m an embedded systems developer and kernel driver developer for Linux, so I’m pretty strong on systems…)
Accomplished_Ad9530@reddit
Fair enough. Years ago I ran into a similar issue on an Intel MBP, so I know what it's like to diagnose and report this sort of thing. Please do report it to Apple, since it seems like they could mitigate it with a scheduler tweak or relax the watchdog under heavy load or something. Best of luck.
jonnywhatshisface@reddit (OP)
I am considering doing that actually, but I’m not so sure they’ll address it. In their minds, running headless on a Mac Studio for example is the right way if you want to leverage the gpu cores for compute. They’d likely argue that anything they impacts the stability of the display is a no go - but it’s still worth reaching out them to nonetheless.