24GB M4 Mac - is Qwen 9B only option while system is running?
Posted by sagiroth@reddit | LocalLLaMA | View on Reddit | 39 comments
I have mac at work that I want to use local model for prototyping and basic prompts that needs to stay on device. What sort of model I can run that I can fit at least 64k context ? Any setups sbare or guides welcome.
I need to have firefox open with one tab at minium. Problem I have is all the chap that runs on Mac itself by default.
Due_Duck_8472@reddit
Claude Code Pro
sagiroth@reddit (OP)
Nice that you read a post not a headline. I already use Claude Team Pro but I need something ad hoc locally not burning tokens for rag prototyping doesnt have to be super smart
Due_Duck_8472@reddit
Another team pro account is your best bet.
sagiroth@reddit (OP)
Yup thats an option
jonas-reddit@reddit
Out of curiousity, how much free memory do you have on that running system? On Linux, I can get “reasonable” results with 27b in around 22-23gb of vram before I squeeze every last byte out of it by adjusting context size.
sagiroth@reddit (OP)
Yeah well, thats beauty of Linux. On my home pc on Linux on 24gb vram I run qwen3.5 27b at 180k but I can disable stuff where with mac, and company laptop thats not possible so I really have 16gb at most of usage ram
jonas-reddit@reddit
Ouch. That’s quite a lot of overhead. I thought Mac OS was a bit more efficient. I might have a Mac Mini somewhere with 24GB that I can take a look at. Just out of curiosity. Linux without booting into graphical user mode is indeed quite nice to minimize vram usage.
sagiroth@reddit (OP)
Yeah problem is since its your own managed device you can disable stuff you dont need or run headless where in this case i cant. I have stuff that checks software, security, slack.
Enough-Astronaut9278@reddit
try the 35B MoE variant, active params are only like 3B so it fits fine. way better than a dense 9B imo
Enough_Big4191@reddit
with 24gb on mac, qwen 9b is probably the only one that’ll run comfortably with other apps open. for 64k context u’ll need to use offloading tricks or memory-mapped context, otherwise the system will start swapping heavily.
Monk_Boy@reddit
Use oMLX and enable TurboQuant.
blackhawk00001@reddit
What models work for you? I tried majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-3bit but didn't see any improvement when I enabled the turboquant setting. It and the rotorquant version perform the same, very well but crash omlx around 25k context on my m4 24gb.
Monk_Boy@reddit
TurboQuant just means you can run with a larger KV contex without the model failure.
Monk_Boy@reddit
I have a MacBook M1 32GB. I run Qwen3.5-4B-oQ8-fp16 (the fp16 is only for M1 or M2) I quantized this with oMLX, and use TurboQuant 4-bit as a setting in oMLX. I like the 4B better than the 9B. I also set the temperature to 0.2 to make the model very logical do it follows instructions well.
blackhawk00001@reddit
l've been tinkering with this also. I'm more familiar with running larger models on workstations but have this 24GB macbook air m4 that I do personal projects on and take with me on trips. I'm trying to find a use case for hermes or pi to run local in an sbx environment with a model hosted local with omlx. 2-4gb reservered for the agent sbx so that reduces my model options.
I really liked the majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit model but it crashes omlx at 25k tokens. Maybe good for the omlx chat or another wrapper.
As far as models allowing for a 65k context I've landed on three possibilities. I'm trying to stick with omlx but do have llama.cpp installed to use. The Qwen3.5-4B and 9B mtp mxfp4 models (linked in tables, I did not make them) seem to run the best. I think both can be deployed at the same time and be used in an expert/runner fashion. I've been bench testing yesterday and today so I still am not sure how well they will work with tool calling. Hopefully more mtp models show up and I'm hoping for a smaller moe model in the future, 35B a3B is just too big.
l've also tried a few gemma-4 26b a4b models but they take up too much memory and can't go to 65k.
I knew at the time I should have ordered the 32GB but the 24GB was on sale in the store. Even so I can see how an M5 cpu would really help out. Prefill speeds are not super great on M4 but it works.
https://huggingface.co/sleepyeldrazi/Qwen3.5-4B-MXFP4-MTP
https://huggingface.co/sleepyeldrazi/Qwen3.5-9B-MXFP4-MTP
Benchmark Model: Qwen3.5-4B-MXFP4-MTP
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem
pp1024/tg128 3176.6 20.13 322.4 tok/s 50.1 tok/s 5.733 201.0 tok/s 3.20 GB
pp4096/tg128 12425.5 21.40 329.6 tok/s 47.1 tok/s 15.144 278.9 tok/s 3.86 GB
pp8192/tg128 25426.4 22.84 322.2 tok/s 44.1 tok/s 28.326 293.7 tok/s 4.41 GB
pp16384/tg128 54661.2 25.30 299.7 tok/s 39.8 tok/s 57.874 285.3 tok/s 5.29 GB
pp32768/tg128 123900.9 32.88 264.5 tok/s 30.7 tok/s 128.077 256.8 tok/s 7.04 GB
pp65536/tg128 329667.3 51.80 198.8 tok/s 19.5 tok/s 336.246 195.3 tok/s 10.61 GB
Benchmark Model: Qwen3.5-9B-MXFP4-MTP
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem
pp1024/tg128 5813.3 33.72 176.1 tok/s 29.9 tok/s 10.095 114.1 tok/s 5.60 GB
pp4096/tg128 22929.1 35.12 178.6 tok/s 28.7 tok/s 27.390 154.2 tok/s 6.22 GB
pp8192/tg128 47272.3 36.27 173.3 tok/s 27.8 tok/s 51.879 160.4 tok/s 6.77 GB
pp16384/tg128 99061.6 39.56 165.4 tok/s 25.5 tok/s 104.085 158.6 tok/s 7.65 GB
pp32768/tg128 227864.1 48.02 143.8 tok/s 21.0 tok/s 233.963 140.6 tok/s 9.40 GB
pp65536/tg128 481132.8 64.88 136.2 tok/s 15.5 tok/s 489.372 134.2 tok/s 11.96 GB
blackhawk00001@reddit
I've been tinkering with this also. I'm more familiar with running larger models on workstations but have this 24GB macbook air m4 that I do personal projects on and take with me on trips. I'm trying to find a use case for hermes or pi to run local in an sbx environment with a model hosted local with omlx. 2-4gb reservered for the agent sbx so that reduces my model options.
I really liked the majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit model but it crashes omlx at 25k tokens. Maybe good for the omlx chat or another wrapper.
As far as models allowing for a 65k context I've landed on three possibilities. I'm trying to stick with omlx but do have llama.cpp installed to use. The Qwen3.5-4B and 9B mtp mxfp4 models (linked in tables, I did not make them) seem to run the best. I think both can be deployed at the same time and be used in an expert/runner fashion. I've been bench testing yesterday and today so I still am not sure how well they will work with tool calling. Hopefully more mtp models show up and I'm hoping for a smaller moe model in the future, 35B a3B is just too big.
I've also tried a few gemma-4 26b a4b models but they take up too much memory and can't go to 65k.
I knew at the time I should have ordered the 32GB but the 24GB was on sale in the store.
Benchmark Model: Qwen3.5-9B-MXFP4-MTP
https://huggingface.co/sleepyeldrazi/Qwen3.5-9B-MXFP4-MTP
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem
pp1024/tg128 5813.3 33.72 176.1 tok/s 29.9 tok/s 10.095 114.1 tok/s 5.60 GB
pp4096/tg128 22929.1 35.12 178.6 tok/s 28.7 tok/s 27.390 154.2 tok/s 6.22 GB
pp8192/tg128 47272.3 36.27 173.3 tok/s 27.8 tok/s 51.879 160.4 tok/s 6.77 GB
pp16384/tg128 99061.6 39.56 165.4 tok/s 25.5 tok/s 104.085 158.6 tok/s 7.65 GB
pp32768/tg128 227864.1 48.02 143.8 tok/s 21.0 tok/s 233.963 140.6 tok/s 9.40 GB
pp65536/tg128 481132.8 64.88 136.2 tok/s 15.5 tok/s 489.372 134.2 tok/s 11.96 GB
Benchmark Model: Qwen3.5-4B-MXFP4-MTP
https://huggingface.co/sleepyeldrazi/Qwen3.5-4B-MXFP4-MTP
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem
pp1024/tg128 3176.6 20.13 322.4 tok/s 50.1 tok/s 5.733 201.0 tok/s 3.20 GB
pp4096/tg128 12425.5 21.40 329.6 tok/s 47.1 tok/s 15.144 278.9 tok/s 3.86 GB
pp8192/tg128 25426.4 22.84 322.2 tok/s 44.1 tok/s 28.326 293.7 tok/s 4.41 GB
pp16384/tg128 54661.2 25.30 299.7 tok/s 39.8 tok/s 57.874 285.3 tok/s 5.29 GB
pp32768/tg128 123900.9 32.88 264.5 tok/s 30.7 tok/s 128.077 256.8 tok/s 7.04 GB
pp65536/tg128 329667.3 51.80 198.8 tok/s 19.5 tok/s 336.246 195.3 tok/s 10.61 GB
Sufficient-Bid3874@reddit
Qwen35BA3B
Bulky_Blood_7362@reddit
I dont think it will fits in 24gb ram. mb q4
_JustLivingLife_@reddit
It fits at IQ2 and works pretty well; Qwen36 A3b is better than Qwen3.59B even at IQ2 I believe
QuestionMarker@reddit
I forget the quant but I've happily run it in a 16gb VM on CPU under WSL2. 5-7tps but it did useful work.
sagiroth@reddit (OP)
Will give it a try
_JustLivingLife_@reddit
FWIW i'm running Qwen 3.6 35B A3B IQ2M (without vision) on my 24GB MB
Last_Mastod0n@reddit
Just curious, why without vision? Does it run a lot faster or is it smaller vram usage?
_JustLivingLife_@reddit
Takes less VRAM when I only need it for text
Last_Mastod0n@reddit
Oh wow, thank you. I should look into this. I run q6 with vision and I certainly need vision for my project. But certain parts do not need vision so perhaps I could use higher quant for the non vision parts with vision disabled.
Last_Mastod0n@reddit
Yes qwen 3.5 a3b is the move. Just use the highest quant that you can run. I highly suggest looking at the unsloth models as they preserve more of the higher quant details.
Im not sure of the process, but find a way to kill as much ram utilization as possible. I know macos is picky about anything OS and memory related but surely there are workarounds.
Kodrackyas@reddit
can confirm, 100% is good token outputs
Saraozte01@reddit
I would give Gemma 4 26B A4B at Q4-Q6 through ollama. Works pretty well for me and leaves some space for context as well!
maximus_reborn@reddit
yeah the only model that works in 24gb. I am on the same boat.
Saraozte01@reddit
Give ministral 3 14B a shot, surprised me a lot
tonyboi76@reddit
Your binding constraint is the 64k context, not the model — KV cache at 64k is big, and that is what eats your headroom on top of macOS + Firefox (budget ~6-8GB for the system).
Three levers that actually make this work on 24GB:
Raise the GPU memory cap. By default macOS only lets the GPU wire down a fraction of unified memory. Bump it: sudo sysctl iogpu.wired_limit_mb=20480 (leaves ~3.5GB for the system). This one change often turns I cannot fit it into it runs fine.
Use MLX, not llama.cpp. On Apple Silicon MLX is noticeably more memory-efficient and faster. Easiest path is LM Studio with the MLX runtime, or mlx_lm directly.
Quantize the KV cache. 64k of fp16 KV is the real hog — dropping it to 8-bit roughly halves that and is basically free in quality.
Model-wise: for a full 64k window in that budget, a 14B-class at 4-bit (Qwen2.5-14B / Qwen3-14B) leaves the most room for context. Qwen3-30B-A3B is smarter and fast (only 3B active), but at 64k you are fighting for memory — doable with the wired_limit bump + KV quant, just tighter. I would start at 14B + 64k, confirm it is stable with Firefox open, then try the 30B MoE if you want more quality and can live closer to the edge.
sagiroth@reddit (OP)
Somewhat related but you could at least proof read AI output.
tonyboi76@reddit
ha fair, formatting got away from me. real answer is just two things: bump iogpu.wired_limit_mb so the GPU can actually use the ram, and quantize the kv cache since 64k is what eats it. everything else is detail.
DunderSunder@reddit
You are right to be frustrated! the combination of macOS + Firefox in a 24gb M4 mac with Qwen 9B is a classic pain point of using 24gb macs. You are not alone!
Rare_Potential_1323@reddit
Try REAP models : ) I am thankful they exist
cibernox@reddit
tl;dr; Yes, pretty much.
Technically you may be able to run a 20B model like gpt-oss, but you would have very little ram to do anything else on that computer.
I'd draw the line in \~14B models in q4.
Saraozte01@reddit
Could try ministral 3 @ 14B and Phi 4 (I think its around the same size), a bit old but they really suprised me.
AmoebaDue6638@reddit
Gemma 3 12B with Q4 quantization should fit comfortably in 24GB with 64k context on M4. Runs great with LM Studio or llama.cpp.
Sufficient-Bid3874@reddit
Bad bot