24GB M4 Mac - is Qwen 9B only option while system is running?

[-]

sagiroth@reddit (OP)

Nice that you read a post not a headline. I already use Claude Team Pro but I need something ad hoc locally not burning tokens for rag prototyping doesnt have to be super smart

[-]

Due_Duck_8472@reddit

Another team pro account is your best bet.

[-]

Out of curiousity, how much free memory do you have on that running system? On Linux, I can get “reasonable” results with 27b in around 22-23gb of vram before I squeeze every last byte out of it by adjusting context size.

[-]

sagiroth@reddit (OP)

Yeah well, thats beauty of Linux. On my home pc on Linux on 24gb vram I run qwen3.5 27b at 180k but I can disable stuff where with mac, and company laptop thats not possible so I really have 16gb at most of usage ram

[-]

jonas-reddit@reddit

Ouch. That’s quite a lot of overhead. I thought Mac OS was a bit more efficient. I might have a Mac Mini somewhere with 24GB that I can take a look at. Just out of curiosity. Linux without booting into graphical user mode is indeed quite nice to minimize vram usage.

[-]

sagiroth@reddit (OP)

Yeah problem is since its your own managed device you can disable stuff you dont need or run headless where in this case i cant. I have stuff that checks software, security, slack.

[-]

Enough-Astronaut9278@reddit

try the 35B MoE variant, active params are only like 3B so it fits fine. way better than a dense 9B imo

[-]

Enough_Big4191@reddit

with 24gb on mac, qwen 9b is probably the only one that’ll run comfortably with other apps open. for 64k context u’ll need to use offloading tricks or memory-mapped context, otherwise the system will start swapping heavily.

[-]

Monk_Boy@reddit

Use oMLX and enable TurboQuant.

[-]

blackhawk00001@reddit

What models work for you? I tried majentik/Qwen3.6-35B-A3B-TurboQuant-MLX-3bit but didn't see any improvement when I enabled the turboquant setting. It and the rotorquant version perform the same, very well but crash omlx around 25k context on my m4 24gb.

[-]

Monk_Boy@reddit

TurboQuant just means you can run with a larger KV contex without the model failure.

[-]

Monk_Boy@reddit

I have a MacBook M1 32GB. I run Qwen3.5-4B-oQ8-fp16 (the fp16 is only for M1 or M2) I quantized this with oMLX, and use TurboQuant 4-bit as a setting in oMLX. I like the 4B better than the 9B. I also set the temperature to 0.2 to make the model very logical do it follows instructions well.

[-]

blackhawk00001@reddit

l've been tinkering with this also. I'm more familiar with running larger models on workstations but have this 24GB macbook air m4 that I do personal projects on and take with me on trips. I'm trying to find a use case for hermes or pi to run local in an sbx environment with a model hosted local with omlx. 2-4gb reservered for the agent sbx so that reduces my model options.

I really liked the majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit model but it crashes omlx at 25k tokens. Maybe good for the omlx chat or another wrapper.

As far as models allowing for a 65k context I've landed on three possibilities. I'm trying to stick with omlx but do have llama.cpp installed to use. The Qwen3.5-4B and 9B mtp mxfp4 models (linked in tables, I did not make them) seem to run the best. I think both can be deployed at the same time and be used in an expert/runner fashion. I've been bench testing yesterday and today so I still am not sure how well they will work with tool calling. Hopefully more mtp models show up and I'm hoping for a smaller moe model in the future, 35B a3B is just too big.

l've also tried a few gemma-4 26b a4b models but they take up too much memory and can't go to 65k.

I knew at the time I should have ordered the 32GB but the 24GB was on sale in the store. Even so I can see how an M5 cpu would really help out. Prefill speeds are not super great on M4 but it works.

https://huggingface.co/sleepyeldrazi/Qwen3.5-4B-MXFP4-MTP
https://huggingface.co/sleepyeldrazi/Qwen3.5-9B-MXFP4-MTP

Benchmark Model: Qwen3.5-4B-MXFP4-MTP

================================================================================

Single Request Results

--------------------------------------------------------------------------------

Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem

pp1024/tg128 3176.6 20.13 322.4 tok/s 50.1 tok/s 5.733 201.0 tok/s 3.20 GB

pp4096/tg128 12425.5 21.40 329.6 tok/s 47.1 tok/s 15.144 278.9 tok/s 3.86 GB

pp8192/tg128 25426.4 22.84 322.2 tok/s 44.1 tok/s 28.326 293.7 tok/s 4.41 GB

pp16384/tg128 54661.2 25.30 299.7 tok/s 39.8 tok/s 57.874 285.3 tok/s 5.29 GB

pp32768/tg128 123900.9 32.88 264.5 tok/s 30.7 tok/s 128.077 256.8 tok/s 7.04 GB

pp65536/tg128 329667.3 51.80 198.8 tok/s 19.5 tok/s 336.246 195.3 tok/s 10.61 GB

Benchmark Model: Qwen3.5-9B-MXFP4-MTP

================================================================================

Single Request Results

--------------------------------------------------------------------------------

Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem

pp1024/tg128 5813.3 33.72 176.1 tok/s 29.9 tok/s 10.095 114.1 tok/s 5.60 GB

pp4096/tg128 22929.1 35.12 178.6 tok/s 28.7 tok/s 27.390 154.2 tok/s 6.22 GB

pp8192/tg128 47272.3 36.27 173.3 tok/s 27.8 tok/s 51.879 160.4 tok/s 6.77 GB

pp16384/tg128 99061.6 39.56 165.4 tok/s 25.5 tok/s 104.085 158.6 tok/s 7.65 GB

pp32768/tg128 227864.1 48.02 143.8 tok/s 21.0 tok/s 233.963 140.6 tok/s 9.40 GB

pp65536/tg128 481132.8 64.88 136.2 tok/s 15.5 tok/s 489.372 134.2 tok/s 11.96 GB

[-]

blackhawk00001@reddit

I've been tinkering with this also. I'm more familiar with running larger models on workstations but have this 24GB macbook air m4 that I do personal projects on and take with me on trips. I'm trying to find a use case for hermes or pi to run local in an sbx environment with a model hosted local with omlx. 2-4gb reservered for the agent sbx so that reduces my model options.

I really liked the majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-3bit model but it crashes omlx at 25k tokens. Maybe good for the omlx chat or another wrapper.

As far as models allowing for a 65k context I've landed on three possibilities. I'm trying to stick with omlx but do have llama.cpp installed to use. The Qwen3.5-4B and 9B mtp mxfp4 models (linked in tables, I did not make them) seem to run the best. I think both can be deployed at the same time and be used in an expert/runner fashion. I've been bench testing yesterday and today so I still am not sure how well they will work with tool calling. Hopefully more mtp models show up and I'm hoping for a smaller moe model in the future, 35B a3B is just too big.

I've also tried a few gemma-4 26b a4b models but they take up too much memory and can't go to 65k.

I knew at the time I should have ordered the 32GB but the 24GB was on sale in the store.

Benchmark Model: Qwen3.5-9B-MXFP4-MTP
https://huggingface.co/sleepyeldrazi/Qwen3.5-9B-MXFP4-MTP