Given how good Qwen become, is it time to grab a 128gb m5 max?
Posted by Rabus@reddit | LocalLLaMA | View on Reddit | 127 comments
I was on the fence of updating my m1 pro 32gb, but seeing how got Qwen is becoming, isnt it the time to start experimenting with local models?
My experience so far was that it never came close to opus, but i see that the 27b models are now getting close to the 4.5 opus (???), which sounds exciting!
Gallardo994@reddit
My M5 Max 128gb arrived last week and I've been running quite some models since then. Before this machine, I've been owning M4 Max 128gb.
At first, when I decided to compare both side by side, I saw almost no difference in prompt processing speed and generation speed, and was disappointed. Turns out, llama-cpp backend, especially the one included with LM Studio, just doesn't use "neural accelerators" properly (there's a PR on llama-cpp repo that addresses this, but it's not merged in as of today). Only MLX gives proper speed boost to prompt processing. However, I suggest oMLX as it has some nice caching techniques that are noticable.
As for running 27B versions of Qwen on M5 Max specifically. Yes, you can run it. Yes, it is quite impressive for its size. However, it's quite slow to generate even at Q8, and because these models like to think a lot it's a deal breaker. You have to crank up presence penalty for it to be bearable. Prompt processing is okay, much faster than thinking. Just don't expect to go beyond 64K context or you'll be pulling your hair off.
I honestly suggest either 35B version of Qwen or even Qwen3-coder-next, both at Q8. Those are perfect models for that hardware, balancing speed and quality.
Sorry for not attaching any numbers as I'm not sitting in front of said Mac right now. If you want, I can test Qwen3.6-27B Q8 and MXFP4 both MLX running on oMLX using the integrated benchmark at different context lengths, in about 12 hours.
More-Curious816@reddit
Do you recommend the m5 max or DGX spark? They both have the same amount of ram and probably price ^(probably)
Kryohi@reddit
Spark has a much lower memory badwidth
brownman19@reddit
Iirc it is far faster for prompt processing often coming close to data center cards if you use nvfp4
If you’re working with large context (like codebases with a pretty static 50-100k token cache of codebase context), dgx sparx becomes more usable.
In some cases might even use it over my m3 ultra studio bc of that long context sustained pp throughput.
That being said if you pair program or use AI as more of a tool, and don’t give long context, then yeah I’d say the memory bandwidth is hindrance
brownman19@reddit
Honestly…depends on which one you plan on adding a second (or third or fourth) of down the line.
Given RDMA is supported now, framework desktop cluster (2x of them with RDMA thunderbolt 5 and high speed networking dock) is a good alternative.
You can get 2 of them for a bit more than a maxed out M5 max. But you get 256 gb vram and you get real clustering speed up. It’s faster than one device. And by a good margin. You can add 2 more as well just like Mac Studio or Mac mini
Alternatively Mac mini cluster with RDMA might be an option. Haven’t looked into it.
Finally, I’m doing this one probably. 3-4 new intel cards in a z890 setup with lots of DDR5 (well not that much but I can get like 128gb of rly fast camm memory at like 8000 mt/s for $800 or so. With 96 to 128 gb vram with that, an under volt, and should run on a single 1200-1500w psu. But still looking
Important_Coach9717@reddit
Now you just need to tell the world how you can get 128GB of ram for that price …
silentsnake@reddit
If you just need a headless box for inferencing, go with the spark, stronger compute (for prefill) and vLLM concurrency.
Gallardo994@reddit
I don't own a DGX Spark to give a comprehensive comparison sadly.
However, this M5 Max machine is a full laptop at 16 inches. Not only it's a fine AI station but it also wipes floors with any high end desktop CPU, has great battery life (unless you're actively running LLMs of course), and is dead silent most of the time. I would not trade it off for a Spark + a separate Windows laptop.
More-Curious816@reddit
You don't need a separate windows laptop tho. The dgx spark is full pc with Linux, you can remotely access it with any device, from your iPhone or MacBook air.
Gallardo994@reddit
Of course, a Windows laptop was just an example. A better deal would probably be an M4/M5 Pro laptop alongside Spark so that you get the benefit of a wonderful CPU for other tasks.
However, if you want a single package solution that can do it all, a maxed out M5 Max is hard to beat.
More-Curious816@reddit
Yeah, the biggest problem tho is the price. If I'm going to dream big, now just imagine, 1tb of unified ram in studio ultra with lpddr6x as the model of ram with bandwidth speed comparable to 5090.
Rabus@reddit (OP)
i am actually considering 14" as i like the portability, and 15% less power is not a dealbraker for me over the 16".
But maybe ill change my mind tomorrow lol
candylandmine@reddit
The 14" Max seems to throttle quite a bit. Seems like it really needs the 16" chassis.
Gallardo994@reddit
Having both 14'' and 16'' Macs at home, the difference in size isn't that big. Both can be comfortably used in bed or on a kitchen table. In my daily life the only difference is whether 16'' fits my small backpack or not. So I wouldn't be too worried if I were you
More-Curious816@reddit
Just fixing the table formatting of OP
UPDATE:
oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx Benchmark Model: Qwen3.6-27B-mxfp4 ================================================================================
Single Request Results
oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx Benchmark Model: Qwen3.6-27B-8bit ================================================================================
Single Request Results
Caffdy@reddit
Those Qwen3.6-27B-mxfp4 tg (tps) numbers are actually faster than my 3090!
gh0stwriter1234@reddit
Don't forget to enable ngram speculative decoding if you are using it for coding tasks.... it doesn't require a draft model but works really well any time input ends up in the output it detects that and auto completes it.
Automatic-Arm8153@reddit
What’s your llama-server command
chimph@reddit
I have just received the same MacBook. The cool thing about local is that you can build with the 35B-Q6 model at fast speed and then have the 27B model review everything while you go do something else. Seems to be a killer combo tbh
jrodder@reddit
I find it interesting many people are doing it in this order. I have been using 27B to be the plan mode, let it think and build a perfect plan slowly and then hand that final .md plan over to 35A3B for execution.
chimph@reddit
tbh I’ve only had this for a day and I assume it’s best to plan at 85tok/s and then not have to be babysit as it reviews at 15tok/s.. but that will depend how you work with it tbf
fallingdowndizzyvr@reddit
Dude! Do you think that's even remotely readable?
You don't need to wait for it to merge. Download and run the PR.
Gallardo994@reddit
Looks fine on both desktop and mobile version of reddit web, as a scrollable code block. There's also a chad who made a proper table in a comment.
Yes I know I don't need to wait. I just don't want to go through hassle of maintaining a thing on my device that homebrew already maintains.
fallingdowndizzyvr@reddit
Replace "www" with "old". Then you'll see an epic runon sentence.
Gallardo994@reddit
I apologize but I'm not familiar with this version of reddit to predict how a certain user would see the message. The original reply still stands - there's a comment with the results as a table which renders there just fine.
Fair-Indication2230@reddit
Is it worth buying M5 Max? or decent laptop with claude code max
Gallardo994@reddit
It depends.
"Is it worth saving my hard earned money for half a year to run local AI?" - most likely not worth it. In some countries the price of such unit is around 10k$ which is super hard to justify if you're looking for a bang for your buck.
However, if you're looking for the absolute best laptop on the market, and if it doesn't hurt your wallet, then it's totally worth it.
Epireve1@reddit
Thanks, I am keeping M4 Max 128GB a little longer
CYTR_@reddit
Yes, please review the MXFP4 with some pp/tg t/s 🙏
Gallardo994@reddit
I've updated the post with both benchmarks, decided to sleep late tonight
Beamsters@reddit
Thanks! Can you please put some numbers of M5 Max on Llama.cpp as a reference point compared to oMLX?
Gallardo994@reddit
That will probably have to wait till tomorrow as I've got none of these models in GGUF format and my network ain't that fast, and I've never used llama-bench. Will report back once I benchmark it
Enough_Big4191@reddit
i’d be careful anchoring on “close to opus,” benchmarks don’t show where it breaks. qwen is strong, but the gap shows up on longer context, edge cases, and consistency. 128gb m5 max is great if u actually want to run bigger models locally and experiment a lot. but if most of your work is still high-stakes or complex, you’ll probably keep bouncing back to cloud anyway.
Extra-Library-5258@reddit
My numbers:
Model Role RAM Peak-tok/s Qwen3-Coder-Next Primary coding \~45 GB 92.5 Qwen3.6-35B-A3B Workflow default \~40 GB 67.2 Qwen3.5-35B-A3B Fallback workflow \~37 GB 73.7 Qwen3.5-122B-A10B Precision tier \~70 GB 55.2
Qwen3-Coder-Next degradation:
128K and still above 50 tok/s!
Extra-Library-5258@reddit
**M5 Max (40c) · Qwen3-Coder-Next · 4bit**
| Context | PP (tok/s) | TG (tok/s) |
|--------:|-------------:|------------:|
| 1k | 2,131 | 93.2 |
| 4k | 3,146 | 90.0 |
| 8k | 3,253 | 87.0 |
| 16k | 3,114 | 79.6 |
| 32k | 2,671 | 66.0 |
| 64k | 1,975 | 51.9 |
| 128k | 1,229 | 34.9 |
| 195.3k | 834 | 20.9 |
**M5 Max (40c) · Qwen3.6-35B-A3B · 8bit**
| Context | PP (tok/s) | TG (tok/s) |
|--------:|-------------:|------------:|
| 1k | 2,174 | 100.9 |
| 4k | 3,683 | 98.1 |
| 8k | 3,942 | 94.0 |
| 16k | 3,846 | 88.0 |
| 32k | 3,286 | 75.9 |
| 64k | 2,428 | 58.4 |
| 128k | 1,557 | 43.7 |
| 195.3k | 1,098 | 28.8 |
u/Rabus
More-Curious816@reddit
[Just fixing the tables formatting, nothing more]
**M5 Max (40c) · Qwen3-Coder-Next · 4bit**
**M5 Max (40c) · Qwen3.6-35B-A3B · 8bit**
u/Rabus
Rabus@reddit (OP)
nice. I think i could fit in 64/128k context limit the way i work with stuff and seeing Opus speeds is crazy to think about. I think ill grab a 128gb, thanks
hurdurdur7@reddit
But do you really code with a 4 bit model :-(
Extra-Library-5258@reddit
There are several structured tasks they have been consistently executing with success, so yes!
GeorgeSC@reddit
Just throwing this as I dont care about the apple ecosystem, but anyone here has experience with an amd strix halo 128GB?
from what can I see the mac starts stronger by having a faster bus speed, but after all, is the amd worth it for inference?
im thinking going that way cause I could install bazzite there and have the pc for ai inference during business hours and then use it for steam play in the after hours.
cafedude@reddit
On the Framework Desktop (Strix Halo with 128GB) I'm getting upper teens tok/sec with the Qwen3.5-27b with 170K context (I've run 3.6-27b but didn't get the perf numbers, should be similar). It's just at the usable threshold for me - any slower and I wouldn't bother with it. With Qwen3-coder-next (and 80B MoE) I get 36tok/sec which is quite useable.
ProfessionalSpend589@reddit
Your comment is misleading, because you don’t mention what quants you are using.
nesymmanqkwemanqk@reddit
Im running the 122b qweb moe model at comfortable 20-25 tg, with decent big size kv cache, you can do quite well and i feel like its better than gpt 5 mini and haiku, close to sonnet on some tasjs
xquarx@reddit
Bazzite is not a fun system to install random things on being atomic, go with the parent dirstro: Fedora.
But both Mac and Strix have similar bottlenecks as far as I've read.
Objective-Picture-72@reddit
I don't think the M5 Max is good for the dense models. It gives you the RAM size to hold the models but the tk/s isn't good enough. So either go with NVIDIA GPUs or wait for the M5 Ultra MacStudio.
PinkySwearNotABot@reddit
is it due to memory bandwidth or what? what causes the slow tok/s?
RedEyed__@reddit
It is due to limited compute power compared to GPU. Just look at TOPS value in Nvidia 6000 PRO
Previous_Fortune9600@reddit
Local AGI ftw !
UnhingedBench@reddit
Here is the list of models I can run on my 128GB M4 Max. That should give you an idea of what you could try. Just be aware that bigger models will run slower.
gegtik@reddit
How did you generate this?
AnonsAnonAnonagain@reddit
128GB just isn’t enough. In my opinion. A minimum of 256GB required to run any sufficient model with large context properly
Caffdy@reddit
let's hope Nvidia get the memo and they update the Spark with double the memory and double the bandwidth next iteration
AnonsAnonAnonagain@reddit
It will be a long time before a spark refresh, it’s meant to be a taste of big boi nvidia, deliberately underpowered.
The next step up from a spark is going to be a $150-300k DGX Station GB300
496GB LPDDR5X 396GB/s RAM 252GB HBM3e 7.1TB/s VRAM
https://nvdam.widen.net/s/jnkrzwnqhj/dgx-station-datasheet
antirez@reddit
27B with thinking enabled is too slow in a MacBook for serious replacement of a frontier model. And I'm not even starting to tell you how Qwen 3.6 27B is not on par with GPT/Opus in the real world (not even Kimi K2.6), but I assume you decided it is enough for you after extensively testing 27B with opencode/pi and a cloud provider. Even so, even the fastest macbook you can buy is too show for serious inference.
tarruda@reddit
For simple/moderate tasks, I'd say that even Qwen 3.6 35BA3B is enough. I've been using it daily and found it to be significantly better than any local model for agentic coding I tried before. Plus it is fast enough on my M1 ultra.
Yes, it cannot do very complex tasks, but you shouldn't be delegating your brain to an LLM anyway. Ideally you'd use it as a code monkey to do things you already figured it out.
Anything above 20 tokens/second generation is good enough for coding with an agent. The main bottleneck with Macs is prompt processing which M5 pro/max is supposed to fix.
Still, for Macs and Strix halo devices a 27b dense is not the best option. The 35BA3B and upcoming (hopefully) 122B A10B 3.6 will be more interesting.
marscarsrars@reddit
Grab the dgx sparks work wonders.
jacek2023@reddit
I am trying to buy fourth 3090 and it's not easy. So yes, 3090 are much better choice but probably not really available.
Ok-Internal9317@reddit
STOP! Have you investigated speed of prompt processing? I can bear with 10tok/s token generation, but definitely not waiting for minutes for LLM to even start generating.
You should look if the M5 Max can become a legit replacement for real productivity, or is it just an expensive toy to brag
chibop1@reddit
It depends on your workflow.
On an M3 Max, I get about 200 TK/s at PP with Qwen3.6-27B. This will slow you down a lot if you submit a long new prompt each time, like processing a new PDF with every request.
However, this speed would be just fine as a QA chat assistant.
Also, oMLX makes it more tolerable using agentic tools with long system prompts by utilizing cold (in SSD) and hot (in RAM) prompts caching.
Some people are also fine with queuing work overnight and reviewing the results in the morning.
Xidium426@reddit
I'm calling BS on this, there is no way you are getting 200 TK/s. Are you sure it's not 20TK/s?
200TK/s is faster than a RTX5090 or a M5 Max.
Hedede@reddit
It's not faster, 5090 can process 30K tokens at 2K tok/s.
alexp702@reddit
He’s talking prompt processing which is in line with M5 Max post earlier
Turtlesaur@reddit
Maybe he means 35b a3b 😬
chibop1@reddit
Nope, Qwen3.6-27b not 35b.
chibop1@reddit
That's the result I get on oMLX. Keep it mind I'm talking about prompt processing, not generation speed. If you use Llama.cpp, I believe the speed on Qwen3.6 is not optimized on llama.cpp yet.
DrBearJ3w@reddit
Prompt processing is not the same as generation speed. 200 seems legit.
silentsnake@reddit
Another issue is how steep the fall off is. PP toks/sec at 2k context length vs PP toks/sec at 65k context length. You want it to be as flat as possible. On strix halo Vulcan/rocm or on Macs, the slope is real bad. This are the subtle things that makes or breaks usability. On a spark (Blackwell) it's practically flat and consistent.
Obvious_Equivalent_1@reddit
Honestly I’m on a M4 Pro with luckily calculated overspace of RAM for Dockers to got 48Gb ram. I’m already running several sessions of Claude Code and the Qwen 3.6 27/35B models are perfect.
While it definitely adds some sluggishness, I have routed every call for Explore(type=haiku) and Search(type=haiku) to 35B, to the whole Claude execution doesn’t feel slower. The planning phase with Opus takes longer for sure, but I run various sessions anyway and the amount of tokens saved has already been noticeable amazingly the past days.
I’m now testing for the 27B as well, it’s a great candidate to offload tooling work to which I’ve been running on Sonnet. And also to run night shift, like a GY queue where I let agent write their verification work to to process at night, all practically for free after HW purchase.
This piece as you see if very aimed at CC, but I’ve been noticing within days(!) I’m already fixing my overdraft issue (I was needing extra API expensive usage on top of my Max 200x plan). And these models even on a expensive - but not so crazy expensive as an M5 Max 128Gb - it’s honestly even with my older M4 Pro 48Gb already in my case a good worth per dollar of hardware on real cloud AI consumption saved
ThenExtension9196@reddit
I got a m4 max 128. Wish I didn’t. Toks are slow af I don’t even bother.
silentsnake@reddit
That's the main reason I switched from strix halo to dgx spark. Both have similar memory bandwidth. But the Blackwell's compute is on a whole other level! For ReAct agents, slow prompt processing is basically unusable.
PinkySwearNotABot@reddit
does the slow load on start only occur at the beginning when you're first loading a new model? or is it slow at the beginning of each response?
what exactly is the bottleneck for the slow PP?
Evening_Ad6637@reddit
The bottleneck is in the processing/computing power.
It‘s slow every time the model receives a new, extensive context. For example, if you start with “Hello”, the model responds immediately at (just as an example) 50 tokens per second. So your Mac or computer can generate text at a rate of 50 tps.
Let’s say your second message is a code you copied and pasted, along with a question. Let’s say 20,000 tokens. Even if the Mac could process/compute these 20,000 tokens at 1,000 TPS, it would take 20 seconds to only start the response (and that’s a pretty optimistic assumption. For example: My m1 max loaded with qwen-27b computes tokens at more like 100 tps, so I would wait more than three minutes).
Let’s assume your third message is simply “Thank you”; then the model will respond immediately again, since the 20,000 tokens are now cached.
But that is exactly the problem with real-world use cases. Real-world tasks typically involve long, multi-tuen conversations with often new, large inputs (code, web searches, PDF extraction, image processing, etc.). That is why local LLMs are useless if the processing speed of the input is not fast enough. Or that’s when MoE models come into play and save your butt.
JacketHistorical2321@reddit
It's 4x prompt processing speed compared to M3. It's not difficult into to find. Chill out dude lol
mr_zerolith@reddit
That's still very slow compared to Nvidia or AMD hardware.
Ell2509@reddit
Most sensible reply i have seen in a while.
Technical-Earth-3254@reddit
Don't trust benchmarks. Real world performance of the 27B is not close to Opus. 3.5 27B wasn't even close to Haiku 4.5. I'm giving it the benefit of a doubt, but don't expect real world performance close to anything SOTA.
Caffdy@reddit
Haiku is good, but these models definitely are better
__Maximum__@reddit
What quants are you running? What framework? What scaffolding?
In my experience haiku is dogshit and qwen 3.6 is very, very good, even with not optimal scaffolding it handles messy vague requests like opus does.
-dysangel-@reddit
It's 1,000,000% SOTA for its size. It isn't as capable as frontier models, but it's definitely punching above its weight.
Technical-Earth-3254@reddit
You probably knew exactly what I meant, but I added the word frontier to my comment.
Song-Historical@reddit
Isn't there an argument that you could start using these local models as subagents to save on tokens with the frontier models? Let's say to implement code, maybe hand off things like decomposing tasks (like when a model can't find a file mid build when context is already a little depleted).
I don't think most serious people are looking to replace their entire workflow yet. I'm just trying to gauge how far along we are.
KURD_1_STAN@reddit
I dont even see improvement in intelligence/size in 3.6 27b compared to 3.5 27b. Altho 3.6 35b was a much better upgrade over 3.5 35b, so im hoping it is just an issue.
Only-An-Egg@reddit
I've been really impressed running Qwen3.6-36B-A3B on my Mac Studio w/ 96GB
ImportantFollowing67@reddit
What Token/second you getting? I'm getting roughly 30 which is fine for me!
Its not as perfect and requires more hand holding but... It's still very good.
Turbulent_Pin7635@reddit
I'm getting 80tk/s 2400tk/s of pp
M3U-512 q8
benevbright@reddit
But rest 400GB would have nothing to do, right?
Only-An-Egg@reddit
I'm getting about 40. I'm using oMLX with 8bit model and 8bit KV cache.
mr_zerolith@reddit
on the first request, or with some actual context?
it's my experience that whatever number you get on the first tokens is going to be 2-3x lower at the end of the context window.
HealthySkirt6910@reddit
The cost needs to be weighed
putrasherni@reddit
not in laptop form though
ptinsley@reddit
What harness are you all running qwen in? I gave it what seemed like a pretty trival task in aider and learned that aider can’t access the web to look up api docs etc to get calls right when writing code. Well either that or qwen was failing at tool lookups, I ran out of time to look at it and haven’t gotten back to it
gregorskii@reddit
They open code
ea_man@reddit
You see a nice small dense model and you want to buy a slow 126mb mini pc?
You wanna buy the fastest 24GB gpu you can afford, then maybe get an other one next year.
Slow and big ain't the future.
datbackup@reddit
A m5 max is not a slow 126mb mini pc
Do some basic research
ea_man@reddit
Compared 5090? It's 2-3x slower. Do I get it wrong?
in my research coding is a matter of precision, stability and repeatability for tools usage = dense models.
Dense models run better in GPU, MoE generic stuff for single query run better on unified.
Correct me if I'm wrong.
vick2djax@reddit
Does the Mac sound like it’s about to blast off into space with its fans going crazy?
octoo01@reddit
No, I sometimes forget it's on, if there's other sounds in the house. It sounds like a normal, if not fairly quiet, laptop with its fans at high
vick2djax@reddit
Reason I ask is I have a M3 Max with only 36GB. And whenever I spin a model up on it, the fans get really loud and it’s the only time I’ve ever heard the fans kick on lol
Embarrassed_Adagio28@reddit
Macs can run big models but they are pretty slow. My $600 dual tesla v100 server runs qwen3.6 27b q5 at 28 tokens per second while a m4 pro runs at 9. Just because macs can fit big models into memory doesnt mean they are fast enough to be useful. Qwen3.6 35b is almost as smart but 3x faster so id test that on a 16gb gpu if you can before you spend a bunch of money.
MiaBchDave@reddit
You do realize the LLM speed difference between an M5 Max and M4 Pro, though, right? Generalizing “Macs” doesn’t exactly apply.
Sevenos@reddit
You won't compare M5 Max with a 16gb card though. That's 4090/5090 territory.
Dontdoitagain69@reddit
I wish rasberry pi and the rest like orange started coming out with 128gb, at least you you will save 2 gs on that apple logo and pay for scores fake geekbenched pc
fastheadcrab@reddit
You are referring to laptops? If you can use a desktop I personally think 2x 5090s will be much faster and you can run the FP8 still.
Large amounts of VRAM like 128GB is better for significantly larger models but you either are trading off speed or will be paying a lot (like the RTX 6000 Pros)
Dontdoitagain69@reddit
all i care is critical thinking and extraction of logical fallacy, that model doesnt exist
ImportantFollowing67@reddit
Dude Get a PGX or equivalent imo And yes it's time to use both Cloud and local....
Rabus@reddit (OP)
Ok im def behind, I have no clue anymore what’s PGX
illforgetsoonenough@reddit
I wasn't sure what it was either, so I looked it up and it appears to be Lenovo's branding for the dgx spark. GB10 Blackwell
https://www.lenovo.com/us/en/p/workstations/thinkstation-p-series/lenovo-thinkstation-pgx-sff/len102s0023
Rabus@reddit (OP)
Jesus what. I get a MacBook I can use daily for that kind of money 😅
ImportantFollowing67@reddit
I'd buy two of these for the price of one of those and I would have twice the RAM? What's the deal? Not sure it's a comparison.
mr_zerolith@reddit
These are really weak like macs.. basically a 5070 with a lot of ram..
ImportantFollowing67@reddit
I got an Asus Ascent GX10 which is.... Just a version of the Nvidia PGX which .. is a Linux only small box with 128 GB of unified RAM and uses less than 200 watts but it puts out like 1000 tfps. I can run 80 GB models fully in memory... My ROI calculation puts it at about 2 to 4 years before it makes sense.
But I've theoretically already saved $750 from using local.... And I bought it this year.
illforgetsoonenough@reddit
Which models do you run on it?
rorowhat@reddit
Get a strix halo instead
WeUsedToBeACountry@reddit
I have a m5 w/ 128, and I've been running qwen3.6 27b all day with unsloth's quantization and lm studio and its been great. I use opencode with gpt 5.4 as the orchestrator and qwen for sub agents. If the model isn't loaded into memory, it does take a few seconds to get going. Once it's hot its fine.
And I have tried oMLX but found it goofy still. I'm just going to wait for LM Studio to properly support MLX I think.
More-Curious816@reddit
Just fixing the table formatting of OP
UPDATE:
oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx Benchmark Model: Qwen3.6-27B-mxfp4 ================================================================================
Single Request Results
oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx Benchmark Model: Qwen3.6-27B-8bit ================================================================================
Single Request Results
Charming-Author4877@reddit
If you have the budget, get a 5090. The speed will be MUCH better than on a macbook and 32GB is enough to run both 3.6 qwen at max or very high context.
The tendency is not larger local models, it's going down to smarter and smaller models
qubridInc@reddit
If you’re serious about local models, 128GB is finally worth it, but only if you’ll actually use it beyond the hype.
Snoo_27681@reddit
TLDR: If you have $5k you don't really need about it's a great investment.
With the M4 Max 128Gb I'm able to run `Qwen3.6-27b-mxfp4` and `Qwen3.6-35B-A3B-mlx-mxfp8`. I got a few Langraph workflows to solve issues with `Qwen3.6-35B-A3B-mlx-mxfp8` so I'm hoping 27B can help with heavier thinking. We will see. I'm assuming the M5 Max is just faster.
I think the value of the local rigs is learning about local models and then if you try to make local models work you have to get better than your pipeline and context management. There is no possible way to do any meaningful work by prompting the same as you do Opus. So it's a very expensive learning piece of equipment that runs some suprisingly decent but super slow models.
Rabus@reddit (OP)
yea im pretty deep in agentic development, but not in local. I feel like local is a logical next step to be kind of not reliant on Anthropic.
brickout@reddit
Nope
Its_Powerful_Bonus@reddit
M5 max works like a charm, but with rtx5090 and turboquant around the corner it might be better choice in some use cases.
mr_zerolith@reddit
This is underpowered hardware with no upgradeaboility. it will always be on the slow side.
I'd strongly recommend if you're going to buy starter hardware, do it on a PCI Express platform so that if your usage doesn't match your expectations, you can just add another GPU or three!
bakawolf123@reddit
hard to say
since m5 ultra got delayed I'm also thinking about one
but I don't want another laptop tbh, my m1pro works just fine in that regard, most of the time sitting closed anyway as I work on connected display and external kb/mouse
really sad they decided this whole CEO swap needs to come first
msitarzewski@reddit
Depends. It's not the fastest machine to do this stuff on as many will surely point out ... but you can do at a coffee shop with an Americano in hand. (I'm on that machine now. At Starbucks.)
Confusion_Senior@reddit
Just stream from ssd
odikee@reddit
Don’t fall into the trap of hyping aibros
jon23d@reddit
I’ve not been able to get it to make me happy. I’m sticking with Minimax for now