Given how good Qwen become, is it time to grab a 128gb m5 max?

[-]

Gallardo994@reddit

My M5 Max 128gb arrived last week and I've been running quite some models since then. Before this machine, I've been owning M4 Max 128gb.

At first, when I decided to compare both side by side, I saw almost no difference in prompt processing speed and generation speed, and was disappointed. Turns out, llama-cpp backend, especially the one included with LM Studio, just doesn't use "neural accelerators" properly (there's a PR on llama-cpp repo that addresses this, but it's not merged in as of today). Only MLX gives proper speed boost to prompt processing. However, I suggest oMLX as it has some nice caching techniques that are noticable.

As for running 27B versions of Qwen on M5 Max specifically. Yes, you can run it. Yes, it is quite impressive for its size. However, it's quite slow to generate even at Q8, and because these models like to think a lot it's a deal breaker. You have to crank up presence penalty for it to be bearable. Prompt processing is okay, much faster than thinking. Just don't expect to go beyond 64K context or you'll be pulling your hair off.

I honestly suggest either 35B version of Qwen or even Qwen3-coder-next, both at Q8. Those are perfect models for that hardware, balancing speed and quality.

Sorry for not attaching any numbers as I'm not sitting in front of said Mac right now. If you want, I can test Qwen3.6-27B Q8 and MXFP4 both MLX running on oMLX using the integrated benchmark at different context lengths, in about 12 hours.

[-]

More-Curious816@reddit

Do you recommend the m5 max or DGX spark? They both have the same amount of ram and probably price ^(probably)

[-]

Kryohi@reddit

Spark has a much lower memory badwidth

[-]

brownman19@reddit

Iirc it is far faster for prompt processing often coming close to data center cards if you use nvfp4

If you’re working with large context (like codebases with a pretty static 50-100k token cache of codebase context), dgx sparx becomes more usable.

In some cases might even use it over my m3 ultra studio bc of that long context sustained pp throughput.

That being said if you pair program or use AI as more of a tool, and don’t give long context, then yeah I’d say the memory bandwidth is hindrance

[-]

brownman19@reddit

Honestly…depends on which one you plan on adding a second (or third or fourth) of down the line.

Given RDMA is supported now, framework desktop cluster (2x of them with RDMA thunderbolt 5 and high speed networking dock) is a good alternative.

You can get 2 of them for a bit more than a maxed out M5 max. But you get 256 gb vram and you get real clustering speed up. It’s faster than one device. And by a good margin. You can add 2 more as well just like Mac Studio or Mac mini

Alternatively Mac mini cluster with RDMA might be an option. Haven’t looked into it.

Finally, I’m doing this one probably. 3-4 new intel cards in a z890 setup with lots of DDR5 (well not that much but I can get like 128gb of rly fast camm memory at like 8000 mt/s for $800 or so. With 96 to 128 gb vram with that, an under volt, and should run on a single 1200-1500w psu. But still looking

[-]

Important_Coach9717@reddit

Now you just need to tell the world how you can get 128GB of ram for that price …

[-]

silentsnake@reddit

If you just need a headless box for inferencing, go with the spark, stronger compute (for prefill) and vLLM concurrency.

[-]

Gallardo994@reddit

I don't own a DGX Spark to give a comprehensive comparison sadly.

However, this M5 Max machine is a full laptop at 16 inches. Not only it's a fine AI station but it also wipes floors with any high end desktop CPU, has great battery life (unless you're actively running LLMs of course), and is dead silent most of the time. I would not trade it off for a Spark + a separate Windows laptop.

[-]

More-Curious816@reddit

You don't need a separate windows laptop tho. The dgx spark is full pc with Linux, you can remotely access it with any device, from your iPhone or MacBook air.

[-]

Gallardo994@reddit

Of course, a Windows laptop was just an example. A better deal would probably be an M4/M5 Pro laptop alongside Spark so that you get the benefit of a wonderful CPU for other tasks.

However, if you want a single package solution that can do it all, a maxed out M5 Max is hard to beat.

[-]

More-Curious816@reddit

Yeah, the biggest problem tho is the price. If I'm going to dream big, now just imagine, 1tb of unified ram in studio ultra with lpddr6x as the model of ram with bandwidth speed comparable to 5090.

[-]

Rabus@reddit (OP)

i am actually considering 14" as i like the portability, and 15% less power is not a dealbraker for me over the 16".

But maybe ill change my mind tomorrow lol

[-]

candylandmine@reddit

The 14" Max seems to throttle quite a bit. Seems like it really needs the 16" chassis.

[-]

Gallardo994@reddit

Having both 14'' and 16'' Macs at home, the difference in size isn't that big. Both can be comfortably used in bed or on a kitchen table. In my daily life the only difference is whether 16'' fits my small backpack or not. So I wouldn't be too worried if I were you

[-]

More-Curious816@reddit

Just fixing the table formatting of OP

UPDATE:

oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx Benchmark Model: Qwen3.6-27B-mxfp4 ================================================================================

Single Request Results

Test	TTFT(ms)	TPOT(ms)	pp TPS	tg TPS	E2E(s)	Throughput	Peak Mem
pp1024/tg128	1291.9	28.60	792.6 tok/s	35.2 tok/s	4.924	234.0 tok/s	15.07 GB
pp4096/tg128	4709.4	29.47	869.7 tok/s	34.2 tok/s	8.453	499.7 tok/s	16.49 GB
pp8192/tg128	9832.9	30.62	833.1 tok/s	32.9 tok/s	13.722	606.3 tok/s	17.11 GB
pp16384/tg128	22414.0	33.22	731.0 tok/s	30.3 tok/s	26.632	620.0 tok/s	18.36 GB
pp32768/tg128	47673.0	36.51	687.3 tok/s	27.6 tok/s	52.310	628.9 tok/s	20.86 GB
pp65536/tg128	112320.4	44.77	583.5 tok/s	22.5 tok/s	118.006	556.4 tok/s	25.90 GB
pp131072/tg128	298153.3	61.39	439.6 tok/s	16.4 tok/s	305.950	428.8 tok/s	36.27 GB

oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx Benchmark Model: Qwen3.6-27B-8bit ================================================================================

Single Request Results

Test	TTFT(ms)	TPOT(ms)	pp TPS	tg TPS	E2E(s)	Throughput	Peak Mem
pp1024/tg128	1433.3	54.99	714.4 tok/s	18.3 tok/s	8.417	136.9 tok/s	28.34 GB
pp4096/tg128	5084.6	56.11	805.6 tok/s	18.0 tok/s	12.211	345.9 tok/s	29.79 GB
pp8192/tg128	10413.9	57.23	786.6 tok/s	17.6 tok/s	17.682	470.5 tok/s	30.42 GB
pp16384/tg128	24285.2	61.02	674.6 tok/s	16.5 tok/s	32.034	515.4 tok/s	31.67 GB
pp32768/tg128	53538.1	64.27	612.0 tok/s	15.7 tok/s	61.700	533.2 tok/s	34.17 GB
pp65536/tg128	123724.9	71.65	529.7 tok/s	14.1 tok/s	132.825	494.4 tok/s	39.20 GB

[-]

Caffdy@reddit

Those Qwen3.6-27B-mxfp4 tg (tps) numbers are actually faster than my 3090!

[-]

gh0stwriter1234@reddit

Don't forget to enable ngram speculative decoding if you are using it for coding tasks.... it doesn't require a draft model but works really well any time input ends up in the output it detects that and auto completes it.

[-]

Automatic-Arm8153@reddit

What’s your llama-server command

[-]

chimph@reddit

I have just received the same MacBook. The cool thing about local is that you can build with the 35B-Q6 model at fast speed and then have the 27B model review everything while you go do something else. Seems to be a killer combo tbh

[-]

jrodder@reddit

I find it interesting many people are doing it in this order. I have been using 27B to be the plan mode, let it think and build a perfect plan slowly and then hand that final .md plan over to 35A3B for execution.

[-]

chimph@reddit

tbh I’ve only had this for a day and I assume it’s best to plan at 85tok/s and then not have to be babysit as it reviews at 15tok/s.. but that will depend how you work with it tbf

[-]

fallingdowndizzyvr@reddit

Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem pp1024/tg128 1291.9 28.60 792.6 tok/s 35.2 tok/s 4.924 234.0 tok/s 15.07 GB pp4096/tg128 4709.4 29.47 869.7 tok/s 34.2 tok/s 8.453 499.7 tok/s 16.49 GB pp8192/tg128 9832.9 30.62 833.1 tok/s 32.9 tok/s 13.722 606.3 tok/s 17.11 GB pp16384/tg128 22414.0 33.22 731.0 tok/s 30.3 tok/s 26.632 620.0 tok/s 18.36 GB pp32768/tg128 47673.0 36.51 687.3 tok/s 27.6 tok/s 52.310 628.9 tok/s 20.86 GB pp65536/tg128 112320.4 44.77 583.5 tok/s 22.5 tok/s 118.006 556.4 tok/s 25.90 GB pp131072/tg128 298153.3 61.39 439.6 tok/s 16.4 tok/s 305.950 428.8 tok/s 36.27 GB

Dude! Do you think that's even remotely readable?

(there's a PR on llama-cpp repo that addresses this, but it's not merged in as of today)

You don't need to wait for it to merge. Download and run the PR.

[-]

Gallardo994@reddit

Looks fine on both desktop and mobile version of reddit web, as a scrollable code block. There's also a chad who made a proper table in a comment.

Yes I know I don't need to wait. I just don't want to go through hassle of maintaining a thing on my device that homebrew already maintains.

[-]

fallingdowndizzyvr@reddit

Looks fine on both desktop and mobile version of reddit web

Replace "www" with "old". Then you'll see an epic runon sentence.

[-]

Gallardo994@reddit

I apologize but I'm not familiar with this version of reddit to predict how a certain user would see the message. The original reply still stands - there's a comment with the results as a table which renders there just fine.

[-]

Fair-Indication2230@reddit

Is it worth buying M5 Max? or decent laptop with claude code max

[-]

Gallardo994@reddit

It depends.

"Is it worth saving my hard earned money for half a year to run local AI?" - most likely not worth it. In some countries the price of such unit is around 10k$ which is super hard to justify if you're looking for a bang for your buck.

However, if you're looking for the absolute best laptop on the market, and if it doesn't hurt your wallet, then it's totally worth it.

[-]

Epireve1@reddit

Thanks, I am keeping M4 Max 128GB a little longer

[-]

CYTR_@reddit

Yes, please review the MXFP4 with some pp/tg t/s 🙏

[-]

Gallardo994@reddit

I've updated the post with both benchmarks, decided to sleep late tonight

[-]

Beamsters@reddit

Thanks! Can you please put some numbers of M5 Max on Llama.cpp as a reference point compared to oMLX?

[-]

Gallardo994@reddit

That will probably have to wait till tomorrow as I've got none of these models in GGUF format and my network ain't that fast, and I've never used llama-bench. Will report back once I benchmark it

[-]

Enough_Big4191@reddit

i’d be careful anchoring on “close to opus,” benchmarks don’t show where it breaks. qwen is strong, but the gap shows up on longer context, edge cases, and consistency. 128gb m5 max is great if u actually want to run bigger models locally and experiment a lot. but if most of your work is still high-stakes or complex, you’ll probably keep bouncing back to cloud anyway.

[-]

Extra-Library-5258@reddit

My numbers:

Model Role RAM Peak-tok/s Qwen3-Coder-Next Primary coding \~45 GB 92.5 Qwen3.6-35B-A3B Workflow default \~40 GB 67.2 Qwen3.5-35B-A3B Fallback workflow \~37 GB 73.7 Qwen3.5-122B-A10B Precision tier \~70 GB 55.2

Qwen3-Coder-Next degradation:

128K and still above 50 tok/s!

[-]

Extra-Library-5258@reddit

**M5 Max (40c) · Qwen3-Coder-Next · 4bit**

| Context | PP (tok/s) | TG (tok/s) |

|--------:|-------------:|------------:|

| 1k | 2,131 | 93.2 |

| 4k | 3,146 | 90.0 |

| 8k | 3,253 | 87.0 |

| 16k | 3,114 | 79.6 |

| 32k | 2,671 | 66.0 |

| 64k | 1,975 | 51.9 |

| 128k | 1,229 | 34.9 |

| 195.3k | 834 | 20.9 |

**M5 Max (40c) · Qwen3.6-35B-A3B · 8bit**

| Context | PP (tok/s) | TG (tok/s) |

|--------:|-------------:|------------:|

| 1k | 2,174 | 100.9 |

| 4k | 3,683 | 98.1 |

| 8k | 3,942 | 94.0 |

| 16k | 3,846 | 88.0 |

| 32k | 3,286 | 75.9 |

| 64k | 2,428 | 58.4 |

| 128k | 1,557 | 43.7 |

| 195.3k | 1,098 | 28.8 |

u/Rabus

[-]

More-Curious816@reddit

[Just fixing the tables formatting, nothing more]

**M5 Max (40c) · Qwen3-Coder-Next · 4bit**

Context	PP (tok/s)	TG (tok/s)
1k	2,131	93.2
4k	3,146	90.0
8k	3,253	87.0
16k	3,114	79.6
32k	2,671	66.0
64k	1,975	51.9
128k	1,229	34.9
195.3k	834	20.9

**M5 Max (40c) · Qwen3.6-35B-A3B · 8bit**

Context	PP (tok/s)	TG (tok/s)
1k	2,174	100.9
4k	3,683	98.1
8k	3,942	94.0
16k	3,846	88.0
32k	3,286	75.9
64k	2,428	58.4
128k	1,557	43.7
195.3k	1,098	28.8

u/Rabus

[-]

Rabus@reddit (OP)

nice. I think i could fit in 64/128k context limit the way i work with stuff and seeing Opus speeds is crazy to think about. I think ill grab a 128gb, thanks

[-]

hurdurdur7@reddit

But do you really code with a 4 bit model :-(

[-]

Extra-Library-5258@reddit

There are several structured tasks they have been consistently executing with success, so yes!

[-]

GeorgeSC@reddit

Just throwing this as I dont care about the apple ecosystem, but anyone here has experience with an amd strix halo 128GB?
from what can I see the mac starts stronger by having a faster bus speed, but after all, is the amd worth it for inference?
im thinking going that way cause I could install bazzite there and have the pc for ai inference during business hours and then use it for steam play in the after hours.

[-]

cafedude@reddit

On the Framework Desktop (Strix Halo with 128GB) I'm getting upper teens tok/sec with the Qwen3.5-27b with 170K context (I've run 3.6-27b but didn't get the perf numbers, should be similar). It's just at the usable threshold for me - any slower and I wouldn't bother with it. With Qwen3-coder-next (and 80B MoE) I get 36tok/sec which is quite useable.

[-]

ProfessionalSpend589@reddit

Your comment is misleading, because you don’t mention what quants you are using.

[-]

nesymmanqkwemanqk@reddit

Im running the 122b qweb moe model at comfortable 20-25 tg, with decent big size kv cache, you can do quite well and i feel like its better than gpt 5 mini and haiku, close to sonnet on some tasjs

[-]

xquarx@reddit

Bazzite is not a fun system to install random things on being atomic, go with the parent dirstro: Fedora.

But both Mac and Strix have similar bottlenecks as far as I've read.

[-]

Objective-Picture-72@reddit

I don't think the M5 Max is good for the dense models. It gives you the RAM size to hold the models but the tk/s isn't good enough. So either go with NVIDIA GPUs or wait for the M5 Ultra MacStudio.

[-]

PinkySwearNotABot@reddit

is it due to memory bandwidth or what? what causes the slow tok/s?

[-]

RedEyed__@reddit

It is due to limited compute power compared to GPU. Just look at TOPS value in Nvidia 6000 PRO

[-]

Previous_Fortune9600@reddit

Local AGI ftw !

[-]

UnhingedBench@reddit

Here is the list of models I can run on my 128GB M4 Max. That should give you an idea of what you could try. Just be aware that bigger models will run slower.

[-]

gegtik@reddit

How did you generate this?

[-]

AnonsAnonAnonagain@reddit

128GB just isn’t enough. In my opinion. A minimum of 256GB required to run any sufficient model with large context properly

[-]

Caffdy@reddit

let's hope Nvidia get the memo and they update the Spark with double the memory and double the bandwidth next iteration

[-]

AnonsAnonAnonagain@reddit

It will be a long time before a spark refresh, it’s meant to be a taste of big boi nvidia, deliberately underpowered.

The next step up from a spark is going to be a $150-300k DGX Station GB300

496GB LPDDR5X 396GB/s RAM 252GB HBM3e 7.1TB/s VRAM

https://nvdam.widen.net/s/jnkrzwnqhj/dgx-station-datasheet

[-]

antirez@reddit

27B with thinking enabled is too slow in a MacBook for serious replacement of a frontier model. And I'm not even starting to tell you how Qwen 3.6 27B is not on par with GPT/Opus in the real world (not even Kimi K2.6), but I assume you decided it is enough for you after extensively testing 27B with opencode/pi and a cloud provider. Even so, even the fastest macbook you can buy is too show for serious inference.

[-]

tarruda@reddit

And I'm not even starting to tell you how Qwen 3.6 27B is not on par with GPT/Opus in the real world (not even Kimi K2.6)

For simple/moderate tasks, I'd say that even Qwen 3.6 35BA3B is enough. I've been using it daily and found it to be significantly better than any local model for agentic coding I tried before. Plus it is fast enough on my M1 ultra.

Yes, it cannot do very complex tasks, but you shouldn't be delegating your brain to an LLM anyway. Ideally you'd use it as a code monkey to do things you already figured it out.

even the fastest macbook you can buy is too show for serious inference

Anything above 20 tokens/second generation is good enough for coding with an agent. The main bottleneck with Macs is prompt processing which M5 pro/max is supposed to fix.

Still, for Macs and Strix halo devices a 27b dense is not the best option. The 35BA3B and upcoming (hopefully) 122B A10B 3.6 will be more interesting.

[-]

marscarsrars@reddit

Grab the dgx sparks work wonders.

[-]

jacek2023@reddit

I am trying to buy fourth 3090 and it's not easy. So yes, 3090 are much better choice but probably not really available.

[-]

Ok-Internal9317@reddit

STOP! Have you investigated speed of prompt processing? I can bear with 10tok/s token generation, but definitely not waiting for minutes for LLM to even start generating.

You should look if the M5 Max can become a legit replacement for real productivity, or is it just an expensive toy to brag

[-]

chibop1@reddit

It depends on your workflow.

On an M3 Max, I get about 200 TK/s at PP with Qwen3.6-27B. This will slow you down a lot if you submit a long new prompt each time, like processing a new PDF with every request.

However, this speed would be just fine as a QA chat assistant.

Also, oMLX makes it more tolerable using agentic tools with long system prompts by utilizing cold (in SSD) and hot (in RAM) prompts caching.

Some people are also fine with queuing work overnight and reviewing the results in the morning.

[-]

Xidium426@reddit

I'm calling BS on this, there is no way you are getting 200 TK/s. Are you sure it's not 20TK/s?

200TK/s is faster than a RTX5090 or a M5 Max.

[-]

Hedede@reddit

It's not faster, 5090 can process 30K tokens at 2K tok/s.

[-]

alexp702@reddit

He’s talking prompt processing which is in line with M5 Max post earlier

[-]

Turtlesaur@reddit

Maybe he means 35b a3b 😬

[-]

chibop1@reddit

Nope, Qwen3.6-27b not 35b.

[-]

chibop1@reddit

That's the result I get on oMLX. Keep it mind I'm talking about prompt processing, not generation speed. If you use Llama.cpp, I believe the speed on Qwen3.6 is not optimized on llama.cpp yet.

[-]

DrBearJ3w@reddit

Prompt processing is not the same as generation speed. 200 seems legit.

[-]

silentsnake@reddit

Another issue is how steep the fall off is. PP toks/sec at 2k context length vs PP toks/sec at 65k context length. You want it to be as flat as possible. On strix halo Vulcan/rocm or on Macs, the slope is real bad. This are the subtle things that makes or breaks usability. On a spark (Blackwell) it's practically flat and consistent.

[-]

Obvious_Equivalent_1@reddit

Honestly I’m on a M4 Pro with luckily calculated overspace of RAM for Dockers to got 48Gb ram. I’m already running several sessions of Claude Code and the Qwen 3.6 27/35B models are perfect.

While it definitely adds some sluggishness, I have routed every call for Explore(type=haiku) and Search(type=haiku) to 35B, to the whole Claude execution doesn’t feel slower. The planning phase with Opus takes longer for sure, but I run various sessions anyway and the amount of tokens saved has already been noticeable amazingly the past days.

I’m now testing for the 27B as well, it’s a great candidate to offload tooling work to which I’ve been running on Sonnet. And also to run night shift, like a GY queue where I let agent write their verification work to to process at night, all practically for free after HW purchase.

This piece as you see if very aimed at CC, but I’ve been noticing within days(!) I’m already fixing my overdraft issue (I was needing extra API expensive usage on top of my Max 200x plan). And these models even on a expensive - but not so crazy expensive as an M5 Max 128Gb - it’s honestly even with my older M4 Pro 48Gb already in my case a good worth per dollar of hardware on real cloud AI consumption saved

[-]

ThenExtension9196@reddit

I got a m4 max 128. Wish I didn’t. Toks are slow af I don’t even bother.

[-]

silentsnake@reddit

That's the main reason I switched from strix halo to dgx spark. Both have similar memory bandwidth. But the Blackwell's compute is on a whole other level! For ReAct agents, slow prompt processing is basically unusable.

[-]

PinkySwearNotABot@reddit

does the slow load on start only occur at the beginning when you're first loading a new model? or is it slow at the beginning of each response?

what exactly is the bottleneck for the slow PP?

[-]

Evening_Ad6637@reddit

The bottleneck is in the processing/computing power.

It‘s slow every time the model receives a new, extensive context. For example, if you start with “Hello”, the model responds immediately at (just as an example) 50 tokens per second. So your Mac or computer can generate text at a rate of 50 tps.

Let’s say your second message is a code you copied and pasted, along with a question. Let’s say 20,000 tokens. Even if the Mac could process/compute these 20,000 tokens at 1,000 TPS, it would take 20 seconds to only start the response (and that’s a pretty optimistic assumption. For example: My m1 max loaded with qwen-27b computes tokens at more like 100 tps, so I would wait more than three minutes).

Let’s assume your third message is simply “Thank you”; then the model will respond immediately again, since the 20,000 tokens are now cached.

But that is exactly the problem with real-world use cases. Real-world tasks typically involve long, multi-tuen conversations with often new, large inputs (code, web searches, PDF extraction, image processing, etc.). That is why local LLMs are useless if the processing speed of the input is not fast enough. Or that’s when MoE models come into play and save your butt.

[-]

JacketHistorical2321@reddit

It's 4x prompt processing speed compared to M3. It's not difficult into to find. Chill out dude lol

[-]

mr_zerolith@reddit

That's still very slow compared to Nvidia or AMD hardware.

[-]

Ell2509@reddit

Most sensible reply i have seen in a while.

[-]

Technical-Earth-3254@reddit

Don't trust benchmarks. Real world performance of the 27B is not close to Opus. 3.5 27B wasn't even close to Haiku 4.5. I'm giving it the benefit of a doubt, but don't expect real world performance close to anything SOTA.

[-]

Caffdy@reddit

Haiku is good, but these models definitely are better

[-]

Maximum@reddit

What quants are you running? What framework? What scaffolding?

In my experience haiku is dogshit and qwen 3.6 is very, very good, even with not optimal scaffolding it handles messy vague requests like opus does.

[-]

-dysangel-@reddit

It's 1,000,000% SOTA for its size. It isn't as capable as frontier models, but it's definitely punching above its weight.

[-]

Technical-Earth-3254@reddit

You probably knew exactly what I meant, but I added the word frontier to my comment.

[-]

Song-Historical@reddit

Isn't there an argument that you could start using these local models as subagents to save on tokens with the frontier models? Let's say to implement code, maybe hand off things like decomposing tasks (like when a model can't find a file mid build when context is already a little depleted).

I don't think most serious people are looking to replace their entire workflow yet. I'm just trying to gauge how far along we are.

[-]

KURD_1_STAN@reddit

I dont even see improvement in intelligence/size in 3.6 27b compared to 3.5 27b. Altho 3.6 35b was a much better upgrade over 3.5 35b, so im hoping it is just an issue.

[-]

Only-An-Egg@reddit

I've been really impressed running Qwen3.6-36B-A3B on my Mac Studio w/ 96GB

[-]

ImportantFollowing67@reddit

What Token/second you getting? I'm getting roughly 30 which is fine for me!
Its not as perfect and requires more hand holding but... It's still very good.

[-]

Turbulent_Pin7635@reddit

I'm getting 80tk/s 2400tk/s of pp

M3U-512 q8

[-]

benevbright@reddit

But rest 400GB would have nothing to do, right?

[-]

Only-An-Egg@reddit

I'm getting about 40. I'm using oMLX with 8bit model and 8bit KV cache.

[-]

mr_zerolith@reddit

on the first request, or with some actual context?

it's my experience that whatever number you get on the first tokens is going to be 2-3x lower at the end of the context window.

[-]

HealthySkirt6910@reddit

The cost needs to be weighed

[-]

putrasherni@reddit

not in laptop form though

[-]

ptinsley@reddit

What harness are you all running qwen in? I gave it what seemed like a pretty trival task in aider and learned that aider can’t access the web to look up api docs etc to get calls right when writing code. Well either that or qwen was failing at tool lookups, I ran out of time to look at it and haven’t gotten back to it

[-]

gregorskii@reddit

They open code

[-]

ea_man@reddit

You see a nice small dense model and you want to buy a slow 126mb mini pc?

You wanna buy the fastest 24GB gpu you can afford, then maybe get an other one next year.

Slow and big ain't the future.

[-]

datbackup@reddit

A m5 max is not a slow 126mb mini pc

Do some basic research

[-]

ea_man@reddit

Compared 5090? It's 2-3x slower. Do I get it wrong?

in my research coding is a matter of precision, stability and repeatability for tools usage = dense models.

Dense models run better in GPU, MoE generic stuff for single query run better on unified.

Correct me if I'm wrong.

[-]

vick2djax@reddit

Does the Mac sound like it’s about to blast off into space with its fans going crazy?

[-]

octoo01@reddit

No, I sometimes forget it's on, if there's other sounds in the house. It sounds like a normal, if not fairly quiet, laptop with its fans at high

[-]

vick2djax@reddit

Reason I ask is I have a M3 Max with only 36GB. And whenever I spin a model up on it, the fans get really loud and it’s the only time I’ve ever heard the fans kick on lol

[-]

Embarrassed_Adagio28@reddit

Macs can run big models but they are pretty slow. My $600 dual tesla v100 server runs qwen3.6 27b q5 at 28 tokens per second while a m4 pro runs at 9. Just because macs can fit big models into memory doesnt mean they are fast enough to be useful. Qwen3.6 35b is almost as smart but 3x faster so id test that on a 16gb gpu if you can before you spend a bunch of money.

[-]

MiaBchDave@reddit

You do realize the LLM speed difference between an M5 Max and M4 Pro, though, right? Generalizing “Macs” doesn’t exactly apply.

[-]

Sevenos@reddit

You won't compare M5 Max with a 16gb card though. That's 4090/5090 territory.

[-]

Dontdoitagain69@reddit

I wish rasberry pi and the rest like orange started coming out with 128gb, at least you you will save 2 gs on that apple logo and pay for scores fake geekbenched pc

[-]

fastheadcrab@reddit

You are referring to laptops? If you can use a desktop I personally think 2x 5090s will be much faster and you can run the FP8 still.

Large amounts of VRAM like 128GB is better for significantly larger models but you either are trading off speed or will be paying a lot (like the RTX 6000 Pros)

[-]

Dontdoitagain69@reddit

all i care is critical thinking and extraction of logical fallacy, that model doesnt exist

[-]

ImportantFollowing67@reddit

Dude Get a PGX or equivalent imo And yes it's time to use both Cloud and local....

[-]

Rabus@reddit (OP)

Ok im def behind, I have no clue anymore what’s PGX

[-]

illforgetsoonenough@reddit

I wasn't sure what it was either, so I looked it up and it appears to be Lenovo's branding for the dgx spark. GB10 Blackwell

https://www.lenovo.com/us/en/p/workstations/thinkstation-p-series/lenovo-thinkstation-pgx-sff/len102s0023

[-]

Rabus@reddit (OP)

Jesus what. I get a MacBook I can use daily for that kind of money 😅

[-]

ImportantFollowing67@reddit

I'd buy two of these for the price of one of those and I would have twice the RAM? What's the deal? Not sure it's a comparison.

[-]

mr_zerolith@reddit

These are really weak like macs.. basically a 5070 with a lot of ram..

[-]

ImportantFollowing67@reddit

I got an Asus Ascent GX10 which is.... Just a version of the Nvidia PGX which .. is a Linux only small box with 128 GB of unified RAM and uses less than 200 watts but it puts out like 1000 tfps. I can run 80 GB models fully in memory... My ROI calculation puts it at about 2 to 4 years before it makes sense.
But I've theoretically already saved $750 from using local.... And I bought it this year.

[-]

illforgetsoonenough@reddit

Which models do you run on it?

[-]

rorowhat@reddit

Get a strix halo instead

[-]

WeUsedToBeACountry@reddit

I have a m5 w/ 128, and I've been running qwen3.6 27b all day with unsloth's quantization and lm studio and its been great. I use opencode with gpt 5.4 as the orchestrator and qwen for sub agents. If the model isn't loaded into memory, it does take a few seconds to get going. Once it's hot its fine.

And I have tried oMLX but found it goofy still. I'm just going to wait for LM Studio to properly support MLX I think.

[-]

More-Curious816@reddit

Just fixing the table formatting of OP

UPDATE:

oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx Benchmark Model: Qwen3.6-27B-mxfp4 ================================================================================

Single Request Results

Test	TTFT(ms)	TPOT(ms)	pp TPS	tg TPS	E2E(s)	Throughput	Peak Mem
pp1024/tg128	1291.9	28.60	792.6 tok/s	35.2 tok/s	4.924	234.0 tok/s	15.07 GB
pp4096/tg128	4709.4	29.47	869.7 tok/s	34.2 tok/s	8.453	499.7 tok/s	16.49 GB
pp8192/tg128	9832.9	30.62	833.1 tok/s	32.9 tok/s	13.722	606.3 tok/s	17.11 GB
pp16384/tg128	22414.0	33.22	731.0 tok/s	30.3 tok/s	26.632	620.0 tok/s	18.36 GB
pp32768/tg128	47673.0	36.51	687.3 tok/s	27.6 tok/s	52.310	628.9 tok/s	20.86 GB
pp65536/tg128	112320.4	44.77	583.5 tok/s	22.5 tok/s	118.006	556.4 tok/s	25.90 GB
pp131072/tg128	298153.3	61.39	439.6 tok/s	16.4 tok/s	305.950	428.8 tok/s	36.27 GB

oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx Benchmark Model: Qwen3.6-27B-8bit ================================================================================

Single Request Results

Test	TTFT(ms)	TPOT(ms)	pp TPS	tg TPS	E2E(s)	Throughput	Peak Mem
pp1024/tg128	1433.3	54.99	714.4 tok/s	18.3 tok/s	8.417	136.9 tok/s	28.34 GB
pp4096/tg128	5084.6	56.11	805.6 tok/s	18.0 tok/s	12.211	345.9 tok/s	29.79 GB
pp8192/tg128	10413.9	57.23	786.6 tok/s	17.6 tok/s	17.682	470.5 tok/s	30.42 GB
pp16384/tg128	24285.2	61.02	674.6 tok/s	16.5 tok/s	32.034	515.4 tok/s	31.67 GB
pp32768/tg128	53538.1	64.27	612.0 tok/s	15.7 tok/s	61.700	533.2 tok/s	34.17 GB
pp65536/tg128	123724.9	71.65	529.7 tok/s	14.1 tok/s	132.825	494.4 tok/s	39.20 GB

[-]

Charming-Author4877@reddit

If you have the budget, get a 5090. The speed will be MUCH better than on a macbook and 32GB is enough to run both 3.6 qwen at max or very high context.
The tendency is not larger local models, it's going down to smarter and smaller models

[-]

qubridInc@reddit

If you’re serious about local models, 128GB is finally worth it, but only if you’ll actually use it beyond the hype.

[-]

Snoo_27681@reddit

TLDR: If you have $5k you don't really need about it's a great investment.

With the M4 Max 128Gb I'm able to run `Qwen3.6-27b-mxfp4` and `Qwen3.6-35B-A3B-mlx-mxfp8`. I got a few Langraph workflows to solve issues with `Qwen3.6-35B-A3B-mlx-mxfp8` so I'm hoping 27B can help with heavier thinking. We will see. I'm assuming the M5 Max is just faster.

I think the value of the local rigs is learning about local models and then if you try to make local models work you have to get better than your pipeline and context management. There is no possible way to do any meaningful work by prompting the same as you do Opus. So it's a very expensive learning piece of equipment that runs some suprisingly decent but super slow models.

[-]

Rabus@reddit (OP)

yea im pretty deep in agentic development, but not in local. I feel like local is a logical next step to be kind of not reliant on Anthropic.

[-]

brickout@reddit

Nope

[-]

Its_Powerful_Bonus@reddit

M5 max works like a charm, but with rtx5090 and turboquant around the corner it might be better choice in some use cases.

[-]

mr_zerolith@reddit

This is underpowered hardware with no upgradeaboility. it will always be on the slow side.

I'd strongly recommend if you're going to buy starter hardware, do it on a PCI Express platform so that if your usage doesn't match your expectations, you can just add another GPU or three!

[-]

bakawolf123@reddit

hard to say
since m5 ultra got delayed I'm also thinking about one
but I don't want another laptop tbh, my m1pro works just fine in that regard, most of the time sitting closed anyway as I work on connected display and external kb/mouse
really sad they decided this whole CEO swap needs to come first

[-]