M5 vs DGX Spark vs Strix Halo vs RTX 6000
Posted by Signal_Ad657@reddit | LocalLLaMA | View on Reddit | 147 comments
Hey guys, super simple. There have been a lot of online debates about the new M5 Macs vs DGX Sparks vs Strix Halo vs dedicated GPUs etc.
So I put them all in a room with good power and cooling and ran everything in parallel with standardized tests for the past 3 days, and published everything to a repo.
A lot of it isn’t a big surprise when you just think about headline numbers and fundamentals. An RTX6000 has a memory bandwidth speed of \~1,800 gb/s vs \~600 for the M5 vs \~256 for the Spark and Strix. Tokens per second per piece of hardware follows that math and curve pretty well.
For the price point, and assuming you are ecosystem agnostic, the maxed out M5 is genuinely legit and very aggressively outperforms the DGX Spark. Again, not really a surprise when you look at their memory bandwidth speeds (2x+ memory bandwidth speeds on the M5 with the same total unified memory).
Second thing worth noting was also probably no surprise but the EVO X2 thermals were an issue with extended runs. The MacBook actually surprised me with how well it held up thermally more than anything. It ran for a few days and cruised in the 80c range. I will say this though, it sounds like a normal gaming laptop when it cooks. There’s a bit of propaganda going on when people say “quiet” with these.
You ramp up an M5 MacBook Pro to cook with local AI and it turns into a blow dryer like every other laptop that’s ever tried to cook with local AI. It’s built like an aircraft carrier and performs really well for what it is, but you will 100% know it’s working when it runs lol.
I’m now swapping back ends and adding data for things like MLX on Mac, different hosting backends on Strix Halo, etc. for how they all impact performance and outputs. The RTX6000 is not the same as the RTX5090 just so the obvious police don’t grab me, but there are a lot of similarities between cards that could make this data useful for someone debating a 5090 PC vs these other machines.
Either way, repo enclosed, hope this helps provide some raw data and numbers for future discussions and debates:
https://github.com/Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests
flyingbanana1234@reddit
m5 max rn costs 5.5k for a 128 gb where as dgx spark is 3.8 k
Amazing-Accident3535@reddit
This
jonydevidson@reddit
Enjoy the software for the Spark
dllu@reddit
ASUS Ascent GX10 is generally available in the US for $3.5k but the support is kinda iffy; the official DGX Spark is $4.7k.
rpkarma@reddit
Kind of iffy in what way?
potatoears@reddit
asus is notoriously bad at honoring warranty/providing support.
at least in for regular mainstream consumer products.
dllu@reddit
Theoretically, system recover images on NVIDIA's website are only for the founder's edition: https://docs.nvidia.com/dgx/dgx-spark/system-recovery.html
while ASUS provides their own images: https://www.asus.com/us/supportonly/gx10/helpdesk_download/
But I just looked it up and people have been able to flash the regular NVIDIA ones to the ASUS machines so maybe it just works? I also heard that the ASUS customer support are not very knowledgeable about the GB10 devices and aren't very helpful.
Miserable-Dare5090@reddit
It’s the same image. I swapped a hard drive, popped the nvidia image, and it’s working like nothing happened to this day.
rpkarma@reddit
Yeah I’ve used the Nvidia one no worries on mine!
And that’s not that surprising to me, Asus customer support sucks no matter the product lol
But they do have actual service centres where I live unlike Nvidia tbf
PentagonUnpadded@reddit
Different support how? You mean hardware support? Many ASUS / third party GB10 users flash it with the official DGX OS.
I don't see how the 'official' nvidia edition is better besides aesthetics. You can buy your own NVMe drive to turn the 3.5 1tb into a 4+ tb for way cheaper than 4.7k.
Saw there's slight thermal variation between third party and official boards. But an easy solution is a $15 usb fan on either the intake or exhaust. Plus none of them throttle in standard indoor condiiotns.
Powerful_Ad8150@reddit
Noone flashes anything. One thing u need to change is wallpaper. Rest is the same
redzod@reddit
I'm semi-technical, but would we ever see a world in which we'd buy a dedicated inference server to process local models? Sort of like how we all have wifi routers now?
Historical-Internal3@reddit
Cuda or death.
apidekachu@reddit
My only concern with the apple hardware is that i can't replace the ssd. Because I'm assuming there would be a lot of wear and tear for running the system 24/7...
__JockY__@reddit
Give me a pair of 6000s over anything on the market. Except maybe four 6000s.
unjustifiably_angry@reddit
For the price of four 6000s you could have 8 Sparks and a network switch to let them all talk to each other and let your run Kimi K2.6 at usable speed.
__JockY__@reddit
All at 4 tokens/sec. Maybe 10? Nice as a novelty toy, but I challenge you to spend your life in Claude cli trying to use that for interactive work. The prefill on your DGX cluster is gonna be measured in blocks of minutes, not seconds.
Throw MiniMax FP8 on those 4x rtx6ks and it’ll rip 3x concurrent 200k Claude sessions at over 200 tokens/sec for decode with insane prefill speeds. That’s the kind of workhorse I need.
A DGX research cluster will certainly run a stronger model than quad GPUs, granted. But the GPUs are useful for interactive work (and MiniMax at FP8 is still very much SOTA), whereas the cluster would really only be useful for tasks where high latency and slow generation are acceptable.
Colecoman1982@reddit
You'd take a pair of 6000s over three 6000s?
__JockY__@reddit
It is pedantry of this sort, up with which I shall not put.
-- Winston Churchill, allegedly.
Colecoman1982@reddit
"The issue with quotes on the Internet is that you can never know if they are real." - Abraham Lincoln
unjustifiably_angry@reddit
You don't buy one spark, you buy two. So the number is actually ~550 GB/s.
spaceman3000@reddit
I have m3 ultra and strix and I love both. Mind that I paid usd 1800 for strix. At this price you can't beat it. But well at today's prices it doesn't make sense to buy either dgx or strix and just get Mac studio.
iMrParker@reddit
Weirdly, prices for AI hardware are very much "you get what you pay for". A lot of people think that the RTX Pro 6000 is overpriced compared to the competition, but you truly do get a multi-tool GPU. In terms of raw compute, memory bandwidth, as well as the software stack for training/fine-tuning.
Mini PCs and Macbooks/Mac Studios are fantastic machines, but they aren't multi-tools. They are inference machines. Training/fine-tuning anything serious, diffusion models, prompt processing etc. are big weakpoints that some people would prefer to pay more to avoid. But if it works for you, then definitely buy a Macbook
unjustifiably_angry@reddit
I'm not sure I follow your logic. A multitool is a single tool that's a jack of all trades, master of none. That's precisely what a Spark, Halo, or M5 Mac is. A 6000 Pro's only practical application at $8K MSRP (which has gone up I think) is AI.
Yorn2@reddit
As someone that owns and uses both Mac and nVidia hardware, the RTX 6k Pro can run image/video, TTS, and LLM at speeds much faster than my M3 Ultra and I can train with it.
The only major benefit of the M3 Ultra is I get to slowly see the outputs of the models I'm missing out on if I had more RTX 6k Pro cards. :D
iMrParker@reddit
I never implied master of none for the RTX Pro 6000. It is a legit multi-tool that is excellent and 99% of prosumer tasks. The GB10 is a jack of all trades, master of none. And Mac Studios are one-trick ponies. They're almost the only prosumer choice for large model inference, but can't do anything else worth attempting
Shoddy-Tutor9563@reddit
How does M5 perform in parallel requests workload? with vLLM my good old 4090 can serve like 50 requests in parallel with performance figures apple could only dream of.
Have they finally fixed the slow-as-hell prompt processing? Like in the scenario when multiple ppl are using LLM for coding and one is constantly evicting other's KV cache?
King_Kasma99@reddit
How much context do these 50 requests use to not blow the vram up? 5 tokens each?
Shoddy-Tutor9563@reddit
nopes, 8k
one such 4090 is quite successfully serving a very busy tool-calling client-facing chatbot for one of the enterprise :)
Southern_Sun_2106@reddit
8K, Jesus... That would be impressive, if it was 2020.
FullOf_Bad_Ideas@reddit
Is a local 4090 serving this chatbot or do you run rented one for prod? Just curious since local hardware doing work in low latency prod seems rare and your comment treats 4090 as fungible items.
King_Kasma99@reddit
What exactly is it used for then? 8k is literally unusable for me
Shoddy-Tutor9563@reddit
For you - sure, it might not be usable. But for my customer, who serves thousands clients per day - it's very usable. He's happiest man alive seeing this chatbot raising his sells
FullOf_Bad_Ideas@reddit
Lots of usecases for low context requests. I'm doing translation with 1.8B HY 1.5 at 6k ctx. Concurrency of 128 per gpu, running on 8 GPUs. I translated about 10B tokens this weekend.
unjustifiably_angry@reddit
You're the first person I've seen bring up actually making real-world use of parallelism instead of just using the inflated numbers as an imaginary value proposition, mind giving details of your use-case, model, etc? Very curious.
Old-Magician9787@reddit
All of the agentic harnesses e.g. opencode make use of parallel background agents
Borilentz@reddit
Have a look at oMLX.
CMPUTX486@reddit
I know M5 Pro can be faster, but I need studio form not laptop!
eltonjock@reddit
Why?
spaceman3000@reddit
Thermals and why to pay for the screen, battery, keyboard when you don't need them?
sn2006gy@reddit
The OS wars are ruining this community. Just get on with whatever works. Community is better if we’re all building cool shit instead comparing penis sizes
Yorn2@reddit
I own an M3 Ultra and two RTX 6000 pros. I feel obligated to comment about my experiences with both because I don't want people buying a Mac and expecting speed or an RTX 6k Pro expecting to run Kimi K2 with no loss.
If I was doing RAG the M3 Ultra would be great and more heavily used, but instead I find it mostly sitting unused. It is usable, though, don't get me wrong, it's just that I'd rather use my two RTX 6k Pros that can run an EXL3 of Qwen 3.5 397B as my workhorse unit since they are definitely faster.
I guess the TL;DR of it is that they each have their use cases, but you need to be totally prepared for those use cases when you buy it. Buy for your needs, not someone else's.
neopolitan77@reddit
Those are hardware comparisons, not OS. And a lot of optimizations are hardware-specific, so it makes sense for the community to cluster a bit along hardware lines. There's an own Strix Halo subreddit where people discuss Vulkan vs ROCm and so on. For NVidia consumer GPUs, VRAM limitations are more important, including questions about offloading. DGX Spark has an own connector for combining multiple units. Some optimizations like MTP benefit everyone equally, but by far not all.
sn2006gy@reddit
spare me the hair splitting.
Mauer_Bluemchen@reddit
What - OS vs. hardware is "hair splitting"? Certainly not.
sn2006gy@reddit
Discuss hardware and software, fine. Making images calling people retards, fucking stupid
neopolitan77@reddit
A comparison serves newcomers to decide what to choose and others whether a switch may be warranted.
If you don't want to talk/read about optimizing local LLM setups, then I really don't understand what you're hoping to get out of this community?
sn2006gy@reddit
that is not what this image does whatsoever.
neopolitan77@reddit
There's a link to a Github repo with detailed benchmark results for different configurations, which are summarized in the OP.
sn2006gy@reddit
ok, still doesn’t change the facts.
Crafty-Sell7325@reddit
Agree. This post is just closer to ragebait than being useful
InnovativeBureaucrat@reddit
I find the reactions useful actually. I don’t have either and it’s hard to figure out what difference what makes
Crafty-Sell7325@reddit
I'm glad for you, but the point is that this does not seem to be done in good faith
Swimming-Chip9582@reddit
Ecosystem on Mac sucks ass
Signal_Ad657@reddit (OP)
Haha that’s totally fair. I’m just reporting performance. I’m not naturally a Mac guy either and was more than happy to come in and find out “M5 is hype”. But IF you don’t care that it’s a Mac (or like Macs), it’s genuinely maybe the best unified memory based machine right now and it’s not very close.
Swimming-Chip9582@reddit
oh - it's not so much that it is a Mac, it's really the surrounding LLM ecosystem. The whole MLX and lack of proper VLLM alternative on the Mac really makes it a hassle to use in my context
Longjumping-Move-455@reddit
I have to admit oMLX has been really good in my experience
Naz6uL@reddit
I've been using oMLX + Hermes agent (M3 Max 128GB), and my only concern so far is potential disk degradation, which is critical on Mac since the storage is soldered.
eltonjock@reddit
How does one even know if their MBP is suffering from disk degradation?
alexp702@reddit
Shouldn’t be a problem unless you choose disk based prompt caching - that’s what kills the drive
Longjumping-Move-455@reddit
It can be worth getting another external ssd just for the cache. That’s what I did
koushd@reddit
What’s wrong with llama.cpp on mac?
Signal_Ad657@reddit (OP)
Yeah I’d agree, it’s not an ecosystem built around self hosting 100%.
starkruzr@reddit
not really? not anymore. it's not as comprehensive as Nvidia's is but I'd say at this point you're doing well enough with Mac to get whatever work you need done done.
Xatter@reddit
The best system is the one you can actually buy and use
Ok_Top9254@reddit
Reminder that 16GB Tesla P100 with a fan is still just less than 120 bucks with similar compute and higher bandwidth than the MBP M5 Max in the post...
AsliReddington@reddit
Sure, let me know once you can do anything not limited by memory lol.
Qwen Image Edit?
WAN2.2?
LTX2.3?
Go cry in the infinite loop
beren0073@reddit
Missed opportunity on the caption: "If those kids could process enough tokens they'd be very upset"
ttkciar@reddit
It's a bit more complicated than that.
When a model and its context fits in the RTX6000, the RTX6000 will outperform M5.
The more the model and context overflows the RTX6000's VRAM, the worse it will perform, whereas the M5's performance will hold steady.
Thus large models will infer more quickly on the M5, since the low main memory bandwidth of the RTX6000-equipped PC will limit that system's performance.
Strix Halo doesn't beat either system for any model size for performance, but it's a lot cheaper, and its peak power draw is quite low for its performance, so it's a way to infer with large'ish models at moderate performance on a tight budget. Neither the hardware nor the power bill will break the bank.
stoppableDissolution@reddit
The real difference between mac/strix and 6000 is that first two are virtually limited to single request inference while the latter can also train. If you are not going to use it for training or some kind of high-batch processing, you simply dont need it.
john0201@reddit
The M5 max and the strix halo are not in the same performance category. M5 max is similar to an Rtx pro 5000 with more memory, which can certainly be used for training.
Miserable-Dare5090@reddit
really? Whats the prompt processing on the M5 max at 100k depth?
stoppableDissolution@reddit
For gemma31, according to https://omlx.ai/benchmarks/hnwitt0f - 300t/s@128k, so like 7 minutes ttft. Not great, but I was expecting it to be worse, tbh.
Miserable-Dare5090@reddit
right. The spark has 2000 tps while the m5 max has 300t/s, and we are getting spammed by all this comparisons without actual comparisons.
stoppableDissolution@reddit
Um, 2k tps in q8? I kinda doubt it, because I have ~3k with 6000 pro at 10-15k, but maybe I'm doing something very wrong
Miserable-Dare5090@reddit
On dual Sparks, the 397b Qwen with int4-autoround quantization right now is churning in front of me. When large context is sent from hermes, it gets up to 15000 tokens per second prefill.
It’s still timing out because the bandwidth is SHIT for decode, but that is fixed with extending the time out vars in the hermes files. It took 2 hours but finally built me what looks like working rocm version of lucebox’s Dflash/Pflash optimizations for qwen 27B. It did have to sift through a lot of stuff to get there, so context is currently compacted once (at 80% of 256k) and now onto 73.9k.
When the prefix cache hit rate is high and its turn by turn, it only shows 250tps prefill but thats bc the prompt was tiny. But impressively, these machines do have serious compute. It’s not a blackwell 6000, but I’d estimate that in TP w VLLM two sparks have the compute of 1/2 an rtxpro6000.
Since I bought the two machines on sale, and together they cost…somewhere in the mid 5000s…that’s acceptable to me. I didn’t have a build to stick an rtx6000 in, and these are two separate computers.
How do 250B+ models fare on a single rtx6000?
FullOf_Bad_Ideas@reddit
Vllm prefill and decode speeds aren't good measures, you need to record them outside, through api consumer. For example through llama-bench.
I'd be curious to see your llama-benchy numbers for Qwen 3.5 397B at 4k, 16k, 32k, 64k, 128k context depth and i can share some of mine (397b 3.5bpw exl3 on 8x 3090 ti)
unjustifiably_angry@reddit
At 100K depth Int4 Qwen397B, Spark gets 228 or 1038 PP depending on what specifically you're measuring:
https://spark-arena.com/benchmark/a1d580bb-9d05-4831-a558-c8d02438747c
stoppableDissolution@reddit
I've not tried any of the big moes, so idk. That machine only has 32gb of system ram (lol), and in my past experience moes desintegrate real bad at ultra low quants
Organic-Thought8662@reddit
Looking at that link, the m5 was 300pp, 2.3tg for 128k ctx. That is WAY slower than even my RTX PRO 5000.
I just ran a benchmark with Gemma4 31b Q8 (same quant as in the linked test) and for 128k ctx i get the below:
at 65K ctx, i get
john0201@reddit
You’re not comparing the same benchmark.
In practice there is no x percent faster since they are different platforms. The raw matmul units in an m5 max are about the same as a 5070 or spark.
If you’re doing something like running a model that won’t fit in memory obviously the Mac would crush an rtx 5000, and if you’re doing something that uses fp4 not available in the m5 max the reverse is true.
It is not a coincidence that the 5090 mobile (which is basically a 5070) and m5 max are similar in performance- that’s the most you can stuff into a laptop at the current process nodes. Apple is on a newer process and wins by a big margin on performance per watt (you can’t even use a 5090 mobile at max performance without a power brick).
john0201@reddit
This isn’t really an answerable question without knowing the model and how it’s served.
Generally it’s about half as fast as a 5090.
john0201@reddit
This isn’t really an answerable question without knowing the model and how it’s served.
Generally it’s about half as fast as a 5090.
AdDizzy8160@reddit
and you can connect to sparks with the intern ConnectX-7 NIC ports. This is often overlooked: Using tensor parallel increases the processing bandwidth or speed by approximately 1.8 times. At the same time, you have 256 GB of unified memory. This makes the Spark more future-proof.
PentagonUnpadded@reddit
+1 to the ports.
If you try to build a Spark cluster to match a DGX cluster, you'll get similar cost thanks to the networking. A 3k Strix plus a $500 networking card is quite similar to a DGX.
If you buy one today and MIGHT add a second later, DGX wins. If you use any nvidia specific tech like Cuda or Int4, the DGX wins. If you need x86, PCIe expansion, want a faster processor or better general linux support the Strix wins.
Tai9ch@reddit
You can direct thunderbolt together a pair of Strix boxes.
unjustifiably_angry@reddit
Spark has a lot of failings but one thing Nvidia got right was predicting sparsity taking over, so memory bandwidth isn't as critical as it once was, and its native FP8 acceleration. Combine the huge RAM and FP8 and you get something ideal for large, accurate sparse models.
Its advertised FP4 performance still seems to be a scam, though Avarok/Atlas has been making progress on making that useful. I'm still not sure I trust FP4 to be accurate enough, at least on a consistent basis, because for it to be useful it needs quantization-aware training and I don't automatically trust nobodies on HF to get that right.
Miserable-Dare5090@reddit
This is the way. You need 2 sparks to see why some of us are converts.
Bonteq@reddit
Are you personally fine-tuning models to your needs? What sort of training are you all doing? And are you seeing legitimate improvement that justifies it?
stoppableDissolution@reddit
Its more of a hobby than something practical, really. Times when any guy could improve a modern model in general capabilities are long gone. I do experiment with RP tuning with some limited success (not good enough to publish, but the direction seems to be correct), and I made a couple of tiny single-purpose models in the past, but I'd hardly call it a return on investment. It is satisfying to tinker with tho.
logos_flux@reddit
Should factor in blackwell nvfp4
john0201@reddit
DGX spark is an RTX 5000
thedudear@reddit
Rtx pro 5000 is ~67 TFLOPS vs 5070s 31 TFLOPS, they aren't remotely similar. Even a 5080 comes in at 56.
john0201@reddit
Yes they are, I own an M5 max and a 5090.
TFLOPs of what? It looks like you’re quoting GPU cores for fp32? And even that appears wrong.
thedudear@reddit
What you own is irrelevant. I too have a 5090, pro 6000, on an Epyc Turin. Who cares.
Those are single precision TFLOPS. But it doesn't matter which way you cut it, if you're comparing two cards within the RTX Blackwell line it scales the same be it BF16, fp8, fp4, etc.
The 5070 is, in every way, no matter what metric you even chose to cherry pick, incomparable the pro 5000. Not even close. I'm not even sure why you'd compare them, much less defend such a comparison by saying you have a 5090 and M5 max.
john0201@reddit
It’s relevant because I have actual performance numbers.
Single precision means fp32. It most definitely does not scale linearly, this is a pretty basic concept. Maybe put some of this in your favorite language model you don’t seem to have a basic understanding of matrix hardware.
thedudear@reddit
Y'know, just for fun, I did some digging and found that Nvidia half's the fp32 accumulate for the consumer class cards. I was wrong, you're even further off than I thought. The 5070 is 2 to 4x slower than the pro 5000.
Otherwise, the TFLOPS scales the same across all data types except where accumulate is in fp32 (with the exception of fp4). Note I did not say linearly, I said the same, as in BF16 and fp16 is going to be 4x fp32, fp8 is 8x fp32 (accumulating in fp8), etc. Broad rule of thumb with some nuance. But the nuance isn't in your favor.
Your actual data saying otherwise, given your track record, is almost certainly in your experiment design or some other bottleneck. A 5070 just isn't in the same class as you're suggesting.
john0201@reddit
I’m glad you’ve looked up what fp32 accumulate is. You’re partially repeating what I just said, but have misunderstood the important part.
This is all basic info on how this hardware works. If you don’t understand it you’ll have a hard time interpreting marketing numbers, low precision integer vs fp, dense vs sparse, cache differences (which can be huge, that is a big differentiator in the B200).
starkruzr@reddit
it's a little frustrating that even with small dense models (looking at you, Qwen 27B) STXH is kind of a dog, IIRC.
ayu-ya@reddit
I need something portable (so not larger than a laptop) and plan on Spark mostly because I want to use it for both LLMs and i2v and from what I looked up, Macs are insanely slow with video gen. If that changed with M5 I wouldn't mind grabbing a 128GB pro instead, I just really can't have one vid taking an hour. If anyone knows if there's much of a difference between M4 and 5 here, please let me know.
Strix could be my 'oh damn I need it NOW, all APIs suitable for my use cases died and my gaming PC exploded but I haven't saved enough yet' emergency option. Can't even dream of big boy GPUs when I need my AI machine to fit into a carry on
unjustifiably_angry@reddit
I don't know if you can run video generation across two Sparks but if you ever plan on using them for LLM you should plan to buy two. Everyone does. A single unit is just too crippled.
FerLuisxd@reddit
5.5k is better than 4k lol
FatheredPuma81@reddit
As a PC builder any system that you can't upgrade sucks imo. I get the appeal of mini PC's with a ton of RAM for AI but I would still rather a real system I can upgrade/downgrade to suit my needs change. Those mini PC's won't change and in the case of at least Apple (haven't checked the others) you can't even upgrade the storage which to someone who spent $400 to get 7.68TB of Gen4 NVME storage is absurd.
MaruluVR@reddit
Yeah but what if you buy the mini pc instead of a gpu you attach to your pre existing pc, you could say gpus suck because they arent upgradable, the vram and main gpu chip are soldered on. In the end its all about perspective.
I personally am running a mini pc with 4 eGPUS as my AI rig tough lol
DinoAmino@reddit
Right. The perspective is that GPU internals aren't replaceable. You upgrade them by taking out the old and installing bigger ones. Same exact perspective for the CPU. Can't add cores to it. Need to remove the old one and replace with a "bigger" CPU (that the same mobo can handle, otherwise... )
So, yeah. Depends on how you want to twist up your perspectives to align with your biases.
MaruluVR@reddit
For a mini PC to be useful for AI the memory needs to be fast which means it needs to be soldered on, you know like a GPU. I agree that soldering on the SSD is stupid but thats just apple and not the other AI mini PCs out there.
Also I am more looking at it from the perspective of having to choose between buying a GPU or a mini PC and if you are in it for AI and are looking at current prices of for example the 5090 "upgrading" makes as much financial sense as rebuying a mini pc. IE I dont feel like mini pcs are a bad investment even if you replace them every few years.
sheijo41@reddit
I’m curious to know more about the egpu stack, are you running different models on each?
MaruluVR@reddit
I am using a single egpu adapter that goes to a plx pcie splitter that can connect 4 gpus at 8 pcie lanes each. The GPUs are connected with only 2 lanes to the mini pc but they can communicate with each other at 8 lanes bidirectionally using the nvidia p2p driver.
I have 2 3090s and 2 modded 3080s, the 3090s are running a q8_0 of gemma 4 with unquanted cache at 125\~t/s which gets used for a voice assistant so it has to be loaded 24/7, the 3080s are for what ever I feel like at the time.
Best thing is since it is usb 4 I can fully power down the gpus by ejecting them and idle at 0w when there is a long down time between requests. (with Gemma E4B loaded on CPU in the mean time and once a query is made the gpus wake up automatically)
FatheredPuma81@reddit
GPU's do suck because they aren't upgradable :). I would love nothing more than to slap 96GB onto my 4090 right now.
Ok_Scientist_8803@reddit
Plenty of 48GB cards on taobao right now, I heard that they are pretty solid but haven't tried it myself (they are definitely expensive).
Socketed vram will almost certainly not exist until the slower running LPCAMMs become tried and tested, and then reworked to handle the crazy signals around GPUs.
FatheredPuma81@reddit
You can also buy 22GB RTX 2080 Ti's off Taobao for like $300 (excluding taxes, shipping, tarrifs).
Yea sadly I don't see it ever happening😞
FullstackSensei@reddit
Would be nice, except we haven't figured out how to do that with current tech. GDDR6 runs 2-2.5 times as fast as the fastest DDR5 RAM you have, and GDDR7 pushes that beyond 3x. The current state of science and technology haven't figured out yet how to run modular memory at such speeds, while making the replacement process consumer friendly.
FatheredPuma81@reddit
Even if the performance is worse I think many would still opt to use them though. And there should be options that are in between RAM sticks and Soldered memory. Something like modular RAM chips that socket in similar to a CPU.
unjustifiably_angry@reddit
Normally I'd completely agree with your position but for specialized hardware there's often no way to make it modular without massively increasing price or reducing performance, and in the current market it's not like adding extra RAM would make sense anyway. The Macs and Sparks can be stacked to increase RAM capacity and speed, so there's that at least.
FullstackSensei@reddit
I'd say it all depends on price.
The thing that irks me with these comparisons is how people utterly ignore price, as if everyone has $20k to throw at hardware.a
The M5 Max 128GB costs about twice as much as Strix Halo, and the RTX Pro 6000 costs another 2k on top of the Mac. This is as surprising as finding out an F-18 is faster than a private jet, which in turn is faster than a propeller plane.
FatheredPuma81@reddit
Oh yea for sure but right now because I already have a PC if I want to run a 122B model at 4 bit I can just toss another 48GB kit into my system for (way way way too much imo) cheaper than any of these mini PCs and probably get better speed than the Strix. Which basically reaffirms my position because I didn't build this PC with AI in mind at all but I can easily do a couple upgrades here and there and use it for AI. If I didn't have a RAID card using my x4 slot I'd probably throw a cheap GPU in instead but I could always sell the RAID card or buy some dirt cheap parts and move the array.
FullstackSensei@reddit
It's not "just" if you're throwing another 1k or so to get the RAM.
The price of 48GB kit can get you a DDR4 Xeon with at least double the RAM. Said Xeon, while being 7-10 years old, will still be faster than anything you might have in your PC.
Don't treat it as an ideological or political choice. Just shop for the best value for your money, whatever that is.
FatheredPuma81@reddit
That is an ideological view and I'd be inconveniencing myself to a stupid degree if I followed it.
That's $500-$600 worth of brand new RAM vs swapping my 4090 back and fourth between my Gaming PC and $600 AI PC or $300 PC + $300 GPU which would be outright worse (though it would be convenient). That's just idiotic. And I wouldn't own a 4090 I'd own an RX 5700 if I followed that blindly.
The ideology I'm pushing is versatility. I bought a Gaming PC, turned it into a Gaming Storage PC, and soon it'll be a Gaming Storage AI PC or maybe I'll spin the Storage and AI portions into their own Best Value PC(s). Which I can do because it's versatile.
neopolitan77@reddit
I'm also sad about the trend towards less modular hardware, but on a technical level it makes sense. You gain something by giving up replaceability, and for regular folks who have a lot of other considerations to keep in mind (budget, noise/heat,...), a SoC with soldered connections may be the best choice.
ThisWillPass@reddit
I mean, maybe? I think we are end game for consumers, feature set is not getting any smaller. With capable ai (not now), I expect software to be hardware agnostic.
Signal_Ad657@reddit (OP)
I totally agree, my primary machine is a custom built tower with Blackwell workstation cards. Being able to swap parts and upgrade / customize is huge.
MrPrevedmedved@reddit
DGX Spark is Nvidia B100 devkit. You can't compare its value just by benchmarking memory bandwidth.
unjustifiably_angry@reddit
Honestly it's not, stuff written for B100 can't run on Spark and stuff written for Spark won't be optimal for B100. Nvidia gimped the cache sizes or something... this is why it's taking so damn long for community efforts to make it run as advertised.
iamegoistman@reddit
eventually laptops or other mobile devices will throttle and slow down. if you want to work together with living model while working on daily tasks it might be problematic (I assume if you don't have 128gb version). apple still uses unified memory. and this is not good idea work with big ai models because you are sharing system memory with model in same pool. I have 48GB macbook m4 pro max ultra but fan noise is annoying me. probably I will switch to minipc or DGX Spark as remote device. especially if local models is not handles advanced tasks with smaller parameter counts memory will continue most important problems of local models. I think DGX is investment for future.
celsowm@reddit
Can we run something like vllm or sglang for concurrent users o m5? Honest question
insanemal@reddit
Oh just hook two M5s together for more performance.
Oh what's that? No RDMA support? Oh. Ok then.
Miserable-Dare5090@reddit
? There is rdma support, not in the way that you can link up linux boxes with mellanox cards, but no one else has implemented TB/RDMA other than the mac devs. To your point, only for TB5 which is completely arbitrary and hopefully at some point they’ll expand to TB4…
eidrag@reddit
isnt m5 with tb5 can now support rdma?
Alive_Ad_3223@reddit
How dare you to add rtx 6000 pro in the comparison list ?
FineClassroom2085@reddit
As a 128gb M4 user who regularly uses my dual RTX 6k rig for inference, I have to say, Mac is the better buy right now unless you're doing agentic coding. Part of the reason is there are no perfect models for the 192gb vram. Gemma 4 and Qwen 3.6 27b are beasts for their size, but run just as well on a 5090 as they do on my 6ks. The mac is much to slow for real agentic work with either of these models. Currently the best model (intelligence + speed) for the dual 6k rig is Qwen 3.5 397b, and it's good, but not frontier level.
If I could afford 1tb of ram the model options would certainly open up a bit.
Feeling-Creme-8866@reddit
"a lot of online discussions"'
Apple vs Kiwi vs Orange vs Creme Brulee
Posts like this really hurt, because they completely miss the mark on absolutely everything.
pfn0@reddit
How about comfyui performance Mac vs. GB10? LLMs aren't the only things that the GB10 does "well"
ManySugar5156@reddit
Prefill vs decode numbers matter a lot, but overall seems like RTX6000 wins for big models while M5 holds steady when you’re VRAM bound.
WiggyWongo@reddit
Me sitting here reading through as if I will ever touch either of these things in my life
Trick-Assignment-828@reddit
cool, i cant afford none!
Miserable-Dare5090@reddit
You omitted prefill numbers from comparisons…
ieatdownvotes4food@reddit
the flaw here is each platform has unique strengths depending on the architecture behind the specific goals.
philmarcracken@reddit
lol
eat_my_ass_n_balls@reddit
Did you do any pre-fill/cache optimizations?
dllu@reddit
john0201@reddit
You’re comparing a laptop with an Nvidia development desktop machine that runs a custom build of Ubuntu.
Who is the one who can’t read?
ATK_DEC_SUS_REL@reddit
With my spark, I don’t have to rely on an abstract framework when I can modify model architecture directly because cuda is easy to manipulate.
shanehiltonward@reddit
Run Colmap. ;) When that fails, run Metashape. ;) When that fails, run WebODM. ;) When that fails...
99OBJ@reddit
Do you know what sub you’re in?
shanehiltonward@reddit
Just showing, if all you are going to do is run certain LLM's on a system with the most limited software choice on Earth, Mac is awesome.
99OBJ@reddit
This sub is literally dedicated to running local LLMs. None of the software you mentioned is even remotely related to that.
Furthermore, it’s ridiculous to suggest that Mac software choice is that limited.
astrogod91@reddit
How does fine-tuning or even pertraining with say 1b token look like for 100m parameter gpt2 for example? Moving from memory bound to compute bound realm. M5 is.exoected to be significantly slower than rtx 5090 even.