Help me choose: Unified Memory (Apple Silicon) or 64GB DDR4 for a Budget Home AI Server?

[-]

FusionCow@reddit

honestly if you get 64gb of ram, you're going to be running at like 0.5t/s speeds, and if you get 16/24gb of ram with apple silicon, the speeds will be alright, but as you said you'll get stuck with dumb models. you're money is either better saving for something like a 64/128gb mac, or just paying api

[-]

khazenwastaken@reddit (OP)

Bro 64/128gb macs are too expensive for me. I will get mac or 64 system by second hand already so I need to decide just between of them.

[-]

l33t-Mt@reddit

Could always slap some older Nvidia p40s in it over time.

Around $250 on ebay.

[-]

VoiceApprehensive893@reddit

what are the speeds for a p40

27b dense 26b moe?

[-]

JacketHistorical2321@reddit

You'd be wasting your money in either case based on what you want to do. That's the hard truth dude.

[-]

UnWiseSageVibe@reddit

You can find the 64GB ones for 1000-1500 used.

[-]

PracticlySpeaking@reddit

64GB M1-M2 Mac Mini do not exist — 24GB was the limit.

You aren't getting anywhere near Mac Studio with 64GB for that kind of money, either.

https://imgur.com/a/mac-studio-m1-max-64gb-1tb-66HkYK8

[-]

UnWiseSageVibe@reddit

I mean I wasn't referring to M1-M2 just any in general at 64GB

[-]

FusionCow@reddit

then yeah I mean it's not worth it for either honestly

[-]

PracticlySpeaking@reddit

I have to agree with u/FusionCow — you are better off with subscriptions or APIs to do useful work. The PC without a GPU is going to be close to useless.

If you just want to learn, get the Mac. You will be running small (9b) models, but the RAM requirements are coming down all the time and capability is going up.

[-]

YT_Brian@reddit

There have been multiple steps to help in size needed for RAM thanks to google lately. Give it a year or two and I wouldn't be so surprised if most LLMs end up using that type of thing, so smaller LLMs won't be as lower quality as they are now.

Right now it might be an annoyance but in the future it shouldn't be as much of one.

Still yeah, if OP can save for a few more months or a year to get a 64/128 unified it would be clearly far better.

[-]

mail4youtoo@reddit

Would help if you mentioned what you budget is

[-]

khazenwastaken@reddit (OP)

I cant easily explain because I live in turkey and there are too many taxes. If I gave you a budget it probably wont be true. Just compare M2 24gb vs. ddr4 64GB system please.

[-]

KFSys@reddit

Honestly, both options have tradeoffs. The Mac will feel much faster and smoother, but you’ll hit that memory ceiling pretty quickly once you try anything beyond smaller models. The 64GB box is the opposite, slower, but you’ve got way more room to experiment with larger models.

If your goal is learning and trying different things, I’d probably lean toward the 64GB setup. Even if it’s slower, not being constrained by memory is a big deal when you’re figuring stuff out.

Another option I’ve used is just offloading the heavier stuff to cloud GPUs when needed. That way, you’re not locked into whatever hardware you buy — you can test bigger models occasionally without committing upfront. I’ve done that with DigitalOcean GPU instances and it works fine for that kind of “burst” usage.

[-]

90hex@reddit

I'd go with the Mac, if the budget is the same. I have a MacBook Air M2 24 GB, and it runs some of the best models really fast. Say, Gemma 4 26B A4B runs at 25 tks in MLX on that little thing. It's pretty awesome.

No amount of RAM on a PC will help with inference speed, unless you can stick a decent GPU in there. I'd go with a used 32-64 GB RAM Mac, and M2+ CPU, but at this price you can find a really nice used gaming PC with a GeForce and 16GB of VRAM.

And in a PC, you can always add more RAM...

No

[-]

ClickClawAI@reddit

Sell your possessions and buy a Mac Studio lil bro

[-]

evia89@reddit

Just use free Google Collab for experiment

If u need llm buy sub like nano gpt $8

Or use free Gemma 4 1500 rpd

[-]

vick2djax@reddit

The thing about running on Apple unified memory is that you have to share the rest of the system with it. I have a 36GB M3 Max and ended up punting on using it for anything AI despite trying hard to not. I was getting maybe 26 or 28GB of RAM usage for anything AI out of the 36 GB and then the rest of the computer wasn’t usable it was so slow/frozen up.

The Mac advantage only comes out when you’re in the high end of RAM. Like 128 GB. Be prepared to be disappointed. I was. Thankfully I got an Unraid server with a 7900XT with 20GB VRAM that demolishes the 36 GB M3 Max. Which was not my expectation.

[-]

Randomshortdude@reddit

I don't usually take the time to respond to too many posts here on Reddit - but I felt compelled in this instance because it seems like you may potentially make a really bad decision (especially if you go off of the feedback of the other commenters here).

To begin with, the DDR4 RAM setup you mentioned (as an alternative) handicaps you out the gate because the inference speeds you can obtain will forever be inferior to that of the Mac M2. Even though there is technically more physical memory that you can leverage, the benefit you get from that is almost nil ~~because DDR4 RAM memory simply isn't fast enough to give you inference speeds anywhere near what you can expect from a GPU (with VRAM)~~ [edit: not necessarily the bottleneck here; memory bandwidth will fuck you up way before we even get to the clock speed of those RAM sticks]. With an i5-8500T processor, I'm assuming you're even further limited to DDR4 RAM with a clock speed around 2400-2800 Mhz.

Apple's unified memory setup means that the RAM you're getting with that mini-PC will be used as though it were VRAM (which makes a huge difference). For the sake of this comparison, I'll assume you want to use the Qwen3.5-27B model locally. We'll assume its quantized down to 4-bit, which will take up 13-14GB(ish) of your available RAM. That leaves you with 10GB RAM (on the Mac m2) for the KV cache. With the assistance of a few compression methods out there (TurboQuant is out of the question in this case w no CUDA), you should be able to fit in a decent context length for that model without any worries (talking 32K context length here; can be higher - but you shouldn't have to even think about 32k).

With that setup, you'd be lucky to eek out more than 1-2 tokens/second on the DDR4 setup you described. With the Mac m2 mini-PC, ~~getting 20-30 tokens/second is more than practical~~ (edit: No its not if you're using the 27B param model due to bandwidth constraints - the real # would be closer to 7-8 tokens/s; however, that's exponentially better than what you would get from the i5-8500T even still, so the point remains here). One additional factor you're not accounting for is the fact that the Apple m2 mini-PC you're considering comes with an 8 core processor (which whips the i5-8500T in benchmarks - remember, the 8500T is an 8TH GEN Intel processor that was released 8 years ago). The m2 chip exists on 5 nm sized silicon vs. the humongous 14 nm that the ancient i5-8500T is on. On top of that, the m2 comes with a 10-core GPU as well (which is substantial in this scenario of LLM hosting & inference when we consider the billions of matrix-to-matrix multiplications / math that must be performed).

Memory Bandwidth > Memory: DDR4 System is Vastly Inferior to M2

Your idea about their being more "memory" with the DDR4 setup is true. But you have to remember that the reason why VRAM > DDR4/DDR5 when it comes to LLMs is because of the inference part. The speed of the token/s generation (decoding) is limited by the memory bandwidth (not the amount of available RAM). Think of the sink analogy. Yes, you may have a bigger sink, but if your goal is to drain water as quickly as possible (i.e., spit tokens out), then ultimately the size of the drain is going to play a much bigger factor in the equation than the size of the sink (give or take a couple of parameters, but don't overly scrutinize the analogy - you get the gist of what I'm saying here).

To show you how this would impact things, let's assume that you take a 32GB param model and quantize it to 4-bits. That equates to ~18GB storage right? Subtract that from 64GB RAM and that leaves you with 46GB left - seems sweet, right? However, when it comes to inference speed that is determined by the available bandwidth / model size. Again, we determined that a 32GB model quantized to 4-bit would take up 18GB; so that's the denominator of our equation. For DDR4-2666 Mhz RAM, you're optimistically looking at 36 GB/s (bandwidth; maxing out on that bandwidth is unrealistic and I only tacked a ~10% dropoff from benchmark maximums). With that math, we're looking at a max possible gen speed of 2 tokens/second (in the absolute best case scenario).

Prefill Makes the i5-8500T Impractical, Decoding (Inference) will be Orders of Magnitude Slower than the M2

What's crazy is what I described above isn't even the biggest bottleneck of your DDR4 setup. We have to consider the age & capability of the processor that system is leveraging (i5-8500T; 8th generation Intel desktop processor). This is a desktop T-series processor that's not designed for overly heavy AI-based workloads. Its designed to operate at a 35-Watt TDP limit. Also, in addition to it having less threads than the M2 (the latter having 8; i5-8500T has 6), it only comes with 6 cores too. This matters for Intel-based PCs because they actually use multi-threading (at least latter generations did). So latter chips like the i5-10500T would have provided you with 6 cores and 12 threads (versus just 6). Your processor will actually be the bottleneck before you even run into the RAM limitations we discussed.

If you're wondering why, we have to remember we're asking this processor to compress a 32B model and store it in RAM (at 4-bits). The reason why is because the i5-8500T can't do math on 4-bit numbers. So it has to take that 18GB model (quantized) that's in RAM, send it back to the CPU cache (which is also substantially smaller than the M2 at L1 + L2), then decompress it back to 16-bit or 322-bit floating point arithmetic so that the math (dot products) can be performed before later discarding these results. As if all that wasn't enough - this intel processor doesn't possess the AVX512 extension to its instruction set. Some may argue with me about this, but assume that to mean that your processor will not be able to do math on 8-bit floating point arithmetic. Also, its at least 2x slower than the latter processors that can handle this type of math (that Intel produces).

Conversely, the M2 chip by Mac (that's in your mini-PC) is able to handle 4-bit floating point arithmetic out the gate. That alone is a major game changer (putting all the other enhancements that come with the M2 chip to the side). But we're simply talking about the decoding process at this point. We haven't even addressed the processing of the actual prompt.

TTFT (Time to First Token) Speeds for i5-8500T Could be Minutes in Some Cases

Let's go back to the 32B param model example (using this because you stated that the additional memory / headroom was a motivating factor for you choosing the i5-8500T over the Mac M2 chip; so it only makes sense to hypothesize a setup where you leverage this supposed advantage).

As we noted before, its going to take up to 18GB RAM (quantized at 4-bit). However, when it comes to the prefill (i.e., actually receiving the prompt and 'understanding' it), that process is largely limited by computation resource. For your i5-8500T setup, you didn't note any GPU would be included (and if there were one, I doubt it would move the needle much at all in this scenario). So we're relying entirely on a 2018 35W desktop processor to compute millions of complex matrix-to-matrix calculations during this prefill process.

Optimistically (and I mean in the best case scenario), you'll be sitting at your computer for a solid 5 or so minutes before the first token even appears. And when the tokens do start appearing, they'll likely be at a speed of roughly 0.5-1 token/second (again, even that is generation). This would not be the case for the M2 - at all.

eGPU Option Now Available for Mac

Recently (and I mean within the last week or so), Mac updated the drivers for the m2 chip for compatibility with AMD and NVIDIA. So that means eGPU hookups are now in the field of play. Luckily, the mini-PC with an m2 chip has two thunderbolt 4 ports on it (likely can only handle one eGPU hookup at a time). The generation of thunderbolt matters here when dealing with the actual speed of inference in this setup. Your i5-8500T is only going to be able to handle PCIe3.0 hookups and since there will likely be no Thunderbolt3 ports for connection, you'd have to open up the Dell/PC you have and manually hookup any eGPU setup you want - but that would all be for nil, because the overhead would be reached via compute just from the swapping back and forth from eGPU to the CPU. Unified memory eliminates that bottleneck to a large extent, so the bottleneck will only lie in the Thunderbolt4 bandwidth (40GB/s I believe).

Conclusion

In no universe should you ever consider getting the i5-8500t over an m2 if your only consideration in making the decision (between one or the other) is local LLM hosting and inference.

Anyone telling you otherwise has no clue what the fuck they're talking about. Respectfully.

[-]

Randomshortdude@reddit

I don't usually take the time to respond to too many posts here on Reddit - but I felt compelled in this instance because it seems like you may potentially make a really bad decision (especially if you go off of the feedback of the other commenters here).

To begin with, the DDR4 RAM setup you mentioned (as an alternative) handicaps you out the gate because the inference speeds you can obtain will forever be inferior to that of the Mac M2. Even though there is technically more physical memory that you can leverage, the benefit you get from that is almost nil because DDR4 RAM memory simply isn't fast enough to give you inference speeds anywhere near what you can expect from a GPU (with VRAM). With an i5-8500T processor, I'm assuming you're even further limited to DDR4 RAM with a clock speed around 2400-2800 Mhz.

Apple's unified memory setup means that the RAM you're getting with that mini-PC will be used as though it were VRAM (which makes a huge difference). For the sake of this comparison, I'll assume you want to use the Qwen3.5-27B model locally. We'll assume its quantized down to 4-bit, which will take up 13-14GB(ish) of your available RAM. That leaves you with 10GB RAM (on the Mac m2) for the KV cache. With the assistance of a few compression methods out there (TurboQuant is out of the question in this case w no CUDA), you should be able to fit in a decent context length for that model without any worries (talking 32K context length here; can be higher - but you shouldn't have to even think about 32k).

With that setup, you'd be lucky to eek out more than 1-2 tokens/second on the DDR4 setup you described. With the Mac m2 mini-PC, getting 20-30 tokens/second is more than practical. One additional factor you're not accounting for is the fact that the Apple m2 mini-PC you're considering comes with an 8 core processor (which whips the i5-8500T in benchmarks - remember, the 8500T is an 8TH GEN Intel processor that was released 8 years ago). The m2 chip exists on 5 nm sized silicon vs. the humongous 14 nm that the ancient i5-8500T is on. On top of that, the m2 comes with a 10-core GPU as well.

Memory Bandwidth for the DDR4 System is Vastly Inferior to M2

Your idea about their being more "memory" with the DDR4 setup is true. But you have to remember that the reason why VRAM > DDR4/DDR5 when it comes to LLMs is because of the inference part. The speed of the token/s generation (decoding) is limited by the memory bandwidth (not the amount of available RAM). Think of the sink analogy. Yes, you may have a bigger sink, but if your goal is to drain water as quickly as possible (i.e., spit tokens out), then ultimately the size of the drain is going to play a much bigger factor in the equation than the size of the sink (give or take a couple of parameters, but don't overly scrutinize the analogy - you get the gist of what I'm saying here).

To show you how this would impact things, let's assume that you take a 32GB param model and quantize it to 4-bits. That equates to ~18GB storage right? Subtract that from 64GB RAM and that leaves you with 46GB left - seems sweet, right? However, when it comes to inference speed that is determined by the available bandwidth / model size. Again, we determined that a 32GB model quantized to 4-bit would take up 18GB; so that's the denominator of our equation. For DDR4-2666 Mhz RAM, you're optimistically looking at 36 GB/s (bandwidth; maxing out on that bandwidth is unrealistic and I only tacked a ~10% dropoff from benchmark maximums). With that math, we're looking at a max possible gen speed of 2 tokens/second (in the absolute best case scenario).

i5-8500T Processor Vastly Inferior to M2 for LLM Prefill & Decoding (Inference)

What's crazy is what I described above isn't even the biggest bottleneck of your DDR4 setup. We have to consider the age & capability of the processor that system is leveraging (i5-8500T; 8th generation Intel desktop processor). This is a desktop T-series processor that's not designed for overly heavy AI-based workloads. Its designed to operate at a 35-Watt TDP limit. Also, in addition to it having less threads than the M2 (the latter having 8; i5-8500T has 6), it only comes with 6 cores too. This matters for Intel-based PCs because they actually use multi-threading (at least latter generations did). So latter chips like the i5-10500T would have provided you with 6 cores and 12 threads (versus just 6). Your processor will actually be the bottleneck before you even run into the RAM limitations we discussed.

If you're wondering why, we have to remember we're asking this processor to compress a 32B model and store it in RAM (at 4-bits). The reason why is because the i5-8500T can't do math on 4-bit numbers. So it has to take that 18GB model (quantized) that's in RAM, send it back to the CPU cache (which is also substantially smaller than the M2 at L1 + L2), then decompress it back to 16-bit or 322-bit floating point arithmetic so that the math (dot products) can be performed before later discarding these results. As if all that wasn't enough - this intel processor doesn't possess the AVX512 extension to its instruction set. Some may argue with me about this, but assume that to mean that your processor will not be able to do math on 8-bit floating point arithmetic. Also, its at least 2x slower than the latter processors that can handle this type of math (that Intel produces).

Conversely, the M2 chip by Mac (that's in your mini-PC) is able to handle 4-bit floating point arithmetic out the gate. That alone is a major game changer (putting all the other enhancements that come with the M2 chip to the side). But we're simply talking about the decoding process at this point. We haven't even addressed the processing of the actual prompt.

TTFT (Time to First Token) Speeds for i5-8500T Could be Minutes in Some Cases

Let's go back to the 32B param model example (using this because you stated that the additional memory / headroom was a motivating factor for you choosing the i5-8500T over the Mac M2 chip; so it only makes sense to hypothesize a setup where you leverage this supposed advantage).

As we noted before, its going to take up to 18GB RAM (quantized at 4-bit). However, when it comes to the prefill (i.e., actually receiving the prompt and 'understanding' it), that process is largely limited by computation resource. For your i5-8500T setup, you didn't note any GPU would be included (and if there were one, I doubt it would move the needle much at all in this scenario). So we're relying entirely on a 2018 35W desktop processor to compute millions of complex matrix-to-matrix calculations during this prefill process.

Optimistically (and I mean in the best case scenario), you'll be sitting at your computer for a solid 5 or so minutes before the first token even appears. And when the tokens do start appearing, they'll likely be at a speed of roughly 0.5-1 token/second (again, even that is generation). This would not be the case for the M2 - at all.

eGPU Option Now Available for Mac

Recently (and I mean within the last week or so), Mac updated the drivers for the m2 chip for compatibility with AMD and NVIDIA. So that means eGPU hookups are now in the field of play. Luckily, the mini-PC with an m2 chip has two thunderbolt 4 ports on it (likely can only handle one eGPU hookup at a time). The generation of thunderbolt matters here when dealing with the actual speed of inference in this setup. Your i5-8500T is only going to be able to handle PCIe3.0 hookups and since there will likely be no Thunderbolt3 ports for connection, you'd have to open up the Dell/PC you have and manually hookup any eGPU setup you want - but that would all be for nil, because the overhead would be reached via compute just from the swapping back and forth from eGPU to the CPU. Unified memory eliminates that bottleneck to a large extent, so the bottleneck will only lie in the Thunderbolt4 bandwidth (40GB/s I believe).

Conclusion

In no universe should you ever consider getting the i5-8500t over an m2 if your only consideration in making the decision (between one or the other) is local LLM hosting and inference.

Anyone telling you otherwise has no clue what the fuck they're talking about. Respectfully.

[-]