Why should I **not** buy an AMD AI Max+ 395 128GB right away ?

[-]

atlantageek2@reddit

Just found this thread. Looking at these machines going up +1k since December. Wonder if OP regrets buying or is glad they did buy.

[-]

$2k is like 1000 hours of rented H100 in the cloud.

This was kind of my problem.

I thought about getting an RTX 5090 for LLM stuff. But that's $4,599 in Australian fake money, or around $2,800 real dollars. For that price, I could rent one for I think I worked out to be something like 2,000 hours. Sure, it was cool for gaming, but I only had 1440p/144 monitors so what was the point?

Assuming I used it for LLM work 2 hours a day (optimistic,) that's three year worth of rentals. In three years, the price for renting a 5090 will presumably go down (or for the same price, I could get something better) so the savings would compound, but in three years, if I bought a card, I will still just have a 5090.

So I instead bought a 5070ti instead ($1,500 AUD), and told myself if I wanted more, I could spend $1,000 worth of rental credits and still be ahead by like $2,100. So far I've barely used $50.

Honestly if the price came down to below the price of my first used car I could consider it, I'd be content with paying the same amount as a decent cheap holiday; but $2k USD is a LOT for a toy that I just don't use that often.

[-]

ImmediateImagination@reddit

The pice is now 4000 USD for the Framwork desktop with a standard set of ssd and cosmetics, so, fast approaching a server grade Nvidia GPU

[-]

DavidAdamsAuthor@reddit

It's just crazy. That's a rediculous price.

[-]

Ok-Possibility-5586@reddit

Dude that's way too rational.

[-]

Freonr2@reddit

Don't underestimate the pain/friction of spinning instances up and down.

[-]

StyMaar@reddit (OP)

That's what I'm saying here:

But also it seems to be much more work

[-]

-dysangel-@reddit

You don't buy this necessarily because it's cheaper. IMO you buy it if you want to run regular experiments without worrying about racking up API costs, or if you want to be able to run offline etc. I think it's a good investment if you are seriously interested in running locally. Models and algorithms are only going to continue to get more efficient. So this machine should do you for a while. If you can wait though, it *will* likely be cheaper to just run in the cloud for now, and then buy in a couple of years once all the hardware providers are competing more on VRAM

[-]

EdanStarfire@reddit

A lot of my early Subagents experiments were done on local LLMs for exactly this reason. I knew I was gonna be throwing tokens out the door and could afford for it to be slower, so let it crank overnight on things I was working on instead of paying for who knows how much API usage.

[-]

Silver-Chipmunk7744@reddit

I'm also thinking of service providers that let you pay per prompt for any LLMs. For open source models it's often less than a cent per request. This is also insanely easy to use.

Except in the current American political climate payment processors are cracking down on "adult" materials. I wouldn't trust any US providers.

[-]

DavidAdamsAuthor@reddit

It depends on how NSFW.

Gemini is pretty easy to trick into writing the most insane smut you could imagine, but it also is just one model, and a model with some flaws too. Once you've gooned yourself to the point that your body is basically a dried raisin and you've misplaced gallons, you need fresher, more electric fields to explore.

[-]

DavidAdamsAuthor@reddit

The AI revolution is carried on the backs of heroic gooners.

[-]

rishabhbajpai24@reddit

I don't think the prices for these devices will go down in the US, at least because of the tariffs. Only Chinese mini PC companies are selling them at under $1900; with future tariffs, the price will go up. US-based companies are already charging a lot.

[-]

false79@reddit

The MAX+ 395 is probably the best CPU you can get for a small form factor device, especially that the single clock speed can knock out full desktop processors.

However, why I'm not getting it is that the price is premium and the memory bandwidth is a mere 256 GB/s.

tarheelbandb@reddit

What does that extra bandwidth actually get you in terms of productivity? I think it's an important question to answer, especially in the context of a single user.

[-]

false79@reddit

uhh like everything? As a coder, you need to iterate quickly back and forth. Right now, I off load agents 10% of my work a day, sometimes 20%. What would take days, I have solutions generated over hours, breaking down a project into much smaller problems a coder dense LLM can handle, while generating unit tests to go with the artifacts created. ROI is paid off within a month if not sooner.

Second to bandwidth (which is related to compute) is context. Having hit limits in 24GB, the appetite for more VRAM has increased and along with it, one needs even more bandwidth to make effective use of that.

[-]

findus_l@reddit

Surprisingly the bosgame m5 ryzen AMD ai max+ 128gb model doesn't seem to have increased in price yet. I'm seriously considering getting it before they notice... But why didn't it? It's strange.

[-]

Daniel_H212@reddit

This honestly helps. I had the same question as OP and now I realize that yeah, I don't *need* it, so why not wait for better? The 395+ is, at the end of the day, a bit of a first-gen product, why should I be helping them beta test it? Who's to say there won't be better products with more performance, higher bandwidth, and more RAM that comes out a year from now?

[-]

Slightly older processors come down in price pretty rapidly, but new processors stay high for a while and very old processors hold steady for several years.

How did this age?

[-]

StyMaar@reddit (OP)

Pretty well actually, Qwen3.5-122B fits with its full context length on Q4, and the performance is good IMHO (both in terms of tps and in terms of capability).

[-]

RevolutionaryAd7360@reddit

I was thinking these look pretty sweet for the money. I'm late to the game and just bought 2.

profcuck@reddit

This is an absolutely great thread. Thanks to /u/StyMaar for posting it - I'm in exactly the same boat. I just keenly read everything here and like you, I'm not convinced out of the purchase.

The most persuasive argument is the one that I've been struggling with which is "wait, something better will come, it always does". That's true but also could persuade anyone at any time to never buy anything. The real question is whether there's something big coming soon, or a year or two away. Big difference.

However, there are "solid" rumors about Strix Medusa - the successor to Strix Halo. There appears to be imminent announcements, but Lisa Su is announced as keynoting CES in January 2026. My guess is that whatever is announced then or at the November financial analysts day will still be a ways into the future.

Therefore, I've concluded that I won't be blindsided and disappointed by something 50% faster/better shipping in December. And... this seems in any event "good enough" for my use case.

I'm running models on my M4 Max 128gb laptop, and it's great. (Mac haters often say it's awful, but it isn't!). But I have some projects where I want to crank away 24x7, and on my homelab setup I'd love to have it open for family members to play with. But my laptop is my daily driver and I'm going back and forth to work and travel and so on, so it isn't really suitable for all kinds of fun stuff that I want to do.

Local llm's are generally fine (distilled Deepseek, different flavours of Qwen, gptoss120b, etc)

Energy consumption rarely reach 200w noise levels are affordable (for GMKTEC evo-x2)

[-]

clare64@reddit

do you reckon this insufficient for local video rendering?

[-]

paul_tu@reddit

I'd say its too slow for the purpose of AI video generation.

Yet its possible

I didn't try it for rendering BTW

But take into consideration traditionally poor codecs support by AMD.

Some surprises like AVX512 support in AI395MAX are still possible.

[-]

Successful-Put-4899@reddit

You know those days when everybody tried to automate their home with raspberry pi's and home assistant ...
This combo (AI capabilities thanks to the VRAM) actually convinces me to take the step.... everything controlled with ollama, home assistant, n8n and still have plenty power to run your local services like streaming, NAS/backup, thanks to the millions of available docker images... the chance to shed so many internet headaches. You have the ability to train your LLMs on your own setup as well so you constantly have a low paid, private assistant that can look through documentation for you, help you with administration and manage your documents, all while managing the lights in the house, the temperature and setting the mood in the room ... the options are getting limitless.
And none of the big tech spying on your data and telling you what you should or should not do.

[-]

clare64@reddit

youve nailed it. reckon its possible with the pc op is referring to?

[-]

No-Manufacturer-3315@reddit

Memory bandwidth issues

[-]

StyMaar@reddit (OP)

256GB/s is more than enough for model with 12-3B active parameters though.

Compute is likely a bigger issue for long context and time to first token.

[-]

clare64@reddit

am i correct in assuming that if your use case doesnt care about 'speed' this 395 (with 256gb/s) will be fine?...as long as I can use video models locally and have it running no cost - could care less if its fast/slow at this point ...

[-]

zipzag@reddit

The upcoming nvidia spark is 270GB/s. the high end mac mini is 250. This speed is about the stat of the art for these more moderately priced SOC computers.

The more expensive mac studioare are about 500GB/s and 800.

SOC choices good on memory and low on processor. Arguably many people would trade some speed on the 5090 for more memory. What's available today is a bit imbalanced for how most people want to run inference.

[-]

s101c@reddit

You cannot use Nvidia Spark as a gaming machine or professional x86 workstation. Many of us here buy an expensive computer to do all of these things.

If you're about to spend $2k on a product like that, why don't you "test drive" it so to speak, and instead budget $100 worth of H100 GPU rentals?

Buying 395 depends on your specific use cases. It is not as fast as its other counterparts but cheaper compared to a Mac or Nvidia DGX Sparks.

[-]

randylush@reddit

I had one 4070, one 3070, one 4080, and one 4090 with a combined RAM of 324 GB

rishabhbajpai24@reddit

Reasons not to buy:

2k is not zero: A lot of LLM providers have free tiers for chat and API, so for everyday purposes, you don't need to spend 2k.

Time to set up the server: For someone new to LLMs and 395, it may take some time to set up everything.

Stability: AMD is not as good as Nvidia when it comes to LLM acceleration. ROCm is not as stable as CUDA. Vulkan is good but still fails sometimes.

Non-LLM workload: Running CUDA-optimized algorithms is not always easy on AMD.

[-]

StyMaar@reddit (OP)

2k is not zero: A lot of LLM providers have free tiers for chat and API, so for everyday purposes, you don't need to spend 2k.

If I can add: Linux support is horrible.

Lemonade: no support on Linux for NPU Ollama: no rocm in the default packages vLLM: no rocm support, 3 weeks ago added but not yet in the latest version

I'm just fighting to get something working on Linux and so far: nothing is.

[-]

LsDmT@reddit

What specific models are you running on the 395? Are you using vulkan-radv, vulkan-amdvlk, rocm-6.4.3-rocwmma?

Have you found any solid models working with rocm7_rc-rocwmma yet?

[-]

farnoud@reddit

Is splitting model between a few cards viable? Is it better to have all the cards on the same system?

It’s a lot cheaper to buy 3x 3090 than l40s

[-]

kkb294@reddit

Same with me as well. I have the 4090 48GB, few 4060 Ti 16GB variants at home rigs. They are bulky, power hungry, and jet sound machines under the load. But, whenever I want to demo something, I load the code and local llm's into it and carry it to the offices for thr demos. It has the local llm setup (LM studio, ollama, llama-cpp-python, whisper STT, kokoro TTS).

All, I have to do is load the app container into the docker and hook it up with other layers for the demo.

[-]

StyMaar@reddit (OP)

because for the amount of bucks you are going to spend you can get a um890 mini pc+5060ti 16gb of vram and do everything 2 times faster in image/video generation.

This is /r/localllama, not /r/stablediffusion, I don't care about video generation at all.

[-]

WhaleFactory@reddit

I definitely did not need one, but i bought one anyways. Now I am running gpt-oss-120b at >40tps and sipping power. Thing uses less power under load than my other servers do sitting there idle.

[-]

dougmaitelli@reddit

So, question, I just got a GMKtec Ryzen AI Max+ 395 and I can get only 20t/s on a 12b model. So I don't know how you can get double these tokens in a model 10 times bigger.

Am I missing something?

[-]

deadly_sin_666@reddit

Which one did you get?

[-]

j0rs0@reddit

You were supposed to help 😆

[-]

Hodr@reddit

He definitely helped. Here's my help, my dipshit nephew who refuses to get a job dropped 4 grand on a credit card for rims on a 10 year old Accord. Puts a presumably decently well compensated person dropping 2K on an awesome setup for running local LLMS in perspective.

[-]

Bakoro@reddit

Jeez, I make six figures and haven't bought a new computer in 5 years.

Suddenly we went from comparing local AI setups to comparing ourselves against random 'dipshit nephew'. ngl, compared to him we look peachy buying some some mini PC.

I probably should have written around thousand, but I could swear ssomeone listing over 2k.

Anyways this is best I found and like I wrote lots of things seem to affect it (mostly negatively).

https://forum.level1techs.com/t/strix-halo-ryzen-ai-max-395-llm-benchmark-results/233796

[-]

Gringe8@reddit

To compare, that gets ~100 t/s prompt processing on 70b while my 5090 gets 1500+

[-]

ethertype@reddit

Cool. As long as this is a single-user setup, what is the added value of anything faster than human reading speed?

I could sell my 4 3090s and buy a 395. But, knowing myself, I'll probably end up with a 395. And 4 3090s.

For better benchmarks, be sure to the latest updates (linked in that post) or these: https://kyuz0.github.io/amd-strix-halo-toolboxes/ - Note, these are pp512/tg128 so best case numbers. While I do sweeps, I haven't done a lot of long-context tests (since they take forever to do), however I did run one long context test, perf def gets pretty bad out at 128K... https://strixhalo-homelab.d7.wtf/AI/llamacpp-performance#long-context-length-testing

[-]

KontoOficjalneMR@reddit

perf def gets pretty bad out at 128K

Figured, but then it's incredibly rare for a model to even support that long context, not to mention making actual use of it.

[-]

kmouratidis@reddit

Is pp512 a reliable test for speed on long prompts? Why not benchmark your own computer? You can install vllm and run something like this:

vllm bench serve \
  --backend llama.cpp \
  --base-url "http://localhost:8000" \
  --endpoint /v1/completions \
  --model "<served_model_name>" \
  --tokenizer "<huggingface_repo>" \
  --num-prompts 1 \
  --random-input-len 10000 \
  --random-output-len 2000 \
  --dataset-name random \
  --ignore-eos

[-]

KontoOficjalneMR@reddit

Is pp512 a reliable test for speed on long prompts? Why not benchmark your own computer?

Those appear to be out of date. The updated numbers for that particular model show 182 pp t/s: https://github.com/lhl/strix-halo-testing/tree/main/llm-bench

Sparser MoE models seem to fare much better. Some of the smaller ones get 1k+.

[-]

StyMaar@reddit (OP)

How can it be so low?

[-]

GreenCap49@reddit

I get around 500 pp, it's pretty fast, also gpt-oss-120b has SWA, so only the initial prompt takes a bit longer, than it gets faster

[-]

Cool-Hornet4434@reddit

That's not SWA... If you run the model through a "dry run" it takes a long time to start up, but the first run with an actual prompt is fast. If you skip the dry run with the --no-warmup option, then the first run is slow, and it speeds up after that. SWA is Sliding Window Attention, which would only affect how much VRAM it uses for the KV Cache. BUT With Oobabooga you can't use any quantization of the KV Cache because it's already quantized...the whole model runs with MXFP4 which squeezes down the numbers that AI models use from their normal size to just 4.25 bits per number.

[-]

xjE4644Eyc@reddit

Same. Any models that are less than 100 gb and I'm like 'lets do this'.

That's not true. My phone consumes about 3000mAh of power a day.

If you want to minimize power usage, the best option is to live in a cave with no electricity.

[-]

Rynn-7@reddit

MAh is a useless unit when comparing usage against a computer. You need to multiply by the phones voltage to get watt-hours.

[-]

Creepy-Bell-4527@reddit

The batteries voltage, but you're right. I was just too lazy to Google if it was 3.3-3.8 or 5v so I went for the unit I knew!

[-]

rawednylme@reddit

What is the best local model the cave can run?

[-]

Creepy-Bell-4527@reddit

If you talk loudly enough it echoes back what you said, I think gpt-2 had similar behaviour.

[-]

Cool-Chemical-5629@reddit

the best option is to live in a cave with no electricity

[-]

BillDStrong@reddit

Next Gen RDNA 5 is supposed to use lpddr so if a board partner wanted, they could. Would they? Who knows.

[-]

AppealSame4367@reddit

But is that really worth it? What does oss 120b do for you?

Now that i know gpt-5 on codex i wouldn't wanna settle for less xD

[-]

Apart-Touch9277@reddit

I think having a decent offline position is important enough to invest

[-]

tarheelbandb@reddit

It's more like what does Qwen2.5-coder or Qwen3 coder do for you.

[-]

TetsujinXLIV@reddit

What was your method of deployment? I tried ollama and got poor performance. I was hoping to run it headless in docker and connect from my desktop.

2CatsOnMyKeyboard@reddit

120B just didn't load in LMStudio with 50K context window. Don't what the max for that size would be, probably around 20-30K.

[-]

EdanStarfire@reddit

I altered the BIOS config to put 96GB dedicated to GPU and load gpt-oss fine at 128k context. Runs great.

[-]

No_Afternoon_4260@reddit

Facyory setting is 96gb for the gpu but you can set it higher. Have you done that?

[-]

Xp_12@reddit

I'd imagine it follows the trend exhibited on this chart.

https://forum.level1techs.com/t/strix-halo-ryzen-ai-max-395-llm-benchmark-results/233796

[-]

henryshoe@reddit

Would you mind telling me exactly what your set up is?

[-]

tat_tvam_asshole@reddit

This felt so good to upvote. I too bought a Strix Halo box, within hours of seeing it advertised, straight to Microcenter.

[-]

Peter-rabbit010@reddit

how do you compare to rtx 6000 pro? I hadn't considered amd wafers until this post.

you should get some referral money from amd

[-]

Accomplished_Bet4312@reddit

I'm also making some excuses to stop the idea of getting this :). I have a desktop already, and all the things I need is to get a new gpu card.

But it's good with a tiny machine under my TV. If it can work with steamOS, I can use it majorly as an ai server and a steam console after work.

[-]

Commercial-Fly-6296@reddit

Sorry for my lack of knowledge, but does this work well with the fact that most use cuda compatible libraries and so on ?

Also, will this be fast enough to fine tune and inference or maybe even distill ? ( When compared to RTX)

[-]

Most_Seaworthiness71@reddit

Does anyone know what’s the actual bandwidth for cpu cores vs gpu cores ? I read somewhere that the cpu cores bandwidth is significantly less than the bandwidth allocated to the gpu cores .

Fedora crashed during install twice when clicking "allow proprietary software"🤷‍♂️. Once booted, i began installing my usual apps then Fedora crashed. It was a weird crash, complete lock then a system shut down... Current kernel didn't have full support so I had to dig around for kernel updates and work arounds. ROCm drvers were a pain, but it looked like there was better support with Ubuntu. So, I reverted to Ubuntu. Worked better out of the box, but then Citrix Workspace didn't work at all despite a few hours of troubleshooting (need for my work) . I think NPU support is Windows only for right now. But then Windows doesn't have full ROCm support. I ended up going against my own wishes and installed Win11 in hopes to have the essentials work that I needed without significant error. On windows, the WiFi driver seems to struggle with DNS resolution during large packet transfers I spent a few hours debugging that.

I received my framework on Friday. I spent all of my free time (with two kiddos) trying to get things working the way I wanted with mixed results. So i decided on Windows that way I can use the apps I use for work by Monday. I plan to dual boot but waiting for some solid free time to get that done. Probably another weekend project.

[-]

ImEatingSeeds@reddit

Check out CachyOS. Don't let the fact that it's Arch Linux scare you. I've been running it on bleeding-edge hardware without ANY issues for over a year as my daily driver.

Windows was dual-booted on the system as an insurance policy/hedge against the risk that if Linux f*cks me, I still have Windows to fall back on (for work, etc).

At this point, I can confidently say that with the exception of one issue around repository keyring updates that went sideways (which was fixed within a day without needing to reinstall the OS or much tinkering)...Cachy has never left me in a situation where I had to boot into Windows as an emergency fallback.

Even when it comes to gaming, Cachy is so good (and optimized) for it that I almost never boot into Windows for video games either.

...and that's coming from a guy who has never been into "ricing out" his own rig, has never run distros like Arch or Gentoo as daily drivers (ever), and who has always preferred to use distros like PopOS, PinguyOS, Mint, Ubuntu, etc.

Driver/hardware support is superb, always up-to-date, and the installation was stupid-easy. My system's been rock solid and stable...almost to the point where it's boring me.

[-]

mfarmemo@reddit

Can confirm. Cachy worked great with minimal tinkering effort. I ran it through a full workday of coding, local LLM usage, web usage, and virtual meetings with only two issues: headphone output noise and one GPU crash during inference . Much better experience than Fedora or Ubuntu. The Citrix Workspace app even worked and it never works right on Linux distros.

[-]

ImEatingSeeds@reddit

w00t! Great to hear :)

Ok_Warning2146@reddit

If u do video gen, it will be uselessly slow

[-]

n1k0v@reddit

What are your thoughts on the 385 32gb ?

[-]

StyMaar@reddit (OP)

I don't have any, as I haven't tried it ^^

[-]

Particular-Party4655@reddit

I am wondering why there are still no PCIe extension cards announced based on this chip (AMD AI Max+ 395)? Imagine... That would be a bomb!

[-]

StyMaar@reddit (OP)

I'm no hardware specialist at all, but this chip is a full SoC with a CPU in it, it's not a regular GPU.

Why isn't there high memory/moderate-low bandwidth graphic cards using the same kind of design of simply stacking cheap LPDDR instead of GDDR is a legit question though.

[-]

Particular-Party4655@reddit

Yes. I think there is a huge marketing / money reason why we still don't have AI-first / GPU-second PCIe cards for consumer market. I basically asked this question to troll them a bit. However, for AMD that might be a great marketing move. It would allow them to win this market (local AI for the masses) while Nvidia is focused on high-profile clients owning datacenters... From a hardware perspective it is very easy to turn this SoC into a PCIe card (it already has a PCIe4x16 slot), the software stack (ROCm-based, all the routing / balancing pipelines) will catch up inevitably.

[-]

Ok-Possibility-5586@reddit

Can you explain what you mean by 128GB of VRAM?

This is a laptop right?

It looks like it's 128GB of RAM.

Is it shared VRAM/RAM like on apple or what?

[-]

StyMaar@reddit (OP)

This is a laptop right?

This is a chip, which was designed by AMD for use in laptops but has been used my mini-PC makers, the Framework desktop being the most notorious.

It looks like it's 128GB of RAM. Is it shared VRAM/RAM like on apple or what?

AFAIK it's not exactly the same as Apple, but it's the same idea.

So yes is 128GB of LPDDR5x stuck directly on the SoC, and has 256GBps of memory bandwidth (lower than true GDDR6X VRAM, but higher than your desktop's system RAM).

[-]

Ok-Possibility-5586@reddit

Cool. Did you end up buying the laptop? (I think you said strix halo?). I'm looking at this: HP ZBook Ultra G1a

[-]

szab999@reddit

Not OP, but I got this exact HP Zbook Ultra G1a with Ryzen 395 + 128GB RAM. You can allocate up to 96GB to the GPU. I have it running with Debian, but I haven’t managed to setup the latest RoCM yet, there are some installation issues with that.

[-]

Ok-Possibility-5586@reddit

It would be awesome if you keep us up to date. On paper it sounds like a no-brainer for a daily driver if it can be made to work.

IMO the shared memory thing is a potential NVIDIA killer for the hobby end of the market. Both Intel and AMD should be putting a team onto building out as many Intel/AMD forks of the major python libraries.

[-]

szab999@reddit

I made it work with podman (docker alternative), running ubuntu 24.04 + rocm 6.4.3 in the container. Ollama is working well, I will do some performance tests tomorrow.

[-]

StyMaar@reddit (OP)

“Strix halo” is just the AMD codename for the “Ryzen AI max+ 395”.

Why is AMD insisting on using such an horrible name scheme instead of the code name is beyond me.

With this clear superiority, I bit the bullet and just got a mac studio, which I use as a locked down server for running AI sofware.

To mitigate the downsides of using something like MacOS, I don't connect it directly to my home router. Instead it is connected to a mini pc via cable that acts as a gateway to the Mac, and which blocks access to the internet most of the time. I unlock it to download LLMs, but always block access to apple servers to prevent automatic updates.

If Asahi Linux gained mainstream distro support and Vulkan can run AI software as well as Apple silicon Metal, eventually I might switch to it.

[-]

randomfoo2@reddit

800GB/s memory bandwidth, which is very important for LLMs. To give some perspective, RTX 3090 is 960 GB/s and most consumer NVidia cards are below 800GB/s. AI MAX 395 is 256GB/s, which is lower than an RTX 3060 with 360GB/s.

While I think Macs can be good and I haven't ever gotten my hands on a high end one to run more extensive real testing, I do think that the 800GB/s theoretical MBW numbers are somewhat misleading. In the llama.cpp Mac performance discussion thread the top of the line (80CU) M3 Ultra w/ 800GB/s of theoretical MBW gets a tg128 of 92.14 tok/s with the Metal backend on the standard llama2-7b q4_0 test model. Not bad!

However, when comparing the same model, Strix Halo gets tg128 52.16 tok/s with the Vulkan backend (RADV driver) - that's 57% of the tg128 perf at 32% of the max theoretical MBW. 🤔

On the flip side, the 3090 (CUDA w/ FA) gets 161.89 tok/s tg128 - that's 76% better performance than the M3 Ultra even though the 3090 only has +20% more theoretical MBW.

(For Strix Halo and 3090, my personal llama-bench numbers corroborate the published results.)

For those interested, be sure to also take a look at the pp512 (compute bound prompt processing/prefill), the numbers are even more stark as a comparison. You don't really get a free lunch when it comes to matmuls/watt.

[-]

Icy-Signature8160@reddit

the same model, Strix Halo gets tg128 52.16 tok/s with the Vulkan

I see 5.18 t/s in your list, what's wrong?
ps. can you check the new qwen3-next-80B-A3B how it performs on the Strix Halo?

[-]

randomfoo2@reddit

There’s literally no way a 70B is getting 50 t/s tg on Atrix Halo lol. Either you’re reading pp/tg wrong or simply the wrong model.

Qwen3 next needs to be quanted to run on Strix Halo and my understanding is that nothing does it yet.

[-]

tarruda@reddit

I think memory bandwidth is just one of the factors, and it doesn't has the same impact on all LLMs. GPU performance is also impactful, and the Mac GPU is definitely less powerful than a 3090 (however, it has a much better performance/power ratio). Here's the M1 ultra llama-bench for GPT-OSS 120b:

% ./build/bin/llama-bench -m downloads/ggml-org/gpt-oss-120b-GGUF/mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -d 0,20000
| model                          |       size |     params | backend    | threads | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Metal,BLAS |       1 |     2048 |  1 |           pp512 |        651.09 ± 4.93 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Metal,BLAS |       1 |     2048 |  1 |           tg128 |         62.52 ± 0.03 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Metal,BLAS |       1 |     2048 |  1 |  pp512 @ d20000 |        359.41 ± 0.76 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Metal,BLAS |       1 |     2048 |  1 |  tg128 @ d20000 |         42.60 ± 0.03 |

[-]

randomfoo2@reddit

Here btw is how Strix Halo performs with the same model and the Vulkan AMDVLK backend (60-90W):

build/bin/llama-bench -m /models/gguf/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -fa 1 -b 2048 -ub 2048 -d 0,20000
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan,RPC	99	2048	1	pp512	774.32 ± 5.13
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan,RPC	99	2048	1	tg128	50.00 ± 0.06
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan,RPC	99	2048	1	pp512 @ d20000	230.99 ± 1.03
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Vulkan,RPC	99	2048	1	tg128 @ d20000	36.03 ± 0.01

[-]

tarruda@reddit

That's quite good. Seems like memory bandwidth doesn't affect MoE that much. Can you run it on a 70b dense model like hermes 4? I'm curious if it will be much different than what I'm getting:

% ./build/bin/llama-bench -m download/unsloth/Hermes-4-70B-GGUF/ud-q4_k_xl/Hermes-4-70B-UD-Q4_K_XL.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -d 0,20000 
| model                          |       size |     params | backend    | threads | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.73 GiB |    70.55 B | Metal,BLAS |       1 |     2048 |  1 |           pp512 |         62.17 ± 0.02 |
| llama 70B Q4_K - Medium        |  39.73 GiB |    70.55 B | Metal,BLAS |       1 |     2048 |  1 |           tg128 |          9.83 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.73 GiB |    70.55 B | Metal,BLAS |       1 |     2048 |  1 |  pp512 @ d20000 |         39.98 ± 0.01 |
| llama 70B Q4_K - Medium        |  39.73 GiB |    70.55 B | Metal,BLAS |       1 |     2048 |  1 |  tg128 @ d20000 |          7.64 ± 0.00 |

[-]

randomfoo2@reddit

For a Llama 3 70B it looks like Vulkan RADV performs better than AMDVLK:

❯ AMD_VULKAN_ICD=RADV build/bin/llama-bench --mmap 0 -r 1 -fa 1 -m /models/gguf/shisa-v2-llama3.3-70b.i1-Q4_K_M.gguf -d 0,20000                                                          (base)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	Vulkan,RPC	99	1	pp512	77.07 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	Vulkan,RPC	99	1	tg128	5.18 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	Vulkan,RPC	99	1	pp512 @ d20000	32.00 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	Vulkan,RPC	99	1	tg128 @ d20000	4.39 ± 0.00

[-]

tarruda@reddit

As I suspected, memory bandwidth has a more significant impact on LLMs with more active parameters. If we compare the GPT-OSS inference numbers, mine is only 25% better than yours (62.52/50), while the 70B LLM it is 89% better (9.83/5.18)

Another interesting thing is that pp512 for 0 context is better on Ryzen AI GPU. IIRC prompt processing is affected mostly by GPU performance.

[-]

randomfoo2@reddit

Sure dense models are more MBW bound, but you'll note that again, that while Strix Halo gets decently close to its theoretical MBW limit (39.59GB * 5.18 tok/s = 205.08 GB/s of a 256GB/s theoretical) \~80% of max MBW. OTOH the for the Mac, at 39.73GB * 9.83 = 390.55 GB/s or 49% of the max theoretical MBW. I'm curious what `asitop` reports when you're running your inferencing. Does it ever get anywhere close to 800GB/s? If so that would mean there's some other efficiency going on since that isn't being reflected by the actual tg speeds.

[-]

fatboy93@reddit

Genuine answer: ROCm

But Vulkan is 8p% of the way there so idk man.

[-]

StyMaar@reddit (OP)

I'm definitely planning to use Vulkan.

[-]

RRO-19@reddit

That's a lot of VRAM for local models. Main downsides would be power consumption and whether your use case actually needs that much memory. Most fine-tuning tasks work fine with less.

[-]

StyMaar@reddit (OP)

That's a lot of VRAM for local models. Main downsides would be power consumption and whether your use case actually needs that much memory.

If you want to run gpt-oss-120b or Qwen3-next, you cannot really do it with less memory…

[-]

power97992@reddit

Cant u get a m2 max mac studio for a similar price but with more bandwidth but less flops?

[-]

StyMaar@reddit (OP)

Software support is going to be a problem as Asahi Linux isn't exactly mature AFAIK (and no, I'm not going to touch Mac OS with a ten feet pole ever again).

[-]

power97992@reddit

What is wrong with the Mac OS?

[-]

StyMaar@reddit (OP)

If I'm spending two thousand bucks on a computer, it's my computer and Apple considers the computer they sold as theirs (you can't repaire it yourself, even with actual Apple parts, because there's a DRM in the parts, they don't give you full access to your OS's internal “for security” and so on).

Some update from strix halo discord: "Wow this new change is phenomenal. Iteration times for Qwen Image have gone down drastically. The same prompt, previously with Ultra Fast (4 steps) that took 2.5 minutes, now completes in 1m 40 seconds with Fast (8 steps).

Iteration speeds have gone from ~20s/iteration to less than 9s/iteration.

Great going <@793885450133176351> " this is with https://github.com/kyuz0/amd-strix-halo-image-video-toolboxes I haven't tested it personally yet

[-]

GreenCap49@reddit

It's not bad, Qwen image with 8 step lora takes around 5-10min per image. Define bad :D

[-]

Aplakka@reddit

Thanks, that's good to know. I've been interested in knowing how these might work for image or video generation but I haven't seen any benchmarks.

[-]

sleepingsysadmin@reddit

>My wallet thanks you in advance.

Sorry wallet, no help today.

I was asking GPT5 recently to try to estimate AMD and Intel's product cycles and for when a 256gb option with that much more memory bandwidth might come. It was thinking 2027-2028.

Intel cancelled their option and are moving toward enterprise options. So nothing for us.

Apple isnt likely to bring a 256 mac mini until 2028.

[-]

Kindly_Elk_2584@reddit

Because it's AMD GPU lol. You can't run anything besides a selection of LLM models. And it's just 4070 level of power, would rather invest that money into a 5090.

[-]

profcuck@reddit

Hey /u/StyMaar you may find this useful:

https://docs.google.com/spreadsheets/d/1mmob8me7STljG6r7EvmJBuhJTqoBaAtba2f_9RU7Ef4/edit?gid=0#gid=0

As I am in the midst of upgrading my homelab to 10gbe, the Beelink looks like the one for me. Let us know which one you get?

[-]

Left-Language9389@reddit

Anyone got a link to this machine so I can see what OP is not buying?

[-]

Cloakk-Seraph@reddit

I mean I've been rocking one for awhile. I've anecdotally notice that while there's plenty of varm.on tap, it's not as responsive or quick as my dedicated AMD 7900 xtx on my desktop. Also cue the AI is hardere with AMD vs nvidia (not that I know what that's like)

[-]

lodg1111@reddit

Bigger model counter intuitively does not work so good in some task. For example I was writing cover letter, gpt-oss-120B acts very unprofessionally writing subheading and point forms. even after i told it to fix, it is still repeating the error.

Switched to use gpt-oss-20B. already very decent, in paragraphs. No short forms, no point forms, no subheadings. write much better email.

[-]

tronathan@reddit

Do these machines suffer from the expensive input token issue that affects MLX, etc? e.g. Are you going to have 60 second time-to-first-token?

[-]

amztec@reddit

I will buy it, simple reason

it is small
it is near silent (waiting for the Beelink version of this)
it can run 120B models

ijustwanttolive23@reddit

Do you feel the same about gaming PCs, your car, etc? There is a lot of freedom and consistently and privacy benefits in owning even if utilization isn't super high.

[-]

Ok-Hawk-5828@reddit

I've never had a gaming device since I was a kid. I have a truck that does what I need it to do and if it doesn't, then I rent instead of buying more cars or something that could meet every imaginable need. Utilization is everything when it comes quickly depreciating assets.

[-]

StyMaar@reddit (OP)

You’re better off using cloud APIs

Thanks but no thanks. I'm not surrendering my privacy for convenience or small savings.

A 3090 stack is better at most things.

Except at real estate, power consumption and noise. That's a big deal since my computer is in the living room…

DGX is more versatile

Wat? DGX is ARM (so no gaming) and will likely only support a custom Nvidia distro based on an obsolete Ubuntu. It may be a much better machine for ML scientists, but it's the opposite of “more versatile”.

[-]

xeikeo@reddit

I just put a pre-order in for a frame work desktop 😭😭😭 maxed out.

[-]

audioen@reddit

So it's a nice general desktop computer, can do gaming, LLM in a limited way for 1 user, and I appreciate compact size and silence which is what I get using HP Mini Z2 G1a box. It's more like $4000, though.

I predict that in about 1 year, it will be considered a lower end computer for LLM. So within a year, these AI Max boxes will start to look like paperweights and the competitors will likely be several times faster and some probably ship with more RAM. The Ryzen 395+ is kind of discount Apple -- more attractive performance/price ratio, but that's about it. Slow compute means: stable diffusion is slow, video generation is slow, dense LLM models are slow. It can literally only run MoE stuff and very small < 10B models at usable speed.

I love my little computer, but I'm expecting to be selling it within the year.

[-]

ijustwanttolive23@reddit

Correct me if I am wrong but a Mac Studio would be noticeable faster right? And if you get the 256 gb or higher your options are a lot better?

[-]

StyMaar@reddit (OP)

Correct me if I am wrong but a Mac Studio would be noticeable faster right?

Indeed, but it would be much more expensive. I could almost buy two 5080 to put the Strix Halo's PCIe slots at this price.

And if you get the 256 gb or higher your options are a lot better?

$5600 though, that's close to three times the price of the 395 I'm talking about.

Also I don't want to tinker with Asahi Linux and I'm not going to run a closed OS on my computer.

[-]

ijustwanttolive23@reddit

True. I forget the price.

Question: Can the AMD AI Max+ 395 use an external GPU? Could you hook up a 3090 and offload a chuck of a model?

[-]

StyMaar@reddit (OP)

AFAIK it depends on which exact model you buy, but the framework laptop has a PCIe x4 socket (+ 2 M.2 socket, one of which can easily be repurposed into another PCIe x4).

I'm not 100% sure about that but I've read plenty of time that the number of PCIe lanes isn't that big of a deal for LLMs as there's much data to move from one GPU to another but don't quote me on that.

[-]

cidiousx@reddit

I saw this thread and am debating getting a box myself.

Currently I'm running a full blown server with Intel 245K + 96GB 7200 + 2x A5000 24GB (48GB total). If I'd want to expand the VRAM I'd have to go for some (sketchy custom) SKU 4090 48GB cards or break my bank.

I bought the A5000 cards for $1100 USD new a pop and they go for $1300-1350 USD here locally now second hand.. I could sell them and make a profit. In return I could grab a MAX+ 395 box with 128GB soldered ram.

The benefit of that swap would be also freeing up a lot of the other parts in the build that can strengthen some of my other server builds.

My grip with it is that the ram is soldered and nothing about the box can be upgraded. It will just depreciate and without being able to upgrade it, not a long term solution.

Any thoughts?

[-]

DevDuderino@reddit

9/10 would highly recommend. Finding do many random uses for cheap AI I didn't know I had before

[-]

NegativeKarmaSniifer@reddit

Can you list a few?I'm just curious.

[-]

DevDuderino@reddit

Let's see.

Podcast transcription and summarization, I use a combination of whisper.cpp and gpt-oss:120b to extract a structured representation of the segments and topics covered.
FM radio monitoring, have an rtl-sdr tuned to a local news station. Every 60s clips are processed through a whisper->llm pipeline, anything 'important' is included in an hourly news summary I get via email.
Light 'agent' work. Crush cli running with QwenCode 30b works surprisingly well with MCP servers(for simple tasks like DB queries.).

I still use gemini+claude for long-context tasks where accuracy is important. Being able to just run batch tasks constantly without worrying about token costs has been great.

[-]

flammafex@reddit

```FM radio monitoring, have an rtl-sdr tuned to a local news station. Every 60s clips are processed through a whisper->llm pipeline, anything 'important' is included in an hourly news summary I get via email.```

I get that, but how to test for you importance? Simple keyword matching? LLM judges?

[-]

DevDuderino@reddit

So what l do is check against a running list of "news topics" I have stored in a sqlite DB.

Right now "important" means either something completely new, not on the existing topic tracking or new information matching a Google Google alert style . so the Charlie Charlie Kirk stuff popped up since it was a new story. "National guard deployment Chicago" is a subject I have set up to be included as an immediate alert.

Ah ok so privacy is a key goal of yours.

What is your usecase? Coding?

[-]

StyMaar@reddit (OP)

Ah ok so privacy is a key goal of yours.

Yes. Not that what I do with LLM is necessarily confidential (though sometimes it is because it's contains customers' informations) but mostly as a matter of principle.

What is your usecase? Coding?

A bit of coding, but lots of various stuff LLMs work well at, like getting unstuck at making the first draft of an important email or other written document (it doesn't matter if pretty much all of it is rewritten, the llm is there to overcome the “blank page syndrome”), getting rephrasing suggestions, summarizing technical documents, extracting relevant information from logs, building valid json requests for an API from natural language input, etc.

I've been using a mix of Mistral, Gemma and Qwen, with a good chunk of the work being done on CPU because I only have 8GB of VRAM.

As you can guess, it's pretty slow and I've been contemplating buying a better GPU for a while, but I've been disappointed by new models and didn't feel great about buying an used card given that they sometimes have defects that can be hard to detect.

[-]

Soggy-Camera1270@reddit

Yeah but it's not just an AI tool either, is it.

[-]

No_Bake6681@reddit

Sort agree... as a reminder op wants to be dissuaded.

If having another laptop class computer is useful then sure.

If all they want to do is build with AI (same) then I've surmised that the 395+ is more of a novelty and a distraction from that true goal.

Yo op, what are your goals?

[-]

It lets you run those models, but not at "decent" speeds. This means it's mostly great for testing, but you'll really want a larger GPU cluster or opt for cloud-hosted options to get any real work done (e.g. looking at code).

I have a Strix Point (AI HX 375) laptop with 64 GB of RAM. It's about \~1/3 the GPU performance and \~1/2 the memory bandwidth. Sure, it can run models like OSS 20B, but only at \~25 TPS

[-]

BillDStrong@reddit

Because what is coming is going to be so much better.

In seriousness, if you need it now, then get it now. If not, then don't. I would suggest you make sure to get a platform that offers some expansion capability, such as the Framework with its x4 PCI-e port, of the model that has oculink ports. This will allow you to add additional GPUs later.

Why would that be important? AMD, for instance, in its next gen RDNA 5 GPUs will be pairing with LPDDR5 memory instead of the much more expensive GDDR6/7, so they will be able to offer cards with much more VRAM. GPU VRAM is the only way you have of expanding the amount of memory on these machines, so if you want to run a model that is much larger, you need a solution.

Now, AMD has not announced sizes yet, and these are schedules for more than a year or so away, but their use of essentially Desktop Memory means they, or their partners, could offer a card that has memory on the order of professional cards at much cheaper prices. Even the threat of this should force Nvidia's hands for more memory, so future proofing means you win, but if not, you don't.

[-]

StyMaar@reddit (OP)

AMD, for instance, in its next gen RDNA 5 GPUs will be pairing with LPDDR5 memory instead of the much more expensive GDDR6/7, so they will be able to offer cards with much more VRAM.

Very interesting, where did you get that from?

[-]

BillDStrong@reddit

Watching all the leaking videos. In this case, Moore's Law is Dead, who has a very high track record, released it and others have confirmed that is what the documents they have seen say.

[-]

momono75@reddit

Okay. I will try to stop you.

Z.ai has a fixed price subscription plan for GLM 4.5. it's $6 per month, and more than three times higher limit against Claude Code pro.

[-]

StyMaar@reddit (OP)

I'm not willing to give up my freedom to save a few bucks. Cloud LLMs are a non starter for me.

[-]

momono75@reddit

How about Nvidia's spark if you don't mind about the cost so much? And even if you don't choose it, other products' price might go down if spark is released.

[-]

StyMaar@reddit (OP)

How about Nvidia's spark if you don't mind about the cost so much?

ARM means I can't use that as my daily driver (I play a small but non null amount of video games). Also the custom Linux distro thing on previous equivalent from Nvidia don't inspire much confidence.

[-]

momono75@reddit

I see. Yeah, that will work better than the usual gaming mini PCs, or Steam Deck. I failed to stop you.

[-]

StyMaar@reddit (OP)

Also, did you see that: https://www.reddit.com/r/LocalLLaMA/comments/1ndz1k4/pny_preorder_listing_shows_nvidia_dgx_spark_at/

If that's true then it means the Spark will be as expense as a 395 and a 5090 combined!

I'd buy both rather than just the Spark any day if I wanted to spend $4300 …

[-]

momono75@reddit

What?! So expensive... I buy 395 now.

[-]

FabioTR@reddit

Two reasons:
1) Rocm support for strix halo is still not perfect. It will definitely improve in the next months.
2) I think soon some chinese producers will release strix halo motherboard, with RAM and CPU integrated and you will be able to get better and cheaper systems for less.

[-]

prusswan@reddit

If you can afford to and have some idea of how you can realize the value, just go ahead and do it. Time is money.

$2 is really not a lot. As long as you put it to good use.

[-]

randomqhacker@reddit

$1000 difference from 32GB to 128GB models tells you they are charging way too much right now. At least wait for Black Friday.

[-]

gofiend@reddit

Really want someone to make a mATX version with like two PCI x8 slots. 128GB + 2x 24-32 GB GPUs would be absurdly powerful.

[-]

StyMaar@reddit (OP)

like two PCI x8 slots

You can't really have that many PCIe lanes unfortunately, AFAIK there are just 12 lanes available on the CPU.

But you can still use two PCIe x4 slots if you repurpose on of the M.2 slot.

I have no idea know how much PCIe lane width impact performance for LLM inference though.

[-]

Educational_Sun_8813@reddit

it will be fine, you just have to loade model, if you have alleady nvidia just run nvtop to see how little are memory transfers during inference

[-]

StyMaar@reddit (OP)

That was my intuition but using nvtop is a good idea actually, thanks.

[-]

gofiend@reddit

Yeah repurposing would be fine. Would be nice if someone turned this into an integrated motherboard that did the repurposing. Seems silly to have tons of storage bandwidth (NVME is cheap) when you could stick another GPU on. Pretty sure it's a net win (especially with MOEs) even if you have slow PCI (since you won't be using tensor parallelism anyway).

[-]

NWDD@reddit

Since most boards are ITX or smaller, you should have space in a matx case to repurpose for 2 gpus without much of a hassle. If you're willing to sacrifice all M.2 it might be possible to run five gpus (three at 64gbps and two at 40gbps).

The 64gbps bandwidth is high enough that you shouldn't notice it other than model loading / hot-swapping (or games that stream a lot of assets), you should still have enough sata headroom (up to 24\~32 gbps with proper raiding, depending on the board you're using)

To me the most annoying thing is Framework Desktop having the capped pcie slot, chinese manufacturers selling motherboards without using most connectivity and minisforum shipping a full computer instead of a standalone motherboard. It's criminal.

[-]

gofiend@reddit

Not sure it's reasonable to run GPUs at under PCI 4 4x? Otherwise, it's a great idea.

+100 re flubbing the PCI slots. Heck add $500 to the device and just give us a bunch of slowish PCI slots instead of M2.

[-]

NWDD@reddit

it is not optimal, but good enough:
- It was common for homelab setups to train local GANs and other ml models https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/
- When loading a model, streaming from disk will be bound at 64gbps (so if you are trying to optimize model loading and you'll be loading them from disk, there won't be a significant difference between loading from a pcie4 nvme into a x4 gpu vs loading into a x8 gpu)
- When gaming, streaming from disk in pcie4 is bound at 64gbps.
- When gaming, other operations (like passing vertices or indices to the gpu) you still have \~133 megabytes per frame at 60fps, so it shouldn't have an impact on most AAA games (where the world is mostly static) or indie games. There doesn't seem to be a lot of information specifically benchmarking something like a 5090/4090 at pcie 4.0 x4, but you can see that people were gaming just fine in pcie 3.0 x8 which is equivalent and the gpus from back then, like the 1080, that are still good enough to play most games https://youtu.be/XJuj16gRoBI?si=FSYukwpwqeuyEEvs&t=311 ).

[-]

fallingdowndizzyvr@reddit

AFAIK there are just 12 lanes available on the CPU.

There are 16 PCIe lanes.

But you can still use two PCIe x4 slots if you repurpose one of the M.2 slot (why two NVMe drives?)

If you don't need it, wait. The LLM field is too green yet, too much churn is going on. Great hardware today might gather dust next week.

Compatibility.

Check out this guy:

https://www.youtube.com/watch?v=wCBLMXgk3No

[-]

StyMaar@reddit (OP)

Will give it a look, thanks

[-]

Xamanthas@reddit

Because you never buy the 1st gen if anything. Ignore the rest imo. Only suckers willing to part with their money buy 1st gen

[-]

_hypochonder_@reddit

>the 128GB of VRAM for $2k is unbeatable.
You can build a system with 4x AMD MI50 32GB under 2k and would be faster.

[-]

spaceman_@reddit

I bought a 64GB laptop with the chip before summer, because I thought "40GB will fit every model I can run at reasonable token rates anyway, whatever the 128GB can fit that the 64GB can't would be to slow to be useful."

I have regrets now.

[-]

FlyByPC@reddit

If you don't need it now, prices for tech generally go down, historically. I'd expect the same tech to be less expensive / more capable, next year.

It's generally always a great time to buy new tech compared to last year, but generally always a terrible idea compared to buying next year -- unless you could really use whatever it is, now.

[-]

alpha_epsilion@reddit

Generation 1 product is often for suckers and guinea pigs

[-]

Technical_Ad_440@reddit

wasnt it nvidia is 20 tokens or something a second while the current amd one is 2 tokens a second? and the new one is gonna be 4tokens. they can run big models but really slow is what i got from it unless i was looking at the wrong info which with how sites are random bs articles is very possible probably likely

if these are indeed fast and can run llm and video models as well as image ones i will buy one. but i think that was the other flaw in that they run mostly text llm for video you still need nvidia and such unless that was also some more bs i came across when looking into amd AI rig.

i was looking at nvidia server jetson stuff to do video and images thats what i was recomended, seems affordable can get 128gb and is very upgradable at around 3k

[-]

dispalt@reddit

Hey there, AMD CEO here, you should definitely buy one

[-]

pengy99@reddit

It's essentially a first gen product. I would expect similar solutions over the next couple years to get drastically better with higher bandwidth and possibly even more memory. Unless you have a real professional use case I would just wait.

[-]

Freonr2@reddit

Probably great for gpt oss and glm air as you state, and any potential future 80-200B MOEs, particularly low active%. Could be a solid gaming desktop if you dual boot win/steamos. CPU is strong, GPU is at least very solid.

Probably will suck for everything else, like diffusion models. No cuda, worse compatibility. Less community knowledge/support.

[-]

seppe0815@reddit

Sitting in apple eco system but the memory speed is to low for video and image gen. Nn this ryzen ... but anyway have fun

[-]

flanconleche@reddit

Dooooooo it, got my framework desktop and loving this thing to dead

[-]

green__1@reddit

so, as a complete newbie who knows nothing, I'm curious about this. so far everything I've read says Nvidia is the only one ever worth considering for any generative AI workloads, and that discrete GPU with dedicated vram is the only way to go.

but those same people say that the only thing that really matters is vram, and 128gb is a heck of a lot of that compared to any dedicated nvidia card within a lightyear of that pricing ballpark.

I'm just in the process of scraping together the entrance fees for the localllama club, but something like this sure feels like a nice option! (looking for a lot of general purpose llm plus some occasional image gen, on something that can also run a good desktop Linux distro as my daily driver to replace my current PC that is old enough that it still stores data by chiseling it onto some tablets)

[-]

camwasrule@reddit

Please don't forget the big issue is prompt processing everyone. The larger the conversation gets the longer the PP gets too. Probably better running a 32b model with that much vram and using the vram for the kv cache

[-]

Morganross@reddit

REAL ANSWER:

At best you get 2.5 million tokens per DAY.

even if it is cost neutral over the long term, its a drop in the bucket compared to how many tokens you'll use per day.

[-]

Clear-Ad-9312@reddit

why you should not will be because it is completely overkill for most people, but at the same time it is enough to have entry level type MoE models like the ones you mentioned. It is also possible to wait 2 to 3 years and get a system that will absolutely be way better but cost a bit more or the same as this one. On the other hand, API costs are way cheaper than local.

Another downside is that it is completely stuck with the current configuration. You can't realistically upgrade CPU or RAM, plus once DDR6 comes out then the benchmarks will completely shift for CPU+RAM to being a viable choice vs the very expensive GPUs.

[-]

mr_zerolith@reddit

It's like half the speed of a 5090, and a single 5090 doesn't currently have the compute power to run a substantial model ( \~100B )

I'd honestly buy a 5090 today but in the future, the better hardware is:
- Nextgen Rubin-based nvidia
- Apple M5

But today, you will regret buying anything but the strongest thing you can get your hands on for under 10k
Don't bet on Qwen reducing your hardware needs by 4x, that probably isn't happening, and Qwen3 isn't that smart compared to other models in the first place.

[-]

AlwaysLateToThaParty@reddit

Half the speed? More like one eigth? 250GB/s vs 1750+GB/s? I'm pretty sure that's the bandwidth comparison.

[-]

mr_zerolith@reddit

Hmm i'm thinking more of tokens delivered per sec rather than just bandwidth
It is a different architecture after all and the unseen variable is latency.

An ultra is basically two max’s bonded together so an M4 would have had around 1.08tb if it had came out, so with an M5 i would expect around 1.25tb.

[-]

Educational_Sun_8813@reddit

all little sparks are the same, they have other box, but functionally they will be the same, advantage of that platform is they have infiniband now owned by nvidia, so you can bond two units for very fast interconnection with 256g

[-]

CubicleHermit@reddit

Why should I not buy an AMD AI Max+ 395 128GB right away ?

Depends. Desktop or laptop?

Desktop? Can't see any reason why not, although I haven't seen pricing for DGX Spark with a comparable amount of memory, and the desktops available are kind of questionable in terms of form factor and build quality.

I don't know about your use case, but OpenRouter gives you 1,000 free requests as long as you have $10 in your account.

Wait a little longer for better CPU.

Or build Threadripper 7000 series CPU

[-]

someguy@reddit

Because it can only run MoE models and it's only twice as fast as a regular DDR5 system.

[-]

NeverEnPassant@reddit

Because prompt processing is going to be unusable for larger contexts. A 5090 + fast system ram will be faster in practice for MoE.

[-]

GangstaRIB@reddit

If you already have a solution to run your models then it’s probably best to wait. If not have at it hoss.

[-]

spookperson@reddit

I was super super close to ordering an AMD AI Max+ 395 128GB this week. I think I have avoided the temptation for now. I'll post some links/thoughts related to my decision to buy or not buy.

I am most interested in batch/concurrency situations (2-5 users or concurrent tasks at the same time). I would be a lot more interested in Strix Halo if it went up to 256-512gb (though here is some data on clustering them: https://www.jeffgeerling.com/blog/2025/i-clustered-four-framework-mainboards-test-huge-llms )

In my mind, here are some alternatives for comparison to the 128GB Strix Halo for LLMs. 1) a regular PC with something like 128GB of ram and a graphics card running ktransformers, 2) a Mac Studio with at least 128gb of ram (running MLX or GGUF), 3) maybe Project Digits but that still hasn't come out yet.

Strix Halo obviously has faster memory than a regular PC with system ram and a graphics card - but here are some sample benchmarks for ktransformers on consumer hardware (Core i9-14900KF + dual-channel DDR5-4000 MT/s) + RTX 4090: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/AMX.md

Apple hardware is pretty good for a single user at a time but the prompt processing is not super fast and concurrency (multiple users or multiple tasks at the same time) is not easy. Here are some llama.cpp benchmarks for a bunch of different m-series chips: https://github.com/ggml-org/llama.cpp/discussions/4167 and this PR has some notes on llama.cpp speed in high-throughput mode with llama-batched-bench: https://github.com/ggml-org/llama.cpp/pull/14363

redoubt515@reddit

so please give me arguments about why I should not buy it.

I'm not going to try to talk you out of it, because I think its a solid option at a semi reasonable (if high) price.

But you are operating under a false assumption:

and the 128GB of VRAM for $2k is unbeatable.

You have 128GB of LPDDR5x system RAM. It is not VRAM.

It is faster than your typical system RAM, but slower than your typical GPU memory.

Qwen3-80B-A3B

I havent' heard of this before, is this confirmed or a rumor/speculation?

[-]

vulcan4d@reddit

It is overrated and that puts the prices high. It is good as a toy or intro to LLMs.

[-]

http206@reddit

Because there will be lots of these things coming out this year, some better than others, some cheaper than others, and unless you have a concrete use for one now you're better off waiting a few months.

[-]

fallingdowndizzyvr@reddit

The price has dropped since release. I bought at the pre-order discount but since then it's even been cheaper.

[-]

FORLLM@reddit

I'm inclined to wait for better, more vram. 128gb isn't cool. 1tb is cool. I could be delusional, but I suspect there will be devices to get us there at increasingly reasonable prices in a year or two. AMD ai max and nvidia spark are encouraging steps in that direction. I'd rather wait a couple years, as much as I'm encouraged by reports about kimi, qwen etc, I suspect I'd be a little disappointed acquiring hardware now, not just in a 'hardware is always getting better/cheaper' kind of way, but in a 'current hardware doesn't fit my market at all yet' kind of way. Adjacent to that, one of the recent videos I watched on the ai max mentioned a number of driver issues, sorry, don't recall the video though I probably saw it in this sub if that helps. A couple years on I bet those drivers will purr.

I think the hardware and software may be approaching mutually sweet spots in price and performance in the next couple of years though. And if nvidia has enough broadcom/custom silicon problems with their big tech ordering, they may get more eager to repackage silicon and sell it to us nobodies for reasonable prices again. I'd rather spend $5k in a couple of years to get something that's bang on what I want than reach now and get hardware that disappointing on its own to run models that aren't even quite what I'm hoping for yet with immature drivers. And I want to run audio models, video models, models I haven't even heard of yet. The market I want my AI rig for is still in very early innings.

On the other hand if you find $2k easier to part with than 2 years of waiting, your wallet may need to just take one for the team. Sorry, StyMaar's wallet!

[-]

Antique-Ad1012@reddit

We are nowhere near consumer hardware for useful local ai. Save your money and wait a few years, the technology exists but it takes time for the industry to adopt everything at scale to make it affordable.

Im using a m2 ultra and i will keel using online services for now

[-]

Few_Size_4798@reddit

Save up to 6-8 thousand and wait for Intel Battlematrix, but the prices that are appearing now are completely unreasonable, of course.

And then there's the energy consumption.

I'll save up (I'm waiting to see what Minisforum offers, maybe around $1,500), but I'll also buy one for my collection.

[-]

Secure_Reflection409@reddit

I bet Qwen have cooked that 80b to sit perfectly inside 48GB would be my number one reason for maybe holding off.

Maybe treat yourself to both :D

[-]

jonahbenton@reddit

The FW is super fast for normal computer use but for LLM compared to nvidia (3090s, A6000s) for interactive use cases with large model and context it is several times slower. I think people are getting good results with moe but I haven't gotten there yet.

[-]

nostriluu@reddit

I'm waiting for a Thinkpad with this chip. A desktop doesn't make sense to me because desktops are about expansion, right? And I'd want a CUDA GPU for some tasks.

[-]

StyMaar@reddit (OP)

A desktop doesn't make sense to me because desktops are about expansion, right?

You have two PCIe x4 available (if you repurpose one of the M.2 slot), doesn't that count ?

And I'd want a CUDA GPU for some tasks.

Fair, but I have no use for CUDA personally.

[-]