Why should I **not** buy an AMD AI Max+ 395 128GB right away ?
Posted by StyMaar@reddit | LocalLLaMA | View on Reddit | 546 comments
With the rise of medium-sized MoE (gpt-oss-120B, GLM-4.5-air, and now the incoming Qwen3-80B-A3B) and their excellent performance for local models (well at least for the two first), the relatively low compute and memory bandwidth of the Strix Halo doesn't sounds too much of a problem anymore (because of the low active parameters count) and the 128GB of VRAM for $2k is unbeatable.
So now I'm very tempted to buy one, but I'm also aware that I don't really need one, so please give me arguments about why I should not buy it.
My wallet thanks you in advance.
atlantageek2@reddit
Just found this thread. Looking at these machines going up +1k since December. Wonder if OP regrets buying or is glad they did buy.
StyMaar@reddit (OP)
Responding from my Bosgame M5, I can tell you that I'm very happy I bought it before it went up.
Now I'm eagerly waiting for the 1222B version of Qwen3.6 (I've been using Qwen3.5-122B extensively since its release and I'm very very happy about it).
Silver-Chipmunk7744@reddit
Take my answer with a grain of salt because i am myself wondering the same thing but...
I think for low/moderate use, it's probably cheaper to just use cloud based option to run the big open source models. But local has the advantage of 100% privacy and for heavy use it may end up cheaper in the long run.
I also think waiting if possible makes sense, since prices will go down and new tech will eventually come out.
StyMaar@reddit (OP)
That's a good argument, $2k is like 1000 hours of rented H100 in the cloud.
But also it seems to be much more work if you want to manage the llm running own your own rented GPU, rather than just running llama.cpp locally.
DavidAdamsAuthor@reddit
This was kind of my problem.
I thought about getting an RTX 5090 for LLM stuff. But that's $4,599 in Australian fake money, or around $2,800 real dollars. For that price, I could rent one for I think I worked out to be something like 2,000 hours. Sure, it was cool for gaming, but I only had 1440p/144 monitors so what was the point?
Assuming I used it for LLM work 2 hours a day (optimistic,) that's three year worth of rentals. In three years, the price for renting a 5090 will presumably go down (or for the same price, I could get something better) so the savings would compound, but in three years, if I bought a card, I will still just have a 5090.
So I instead bought a 5070ti instead ($1,500 AUD), and told myself if I wanted more, I could spend $1,000 worth of rental credits and still be ahead by like $2,100. So far I've barely used $50.
Honestly if the price came down to below the price of my first used car I could consider it, I'd be content with paying the same amount as a decent cheap holiday; but $2k USD is a LOT for a toy that I just don't use that often.
ImmediateImagination@reddit
The pice is now 4000 USD for the Framwork desktop with a standard set of ssd and cosmetics, so, fast approaching a server grade Nvidia GPU
DavidAdamsAuthor@reddit
It's just crazy. That's a rediculous price.
Ok-Possibility-5586@reddit
Dude that's way too rational.
Freonr2@reddit
Don't underestimate the pain/friction of spinning instances up and down.
StyMaar@reddit (OP)
That's what I'm saying here:
-dysangel-@reddit
You don't buy this necessarily because it's cheaper. IMO you buy it if you want to run regular experiments without worrying about racking up API costs, or if you want to be able to run offline etc. I think it's a good investment if you are seriously interested in running locally. Models and algorithms are only going to continue to get more efficient. So this machine should do you for a while. If you can wait though, it *will* likely be cheaper to just run in the cloud for now, and then buy in a couple of years once all the hardware providers are competing more on VRAM
EdanStarfire@reddit
A lot of my early Subagents experiments were done on local LLMs for exactly this reason. I knew I was gonna be throwing tokens out the door and could afford for it to be slower, so let it crank overnight on things I was working on instead of paying for who knows how much API usage.
Silver-Chipmunk7744@reddit
I'm also thinking of service providers that let you pay per prompt for any LLMs. For open source models it's often less than a cent per request. This is also insanely easy to use.
The problem is probably for people who want 100% privacy.
sudochmod@reddit
Aside from the LLM aspect. It's decent at running games and has a low power draw. I love mine. I hope people buy more of them so that AMD continues to make them.
clare64@reddit
did you have any issues or bugs thus far? and aside from gaming have u tried any local llm stuff?
sudochmod@reddit
I do a ton of local LLM and it’s great.
clare64@reddit
Awesome thank you for sharing!! Ever tried image or video Gen?
Educational_Sun_8813@reddit
i'll get my framework in october! and will start contributing to rocm solutions
pn_1984@reddit
came back here after 3 months, crying. The RAM alone is now selling for the price of a mini PC then
nickthecook@reddit
Look for a “what’s the best uncensored roleplay model” post from OP in a month. :P
Silver-Chipmunk7744@reddit
Of course lol
But tbh i don't think these service providers care about NSFW unless you are doing something super illegal.
my_name_isnt_clever@reddit
Except in the current American political climate payment processors are cracking down on "adult" materials. I wouldn't trust any US providers.
DavidAdamsAuthor@reddit
It depends on how NSFW.
Gemini is pretty easy to trick into writing the most insane smut you could imagine, but it also is just one model, and a model with some flaws too. Once you've gooned yourself to the point that your body is basically a dried raisin and you've misplaced gallons, you need fresher, more electric fields to explore.
DavidAdamsAuthor@reddit
The AI revolution is carried on the backs of heroic gooners.
rishabhbajpai24@reddit
I don't think the prices for these devices will go down in the US, at least because of the tariffs. Only Chinese mini PC companies are selling them at under $1900; with future tariffs, the price will go up. US-based companies are already charging a lot.
false79@reddit
The MAX+ 395 is probably the best CPU you can get for a small form factor device, especially that the single clock speed can knock out full desktop processors.
However, why I'm not getting it is that the price is premium and the memory bandwidth is a mere 256 GB/s.
My discount AMD 7900XTX has 3x the bandwidth. Sure it's only 24GB but there are a lot of useful models that can fit within 24GB VRAM.
Few_Size_4798@reddit
I have a 7900 XTX too, but it's like a stove, 300 watts and all that, and they promise quiet operation with it.
ImmediateImagination@reddit
u down clock the GPU to around 100 Watts, better for the HW and still beats the crap out of Max 395
Vektast@reddit
just power limit it.
false79@reddit
What you running these days? I've always got qwen3-30b-a3b-thinking-2507 if not qwen3-4b-thinking
Few_Size_4798@reddit
mainly for SD and pixtral tasks
clare64@reddit
are there other options with higher gb/s currently available? or are there confirmed releases? curious what made you want to pass on 256 gb/s...
false79@reddit
I passed on the 256GB/s because I'm coming from the world of DDR6;s <960GB/s memory bandwidth.
gpt-oss-20b flies at 170tok/s and is meeting 95% of my needs through thoughtful prompt engineering and context management.
tarheelbandb@reddit
What does that extra bandwidth actually get you in terms of productivity? I think it's an important question to answer, especially in the context of a single user.
false79@reddit
uhh like everything? As a coder, you need to iterate quickly back and forth. Right now, I off load agents 10% of my work a day, sometimes 20%. What would take days, I have solutions generated over hours, breaking down a project into much smaller problems a coder dense LLM can handle, while generating unit tests to go with the artifacts created. ROI is paid off within a month if not sooner.
Second to bandwidth (which is related to compute) is context. Having hit limits in 24GB, the appetite for more VRAM has increased and along with it, one needs even more bandwidth to make effective use of that.
Educational_Sun_8813@reddit
but strix has tdp around 120W sorry, but you can have running it all day around,
ttkciar@reddit
My generic advice regarding computer hardware is, "if you can wait, wait; in the long run, hardware prices go down and software support gets better. If you can't wait, then buy."
Since you say that you don't need this right now, you should wait. When you have a genuine need which requires new hardware, buy new hardware then.
cranberrie_sauce@reddit
> if you can wait, wait; in the long run, hardware prices go down
did you post this before ram and nvme prices skyrocketed 4x?
SpecialNothingness@reddit
Just two more years.... hold it... it'll get goood... totally worth it....
lmneozoo@reddit
Now 128gb of ram costs as much as these machines 💀💀💀
nashfrostedtips@reddit
Late post but I got mine for $3200 CAD in, I think, late November, it's now $4800.
Neborodat@reddit
That did not age well.
Em-tech@reddit
Glad we waited and saw prices go up significantly. I should have gotten in on the framework desktop q3 launch. Ughhhhh
roughseasbanshee@reddit
this is kinda the right answer. i've waited four years, i've waited six months, and either way, a new laptop that makes me salivate will be out in weeks. i've decided i'll buy it if i get this job. i also know that if i don't get it i'll wanna rope so i might buy it anyway to cope 😇
Iseus1024@reddit
ez a tipp nem öregedett túl jól :D
Objective_Mousse7216@reddit
RAM going up and up in price.
night0x63@reddit
This is what I do... But not because I'm responsible... But because I'm lazy... it is a pain to buy and replace one daily driver PC to another.
HanZolo916@reddit
how about nixOS ;)
findus_l@reddit
About that...
ttkciar@reddit
Heh, yeah, I did not foresee hyperscalers buying all the RAM and sending prices soaring.
In the long run I posit that hardware will get cheaper, though, but maybe not for years now.
findus_l@reddit
Surprisingly the bosgame m5 ryzen AMD ai max+ 128gb model doesn't seem to have increased in price yet. I'm seriously considering getting it before they notice... But why didn't it? It's strange.
Daniel_H212@reddit
This honestly helps. I had the same question as OP and now I realize that yeah, I don't *need* it, so why not wait for better? The 395+ is, at the end of the day, a bit of a first-gen product, why should I be helping them beta test it? Who's to say there won't be better products with more performance, higher bandwidth, and more RAM that comes out a year from now?
ASYMT0TIC@reddit
Excepting 3090's, which some people bought in 2020 for gaming, used for three years, and then sold for a profit. That flipped the script in a way I don't think we've seen before.
ttkciar@reddit
Yep, you're right about that. I've been in the hardware game for forty years, and that kind of reversal is rare.
Nevertheless, I posit that in the long run the trend is for hardware to get cheaper as it ages (until it becomes "vintage" or hard to find, at which point it starts getting more expensive again).
Purplekeyboard@reddit
That used to be the case, but hardware isn't really getting much cheaper or better over time any more.
bolmer@reddit
AMD next to be released APUs are going to be really good for medium size LLM and they are not going to use expensive Vram
Tai9ch@reddit
It certainly feels that way compared to 20 years ago.
On the other hand, hardware right now is improving faster than it was 10 years ago. We had quad-core 2.5 GHz laptops with 8-16GB of RAM for far too long.
ttkciar@reddit
Admittedly, it depends on the hardware.
I've been tracking MI210 prices pretty closely. Two years ago they cost about $13,500, and today they only cost about $4,500.
On the other hand, DDR4 LRDIMM prices have been pretty stable for years now.
Slightly older processors come down in price pretty rapidly, but new processors stay high for a while and very old processors hold steady for several years.
Silver_Jaguar_24@reddit
This is what I keep telling myself every year with smartphones and now I am still stuck with my 2018 Chinese brand phone, waiting to get the best upgrade lol.
tiger_ace@reddit
honestly this is one of the most exciting times in technology for a while
if you view $2k as both a learning AND entertainment investment then it's incredibly cheap given than you get hardware too
i don't think we should view things right now purely as "hardware gets better over time" like smartphones
Bakoro@reddit
At some point you just have to pull out the wallet.
Do what you can to time the purchase(s), but it's easy to keep putting it off, and eventually you're managing with ancient hardware saying "it's good enough", not understanding how much better you could have it.
I remember when the Pentium 4 was on its way out, and then years later, when it could technically still process, but it literally made no economic sense to keep the thing working. It got to the point where the cost of electricity the thing used every year alone justified the purchase of a new CPU ans motherboard.
You have to keep the total cost, the opportunity cost, and your quality of life all in mind.
RevolutionaryAd7360@reddit
How did this age?
StyMaar@reddit (OP)
Pretty well actually, Qwen3.5-122B fits with its full context length on Q4, and the performance is good IMHO (both in terms of tps and in terms of capability).
RevolutionaryAd7360@reddit
I was thinking these look pretty sweet for the money. I'm late to the game and just bought 2.
Not sure why they aren't more popular but like I'm never going to be confused for being th smartest guy in the room. I assume I'm missing something.
NewDependent8219@reddit
Considering upcoming RAM crysis, I hope your procrastination didn't save your wallet, and you bought it. And I also hope you are happy with it.
StyMaar@reddit (OP)
I ended up procrastinating for a while, but then seeing the RAM price skyrocketing was enough of a nudge and I eventually ordered one. I'm supposed to receive it by next week.
bebetterinsomething@reddit
Which one did you order? I see the framework going for $3 with tax and shipping on top of that.
StyMaar@reddit (OP)
I ordered the bosgame M5, which I got for 1600€ with tax and shipping. But it's now much more expensive as well.
Ornery_Cockroach_824@reddit
I think this could help
https://www.amd.com/en/developer/resources/technical-articles/2026/how-to-run-a-one-trillion-parameter-llm-locally-an-amd.html
profcuck@reddit
This is an absolutely great thread. Thanks to /u/StyMaar for posting it - I'm in exactly the same boat. I just keenly read everything here and like you, I'm not convinced out of the purchase.
The most persuasive argument is the one that I've been struggling with which is "wait, something better will come, it always does". That's true but also could persuade anyone at any time to never buy anything. The real question is whether there's something big coming soon, or a year or two away. Big difference.
However, there are "solid" rumors about Strix Medusa - the successor to Strix Halo. There appears to be imminent announcements, but Lisa Su is announced as keynoting CES in January 2026. My guess is that whatever is announced then or at the November financial analysts day will still be a ways into the future.
Therefore, I've concluded that I won't be blindsided and disappointed by something 50% faster/better shipping in December. And... this seems in any event "good enough" for my use case.
I'm running models on my M4 Max 128gb laptop, and it's great. (Mac haters often say it's awful, but it isn't!). But I have some projects where I want to crank away 24x7, and on my homelab setup I'd love to have it open for family members to play with. But my laptop is my daily driver and I'm going back and forth to work and travel and so on, so it isn't really suitable for all kinds of fun stuff that I want to do.
So - now my only question is which one to buy. :). That'll keep me busy for a few weeks anyway, lol.
clare64@reddit
have u tried the m4 max to be able to handle local video models or even the kind of 24x7 'cranking away' you refer to?
profcuck@reddit
I have not used it for local video models or image models other than in the most simple way. (i.e. I got "Draw Things" working and messed with it for a few minutes). I wish I could give advice but I just don't know that space very well.
In terms of 24x7 cranking, I also haven't done that even though I use it quite a lot for LLMs, but just on an ad hoc basis, asking questions, giving a text for feedback, etc. The problem with 24x7 cranking on my actual laptop is that I use it all day every day for work. I could set up some jobs to run overnight I guess, but I haven't done that yet.
Now, for LLMs I can tell you what you probably already know: you can run big models with 128GB of unified ram, and the M4 Max with 128GB of ram is going to be significantly faster than the AMD Max+ 395. But both struggle with prompt processing and therefore aren't necessarily great for live coding as an example. But for batch processes, both should be just fine, and as compared with most people's Nvidia setups which don't really have enough VRAM, at least you can run smarter models, albeit a bit slowly.
I'm definitely in the camp that if your use case aligns, Macs are definitely the value champion for inference. And if your use case doesn't align, well, it may be disappointing!
clare64@reddit
Exactly what I needed thanks! Yes I want to run big models. Not overly concerned about speed on output or concurrent processing ..just want enough power to not rely on the paid image-gen solutions. Did you wait an unreasonable amount of time when u tried out DrawThings?
profcuck@reddit
Just to try to be useful and because I was curious, I just opened it up. When you start a new project, it throws up a random (?) sample prompt. I just clicked to run this one: "a samurai walking towards a mountain, 4k, highly detailed, sharp focus, grayscale". I. used the model "Z Image Turbo 1.0". It took 3 minutes and 5 seconds.
I don't play in this space so I don't know if that's unreasonable or not. Doesn't seem exactly fast for interactive sessions but if you wanted to batch generate images overnight it's probably fine. (15-20 per hour?)
The same prompt in Stable Diffusion v2.1 ran in under 5 seconds. The quality is a lot less but depends on use case as ever.
clare64@reddit
yep, quite reasonable for many use-cases. Thx for the benchmark! Will post a review regardless where I land (mac studio or amd)
profcuck@reddit
I didn't spend more than a couple of minutes on testing it at all. It's on my eternally long to-do list!
Creepy-Douchebag@reddit
I want to learn more about LLM's this is my reason to buy one.
clare64@reddit
same. did u end up doing so?
Creepy-Douchebag@reddit
Hopefully next month Ramen noodle will be my meet dietary needs.
paul_tu@reddit
If you need comfy or any other way to generate images it's going to be painful.
Local llm's are generally fine (distilled Deepseek, different flavours of Qwen, gptoss120b, etc)
Energy consumption rarely reach 200w noise levels are affordable (for GMKTEC evo-x2)
clare64@reddit
do you reckon this insufficient for local video rendering?
paul_tu@reddit
I'd say its too slow for the purpose of AI video generation.
Yet its possible
I didn't try it for rendering BTW
But take into consideration traditionally poor codecs support by AMD.
Some surprises like AVX512 support in AI395MAX are still possible.
Successful-Put-4899@reddit
You know those days when everybody tried to automate their home with raspberry pi's and home assistant ...
This combo (AI capabilities thanks to the VRAM) actually convinces me to take the step.... everything controlled with ollama, home assistant, n8n and still have plenty power to run your local services like streaming, NAS/backup, thanks to the millions of available docker images... the chance to shed so many internet headaches. You have the ability to train your LLMs on your own setup as well so you constantly have a low paid, private assistant that can look through documentation for you, help you with administration and manage your documents, all while managing the lights in the house, the temperature and setting the mood in the room ... the options are getting limitless.
And none of the big tech spying on your data and telling you what you should or should not do.
clare64@reddit
youve nailed it. reckon its possible with the pc op is referring to?
No-Manufacturer-3315@reddit
Memory bandwidth issues
StyMaar@reddit (OP)
256GB/s is more than enough for model with 12-3B active parameters though.
Compute is likely a bigger issue for long context and time to first token.
clare64@reddit
am i correct in assuming that if your use case doesnt care about 'speed' this 395 (with 256gb/s) will be fine?...as long as I can use video models locally and have it running no cost - could care less if its fast/slow at this point ...
zipzag@reddit
The upcoming nvidia spark is 270GB/s. the high end mac mini is 250. This speed is about the stat of the art for these more moderately priced SOC computers.
The more expensive mac studioare are about 500GB/s and 800.
SOC choices good on memory and low on processor. Arguably many people would trade some speed on the 5090 for more memory. What's available today is a bit imbalanced for how most people want to run inference.
s101c@reddit
You cannot use Nvidia Spark as a gaming machine or professional x86 workstation. Many of us here buy an expensive computer to do all of these things.
Mac is also matter of taste, not everyone like macOS.
Original_Finding2212@reddit
It really is about purpose.
General purpose? Sure Fine tuning? Spark
No-Manufacturer-3315@reddit
That’s a small amount of parameters… I would suspect it excels at MoE models but is cripples in monolithic ones
sudochmod@reddit
The processor on this is almost equivalent to a 9950.
fallingdowndizzyvr@reddit
It's simplistic to only consider memory bandwidth. My Max+ 395 is faster than 3xMi50s cobbled together with much faster memory bandwidth.
kezopster@reddit
This is MY question, too! I just got bit by the LocalLaLM bug. Right now, I'm using my ROG Laptop with 13 Gen i9-13900H + Nvidia 4070. It's enough to show me the possibilities without itching the scratch. So, do I spend $2k on one of these Strix Halo's, slightly less on a desktop with a 5070ti (can't afford to go higher than that). Heck, I've even seen a few laptops down around that price with Intel processors. But, seriously, I'm about to pull the trigger on GMKtec EVO-X2 AI Mini PC Ryzen Al Max+ 395 Mini Gaming Computers, 128GB LPDDR5X 8000MHz (16GB*8) 2TB PCIe 4.0 SSD. Am I being dumb?
DavidAdamsAuthor@reddit
If you're about to spend $2k on a product like that, why don't you "test drive" it so to speak, and instead budget $100 worth of H100 GPU rentals?
You'll find that the performance is WAY better, and you might find you're just not using it as much as you thought you might.
clare64@reddit
is there a simple walk-through on how to do this on youtube sumwhere? certainly seems reasonable like your analogy of buying a car ...run some local llm stuff to see if its to your liking
One-Kangaroo911@reddit
ㅋㅋㅋㅋㅋㅋ 그 때 샀어야지!!!! ㅋㅋㅋㅋㅋㅋㅋ
Early-Type-6300@reddit
Because it cannot run Kimi K2.5 ;-)
Weird-Consequence366@reddit
If you can wait a month or three the price will go down on them a bit.
Hood-Boy@reddit
Aged like milk
Weird-Consequence366@reddit
Funny. I got one for cheap on Black Friday.
yuno_me@reddit
aged like milk
Zc5Gwu@reddit
Just like 3090 prices were supposed to go down… and 4090 and…
Zentrosis@reddit
I bought my 4090 somewhat close to launch when everybody was talking about how overpriced it was.
I don't feel like I made the wrong choice.
kevin_1994@reddit
tfw 4090s are somehow selling for same price as a 5090 on ebay
SubstanceDilettante@reddit
I bought the 4090 a few weeks after launch from best buy for MSRP
uti24@reddit
I mean, Black Fridays could be nuts in the US, like getting this thing for 1500$
my_name_isnt_clever@reddit
Could be, but I don't even remember the last Black Friday that had actual worthwhile deals rather than artificial "sales" by increasing the price ahead of time. And with the demand for AI hardware I'll believe it when I see it.
uti24@reddit
Every Black Friday, we have posts where someone gets a full PC with 13600K and RTX 4080 for $750 from Micro Center or wherever.
ComingInSideways@reddit
All this depends on ongoing tariffs and trade issues. Tech going down is not necessarily a given now.
boubainlive@reddit
So there is an upside to procrastination 🤔😁
rishabhbajpai24@reddit
I had one 4070, one 3070, one 4080, and one 4090 with a combined RAM of 324 GB. I still bought it. I find it to be the most balanced device that performs comparably to other systems and consumes very little power. I usually run 30-ish B models with a large context length. 395 is definitely less powerful for dense models compared to RTXs but performs really well on MOE. Now, I have shifted most of my workload onto it, and I don't regret it.
I am a developer, and I need to test my software on different devices, so buying it was the right choice for me (that's how I justify it), but may not be the best choice for you.
Buying 395 depends on your specific use cases. It is not as fast as its other counterparts but cheaper compared to a Mac or Nvidia DGX Sparks.
randylush@reddit
Huh?
Quirky_College_6251@reddit
But that will probably consume 1200 watts on load and AMD and Apple 120W
rishabhbajpai24@reddit
324 GB is the system RAM (excluding VRAM)
BhaiBaiBhaiBai@reddit
He said RAM, not VRAM. Perhaps he's referring to his whole rig
HiddenoO@reddit
Probably system RAM + VRAM combined.
StyMaar@reddit (OP)
You're not helping man!
(thanks for the answer though ;))
rishabhbajpai24@reddit
Reasons not to buy:
2k is not zero: A lot of LLM providers have free tiers for chat and API, so for everyday purposes, you don't need to spend 2k.
Time to set up the server: For someone new to LLMs and 395, it may take some time to set up everything.
Stability: AMD is not as good as Nvidia when it comes to LLM acceleration. ROCm is not as stable as CUDA. Vulkan is good but still fails sometimes.
Non-LLM workload: Running CUDA-optimized algorithms is not always easy on AMD.
StyMaar@reddit (OP)
$2k is much less than selling my soul, which is roughly what these free tier API provider are taking away from you.
Kramy@reddit
I bought a few $330 CAD MiniPCs (6800U 32GB LPDDR5-6400), so about $200 USD because of Canadian taxes and the exchange rate. I am running Olla and Ollama off of them. It crunches out tokens at a pretty good pace, and now I can fire lots of simultaneous requests at it. I need to get qwen3-vl working and some other models, plus expand OpenWebUI and some other software to have more capabilities. But bit by bit, I'm putting together something very useful, and the learning experience is definitely fun! Can't go wrong spending money on knowledge.
Acrobatic-Rice-4598@reddit
The data is anonymous, what's the problem?
StyMaar@reddit (OP)
That it's 2025 and some people think anonymity exists in a digital context is beyond me.
Acrobatic-Rice-4598@reddit
Pay 2k/€ for a PC to anonymize your requests, what do you want me to tell you x) Personally, I do not transmit personal data to AI
StyMaar@reddit (OP)
What are you typing? Cryptographically secure randomness? If not, you are in fact transmitting personal data to AI providers even if you don't recognize it as such.
No-Row-Boat@reddit
If I can add: Linux support is horrible.
Lemonade: no support on Linux for NPU Ollama: no rocm in the default packages vLLM: no rocm support, 3 weeks ago added but not yet in the latest version
I'm just fighting to get something working on Linux and so far: nothing is.
LsDmT@reddit
What specific models are you running on the 395? Are you using vulkan-radv, vulkan-amdvlk, rocm-6.4.3-rocwmma?
Have you found any solid models working with rocm7_rc-rocwmma yet?
farnoud@reddit
Is splitting model between a few cards viable? Is it better to have all the cards on the same system?
It’s a lot cheaper to buy 3x 3090 than l40s
kkb294@reddit
Same with me as well. I have the 4090 48GB, few 4060 Ti 16GB variants at home rigs. They are bulky, power hungry, and jet sound machines under the load. But, whenever I want to demo something, I load the code and local llm's into it and carry it to the offices for thr demos. It has the local llm setup (LM studio, ollama, llama-cpp-python, whisper STT, kokoro TTS).
All, I have to do is load the app container into the docker and hook it up with other layers for the demo.
The only Con is, it is very bad for the Stable Diffusion 🤦♂️
rishabhbajpai24@reddit
That's true. I haven't gotten good results on stable diffusion with it.
xXprayerwarrior69Xx@reddit
currently i am really stuck on what to do. should i get a strix halo, a mac studio, a dgx sparks, everything is moving so fast that it's hard to pinpoint what to do. the good thing is that i dont NEED to find the answer but i really to experiment and get deeper knowledge. i imagine the right answer is probably strix halo due to cost/performance ratio but i am thinking that maybe apple is cooking something and we still dont know when the sparks will release... oh wel
Smart_Government6493@reddit
야 3개월뒤인 지금 이떄 안산거 엄청 후회했을거같다 ㅋㅋㅋ
nzMike8@reddit
This seems like a good deal. If the max 395 is what you are after https://frame.work/products/framework-desktop-mainboard-amd-ryzen-ai-max-300-series?v=FRAFMK0006
CoqueTornado@reddit
because for the amount of bucks you are going to spend you can get a um890 mini pc+5060ti 16gb of vram and do everything 2 times faster in image/video generation.
https://vladmandic.github.io/sd-extension-system-info/pages/benchmark.html
Here you can find the speed of a xl model in each proposal, 8060s and 5060ti, 1.43it/s vs 6it/s; In LLM these 220B models with MOE are ok but the speed of the pp when the context is more than 50k looks downgraded a lot so you have to wait for it to read it for like 2 minutes (I've read that somewhere). If any can confirm me that, I would appreciate the gently information. So is not really the best thing but hey is a local powerful LLM.
StyMaar@reddit (OP)
This is /r/localllama, not /r/stablediffusion, I don't care about video generation at all.
WhaleFactory@reddit
I definitely did not need one, but i bought one anyways. Now I am running gpt-oss-120b at >40tps and sipping power. Thing uses less power under load than my other servers do sitting there idle.
dougmaitelli@reddit
So, question, I just got a GMKtec Ryzen AI Max+ 395 and I can get only 20t/s on a 12b model. So I don't know how you can get double these tokens in a model 10 times bigger.
Am I missing something?
deadly_sin_666@reddit
Which one did you get?
j0rs0@reddit
You were supposed to help 😆
Hodr@reddit
He definitely helped. Here's my help, my dipshit nephew who refuses to get a job dropped 4 grand on a credit card for rims on a 10 year old Accord. Puts a presumably decently well compensated person dropping 2K on an awesome setup for running local LLMS in perspective.
Bakoro@reddit
Jeez, I make six figures and haven't bought a new computer in 5 years.
The idea of dropping $2k on anything sets off alarms in my brain, even for important stuff that I know is worth it.
$4k on rims?
I'm for real irritated just reading that, because I've known people who do that shit. Later on they'll be saying "it's hard our here, you don't understand". And "why can't I just catch a break?"
randomqhacker@reddit
You're making six figures, you're supposed to be putting money back into the economy. Dipshit nephew is a hero for supporting local business and bringing joy to everyone who sees those beautiful rims.
Plebius-Maximus@reddit
Gonna need a pic of these beautiful, economy-propping rims
BasvanS@reddit
Imagine the worst rims you can think of, the ones that raise the hairs on your back, and then come back, because they’re 5 times worse, and there’s a Geneva Convention against torture that’ll become relevant.
fatboy93@reddit
So, either floaters or spinners, or just normal ones with a really gaudy ass shit colors
No_Debate_8297@reddit
HEY! Those rims are carrying the weight of the whole economy.
h311m4n000@reddit
My daily driver is still rocking a 1080ti and a core i7 9700 and I don't plan on changing it...and I also make 6 figures 😛
pixelpoet_nz@reddit
Yeah he really needs a job to support his penchant for buying rims, we'll call it a ...
pn_1984@reddit
Suddenly we went from comparing local AI setups to comparing ourselves against random 'dipshit nephew'. ngl, compared to him we look peachy buying some some mini PC.
Hey next time I want to extend using OCULink, I can bookmark this comment and gain more confidence!
Some-Cow-3692@reddit
Thats a solid perspective. 2k for a solid local LLM setup is a reasonable investment compared to many other hobbies
Xp_12@reddit
That's some good help.
fooo12gh@reddit
Serveurperso@reddit
Faudrait qu'on compare avec GLM 4.5 Air, car avec GPT-OSS-120B un PC Linux dédié tourne à cette vitesse !
StyMaar@reddit (OP)
What kind of pp speed do you get ? (How long would it take to process a 5k-15k token context ?)
KontoOficjalneMR@reddit
PP is in thousands t/s. so 2-10s depending on any factors
one-wandering-mind@reddit
That seems too high given what I have seen from other folks. Are you sure ? Let's just say one 8k prompt. How fast? What are the factors?
KontoOficjalneMR@reddit
I probably should have written around thousand, but I could swear ssomeone listing over 2k.
Anyways this is best I found and like I wrote lots of things seem to affect it (mostly negatively).
https://forum.level1techs.com/t/strix-halo-ryzen-ai-max-395-llm-benchmark-results/233796
Gringe8@reddit
To compare, that gets ~100 t/s prompt processing on 70b while my 5090 gets 1500+
ethertype@reddit
Cool. As long as this is a single-user setup, what is the added value of anything faster than human reading speed?
I could sell my 4 3090s and buy a 395. But, knowing myself, I'll probably end up with a 395. And 4 3090s.
I suck at getting rid of stuff.
kryptkpr@reddit
We are talking pp speed not tg speed.. if you're running local coding agents they eat 10K prompt tokens before getting out of bed so at 100 Tok/sec you will wait 2 minutes before generation even begins. Those 4x3090 can do this in seconds.
ethertype@reddit
That is a useful answer, thank you.
CSEliot@reddit
The tradeoff is in the VRAM. A GPU with 24GB of VRAM can deploy fewer/smaller models but use them much faster. Than an APU with 96GB of shared VRAM.
Gringe8@reddit
Thats why I decided against getting something like this. While you CAN fit a bigger model on it, the prompt processing is so slow it wouldn't be useful for my use case. I decided to go with a dual gpu setup instead of a mini pc for AI stuff.
ItzDaReaper@reddit
What’s your use case?
randomfoo2@reddit
For better benchmarks, be sure to the latest updates (linked in that post) or these: https://kyuz0.github.io/amd-strix-halo-toolboxes/ - Note, these are pp512/tg128 so best case numbers. While I do sweeps, I haven't done a lot of long-context tests (since they take forever to do), however I did run one long context test, perf def gets pretty bad out at 128K... https://strixhalo-homelab.d7.wtf/AI/llamacpp-performance#long-context-length-testing
KontoOficjalneMR@reddit
Figured, but then it's incredibly rare for a model to even support that long context, not to mention making actual use of it.
kmouratidis@reddit
Is
pp512a reliable test for speed on long prompts? Why not benchmark your own computer? You can install vllm and run something like this:KontoOficjalneMR@reddit
What would a benchmark on my computer tell you that ones from localscore.ai does not?
kmouratidis@reddit
It would give a directly comparable benchmark result that does not assume a GGUF file is being used to deploy a model?
KontoOficjalneMR@reddit
Comparable to what?
MaverickPT@reddit
Wow. That's a big PP
Business-Weekend-537@reddit
Petition to henceforth refer to undervolted vs overclocked performance for this metric as “soft” and “hard” PP from here on out.
CV514@reddit
I have factory calibrated PP then. Wow.
Business-Weekend-537@reddit
You could also called it bone stock
BhaiBaiBhaiBai@reddit
Seconded
Healthy-Nebula-3603@reddit
lol
Gringe8@reddit
The benchmarks below showing the pp of a large moe model showing it gets 63 pp t/s
Fuzzdump@reddit
Not sure what benchmarks you're referring to, gpt-oss-120b gets 750 pp t/s on ROCm. https://kyuz0.github.io/amd-strix-halo-toolboxes/
Gringe8@reddit
The ones that were linked in the comment below mine
Fuzzdump@reddit
Those appear to be out of date. The updated numbers for that particular model show 182 pp t/s: https://github.com/lhl/strix-halo-testing/tree/main/llm-bench
Sparser MoE models seem to fare much better. Some of the smaller ones get 1k+.
StyMaar@reddit (OP)
How can it be so low?
GreenCap49@reddit
I get around 500 pp, it's pretty fast, also gpt-oss-120b has SWA, so only the initial prompt takes a bit longer, than it gets faster
Cool-Hornet4434@reddit
That's not SWA... If you run the model through a "dry run" it takes a long time to start up, but the first run with an actual prompt is fast. If you skip the dry run with the --no-warmup option, then the first run is slow, and it speeds up after that. SWA is Sliding Window Attention, which would only affect how much VRAM it uses for the KV Cache. BUT With Oobabooga you can't use any quantization of the KV Cache because it's already quantized...the whole model runs with MXFP4 which squeezes down the numbers that AI models use from their normal size to just 4.25 bits per number.
xjE4644Eyc@reddit
Same. Any models that are less than 100 gb and I'm like 'lets do this'.
Great for privacy related things where I'm not sure if zero data retention is really zero data retention.
pn_1984@reddit
why do you say so?
xjE4644Eyc@reddit
We're in a bubble and it's going to take down a bunch of companies; a lot of these companies are really skating by financially. And the only valuable thing they have is the data that they're processing. When push comes to shove (e.g. company solvency risk, their own/personal financial well-being), I don't know if many are able to withstand the tempting offer of putting that data up for sale.
jacek2023@reddit
How expensive it is? Because I have twice more t/s on 3x3090
Aromatic-Low-4578@reddit
I bet you have more than twice the power draw though
jacek2023@reddit
If you want to minimize power usage the best option will be ChatGPT in phone
valiant2016@reddit
How is that local?
jacek2023@reddit
It's not, but uses no power ;)
Creepy-Bell-4527@reddit
That's not true. My phone consumes about 3000mAh of power a day.
If you want to minimize power usage, the best option is to live in a cave with no electricity.
Rynn-7@reddit
MAh is a useless unit when comparing usage against a computer. You need to multiply by the phones voltage to get watt-hours.
Creepy-Bell-4527@reddit
The batteries voltage, but you're right. I was just too lazy to Google if it was 3.3-3.8 or 5v so I went for the unit I knew!
rawednylme@reddit
What is the best local model the cave can run?
Creepy-Bell-4527@reddit
If you talk loudly enough it echoes back what you said, I think gpt-2 had similar behaviour.
Cool-Chemical-5629@reddit
And then visit a local shaman and buy some hallucinogenic mushrooms from him. Next, get one longish pointy, sharp stone to use as a chisel and another bigger stone with a nice flat surface to use as a hammer. Consume the mushrooms and start carving whatever comes to your mind into the cave wall. When you come to your senses, it will feel like someone else (the cave's hidden "AI") carved those writings on the wall.
JumpingJack79@reddit
👆 THIS 👆 is a truly local LLM.
No_Afternoon_4260@reddit
Books are cool
valiant2016@reddit
r/LostRedditor ?
JimJava@reddit
It's amazing how people double down on arguments they lost.
sudochmod@reddit
I got mine for $1650 :D
magikowl@reddit
What brand/model and from where?
sudochmod@reddit
Nimo. Just search it. They run specials and I got it during the back to school special.
LowMental5202@reddit
I really want one too, but I only really used NVIDIA so far. The thing I’m most afraid of is software compatibility
Awkward-Candle-4977@reddit
Amd should make gpu card with dimm slots because nvidia won't (they'll say buy dgx).
Rdna 4 navi48 chip with 8x 128 GB ddr5 DIMM will be very attractive for inference. It might work faster than multiple rtx6000 cards because 1 ddr5 DIMM bandwidth is similar to 16x lane pcie 5.
Vibe coding then can use those large models.
BillDStrong@reddit
Next Gen RDNA 5 is supposed to use lpddr so if a board partner wanted, they could. Would they? Who knows.
AppealSame4367@reddit
But is that really worth it? What does oss 120b do for you?
Now that i know gpt-5 on codex i wouldn't wanna settle for less xD
Apart-Touch9277@reddit
I think having a decent offline position is important enough to invest
tarheelbandb@reddit
It's more like what does Qwen2.5-coder or Qwen3 coder do for you.
TetsujinXLIV@reddit
What was your method of deployment? I tried ollama and got poor performance. I was hoping to run it headless in docker and connect from my desktop.
EdanStarfire@reddit
I'm using lm studio with Vulcan. Haven't really dug into rocm yet as it's working great so far with little effort.
TetsujinXLIV@reddit
Do you use the unsloth GGUFs? I had the 20B model running with llama.cpp vulkan docker container getting about 40 t/s but when I went to the 120b it went down to 8 t/s but I'm wondering if it's my command?
`./llama-server -m /models/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf --threads -1 --ctx-size 16384 --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --temp 0.6 --min-p 0.0 --top-p 1.0 --top-k 0.0 --host 0.0.0.0 --port 8080`
or if it's because the version I'm using of llama.cpp is behind
EdanStarfire@reddit
Just openai/gpt-oss-120b, not an unsloth version. Looks like MXFP4 63.49GB. context=131072, GPU offload 36/36, CPU thread pool=12, offload kv cache to mem, flash attention, kcache quant fp16, vcache quant fp16.
Probably not optimal, but gets the job done.
Random test with a 13875 token prompt (fully new load, nothing cached): 343 tok/s PP speed 25.81 tok/s output generation
Eugr@reddit
Ollama lacks proper AMD support, but llama.cpp with Vulkan should run well. I've heard Rocm support has become better recently too for Strix Halo.
TetsujinXLIV@reddit
That’s probably my problem then. I need to dig into it future tonight.
xjE4644Eyc@reddit
I'm using llama.cpp. It's great.
ThinkExtension2328@reddit
Can you name your specific machine? This sounds tasty
EdanStarfire@reddit
GMKtec Evo-X2 for me. Very happy with purchase. Got home assistant like 75% of the way to fully local with qwen3-coder-30B-A3. Just working on exposing the right devices and tweaking system prompt.
Secure_Reflection409@reddit
Do you use it for roo/cline?
What kinda token speeds you seeing when you're 50k+ deep?
2CatsOnMyKeyboard@reddit
120B just didn't load in LMStudio with 50K context window. Don't what the max for that size would be, probably around 20-30K.
EdanStarfire@reddit
I altered the BIOS config to put 96GB dedicated to GPU and load gpt-oss fine at 128k context. Runs great.
No_Afternoon_4260@reddit
Facyory setting is 96gb for the gpu but you can set it higher. Have you done that?
Xp_12@reddit
I'd imagine it follows the trend exhibited on this chart.
https://forum.level1techs.com/t/strix-halo-ryzen-ai-max-395-llm-benchmark-results/233796
henryshoe@reddit
Would you mind telling me exactly what your set up is?
tat_tvam_asshole@reddit
This felt so good to upvote. I too bought a Strix Halo box, within hours of seeing it advertised, straight to Microcenter.
Peter-rabbit010@reddit
how do you compare to rtx 6000 pro? I hadn't considered amd wafers until this post.
you should get some referral money from amd
Accomplished_Bet4312@reddit
I'm also making some excuses to stop the idea of getting this :). I have a desktop already, and all the things I need is to get a new gpu card.
But it's good with a tiny machine under my TV. If it can work with steamOS, I can use it majorly as an ai server and a steam console after work.
Commercial-Fly-6296@reddit
Sorry for my lack of knowledge, but does this work well with the fact that most use cuda compatible libraries and so on ?
Also, will this be fast enough to fine tune and inference or maybe even distill ? ( When compared to RTX)
Most_Seaworthiness71@reddit
Does anyone know what’s the actual bandwidth for cpu cores vs gpu cores ? I read somewhere that the cpu cores bandwidth is significantly less than the bandwidth allocated to the gpu cores .
absolutzehro@reddit
Goddammit, I came here to find out more and ended up buying a 128GB mini PC just now. I hate all of you.
mfarmemo@reddit
I own the Framework Desktop (128gb version).
Here's the main reason I'd say to NOT buy: You want everything to "just work". Bleeding edge hardware needs bleeding edge software/drivers. Expect bugs, crashes, and reading forums and reddit in search of workarounds.
I'm happy with my purchase but I also love to tinker with new tech. If that's not you, you should wait.
StyMaar@reddit (OP)
That's a very good answer actually.
What did you have trouble with?
(I've been using Linux as a daily driver for 17 years now so I'm pretty familiar with the process, but the kids eat my tinkering budget a lot so I wouldn't buy something that would be too much of a hassle running).
mfarmemo@reddit
Fedora crashed during install twice when clicking "allow proprietary software"🤷♂️. Once booted, i began installing my usual apps then Fedora crashed. It was a weird crash, complete lock then a system shut down... Current kernel didn't have full support so I had to dig around for kernel updates and work arounds. ROCm drvers were a pain, but it looked like there was better support with Ubuntu. So, I reverted to Ubuntu. Worked better out of the box, but then Citrix Workspace didn't work at all despite a few hours of troubleshooting (need for my work) . I think NPU support is Windows only for right now. But then Windows doesn't have full ROCm support. I ended up going against my own wishes and installed Win11 in hopes to have the essentials work that I needed without significant error. On windows, the WiFi driver seems to struggle with DNS resolution during large packet transfers I spent a few hours debugging that.
I received my framework on Friday. I spent all of my free time (with two kiddos) trying to get things working the way I wanted with mixed results. So i decided on Windows that way I can use the apps I use for work by Monday. I plan to dual boot but waiting for some solid free time to get that done. Probably another weekend project.
ImEatingSeeds@reddit
Check out CachyOS. Don't let the fact that it's Arch Linux scare you. I've been running it on bleeding-edge hardware without ANY issues for over a year as my daily driver.
Windows was dual-booted on the system as an insurance policy/hedge against the risk that if Linux f*cks me, I still have Windows to fall back on (for work, etc).
At this point, I can confidently say that with the exception of one issue around repository keyring updates that went sideways (which was fixed within a day without needing to reinstall the OS or much tinkering)...Cachy has never left me in a situation where I had to boot into Windows as an emergency fallback.
Even when it comes to gaming, Cachy is so good (and optimized) for it that I almost never boot into Windows for video games either.
...and that's coming from a guy who has never been into "ricing out" his own rig, has never run distros like Arch or Gentoo as daily drivers (ever), and who has always preferred to use distros like PopOS, PinguyOS, Mint, Ubuntu, etc.
Driver/hardware support is superb, always up-to-date, and the installation was stupid-easy. My system's been rock solid and stable...almost to the point where it's boring me.
mfarmemo@reddit
Can confirm. Cachy worked great with minimal tinkering effort. I ran it through a full workday of coding, local LLM usage, web usage, and virtual meetings with only two issues: headphone output noise and one GPU crash during inference . Much better experience than Fedora or Ubuntu. The Citrix Workspace app even worked and it never works right on Linux distros.
ImEatingSeeds@reddit
w00t! Great to hear :)
Now…are you brave enough to install hyprland and fully complete your transformation? 😅
mfarmemo@reddit
I'll give it a try. Thanks!
StyMaar@reddit (OP)
That's a great feedback, thanks !
Hot_Turnip_3309@reddit
is it possible I could rent it for a few hours and try to get a few models running? or what tokens per second do you get? I wanted to try qwen3-next 80b. if it can run that, then I could run it 24/7 to do tasks. like 12,000-24,000 60-200k ctx requests per day
mfarmemo@reddit
tps is variable by model architecture, inference setup, and quant type, for example gpt-oss-120b runs great. ~20-25tps output with q4 quant running on GPU fully loaded to vram. Qwen3 next will run just fine. Context length will depend on how much memory is free after the model is loaded and other params. There's one calculators online that can help you understand the optimal setup and memory reqs. I don't anticipate any issues with the new Qwen model.
Eugr@reddit
Is there anyone here who has GMKTek version? I have Framework Desktop on pre-order, but GMKTek is available now for cheaper. One thing that is stopping me is that I've heard that the fan gets pretty loud and the thermals are not great compared to Framework Desktop, so it starts strong, but then throttles down.
Educational_Sun_8813@reddit
with framework you have two m2 and one pcie x4
Eugr@reddit
GMKTek one has two M2 as well, but no PCIe slot. What I also like about Framework Desktop is that the motherboard is MiniITX and uses a standard FlexATX power supply. But I'm in Batch 13 (Q4 shipping).
simracerman@reddit
I’m also in the same Batch and shipping Q4. I ordered the 128Gb board only. With the parts I have planned it will come at around $1900. My future addition is eGPU through Oculink.
green__1@reddit
I don't see oculink listed anywhere on the framework website, how will that work?
simracerman@reddit
Using one of these: https://www.amazon.com/Compatible-OCuLink-SFF-8612-SFF-8611-External/dp/B0F89XSVYF/ref=mp_s_a_1_8?crid=2193JIE7RK9HC&dib=eyJ2IjoiMSJ9.WLo9TSVXH8gqv5uFJfvvKxG02gQvCEL_h5aC72N6Ifv5zHGBH_LCaSvY-Rdae3rSY596XJ3vMYoPe4SRR3BL74DUI1xCWzKsBVYDYFO8bd_IbV225eHp_e1K079teWICnZb7j61imWzmWM63ZiClJo_5HjvbWUR7FhcqkYfziXQplEUt0GU5VhrYZj3_LenY_nNuhBOsTXmHXTea4CO7XA.ur9tBt0FU5PDby5AojwTOjDxpR_HB49ViheJ636OqA0&dib_tag=se&keywords=mini+pcie+to+oculink&qid=1758084579&sprefix=mini+pcie+to+ocu%2Caps%2C177&sr=8-8
xjE4644Eyc@reddit
I have it. The fan is annoying, but I keep it my server closet so doesn't bother me. Sometimes it gets in this loop where it is sucking 25 watts of power continuously, but most of the time its 5-7 watts.
Would buy again.
Eugr@reddit
Thanks!
arades@reddit
Just a note, only 96GB is available to use as VRAM, a minimum of 32GB will always be usable for CPU.
For some reason I doubt that only 75% usable for VRAM will be a big dissuader. Especially considering the other 25% can obviously be moved in/out pretty quick if used for KV cache
Eugr@reddit
In Linux, you can set it up to use all RAM as unified memory, so you just dedicate 500MB to VRAM in BIOS, the rest will be allocated dynamically as needed (just like on Mac).
d3v3l0pr@reddit
Do you have a link on how to do that?
Eugr@reddit
kyuz0/amd-strix-halo-toolboxes
Look at host configuration section. The key are these kernel parameters: `amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432`
d3v3l0pr@reddit
Hey thanks a lot! I've been sitting on this ai max+ 395 and hitting the context limit a lot, can't wait to test this :)
Eugr@reddit
No problem! What AI Max+ 395 box do you have? Framework? GMKTek? Other? I have a Framework Desktop on pre-order, but looks like I can get GMKTek one faster and cheaper, but not sure about thermals/noise...
d3v3l0pr@reddit
I got the latest Beelink GTR 9 Pro. It was a preorder, but still it shipped within 10 days I believe. I haven’t tested to get thermal data. The fan is faintly audible, but spins up somewhat during heavy load. It looks nice, is quite small and has a bunch of io so I’m quite happy with it. I’m mostly compiling rust and hoping to use for local llm and run some services, and for that it’s a beast.
Now I just need to dive deeper into this local llm rabbithole and figure out the right agent tools and models to use. Been trying opencode, but don’t know yet how to use it, and I hit the context limit fast with gpt-oss
Eugr@reddit
What do you use to run gpt-oss and what context size have you set?
d3v3l0pr@reddit
lmstudio, and max context size without the kernel parameters you posted i get is 15k with gpt-oss 120b
Eugr@reddit
You should be able to get full 128k context on your hardware. But you either need to allocate memory dynamically (see my post about kernel parameters) or dedicate 96GB to VRAM, as gpt-oss takes around 64GB when loaded with full context, IIRC. It may fit into 64GB of VRAM, but I'm not sure.
Eugr@reddit
Hmm, that Beelink looks interesting, but currently more expensive than Framework or GMKTek. I guess I'll just wait a bit more and see if any of those prices go down... Or not, never know with these tariffs now.
waiting_for_zban@reddit
The issue are also software stack support. 2 months ago, I couldn't run reliably without specifically setting the UMA limit in BIOS, few updates later, it's working very well with GTT. Although again, lots of fluctuations with tg and pp.
arades@reddit
That's actually great info, I assumed that was a chipset restriction
Educational_Sun_8813@reddit
you can adress 112G with normal libre OS
Zyguard7777777@reddit
112gb as vram under Linux apparently
SuitableAd5090@reddit
Here is a good site on using the device in a home lab scenario https://strixhalo-homelab.d7.wtf/
I really like the way they break it down. It's basically a 7600xt with tons of RAM. So it's already kinda outdated and on old architectures for AMD. I can't help but feel that the next wave of AI things they do will be much much better
Ok_Warning2146@reddit
If u do video gen, it will be uselessly slow
n1k0v@reddit
What are your thoughts on the 385 32gb ?
StyMaar@reddit (OP)
I don't have any, as I haven't tried it ^^
Particular-Party4655@reddit
I am wondering why there are still no PCIe extension cards announced based on this chip (AMD AI Max+ 395)? Imagine... That would be a bomb!
StyMaar@reddit (OP)
I'm no hardware specialist at all, but this chip is a full SoC with a CPU in it, it's not a regular GPU.
Why isn't there high memory/moderate-low bandwidth graphic cards using the same kind of design of simply stacking cheap LPDDR instead of GDDR is a legit question though.
Particular-Party4655@reddit
Yes. I think there is a huge marketing / money reason why we still don't have AI-first / GPU-second PCIe cards for consumer market. I basically asked this question to troll them a bit. However, for AMD that might be a great marketing move. It would allow them to win this market (local AI for the masses) while Nvidia is focused on high-profile clients owning datacenters... From a hardware perspective it is very easy to turn this SoC into a PCIe card (it already has a PCIe4x16 slot), the software stack (ROCm-based, all the routing / balancing pipelines) will catch up inevitably.
Ok-Possibility-5586@reddit
Can you explain what you mean by 128GB of VRAM?
This is a laptop right?
It looks like it's 128GB of RAM.
Is it shared VRAM/RAM like on apple or what?
StyMaar@reddit (OP)
This is a chip, which was designed by AMD for use in laptops but has been used my mini-PC makers, the Framework desktop being the most notorious.
AFAIK it's not exactly the same as Apple, but it's the same idea.
So yes is 128GB of LPDDR5x stuck directly on the SoC, and has 256GBps of memory bandwidth (lower than true GDDR6X VRAM, but higher than your desktop's system RAM).
Ok-Possibility-5586@reddit
Cool. Did you end up buying the laptop? (I think you said strix halo?). I'm looking at this: HP ZBook Ultra G1a
szab999@reddit
Not OP, but I got this exact HP Zbook Ultra G1a with Ryzen 395 + 128GB RAM. You can allocate up to 96GB to the GPU. I have it running with Debian, but I haven’t managed to setup the latest RoCM yet, there are some installation issues with that.
Ok-Possibility-5586@reddit
It would be awesome if you keep us up to date. On paper it sounds like a no-brainer for a daily driver if it can be made to work.
IMO the shared memory thing is a potential NVIDIA killer for the hobby end of the market. Both Intel and AMD should be putting a team onto building out as many Intel/AMD forks of the major python libraries.
szab999@reddit
I made it work with podman (docker alternative), running ubuntu 24.04 + rocm 6.4.3 in the container. Ollama is working well, I will do some performance tests tomorrow.
StyMaar@reddit (OP)
“Strix halo” is just the AMD codename for the “Ryzen AI max+ 395”.
Why is AMD insisting on using such an horrible name scheme instead of the code name is beyond me.
But to answer your question, I have zero need for a laptop (I still have one from 2015 that I use every once in a while when I really need a laptop, but that's very rare).
ptyslaw@reddit
Where is it available for $2k? Is it in a laptop or mini pc form?
StyMaar@reddit (OP)
Framework desktop motherboard.
ptyslaw@reddit
Oh that’s freaking cool! I thought it was only available as laptops
Serveurperso@reddit
En fait on est le "cul entre deux chaises" car d’expérience je dispose d'un Ryzen 9 9950X3D/RTX5090FE dédiée IA Debian CLI, et pour les MoE de taille moyenne comme GPT-OSS 120B (tourne 50 t/s sous llama.cpp) GLM 4.5 Air (25 t/s) on est dans le segment déborde de la VRAM 32Go, et la DDR5 6600 MT/s (100Go/s) handicape, mais pas assez pour qu'un DGX Spark ou un AMD AI Max+ 395 128Go apporte grand chose. Je dirais que si t'as rien achète un AI Max mais si t'as un bon PC DDR5 ce sera pas formidable et même bqp plus lent sur les modèles 32B denses avec le bon tuning (Q5_K_M / Q6 imatrix avec KVCache en Q8_K et fast attention activé).
Et pour ses MoE sous llama.cpp bien utiliser --n-cpu-moe
tarruda@reddit
If you can spend $2k on AI MAX, just add a few extra hundred and get a Mac Studio M1 ultra with 128GB. It not only is significantly faster for running LLMs, but you can allocate up to 125GB to VRAM, which increases the selection of models you can run. For example, It can run Qwen 3 235B at 4 bit quant.
StyMaar@reddit (OP)
I don't want to mess around with Asahi Linux on Mac, and MacOS is a nonstarter for me so I'll pass.
tarruda@reddit
I sympathize with your point of view: Never liked Apple and MacOS, and would never otherwise buy one of their products. I also don't want to mess with Asahi unless it gained official Linux kernel support, which doesn't seem like it will ever happen (they appear to be stuck on M2).
But the truth is that there's nothing that comes close to Apple silicon for running LLMs in a local home lab:
With this clear superiority, I bit the bullet and just got a mac studio, which I use as a locked down server for running AI sofware.
To mitigate the downsides of using something like MacOS, I don't connect it directly to my home router. Instead it is connected to a mini pc via cable that acts as a gateway to the Mac, and which blocks access to the internet most of the time. I unlock it to download LLMs, but always block access to apple servers to prevent automatic updates.
If Asahi Linux gained mainstream distro support and Vulkan can run AI software as well as Apple silicon Metal, eventually I might switch to it.
randomfoo2@reddit
While I think Macs can be good and I haven't ever gotten my hands on a high end one to run more extensive real testing, I do think that the 800GB/s theoretical MBW numbers are somewhat misleading. In the llama.cpp Mac performance discussion thread the top of the line (80CU) M3 Ultra w/ 800GB/s of theoretical MBW gets a tg128 of 92.14 tok/s with the Metal backend on the standard llama2-7b q4_0 test model. Not bad!
However, when comparing the same model, Strix Halo gets tg128 52.16 tok/s with the Vulkan backend (RADV driver) - that's 57% of the tg128 perf at 32% of the max theoretical MBW. 🤔
On the flip side, the 3090 (CUDA w/ FA) gets 161.89 tok/s tg128 - that's 76% better performance than the M3 Ultra even though the 3090 only has +20% more theoretical MBW.
(For Strix Halo and 3090, my personal llama-bench numbers corroborate the published results.)
For those interested, be sure to also take a look at the pp512 (compute bound prompt processing/prefill), the numbers are even more stark as a comparison. You don't really get a free lunch when it comes to matmuls/watt.
Icy-Signature8160@reddit
the same model, Strix Halo gets tg128 52.16 tok/s with the Vulkan
I see 5.18 t/s in your list, what's wrong?
ps. can you check the new qwen3-next-80B-A3B how it performs on the Strix Halo?
randomfoo2@reddit
There’s literally no way a 70B is getting 50 t/s tg on Atrix Halo lol. Either you’re reading pp/tg wrong or simply the wrong model.
Qwen3 next needs to be quanted to run on Strix Halo and my understanding is that nothing does it yet.
tarruda@reddit
I think memory bandwidth is just one of the factors, and it doesn't has the same impact on all LLMs. GPU performance is also impactful, and the Mac GPU is definitely less powerful than a 3090 (however, it has a much better performance/power ratio). Here's the M1 ultra llama-bench for GPT-OSS 120b:
randomfoo2@reddit
Here btw is how Strix Halo performs with the same model and the Vulkan AMDVLK backend (60-90W):
tarruda@reddit
That's quite good. Seems like memory bandwidth doesn't affect MoE that much. Can you run it on a 70b dense model like hermes 4? I'm curious if it will be much different than what I'm getting:
randomfoo2@reddit
For a Llama 3 70B it looks like Vulkan RADV performs better than AMDVLK:
tarruda@reddit
As I suspected, memory bandwidth has a more significant impact on LLMs with more active parameters. If we compare the GPT-OSS inference numbers, mine is only 25% better than yours (62.52/50), while the 70B LLM it is 89% better (9.83/5.18)
Another interesting thing is that pp512 for 0 context is better on Ryzen AI GPU. IIRC prompt processing is affected mostly by GPU performance.
randomfoo2@reddit
Sure dense models are more MBW bound, but you'll note that again, that while Strix Halo gets decently close to its theoretical MBW limit (39.59GB * 5.18 tok/s = 205.08 GB/s of a 256GB/s theoretical) \~80% of max MBW. OTOH the for the Mac, at 39.73GB * 9.83 = 390.55 GB/s or 49% of the max theoretical MBW. I'm curious what `asitop` reports when you're running your inferencing. Does it ever get anywhere close to 800GB/s? If so that would mean there's some other efficiency going on since that isn't being reflected by the actual tg speeds.
fatboy93@reddit
Genuine answer: ROCm
But Vulkan is 8p% of the way there so idk man.
StyMaar@reddit (OP)
I'm definitely planning to use Vulkan.
RRO-19@reddit
That's a lot of VRAM for local models. Main downsides would be power consumption and whether your use case actually needs that much memory. Most fine-tuning tasks work fine with less.
StyMaar@reddit (OP)
If you want to run gpt-oss-120b or Qwen3-next, you cannot really do it with less memory…
power97992@reddit
Cant u get a m2 max mac studio for a similar price but with more bandwidth but less flops?
StyMaar@reddit (OP)
Software support is going to be a problem as Asahi Linux isn't exactly mature AFAIK (and no, I'm not going to touch Mac OS with a ten feet pole ever again).
power97992@reddit
What is wrong with the Mac OS?
StyMaar@reddit (OP)
If I'm spending two thousand bucks on a computer, it's my computer and Apple considers the computer they sold as theirs (you can't repaire it yourself, even with actual Apple parts, because there's a DRM in the parts, they don't give you full access to your OS's internal “for security” and so on).
Also I had to work with it a bit, and the developer experience sucks.
power97992@reddit
It is hard to repair but the aluminum build is good though, better than plastic and the ecosystem is convenient.
EmbarrassedAsk2887@reddit
well. before I answer your wallet— can you tell what you finna use the local LLMs for? Is it chat, apis, ide?
StyMaar@reddit (OP)
everything.gif
EmbarrassedAsk2887@reddit
would you be interested if i ask you to buy a ssd and then series of steps later you would be able to run upto 400b models :)
StyMaar@reddit (OP)
“Is it possible to learn this power?”
Food4Lessy@reddit
The best low power llm is Apple Max and AMD 395 for under $2,000.
Sub $500 budget is cloud, Apple Air, 4060
64gb to 128gb hbm gets alot do and saves time as a developer. Runs about 200 llm.
Sub 32gb is good for getting your feet. Runs 50 llm.
sP0re90@reddit
If I understood well llama still doesn’t support AMD NPU so I would wait a bit until it does (and then run models with full power either LMStudio or Ollama). In the meantime maybe prices will drop or other hardware will be released. This is exactly the same thing that stopped me to spend that 2k
Intelligent_Bet_3985@reddit
One thing stopping me is that I've heard it's bad for image generation.
GreenCap49@reddit
Some update from strix halo discord: "Wow this new change is phenomenal. Iteration times for Qwen Image have gone down drastically. The same prompt, previously with Ultra Fast (4 steps) that took 2.5 minutes, now completes in 1m 40 seconds with Fast (8 steps).
Iteration speeds have gone from ~20s/iteration to less than 9s/iteration.
Great going <@793885450133176351> " this is with https://github.com/kyuz0/amd-strix-halo-image-video-toolboxes I haven't tested it personally yet
GreenCap49@reddit
It's not bad, Qwen image with 8 step lora takes around 5-10min per image. Define bad :D
Aplakka@reddit
Thanks, that's good to know. I've been interested in knowing how these might work for image or video generation but I haven't seen any benchmarks.
For comparison, RTX 4090 takes about 30 seconds for Qwen image with 8 step LoRA, so it's about 10 or 20 times faster than AMD AI Max+ 395.
GreenCap49@reddit
Yeah really depends on your needs. For one off image generation it's definitely enough, but if you run a pipeline maybe better of with GPU. But then you can fit Gpt-oss-120b or qwen next on it completely =)
_bani_@reddit
I built a 5 x 3090 rig so I can run things like gpt-oss-120b, it's pretty fast.
Intelligent_Bet_3985@reddit
That's about how I'd define bad, yes.
Freonr2@reddit
5090 would be what, 45-60 seconds?
LumpyWelds@reddit
What about Image understanding (I2T)?
Edzomatic@reddit
Is that quantized or the full model?
ASYMT0TIC@reddit
Not if you buy it for the motherboard and jam your 4090 in there with it.
Freonr2@reddit
MOE LLMs is really what it is ideal for. Pretty much everything else will be relatively slow given the memory bandwidth and compute.
Dollar for dollar, a 5090 for $2k is going to absolutely stomp the 395 for dense diffusion models since even the larger (14B, 20B) models will fit in VRAM with only a few minor compromises (i.e. picking the right GGUF quant to fit, but these work very well).
I would not buy the 395 to run Wan video, Qwen-Image, Flux, etc.
Maybe we'll start seeing large MOE diffusion models later on, but there's no guarantee.
Rich_Repeat_22@reddit
I bet the guide using ComfyUI + ROCm On windows for RDNA3 & RDNA4 dGPUs will work the same with RDNA3.5 the 395 has 🤔
mycall@reddit
Zen6 is around the corner
Rich_Repeat_22@reddit
Medusa will still have the same iGPU however seems on Zen6 the NPU is integrated and beefier.
Time will tell. Imho we need big NPUs instead of GPUs to run LLMs.
mycall@reddit
I mostly agree although NPUs although when I use LLM Studio on my HX 370, it uses Vulcan instead of NPU. This might be due to NPU memory bandwidth. I hope that changes.
Rich_Repeat_22@reddit
LM studio is bad when comes to these APUs.
Murder_Teddy_Bear@reddit
I have one and am very happy. (Hooked up an eGPU with a 4070 ti super to it.)
JumpingJack79@reddit
😮 What product do you have? And how did you hook up the eGPU?
pn_1984@reddit
many mini-PCs come with OCULink nowadays to expand with a proper GPU later. I, for one find it very useful because it let's me stagger my expense.
JumpingJack79@reddit
Are you sure? I'm not finding any viable AI MAX 395 products with Oculink 🤔
pn_1984@reddit
You might need a dedicated docking station but Minisforum M1 Pro-285H is probably something which might be suitable.
JumpingJack79@reddit
No, not suitable. The primary requirement here is the Ryzen 395 with the 128 GB (V)RAM. Without that it's just a regular PC with a GPU, which I already have.
simracerman@reddit
Lookup Aoostar AG01.
tarheelbandb@reddit
I wonder what performance of an RTX 4080 with the MOE offloaded to the CPU would perform like vs the APU alone.
igorwarzocha@reddit
I keep on asking people to try to offload the experts to igpu whenever I see someone with egpu hooked up to the thing...
(Doable with any combo via vulkan, you just put in device vulkan0/1 into your -ot regex style command)
Potential-Leg-639@reddit
Oculink?
Otherwise-Variety674@reddit
I swear if I have the money, I won't think twice and will go ahead and buy it yesterday, it is just god damn worth it.
CryptographerKlutzy7@reddit
It kicks arse seriously, I grabbed two, and regret neither.
sedition666@reddit
Interesting, what do you use both of them for out of curiosity?
CryptographerKlutzy7@reddit
Well, AI boxes mostly, but they also kick booty running games, doing dev work, etc. So one is being used as a dev / gaming machine(different partitions) (which I throw in a backpack when I am heading to friends, conferences, work, etc) the other a dedicated AI box which I have at home.
They are a crazy good "desktop"/ dev box.
The AI box is because we have a bunch of private data stuff, but we stuff need agentic coders for automated debugging, so we can't use the cloud LLMs for that stage, so we push to the box for that.
It works well enough. No regrets getting the boxes at all. They are way better than they have any right to be. ABSOLUTELY max them out on memory.
sedition666@reddit
Very cool I am extremely jealous!
Daniel_H212@reddit
Heres the reason I'll give you:
The AI Max+ 395 is a bit of a first gen product. It's the first foray into large unified memory systems for consumer AI by AMD. Competing products from Intel will come out, and subsequent generation products from AMD will come out too. 128 GB of RAM is a lot today, but will it really be a lot a few years down the line? Keep in mind it isn't upgradable, so if you have a PC now, I'd just spend a few hundred to get a bunch of RAM to let you start running large MoEs now, and wait for bigger and better versions of this kind of machine to come out in the future, rather than beta testing a new product niche for AMD.
sleepingsysadmin@reddit
>My wallet thanks you in advance.
Sorry wallet, no help today.
I was asking GPT5 recently to try to estimate AMD and Intel's product cycles and for when a 256gb option with that much more memory bandwidth might come. It was thinking 2027-2028.
Intel cancelled their option and are moving toward enterprise options. So nothing for us.
Apple isnt likely to bring a 256 mac mini until 2028.
The amd 128gb might be a best in slot for quite a long time.
ailee43@reddit
Intel cancelled battlematrix?
layer4down@reddit
Why not consider M3 Studio 512GB?
tarheelbandb@reddit
Because it's twice as much and the gains may not be worth it?
layer4down@reddit
Sure if an LLM inference workstation or server is the prime use case then it may be overkill. I happen to love my M2 as a daily driver outside of AI/ML. Being able to run larger models is just icing (PP performance notwithstanding).
tarheelbandb@reddit
Curious. I was not under the impression that we were having a conversation about usage outside of AI use cases.
layer4down@reddit
That is true of the OP's main thread. But u/sleepingsysadmin bought up the Mac Mini. It wasn't clear they were aware that there's a larger 512GB unit on the market as well.
tarheelbandb@reddit
Fair.
ASYMT0TIC@reddit
It's five times as much in fact.
tarheelbandb@reddit
I was being conservative and looking at lowest specs models and used, but yeah. I don't think of myself as cheap, but I can't find the value, especially in my use case.
Freonr2@reddit
Because it is $10k.
sleepingsysadmin@reddit
i dont hate myself to use anything but linux.
ashirviskas@reddit
Depending on your self hate and pain tolerance levels, you could use linux on apple hardware
Freonr2@reddit
Next hope for good value is probably the next gen Ryzen AI chip, based on rumors and guessing maybe it'll be the Ryzen 495+ with 256GB and ~450GB/s+ and maybe $3500 or less, but late 2026 is probably optimistic and 2027 might be more realistic.
Anything from Apple is going to be pricey. Current Mac Studio M3 Ultra 256GB is $5600. Maybe value will improve a bit but I'd be surprised to see a decent 256GB system from Apple for less than $4500.
MaverickPT@reddit
That 256 GB Mac mini is gonna cost like 4k-5k🥲
HenkPoley@reddit
Just for the RAM option. 🫣
sleepingsysadmin@reddit
cost much more than $, the soul damage from owning apple products...
tarheelbandb@reddit
Crazy because I think the APU is how AMD stays competitive in the face of sweatys that live and die by Nvidia and Mac. I'm still trying to get a straight answer on how much more productive an individual developer can be on that same amount of memory with Apple silicon or the equivalent in Nvidia GPUs. My best guess is that the return drops off a cliff.
junkmailkeep@reddit
I just got mine is there any things I need to do for better performance. I know rocm support is still getting worked on and works better on Linux.
Negatrev@reddit
Only a fairly abstract reason. But while the tech is moving so fast, your options and requirements can change from month to month.
So if you don't really need it right now, you should wait, save your money and see what becomes...impossible to resist.
Kindly_Elk_2584@reddit
Because it's AMD GPU lol. You can't run anything besides a selection of LLM models. And it's just 4070 level of power, would rather invest that money into a 5090.
profcuck@reddit
Hey /u/StyMaar you may find this useful:
https://docs.google.com/spreadsheets/d/1mmob8me7STljG6r7EvmJBuhJTqoBaAtba2f_9RU7Ef4/edit?gid=0#gid=0
As I am in the midst of upgrading my homelab to 10gbe, the Beelink looks like the one for me. Let us know which one you get?
Left-Language9389@reddit
Anyone got a link to this machine so I can see what OP is not buying?
Cloakk-Seraph@reddit
I mean I've been rocking one for awhile. I've anecdotally notice that while there's plenty of varm.on tap, it's not as responsive or quick as my dedicated AMD 7900 xtx on my desktop. Also cue the AI is hardere with AMD vs nvidia (not that I know what that's like)
lodg1111@reddit
Bigger model counter intuitively does not work so good in some task. For example I was writing cover letter, gpt-oss-120B acts very unprofessionally writing subheading and point forms. even after i told it to fix, it is still repeating the error.
Switched to use gpt-oss-20B. already very decent, in paragraphs. No short forms, no point forms, no subheadings. write much better email.
tronathan@reddit
Do these machines suffer from the expensive input token issue that affects MLX, etc? e.g. Are you going to have 60 second time-to-first-token?
amztec@reddit
I will buy it, simple reason
Why not buy it - if i were cheap, this is a good deal
Massive-Question-550@reddit
Many MOE models still aren't that great and the good ones are too large to fit on the 395. Also the prompt processing is very slow compared to most gpu's and expandability isn't great. If they released a more powerful desktop version that you could connect multiple devices to then that would really be something you would want to keep long term.
Ok-Hawk-5828@reddit
You’re either playing with it or having it do 24/7 workflows.
If toy: You’re better off using cloud APIs. A 3090 stack is better at most things. DGX is more versatile. Studio ultra seems more fun.
If 24/7 tool: is it really that much more useful than a 64GB AGX Xavier for $250-300?
The 395 fits very nicely into its price range and is very competitive in said range but it seems to be bought by people who don’t know exactly what they’re going to do with it and that is not a good sign.
ijustwanttolive23@reddit
Unless cloud providers start offering true ZDR in the privacy policy you should act like everything is public...
Ok-Hawk-5828@reddit
I don’t care if everything is public if a solution meets my needs.
I use local for context management and 24/7 workflows that are cost prohibitive to run in the cloud. Building out local AI setups for <1% utilization sounds like insanity.
ijustwanttolive23@reddit
Do you feel the same about gaming PCs, your car, etc? There is a lot of freedom and consistently and privacy benefits in owning even if utilization isn't super high.
Ok-Hawk-5828@reddit
I've never had a gaming device since I was a kid. I have a truck that does what I need it to do and if it doesn't, then I rent instead of buying more cars or something that could meet every imaginable need. Utilization is everything when it comes quickly depreciating assets.
StyMaar@reddit (OP)
Thanks but no thanks. I'm not surrendering my privacy for convenience or small savings.
Except at real estate, power consumption and noise. That's a big deal since my computer is in the living room…
Wat? DGX is ARM (so no gaming) and will likely only support a custom Nvidia distro based on an obsolete Ubuntu. It may be a much better machine for ML scientists, but it's the opposite of “more versatile”.
xeikeo@reddit
I just put a pre-order in for a frame work desktop 😭😭😭 maxed out.
audioen@reddit
So it's a nice general desktop computer, can do gaming, LLM in a limited way for 1 user, and I appreciate compact size and silence which is what I get using HP Mini Z2 G1a box. It's more like $4000, though.
I predict that in about 1 year, it will be considered a lower end computer for LLM. So within a year, these AI Max boxes will start to look like paperweights and the competitors will likely be several times faster and some probably ship with more RAM. The Ryzen 395+ is kind of discount Apple -- more attractive performance/price ratio, but that's about it. Slow compute means: stable diffusion is slow, video generation is slow, dense LLM models are slow. It can literally only run MoE stuff and very small < 10B models at usable speed.
I love my little computer, but I'm expecting to be selling it within the year.
ijustwanttolive23@reddit
Correct me if I am wrong but a Mac Studio would be noticeable faster right? And if you get the 256 gb or higher your options are a lot better?
StyMaar@reddit (OP)
Indeed, but it would be much more expensive. I could almost buy two 5080 to put the Strix Halo's PCIe slots at this price.
$5600 though, that's close to three times the price of the 395 I'm talking about.
Also I don't want to tinker with Asahi Linux and I'm not going to run a closed OS on my computer.
ijustwanttolive23@reddit
True. I forget the price.
Question: Can the AMD AI Max+ 395 use an external GPU? Could you hook up a 3090 and offload a chuck of a model?
StyMaar@reddit (OP)
AFAIK it depends on which exact model you buy, but the framework laptop has a PCIe x4 socket (+ 2 M.2 socket, one of which can easily be repurposed into another PCIe x4).
I'm not 100% sure about that but I've read plenty of time that the number of PCIe lanes isn't that big of a deal for LLMs as there's much data to move from one GPU to another but don't quote me on that.
cidiousx@reddit
I saw this thread and am debating getting a box myself.
Currently I'm running a full blown server with Intel 245K + 96GB 7200 + 2x A5000 24GB (48GB total). If I'd want to expand the VRAM I'd have to go for some (sketchy custom) SKU 4090 48GB cards or break my bank.
I bought the A5000 cards for $1100 USD new a pop and they go for $1300-1350 USD here locally now second hand.. I could sell them and make a profit. In return I could grab a MAX+ 395 box with 128GB soldered ram.
The benefit of that swap would be also freeing up a lot of the other parts in the build that can strengthen some of my other server builds.
My grip with it is that the ram is soldered and nothing about the box can be upgraded. It will just depreciate and without being able to upgrade it, not a long term solution.
Any thoughts?
DevDuderino@reddit
9/10 would highly recommend. Finding do many random uses for cheap AI I didn't know I had before
NegativeKarmaSniifer@reddit
Can you list a few?I'm just curious.
DevDuderino@reddit
Let's see.
Podcast transcription and summarization, I use a combination of whisper.cpp and gpt-oss:120b to extract a structured representation of the segments and topics covered.
FM radio monitoring, have an rtl-sdr tuned to a local news station. Every 60s clips are processed through a whisper->llm pipeline, anything 'important' is included in an hourly news summary I get via email.
Light 'agent' work. Crush cli running with QwenCode 30b works surprisingly well with MCP servers(for simple tasks like DB queries.).
I still use gemini+claude for long-context tasks where accuracy is important. Being able to just run batch tasks constantly without worrying about token costs has been great.
flammafex@reddit
```FM radio monitoring, have an rtl-sdr tuned to a local news station. Every 60s clips are processed through a whisper->llm pipeline, anything 'important' is included in an hourly news summary I get via email.```
I get that, but how to test for you importance? Simple keyword matching? LLM judges?
DevDuderino@reddit
So what l do is check against a running list of "news topics" I have stored in a sqlite DB.
Right now "important" means either something completely new, not on the existing topic tracking or new information matching a Google Google alert style . so the Charlie Charlie Kirk stuff popped up since it was a new story. "National guard deployment Chicago" is a subject I have set up to be included as an immediate alert.
It's all done through rules. I have set up in SQL so once the structured data is extracted it's fairly simple to generate the "digest".
flammafex@reddit
ty
DevDuderino@reddit
The LLM is mostly just transforming the raw audio into a structure I can script against
mlexx@reddit
curious too
No_Bake6681@reddit
$2000 is a lot of tokens
StyMaar@reddit (OP)
That's a lot of free training data to give to the corporation hosting the API for sure. I'd rather keep that for myself.
No_Bake6681@reddit
Ah ok so privacy is a key goal of yours.
What is your usecase? Coding?
StyMaar@reddit (OP)
Yes. Not that what I do with LLM is necessarily confidential (though sometimes it is because it's contains customers' informations) but mostly as a matter of principle.
A bit of coding, but lots of various stuff LLMs work well at, like getting unstuck at making the first draft of an important email or other written document (it doesn't matter if pretty much all of it is rewritten, the llm is there to overcome the “blank page syndrome”), getting rephrasing suggestions, summarizing technical documents, extracting relevant information from logs, building valid json requests for an API from natural language input, etc.
I've been using a mix of Mistral, Gemma and Qwen, with a good chunk of the work being done on CPU because I only have 8GB of VRAM.
As you can guess, it's pretty slow and I've been contemplating buying a better GPU for a while, but I've been disappointed by new models and didn't feel great about buying an used card given that they sometimes have defects that can be hard to detect.
Soggy-Camera1270@reddit
Yeah but it's not just an AI tool either, is it.
No_Bake6681@reddit
Sort agree... as a reminder op wants to be dissuaded.
If having another laptop class computer is useful then sure.
If all they want to do is build with AI (same) then I've surmised that the 395+ is more of a novelty and a distraction from that true goal.
Yo op, what are your goals?
Ornery-Delivery-1531@reddit
Software is not there yet, wait for final rocM 7.0/6.5 and for other libraries to be stable on it. This still need good not only for LLM, but other use cases like TTS or Stable Diffusion.
So, once this happens, Nvidia Digits may finally ship and it might be a better choice then.
Rich_Repeat_22@reddit
Digits has double the price, pathetic mobile ARM CPU and is restricted to Ubuntu ARM Based NVIDIA-OS due to drivers not existing even for Windows ARM.
StyMaar@reddit (OP)
Digit still isn't there. Also, it will allegedly be twice as expensive and is ARM arch (so I couldn't use it to play occasional video games) and will likely require a custom Nvidia distro based on an old Ubuntu.
Thanks but no thanks.
tarheelbandb@reddit
Just got my shipping update from Bosgame on mine. Despite the site's shipping page saying local US shipping, the tracking says it's coming from Hong Kong. Maybe it will still get here in 5 days.
randomfoo2@reddit
It's your money and time, but here are some downsides/gotchas to be aware of that I outlined here: https://www.reddit.com/r/LocalLLaMA/comments/1nc0dgg/comment/nd87zji/
At the end of the day, you're basically buying a Radeon RX 7600 w/ 120GBish of memory, so those are going to be your pros and cons. If you can get what you want working on a not-well-supported RDNA3 chip and don't need a lot of compute or memory bandwidth (and get a wicked fast CPU as well) it's a good deal. If you're expecting anything as nice as an experience you get from an Nvidia card, or don't want to spend your time poking around to get things working, then maybe you might be better off with other options, but you'll end up with a more complex/power hungry or more expensive system (or both).
There's a dedicated Strix Halo Homelab wiki and discord you can visit to learn more: https://strixhalo-homelab.d7.wtf/ (there's an AI capabilities system there that goes into details on what works/doesn't).
Clipthecliph@reddit
Cause you gotta double it, be future proof!
Quick_Rest@reddit
It lets you run those models, but not at "decent" speeds. This means it's mostly great for testing, but you'll really want a larger GPU cluster or opt for cloud-hosted options to get any real work done (e.g. looking at code).
I have a Strix Point (AI HX 375) laptop with 64 GB of RAM. It's about \~1/3 the GPU performance and \~1/2 the memory bandwidth. Sure, it can run models like OSS 20B, but only at \~25 TPS
BillDStrong@reddit
Because what is coming is going to be so much better.
In seriousness, if you need it now, then get it now. If not, then don't. I would suggest you make sure to get a platform that offers some expansion capability, such as the Framework with its x4 PCI-e port, of the model that has oculink ports. This will allow you to add additional GPUs later.
Why would that be important? AMD, for instance, in its next gen RDNA 5 GPUs will be pairing with LPDDR5 memory instead of the much more expensive GDDR6/7, so they will be able to offer cards with much more VRAM. GPU VRAM is the only way you have of expanding the amount of memory on these machines, so if you want to run a model that is much larger, you need a solution.
Now, AMD has not announced sizes yet, and these are schedules for more than a year or so away, but their use of essentially Desktop Memory means they, or their partners, could offer a card that has memory on the order of professional cards at much cheaper prices. Even the threat of this should force Nvidia's hands for more memory, so future proofing means you win, but if not, you don't.
StyMaar@reddit (OP)
Very interesting, where did you get that from?
BillDStrong@reddit
Watching all the leaking videos. In this case, Moore's Law is Dead, who has a very high track record, released it and others have confirmed that is what the documents they have seen say.
momono75@reddit
Okay. I will try to stop you.
Z.ai has a fixed price subscription plan for GLM 4.5. it's $6 per month, and more than three times higher limit against Claude Code pro.
StyMaar@reddit (OP)
I'm not willing to give up my freedom to save a few bucks. Cloud LLMs are a non starter for me.
momono75@reddit
How about Nvidia's spark if you don't mind about the cost so much? And even if you don't choose it, other products' price might go down if spark is released.
StyMaar@reddit (OP)
ARM means I can't use that as my daily driver (I play a small but non null amount of video games). Also the custom Linux distro thing on previous equivalent from Nvidia don't inspire much confidence.
momono75@reddit
I see. Yeah, that will work better than the usual gaming mini PCs, or Steam Deck. I failed to stop you.
StyMaar@reddit (OP)
Also, did you see that: https://www.reddit.com/r/LocalLLaMA/comments/1ndz1k4/pny_preorder_listing_shows_nvidia_dgx_spark_at/
If that's true then it means the Spark will be as expense as a 395 and a 5090 combined!
I'd buy both rather than just the Spark any day if I wanted to spend $4300 …
momono75@reddit
What?! So expensive... I buy 395 now.
FabioTR@reddit
Two reasons:
1) Rocm support for strix halo is still not perfect. It will definitely improve in the next months.
2) I think soon some chinese producers will release strix halo motherboard, with RAM and CPU integrated and you will be able to get better and cheaper systems for less.
prusswan@reddit
If you can afford to and have some idea of how you can realize the value, just go ahead and do it. Time is money.
$2 is really not a lot. As long as you put it to good use.
randomqhacker@reddit
$1000 difference from 32GB to 128GB models tells you they are charging way too much right now. At least wait for Black Friday.
gofiend@reddit
Really want someone to make a mATX version with like two PCI x8 slots. 128GB + 2x 24-32 GB GPUs would be absurdly powerful.
StyMaar@reddit (OP)
You can't really have that many PCIe lanes unfortunately, AFAIK there are just 12 lanes available on the CPU.
But you can still use two PCIe x4 slots if you repurpose on of the M.2 slot.
I have no idea know how much PCIe lane width impact performance for LLM inference though.
Educational_Sun_8813@reddit
it will be fine, you just have to loade model, if you have alleady nvidia just run nvtop to see how little are memory transfers during inference
StyMaar@reddit (OP)
That was my intuition but using nvtop is a good idea actually, thanks.
gofiend@reddit
Yeah repurposing would be fine. Would be nice if someone turned this into an integrated motherboard that did the repurposing. Seems silly to have tons of storage bandwidth (NVME is cheap) when you could stick another GPU on. Pretty sure it's a net win (especially with MOEs) even if you have slow PCI (since you won't be using tensor parallelism anyway).
NWDD@reddit
Since most boards are ITX or smaller, you should have space in a matx case to repurpose for 2 gpus without much of a hassle. If you're willing to sacrifice all M.2 it might be possible to run five gpus (three at 64gbps and two at 40gbps).
The 64gbps bandwidth is high enough that you shouldn't notice it other than model loading / hot-swapping (or games that stream a lot of assets), you should still have enough sata headroom (up to 24\~32 gbps with proper raiding, depending on the board you're using)
To me the most annoying thing is Framework Desktop having the capped pcie slot, chinese manufacturers selling motherboards without using most connectivity and minisforum shipping a full computer instead of a standalone motherboard. It's criminal.
gofiend@reddit
Not sure it's reasonable to run GPUs at under PCI 4 4x? Otherwise, it's a great idea.
+100 re flubbing the PCI slots. Heck add $500 to the device and just give us a bunch of slowish PCI slots instead of M2.
NWDD@reddit
it is not optimal, but good enough:
- It was common for homelab setups to train local GANs and other ml models https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/
- When loading a model, streaming from disk will be bound at 64gbps (so if you are trying to optimize model loading and you'll be loading them from disk, there won't be a significant difference between loading from a pcie4 nvme into a x4 gpu vs loading into a x8 gpu)
- When gaming, streaming from disk in pcie4 is bound at 64gbps.
- When gaming, other operations (like passing vertices or indices to the gpu) you still have \~133 megabytes per frame at 60fps, so it shouldn't have an impact on most AAA games (where the world is mostly static) or indie games. There doesn't seem to be a lot of information specifically benchmarking something like a 5090/4090 at pcie 4.0 x4, but you can see that people were gaming just fine in pcie 3.0 x8 which is equivalent and the gpus from back then, like the 1080, that are still good enough to play most games https://youtu.be/XJuj16gRoBI?si=FSYukwpwqeuyEEvs&t=311 ).
fallingdowndizzyvr@reddit
There are 16 PCIe lanes.
If you only repurpose one NVME slot, then you only get one x4 slot. If you want two x4 slots then you have use both NVME slots. Which is reasonable since you can just use a USB drive.
Eugr@reddit
Framework Desktop has two M.2 slots and one PCIe x4 slot on the motherboard.
pieonmyjesutildomine@reddit
Can someone help me figure out what PC this guy is talking about? I thought they were talking about the new beelink with the AI Max+ 395 and 128GB, but as I read I think it's a different one. Is beelinks good for this same use case?
StyMaar@reddit (OP)
I'm not stuck on a particular model, even though the Framework has a PCIe port which can be convenient in the long run as it leaves room for extensibility.
Potatomato64@reddit
Medusa halo with RDNA5 coming in 2027 might be significantly better than strix halo with RDNA3.5 as its also a first generation product line
StyMaar@reddit (OP)
2027 is way to far away in the future for my patience though.
slvrsmth@reddit
If you don't need it, wait. The LLM field is too green yet, too much churn is going on. Great hardware today might gather dust next week.
If $2k is not pocket change, and you don't have very clear use case, wait. It's an expense. When the field gets boring and slow to change, you might position it as an investment.
kacoef@reddit
maybe you want cuda.
kwa976@reddit
What OS are you all running? Linux mostly?
StyMaar@reddit (OP)
Linux exclusively.
gthing@reddit
Whats the best model you could run on that? Figure out how much that model costs from a provider like deepinfra, and how many billions of tokens youd uave tonput through it to make your investment worthwhile. And the API will be much faster.
StyMaar@reddit (OP)
API = no privacy = hard pass.
superdav42@reddit
For the same price you could rent something equivalent on vastai for over a year.
StyMaar@reddit (OP)
I could likely rent a H100 for a year, yes.
CondiMesmer@reddit
Because you will save a significant amount of money and effort just using cloud computing instead. Also upgrades are free that way. Why would you ever buy hardware?
StyMaar@reddit (OP)
You do what you want with your privacy and it's good for you if you don't care having no control over “upgrades” (thinking about the people crying about the disappearance of GTP4o …), but what are you even doing on this sub though?
squareOfTwo@reddit
maybe there will be GPUs which are cheaper overall in a year or two.
StyMaar@reddit (OP)
I'm not holding my breath…
SilentLennie@reddit
Compatibility.
Check out this guy:
https://www.youtube.com/watch?v=wCBLMXgk3No
StyMaar@reddit (OP)
Will give it a look, thanks
Xamanthas@reddit
Because you never buy the 1st gen if anything. Ignore the rest imo. Only suckers willing to part with their money buy 1st gen
_hypochonder_@reddit
>the 128GB of VRAM for $2k is unbeatable.
You can build a system with 4x AMD MI50 32GB under 2k and would be faster.
>https://www.reddit.com/r/LocalLLaMA/comments/1lspzn3/128gb_vram_for_600_qwen3_moe_235ba22b_reaching_20/
StyMaar@reddit (OP)
Sure, but it will be much more noisy, consume a lot of power and take a lot more space. I don't really want a bulky workstation in my living room.
lost_mentat@reddit
We need food , water and shelter . That’s all we really need, perhaps some need human company , this is not universal . So I didn’t need to buy the RTX 6000 pro Blackwell , 96GB vRAM, 32core threadripper and 256GB of ECC RAM, but I did anyway. because now I can run scientific simulations and wind tunnel simulations that others are doing more competently and I can run locally , various LLMs which I don’t need to because I can talk to more advanced LLMs via API for pennies . I didn’t need it but I wanted it . & now I want more …
ThenExtension9196@reddit
Cuz it’s not nvidia.
StyMaar@reddit (OP)
I'm asking for argument against the purchase, not in favor.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
spaceman_@reddit
I bought a 64GB laptop with the chip before summer, because I thought "40GB will fit every model I can run at reasonable token rates anyway, whatever the 128GB can fit that the 64GB can't would be to slow to be useful."
I have regrets now.
FlyByPC@reddit
If you don't need it now, prices for tech generally go down, historically. I'd expect the same tech to be less expensive / more capable, next year.
It's generally always a great time to buy new tech compared to last year, but generally always a terrible idea compared to buying next year -- unless you could really use whatever it is, now.
alpha_epsilion@reddit
Generation 1 product is often for suckers and guinea pigs
Technical_Ad_440@reddit
wasnt it nvidia is 20 tokens or something a second while the current amd one is 2 tokens a second? and the new one is gonna be 4tokens. they can run big models but really slow is what i got from it unless i was looking at the wrong info which with how sites are random bs articles is very possible probably likely
if these are indeed fast and can run llm and video models as well as image ones i will buy one. but i think that was the other flaw in that they run mostly text llm for video you still need nvidia and such unless that was also some more bs i came across when looking into amd AI rig.
i was looking at nvidia server jetson stuff to do video and images thats what i was recomended, seems affordable can get 128gb and is very upgradable at around 3k
dispalt@reddit
Hey there, AMD CEO here, you should definitely buy one
pengy99@reddit
It's essentially a first gen product. I would expect similar solutions over the next couple years to get drastically better with higher bandwidth and possibly even more memory. Unless you have a real professional use case I would just wait.
Freonr2@reddit
Probably great for gpt oss and glm air as you state, and any potential future 80-200B MOEs, particularly low active%. Could be a solid gaming desktop if you dual boot win/steamos. CPU is strong, GPU is at least very solid.
Probably will suck for everything else, like diffusion models. No cuda, worse compatibility. Less community knowledge/support.
seppe0815@reddit
Sitting in apple eco system but the memory speed is to low for video and image gen. Nn this ryzen ... but anyway have fun
flanconleche@reddit
Dooooooo it, got my framework desktop and loving this thing to dead
green__1@reddit
so, as a complete newbie who knows nothing, I'm curious about this. so far everything I've read says Nvidia is the only one ever worth considering for any generative AI workloads, and that discrete GPU with dedicated vram is the only way to go.
but those same people say that the only thing that really matters is vram, and 128gb is a heck of a lot of that compared to any dedicated nvidia card within a lightyear of that pricing ballpark.
I'm just in the process of scraping together the entrance fees for the localllama club, but something like this sure feels like a nice option! (looking for a lot of general purpose llm plus some occasional image gen, on something that can also run a good desktop Linux distro as my daily driver to replace my current PC that is old enough that it still stores data by chiseling it onto some tablets)
camwasrule@reddit
Please don't forget the big issue is prompt processing everyone. The larger the conversation gets the longer the PP gets too. Probably better running a 32b model with that much vram and using the vram for the kv cache
Morganross@reddit
REAL ANSWER:
At best you get 2.5 million tokens per DAY.
even if it is cost neutral over the long term, its a drop in the bucket compared to how many tokens you'll use per day.
Clear-Ad-9312@reddit
why you should not will be because it is completely overkill for most people, but at the same time it is enough to have entry level type MoE models like the ones you mentioned. It is also possible to wait 2 to 3 years and get a system that will absolutely be way better but cost a bit more or the same as this one. On the other hand, API costs are way cheaper than local.
Another downside is that it is completely stuck with the current configuration. You can't realistically upgrade CPU or RAM, plus once DDR6 comes out then the benchmarks will completely shift for CPU+RAM to being a viable choice vs the very expensive GPUs.
mr_zerolith@reddit
It's like half the speed of a 5090, and a single 5090 doesn't currently have the compute power to run a substantial model ( \~100B )
I'd honestly buy a 5090 today but in the future, the better hardware is:
- Nextgen Rubin-based nvidia
- Apple M5
But today, you will regret buying anything but the strongest thing you can get your hands on for under 10k
Don't bet on Qwen reducing your hardware needs by 4x, that probably isn't happening, and Qwen3 isn't that smart compared to other models in the first place.
AlwaysLateToThaParty@reddit
Half the speed? More like one eigth? 250GB/s vs 1750+GB/s? I'm pretty sure that's the bandwidth comparison.
mr_zerolith@reddit
Hmm i'm thinking more of tokens delivered per sec rather than just bandwidth
It is a different architecture after all and the unseen variable is latency.
By tokens/sec i'd consider this hardware on the weak side.
AlwaysLateToThaParty@reddit
Tokens second pretty much directly relates to memory bandwidth. So yes, it is on the weak side. I mean there is a pretty good use case for playing with, but anything relating to image creation uses CUDA, so nvidia, but if you want to load a medium sized MoE model, it would be fun to play with. Not very fast though.
Euphoric_Ad9500@reddit
Poor memory bandwidth
superminhreturns@reddit
What problem are you trying to solve with llm? We all love to have that the new shiny toy and if you have the budget then sure why not. But you have to ask yourself what are you going to use this for? Would it not be cheaper to use a cloud service ($20 bucks month)?
If it’s going to be potentially utilize it to make money (start a business) and you need to practice then go for it. If not, really think why you need it.
Don’t get me wrong, I’m a strong advocate for local llm. I have a dual 3090 at home, 2 rtx6000 at work and all of my personal development sff pc has at least 16gb vram for llm. But all of my device has a justification. Email rewrite, promotional messages, text summarize, sentiment analysis, etc. note: for code generation I still use a cloud service (Claude code)
Basically think hard (no pun intended) on why you need this new shiny toy.
tarheelbandb@reddit
I am absolutely a novice "vibe coder" I hit Claude's $20 ceiling every night after about 2 hours. IMO the cloud services are where you start, on prem is what you get when you want more and cloud is when you need to scale. Like so many things cloud services seem really geared towards enterprise use cases or deep pocketed hobbyists.
superminhreturns@reddit
It really depends. My motto is if I can use local llm to do the job, then go with local llm. An example: I used to spend around $400 a month for translation using cloud service. I got smart and utilize local llm to perform the translation.
tarheelbandb@reddit
I mean, you are saving that much every year. Such a great use case that pays for itself in less than 6 months.
I used to manually transcribe audio and video recordings for practically peanut$. I wouldn't be surprised if I was directly (straight up translating your files) or indirectly (Providing the training data for whatever llm that now provides that service)
zipzag@reddit
Claude is the most expensive, Grok the least. You get a lot of tokens with Gemini too. Then there's the Chinese models on OpenRouter.
My Youtube TV subscription cost almost $90/month. LLMs are cheap
Educational_Sun_8813@reddit
you don't need justification for free software and 120W TDP
atape_1@reddit
Because we are still waiting for the Nvidia thingy, that will maybe, someday launch.
-dysangel-@reddit
soon after that came out, the Mac Studio with 512GB VRAM and better bandwidth than the DIGITS, and I opted for that instead. Glad I did not wait!
xxPoLyGLoTxx@reddit
Super jealous! It's definitely on my "upgrade someday" list. I can't wait to see what m4 ultra brings.
-dysangel-@reddit
Iirc they're skipping the M4 Ultra? So the next one is probably the M5 Ultra
xxPoLyGLoTxx@reddit
Oh really? That's interesting. Cripes by then we'll be at 1tb unified memory lol.
beragis@reddit
An ultra is basically two max’s bonded together so an M4 would have had around 1.08tb if it had came out, so with an M5 i would expect around 1.25tb.
That would around 25 percent above the 4090’s bandwidth. Not sure how the ultras 80 gpu cores compare with CUDA but the extra headroom would likely put it near a 4090 or possibly even an H100 with far less power draw.
zipzag@reddit
1-1.5 TB/s is apparently what can be made later next year. But I don't expect new SOC type builds at that speed before 2027.
xxPoLyGLoTxx@reddit
Definitely good for competition. I love having the extra vram you get with Mac versus a single discrete gpu. Even though the gpu is rated faster with memory speed, it’s often slower in the real world because part of the model gets offloaded to ddr4 or ddr5.
zipzag@reddit
M4 Ultra thermals apparently wouldn't worth in the form factor.
StyMaar@reddit (OP)
Isn't it going to be twice as expensive for the same memory bandwidth?
Eugr@reddit
It is. It should have much higher GPU performance, so prompt processing should be faster, but on the other hand it's CPU side will likely be weaker than AMD option. And it's ARM-based that officially supports only NVidia's Linux distro (hopefully other ARM versions would run there too). But if you need CUDA and high GPU compute, that's not a bad option either.
atape_1@reddit
Yep, it won't be any faster. It's only selling point is CUDA, if you don't need it, the AMD machine is great, also much easier to use, since the nvidia thing will have an ARM based CPU.
false79@reddit
I believe the speed will be the same in comparison.
valdev@reddit
This is the only argument I have.
On my 7950x system with 192 GB of RAM I get about 12 tk/s with gpt-oss-120b. And I am pretty confident that with enough tweaking it could be closer to 25 tk/s.
Granted, when I load it onto my 5090 4090 and 4x 3090 it runs closer to 100 tk/s but it also threatens to burn down my house and financial future.
The 7950x build not only gives you 192GB of RAM but room to grow and acquire the gpus
MingMecca@reddit
Curious to hear about what kind of tweaks you would do with your setup to get it closer to 25 tk/s. I have a 7950x3D with 128gb RAM and I'm always looking for tips on inference speed.
valdev@reddit
It almost all comes down to the RAM speed. Optimizing the timing, getting the frequencies as high as they can go. Right now my RAM is setup terribly -- poor timings and only running at 3800 MT/s (of their rated 6400).
Optimizing this is a small nightmare though, anything above 96GB for Ryzen is no-mans-land.
MingMecca@reddit
Yeah those four mem sticks really mess up the memory stability on this chipset unfortunately. Sounds like you're running into the same headache that I am.
rorowhat@reddit
It's worth it, and it's also an excellent gaming machine to boot
bayareaecon@reddit
I’m currently in the process of building something with 4x mi50s. Honestly still kinda new to this and someone can definitely tell me I was stupid and should have got the ai max. I’m hopping to keep my all in cost under 1.5K.
Educational_Sun_8813@reddit
yeah, and then eloctricity bill
johnkapolos@reddit
Why not a dxg spark?
Eugr@reddit
It's been announced long ago and still not released yet. And it will be about 2x more expensive.
johnkapolos@reddit
Last time I checked the base model was support to be $3K
Eugr@reddit
There is ASUS one with 1TB storage for $3K on NVidia reservation page. NVidia one has only 4TB option for $4K over there. ASUS website doesn't list the price at all, so I'll believe it when I see it shipping. I suspect the price will be higher.
Educational_Sun_8813@reddit
you can replace drive for bigger in asus
one-wandering-mind@reddit
Because it is slow, you probably won't use it much, and faster things are coming out like the Nvidia spark and companies building their own versions of that.
But then again who knows when the spark will come out or if it will be at retail so buy it if you want :) then post your performance benchmarks so I can decide on buying it.
Educational_Sun_8813@reddit
all little sparks are the same, they have other box, but functionally they will be the same, advantage of that platform is they have infiniband now owned by nvidia, so you can bond two units for very fast interconnection with 256g
CubicleHermit@reddit
Depends. Desktop or laptop?
Desktop? Can't see any reason why not, although I haven't seen pricing for DGX Spark with a comparable amount of memory, and the desktops available are kind of questionable in terms of form factor and build quality.
Laptop? I would if I could. Unfortunately, there aren't any good laptops available with it.
Eugr@reddit
The MSRP for DGX Spark is $4K for 4TB storage option for NVidia, and the cheapest is $3K for 1TB version from ASUS.
CubicleHermit@reddit
If so, the Corsair or Framework seems like it's relative bargain.
Eugr@reddit
And AMD one is already released, there are people using it, the performance is known, and you can get one now (sort of). The only information we have on DGX Spark is from NVidia website, and it's not very detailed, so who knows what the actual performance would be like.
Educational_Sun_8813@reddit
i'll get my framework probably in October, having around few 3090 with 512 G ram
VanagearDevGuy@reddit
The only issue I've been having with mine is getting comfyui set up for qwen generations. I think this is just a time issue(and a me issue :P) since I see on forums people getting generations in less than 30 seconds so I know the rocm release later this year will solve my woes. But for local language models and even some Unreal Editor 5 developing, it's very nice!
HornyGooner4401@reddit
I don't know about your use case, but OpenRouter gives you 1,000 free requests as long as you have $10 in your account.
Models like GLM 4.5 Air and GPT-OSS have a free tier, and so far I haven't reached that limit.
basitmustafa@reddit
There is no good reason not to. My M4 Max MBP is gathering dust b/c macos has become so bloated and annoying, having Arch + Hyprland on my Flow Z13 with 128GB Strix Halo box is just nothing less than a joy.
The_GSingh@reddit
$$$$ (hope this helps) /s
archieve_@reddit
how about a second hand m1 ultra Mac studio. Their prices are similar
tarheelbandb@reddit
Looooool. Did you read my post from yesterday?
sub_RedditTor@reddit
Memory bandwidth too slow .
Can only add one GPU via the PCIE 4.0 4X interface..
Memory not upgradable ..
Wait a little longer for better CPU.
Or build Threadripper 7000 series CPU
__some__guy@reddit
Because it can only run MoE models and it's only twice as fast as a regular DDR5 system.
NeverEnPassant@reddit
Because prompt processing is going to be unusable for larger contexts. A 5090 + fast system ram will be faster in practice for MoE.
GangstaRIB@reddit
If you already have a solution to run your models then it’s probably best to wait. If not have at it hoss.
spookperson@reddit
I was super super close to ordering an AMD AI Max+ 395 128GB this week. I think I have avoided the temptation for now. I'll post some links/thoughts related to my decision to buy or not buy.
I am most interested in batch/concurrency situations (2-5 users or concurrent tasks at the same time). I would be a lot more interested in Strix Halo if it went up to 256-512gb (though here is some data on clustering them: https://www.jeffgeerling.com/blog/2025/i-clustered-four-framework-mainboards-test-huge-llms )
In my mind, here are some alternatives for comparison to the 128GB Strix Halo for LLMs. 1) a regular PC with something like 128GB of ram and a graphics card running ktransformers, 2) a Mac Studio with at least 128gb of ram (running MLX or GGUF), 3) maybe Project Digits but that still hasn't come out yet.
Strix Halo obviously has faster memory than a regular PC with system ram and a graphics card - but here are some sample benchmarks for ktransformers on consumer hardware (Core i9-14900KF + dual-channel DDR5-4000 MT/s) + RTX 4090: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/AMX.md
Apple hardware is pretty good for a single user at a time but the prompt processing is not super fast and concurrency (multiple users or multiple tasks at the same time) is not easy. Here are some llama.cpp benchmarks for a bunch of different m-series chips: https://github.com/ggml-org/llama.cpp/discussions/4167 and this PR has some notes on llama.cpp speed in high-throughput mode with llama-batched-bench: https://github.com/ggml-org/llama.cpp/pull/14363
So I would love to try Strix Halo for something like ktransformers (connecting a GPU over the PCIe x4 link) or for running vllm to get high concurrency (since vllm on Apple is CPU-only). I found these benchmarks on vllm on the Framework Desktop: https://github.com/lhl/strix-halo-testing/tree/main/vllm (92 tok/s on the highest end Mac for batch size 1 q4 compared to 357ish tok/s on batch size 16 with strix halo vllm q4 - but I don't have bs=16 results to compare for llama.cpp or vllm on Mac).
SeriousObjective6727@reddit
There will be something way better coming out next year....
Dtjosu@reddit
More likely--faster, better, same price, same lower
Content_Cup_8432@reddit
i think 256 gb with 495 will be available in the same price next year with better bandwidth and ROCm support
Vatnik_Annihilator@reddit
I'm not going to get one but I'm really excited for the next iteration. Hopefully they will improve the memory bandwidth.
DanielKramer_@reddit
because you will feel sad when better stuff comes out next year
but then, if you buy it next year, you will feel sad when better stuff comes out next next year
redoubt515@reddit
I'm not going to try to talk you out of it, because I think its a solid option at a semi reasonable (if high) price.
But you are operating under a false assumption:
You have 128GB of LPDDR5x system RAM. It is not VRAM.
It is faster than your typical system RAM, but slower than your typical GPU memory.
I havent' heard of this before, is this confirmed or a rumor/speculation?
vulcan4d@reddit
It is overrated and that puts the prices high. It is good as a toy or intro to LLMs.
http206@reddit
Because there will be lots of these things coming out this year, some better than others, some cheaper than others, and unless you have a concrete use for one now you're better off waiting a few months.
(This is, at least, what I tell myself)
Healthy-Nebula-3603@reddit
IF have had 512 GB or better1024 GB and RAM speed 800-100 GB/s then I WOULD
SillyLilBear@reddit
I would recommend against unless you just want to experiment. it isn't useful for anything practical I have managed to get 44 tokens/sec on GPT 120B Q8 which seems fanatic, it is still slow first token due to painfully slow prompt processing, and even a little context starts to slow it down very fast.
NearbyBig3383@reddit
Why spend money on a plate if you can pay 10 dolls at kicks and then you have practically unlimited requests? Think 2000/10 and it's absurdly more affordable, don't you think?
fabkosta@reddit
Just out of curiosity, which computer would you buy it with?
StyMaar@reddit (OP)
I'd just buy the framework motherboard as I don't really need yet another case…
fabkosta@reddit
Ah, got it.
I must say: I'm one of those thinking the Framework PC is pretty cool.
Only reason speaking against it: The longer you wait, the more bang for the buck you'll get.
But - I am sure you know that already. ;)
fabkosta@reddit
By the way, if/when you buy it, let us know your experience!
dobkeratops@reddit
could it finetune albeit slowly ?
could it run video models in part of a cluster?
i was very impressed with some LLMs on humble CPU, i think the AI max does look like an interseting device.
fallingdowndizzyvr@reddit
The price has dropped since release. I bought at the pre-order discount but since then it's even been cheaper.
FORLLM@reddit
I'm inclined to wait for better, more vram. 128gb isn't cool. 1tb is cool. I could be delusional, but I suspect there will be devices to get us there at increasingly reasonable prices in a year or two. AMD ai max and nvidia spark are encouraging steps in that direction. I'd rather wait a couple years, as much as I'm encouraged by reports about kimi, qwen etc, I suspect I'd be a little disappointed acquiring hardware now, not just in a 'hardware is always getting better/cheaper' kind of way, but in a 'current hardware doesn't fit my market at all yet' kind of way. Adjacent to that, one of the recent videos I watched on the ai max mentioned a number of driver issues, sorry, don't recall the video though I probably saw it in this sub if that helps. A couple years on I bet those drivers will purr.
I think the hardware and software may be approaching mutually sweet spots in price and performance in the next couple of years though. And if nvidia has enough broadcom/custom silicon problems with their big tech ordering, they may get more eager to repackage silicon and sell it to us nobodies for reasonable prices again. I'd rather spend $5k in a couple of years to get something that's bang on what I want than reach now and get hardware that disappointing on its own to run models that aren't even quite what I'm hoping for yet with immature drivers. And I want to run audio models, video models, models I haven't even heard of yet. The market I want my AI rig for is still in very early innings.
On the other hand if you find $2k easier to part with than 2 years of waiting, your wallet may need to just take one for the team. Sorry, StyMaar's wallet!
Antique-Ad1012@reddit
We are nowhere near consumer hardware for useful local ai. Save your money and wait a few years, the technology exists but it takes time for the industry to adopt everything at scale to make it affordable.
Im using a m2 ultra and i will keel using online services for now
Few_Size_4798@reddit
Save up to 6-8 thousand and wait for Intel Battlematrix, but the prices that are appearing now are completely unreasonable, of course.
And then there's the energy consumption.
I'll save up (I'm waiting to see what Minisforum offers, maybe around $1,500), but I'll also buy one for my collection.
Secure_Reflection409@reddit
I bet Qwen have cooked that 80b to sit perfectly inside 48GB would be my number one reason for maybe holding off.
Maybe treat yourself to both :D
jonahbenton@reddit
The FW is super fast for normal computer use but for LLM compared to nvidia (3090s, A6000s) for interactive use cases with large model and context it is several times slower. I think people are getting good results with moe but I haven't gotten there yet.
nostriluu@reddit
I'm waiting for a Thinkpad with this chip. A desktop doesn't make sense to me because desktops are about expansion, right? And I'd want a CUDA GPU for some tasks.
StyMaar@reddit (OP)
You have two PCIe x4 available (if you repurpose one of the M.2 slot), doesn't that count ?
Fair, but I have no use for CUDA personally.
nostriluu@reddit
That's only about 8gb/s. You'd be losing a lot of performance from a GPU.
CubicleHermit@reddit
The existing desktops (Framework, Corsair, and a few others) are all minis with limited expand ability - as you'd expect with a laptop chip that only works with soldered RAM.
Chance-Studio-8242@reddit
I too am curious about AMD AI Max+ ROI