Ryzen AI Max+ 495 (Gorgon Halo) with 192GB VRAM!

[-]

BankjaPrameth@reddit

I don’t want to be that guy but with increased in RAM doesn’t means you can run bigger model effectively. The prefill speed is the weakness of this device.

I think it would be good for running multiple small models for various tasks instead.

[-]

Question - I am not sure what you mean by prefill. I assume that's the loading time of the model into RAM from HDD before being shunted to the GPU? If so, wouldn't that be a 1 time delay while you load the model initially, and if you pin the model to the GPU it remains until you stop it? (Its what I do now with my GPU - slow load up front, pin to memory, then its there whenever I need it)

[-]

unjustifiably_angry@reddit

Say you have a fair-sized codebase or a long conversation. If you need to perform a new operations on that codebase or you come back to an existing conversation after it's been cleared from memory, all relevant data needs to be read into memory. It's like if you want to answer a question for a test that's based on a book. You need to read the entire book before you can answer the question. This is prefill.

Most benchmarks focus on token generation because it's a flashy number people can instantly related to. Anyone who has ever used AI knows the difference between a fast response and a slow one. Usually these toke generation-only benchmarks come from the same disingenuous shithead influencers will do with 4B models, or who pretend massive concurrency matters worth a damn in any practical situation.

In practical use, token generation is often secondary to prefill. If I'm working on a coding project I'm often going back and forth between conversations on different topics because the conversation's context history is extremely important. Having prefill in the thousands makes this relatively painless, I might need to wait 15-60 seconds for a reply or operation to start. If prefill is in the hundreds or tens per second... just give up. You will be waiting 10+ minutes for the first token.

This is one of the places DGX Spark really shines, despite its faults. At 64K prefill depth on Qwen3.5-122b (for example) it still reads at >1000 tokens per second, so less than a minute to complete prefill.

[-]

Narrow-Belt-5030@reddit

Ah got it now - so once the model is loaded, when you send your prompt the prefill is the time it takes to load your prompt into the model .. got it. Thank you !

[-]

unjustifiably_angry@reddit

Not just the prompt (the thing you type) but everything that goes along with the prompt (everything you typed previously and everything that must be read from the project).

[-]

Kong28@reddit

So prefill is getting everything ready for the LLM to perform inference? It happens before inference?

[-]

unjustifiably_angry@reddit

Exactly. It's also commonly referred to as TTFT, or "time to first token".

[-]

Kong28@reddit

Ah now that's a term I have heard, got it, thank you! I guess I never really stopped to think about what happens before inference is initiated.

[-]

Narrow-Belt-5030@reddit

Well, yes, system prompt + message history + new message (I write companions and used to building APIs .. just never new the prefill name meaning)

Thank you again.

[-]

BankjaPrameth@reddit

Prefill is prompt processing speed. It the measurement for model thinking speed. For short context like general chat, you might not notice the difference. But for long context like agentic coding where you need to process multiple files, it can take up to 5-10 minutes before the model can reply you if this speed is slow.

It’s the matrix that new user might not talk about much. Most of user focus only on token generation speed. And it’s only half of the story.

[-]

unjustifiably_angry@reddit

Less than half, if you're doing any sort of professional work. Ask it for a minor change, and if it needs to read the whole project to make that change you'll be waiting minutes just to output a few tokens.

[-]

PromptInjection_@reddit (OP)

Yes, prefill is a huge weakness. I own a Strix Halo myself.

However, Medusa Halo will have significantly more compute and memory bandwidth. Therefore, I'm optimistic.

Furthermore, this puts NVIDIA under pressure to release a DGX Spark 2.

[-]

edsonmedina@reddit

> Medusa Halo will have significantly more compute and memory bandwidth

do you have a source?

[-]

GregsWorld@reddit

It's rumered to use LPDDR6 https://www.tomshardware.com/pc-components/cpus/amds-future-medusa-halo-apus-could-use-lpddr6-ram-new-leak-suggests-ryzen-ai-max-500-series-could-have-80-percent-more-memory-bandwidth

[-]

edsonmedina@reddit

Perfect! Thank you! I wonder what the price tag will look like.

[-]

unjustifiably_angry@reddit

Better RAM is one thing but they don't want to sabotage their "real" AI hardware; they might do what everyone does with GPUs and use faster RAM at the same time as they make the bus width narrower to save on die space and maximize profit.

They might not, too, but at this point I'm not letting myself get optimistic.

[-]

JohnBooty@reddit

Yeah, that's the elephant in the room.

None of the existing players will ever sell anything that will remotely encroach on their server hardware business in terms of performance.

[-]

unjustifiably_angry@reddit

Everyone shits on DGX Spark, often for good reasons, but a motivated person can buy 4 of them and actually have a really good local setup. 512GB of unified RAM at 1TB/s bandwidth for $16K.

Not to directly compare them since there's a huge gulf in bandwidth but for example an H100 with 80GB of VRAM is worth like $40K.

[-]

JohnBooty@reddit

Great point.

[-]

Mental-At-ThirtyFive@reddit

5K easily with the economy inflation + compute hardware inflation.

At $10 for a coffee, lose your daily coffee for 18 months, you will be good to go with Medusa Halo when it arrives

[-]

ProfessionalJackals@reddit

At $10 for a coffee, lose your daily coffee for 18 months

Buy yourself a coffee maker for 250, and some good quality beans for like 15 a pack. Your making way better coffee for like 0.4 bucks per coffee (inc milk).

So now you save $9 per day and still have your coffee :)

[-]

BannedGoNext@reddit

If I already make my own coffee, and don't eat avacados.. what then? Blowjobs behind the wendy's dumpsters for cash?

[-]

JohnBooty@reddit

If somebody starts giving out H200's, the list of things I'm not willing to do behind that dumpster becomes very small indeed.

[-]

crantob@reddit

With an applicable nick, to boot ;)

[-]

JohnBooty@reddit

I can always ask the LLM how to regain my dignity later

[-]

edsonmedina@reddit

Please send location

[-]

Django_McFly@reddit

If it counts for anything, the Apple II with 48kb of memory launched at ~$2600MSRP in 1977. Average car in 1977 was $5600.

It won't be anywhere remotely close to like I could buy a brand new car for the price of two of these and imo it's hyper unlikely to cost even half of what that Apple II adjusted for inflation cost: $14k+.

[-]

crantob@reddit

Points to this. I was that kid in 1981 who asked his dad for a 48kB apple and got myself sat down for a bit of a talk about what our toyota cost, and how that apple would mean mom couldn't get a car...

[-]

Exotic_Accident3101@reddit

It would use both RDNA 5 and have a higher memory bandwith with lpddr6

[-]

DefNattyBoii@reddit

Maybe spark 2 will be more than a glorified SM120 black"brown"well paperweight for curious devs

[-]

unjustifiably_angry@reddit

The software side has been steadily improving lately although it's still a strong recommendation to buy two in order to get a much more useful level of performance. RAM bandwidth imposes a hard cap on how fast inference can be but with a pair of them you're at least getting 0.5 TB/s, and prefill is fairly decent. Troubled as Spark may be, in terms of local hardware there's not really a better choice.

[-]

Kong28@reddit

You seem super knowledgeable! I keep getting my posts deleted for some reason, so hopefully you can answer in the comments.

I think local LLMs are so fucking cool, but what's the capability actually like? I feel like they get so hyped up on Twitter, and I'm ready to spend $5k on some 3090s, but are they even capable? Should I just be spending my money on subscriptions?

[-]

unjustifiably_angry@reddit

Every "wow look how amazing this model is" video I've seen online has given a dishonest impression aimed at engagement with naive viewers and the things demonstrated have little or no real-world value. If you type a one-shot prompt and get some slop out the other end, everyone else can also do this - you've gained nothing and lost time. The code you'll generate is basically useless no matter how good it appears. If you want to make something useful, you need to build a project properly: small steps, architected properly.

The capability of local models is now quite good, it's probably on par with the best online models from 4-6 months ago, but AI is moving very fast and 4-6 months does make a big difference. However, in my opinion the importance of objective LLM "intelligence" has begun to diminish in favor of how effective your usage is. If you expect the computer to read your mind, results will be the same as expecting a human employee to read your mind. LLMs are not Google or an encyclopedia, there are rarely "correct" or even "best" results in the traditional sense. Whether it's "worth it" depends on your use-case (hobby vs profit) and existing skill level with programming. If you're not already a programmer, you should start small on simple projects and learn best practices over time. Ask the LLM itself for recommendations on how your projects should be structured, ask it to build an architecture document to describe how it should work, and then ensure it adheres to it as you carefully build and test each function.

If you want my advice on what to do with your money, I would suggest signing up for a monthly subscription, seeing if/how it limits you, and then deciding based on that if it makes sense to buy local hardware. At the rate I was burning through Claude and Gemini tokens I could only work for an hour or two a day, and even then I had to avoid experimentation because I had to always worry about limits. This is the biggest advantage you get with local hardware: you've paid upfront so you're incentivized to use it rather than avoid using it.

I seriously doubt local LLMs will ever fully replace the utility of online LLMs, but local LLMs can certainly reduce your need for them to an extent that you could save a lot of money in the long run.

I started making projects with local LLMs about 6 months ago and I'm certainly nothing like a real "coder" at this point, but I doubt I would be in a better place today had I spent the same amount of time and money on formal education. My class would probably still be covering things I already knew from my very limited existing coding knowledge. From what I understand, after a year in a CS course your culminating project might something as simple as a Snake game or Tetris clone, etc. But on the other hand, I'm certainly learning things in the wrong order and leaving big gaps in my overall understanding. I think I'm much more inclined towards being a project manager than a coder though - I doubt I could possibly memorize all the various syntax that's required, for starters - and that sort of manager role is what you fit into when using LLMs.

Regarding 3090s, I would never suggest that. It is a newbie trap and such an obvious one it makes me consider conspiracy theories like some sort of vast international coalition of COVID-era 3090 buyers trying to break even. When you realize how limiting 24GB is (you will on the very first day) you'll immediately be throwing good money after bad. One 3090 can barely run anything of value, but what if I buy another one? Oh, but now I'm on the verge of being able to do this other thing... hmm... I do have a third slot on my motherboard... And then it's "is my output bad because I'm using a version of this model that's too quantized, or is it just not a good model? Maybe I'll try a bigger model with more quantized kv-cache? Maybe the opposite?" ... then you think, "I could sell these and get a couple 5090s", etc.

You have three real immediate options:

Try an online subscription to at least see if they might be good enough for your usage pattern.
Online hardware rental (ie runpod, etc) to preview how "local" models perform before spending big on your own hardware
Purchase a real professional card tailored to local AI

If I was using online LLMs for my projects I would be spending over $1000 a month on API fees, easily. On some occasions I'd probably be spending that much in as little as a week. A larger upfront investment quickly pays for itself but only if your usage is routinely very high. But if that is the case, you should be looking at an RTX 5000 at minimum, ideally an RTX 6000 (both Blackwell generation, not Ada, etc). These are very expensive but they are, in my opinion, the real entry-level.

Another option is a pair of DGX Sparks (or their third-party variants) - these are the cheapest way to get 256GB of unified RAM for AI but they're hard to recommend without knowing your skill level with Linux and they'll only be about half as fast as 3090 in raw performance. Claude (etc) can actually guide you through the initial setup very effectively, but don't hear "256GB" and assume you can run a model of that size with respectable performance - an RTX 5000 Pro, despite its much smaller 48GB of VRAM, will run over twice as fast. And a 96GB RTX 6000 Pro will run more like 3-4x as fast.

That said, there are constantly things just over the horizon which might radically improve their token generation performance, and DGX Spark's prefill performance is actually quite decent. The overall performance picture might be very different in as little as a month or two - though these improvements will also apply to all other hardware.

I would not trade my single RTX 6000 with its 96GB of VRAM for six RTX 5090s with 192GB of VRAM. The thought of selling my DGX Sparks (Asus GX10 variant) however is constantly in the back of my mind, because despite their greater theoretical capability, the ease of use with my 6000 is so much greater.

If you want some hard numbers:
- 1 DGX Spark can run Qwen3.6-35B @ 4-bit quantization at up to 90 tokens per second
- 1 RTX 6000 Pro can run the same model @ F16 (no quantization) at up to 185 tokens per second
- 1 DGX Spark can run Qwen3.6-27B @ FP8 (its best, fastest quantization) at up to 18 tokens per second
- 1 RTX 6000 Pro can run Qwen3.6-27B @ 6-bit quantization at, IIRC, around 80 tokens per second
- 2 DGX Sparks can run Qwen3.5-122B @ 4-bit quantization at up to 40 tokens per second
- 1 RTX 6000 Pro can run Qwen3.5-122B @ 6-bit quantization at up to 120 tokens per second
- 2 DGX SParks can run Qwen3-Coder-Next @ FP8 at up to 60 tokens per second
- 1 RTX 6000 Pro can run Qwen3-Coder-Next @ 6-bit quantization at up to 145 tokens per second

The numbers I quoted for 6000 Pro with 6-bit quantization might've been 5-bit quantization, I don't remember for sure. I go back and forth on those, sometimes 5-bit is drastically faster, sometimes they're about the same. In rare cases such as Qwen3.6-35B, F16 is actually faster than Q8.

[-]

segmond@reddit

10% more compute, what they need is to add more lanes to the CPU so it can get 2 PCI slots. take this pair it with cheap budget GPUs 3060/4060/5060 and they will eat up Nvidia's consumer's market.

[-]

Kong28@reddit

Sorry for the newb question, this thing is unified memory right? So how does adding a 12gb vRAM card help out? I'm just wondering as I have a 3060 still and would love to be able to incorporate it somehow.

[-]

segmond@reddit

more memory, faster prompt processing. 3060 is really a bad example, I think it's on the same wavelength as current halo, but I think 5060 will probably be much faster. Frankly if such a thing was possible, I would not add anything less than a 3090.

[-]

Django_McFly@reddit

This x 1000. I'd even give up some of the IO if it meant more more lanes on a PCIe slot.

take this pair it with cheap budget GPUs 3060/4060/5060 and they will eat up Nvidia's consumer's market.

Nvidia's consumer market share will get eaten up by people buying Nvidia GPUs to pair with their new computers. 👍

[-]

StableLlama@reddit

That's what I'm also hoping for - a few more PCI lanes to be able to insert a (nvidia...) GPU. That's then the perfect local AI workstation that stays in the shelf and can be accessed through the network

[-]

unjustifiably_angry@reddit

Agree strongly about Medusa Halo. I plan to buy one as a general-purpose PC unless AMD massively shits the bed somehow. They do seem to specialize in grasping defeat from the jaws of victory.

[-]

Huge-Safety-1061@reddit

This is the correct response. Folks already noticing how rekt the 395 is on qwen 3.6 27b (likely most denses) are correct that more BW was what was needed, not more RAM imo. Produce will however be endlessly hyped by the typical local ai hw Youtubers and netted into a low perf local ai experience likely.

[-]

Kong28@reddit

What's your recommendation for local hardware? Worth investing or just wait and use cloud compute until it catches up?

[-]

2Norn@reddit

it's kinda effectively a useless device for most people here

[-]

rpkarma@reddit

What does Strix Halo’s prefill look like currently?

[-]

Herr_Drosselmeyer@reddit

Pretty bad. From this thread: https://www.reddit.com/r/LocalLLaMA/comments/1t31tk7/mistral_medium_35_128_on_amd_ryzen_ai_max_395/

AI Inference Performance (Mistral-Medium 128B)
One of the standout features is the 128GB unified memory, allowing for ultra-large model offloading.
Model: Mistral-Medium-128B-Q4_K_M (\~75GB)
Token Generation (TG):** 1.57 tokens/sec (Sustained) Prompt Processing (PP):** 32.10 tokens/sec
VRAM Utilization: 79Gi (Unified Memory)
Peak Power: 145.0W (Prefill/Bursts)
Peak Noise: 46 dBA
Note: Successfully offloaded the entire 128B model to the iGPU with \~40Gi remaining for context.

1.57 tokens/sec generation is already pretty dire, but 32 tokens/sec prompt processing means about 16 minutes for 32k context. So really, it can't meaningfullyl run any dense model, despite having sufficient RAM.

[-]

rpkarma@reddit

Oh god. I though my Spark was bad :/

[-]

unjustifiably_angry@reddit

A second spark and a network cable elevates the overall value pretty drastically, if you can find one at a fair price. Close to a linear doubling in performance.

But yeah, prefill is one of the places Spark market-leading in terms of dedicated AI hardware.

[-]

rpkarma@reddit

Yeah I’m thiiiiiis close to buying another Asus GX10. $6500AUD hurts though, and at that price I’m not far of what I would’ve paid for an RTX Pro 6000… but I would have double the VRAM even if it is like 1/4th the bandwidth.

[-]

unjustifiably_angry@reddit

Yeah it's a very rough proposition. 6000 Pro is the best choice and 96GB does put you in the league of being able to comfortably run most local models, but then there's the class above like MiniMax, GLM, etc, that would need to be quantized to hell to fit even in 96GB.

I saw the way things were headed and bought two in January just before Spark prices spiked. There are almost certainly people out there who bought a single Spark, found themselves in the same boat as you, and want to sell it. Maybe you can score one for a more fair price. Don't get suckered by the people on eBay with cheap sparks and zero feedback!

[-]

rpkarma@reddit

Oh $6500AUD is the normal price by the way haha. Thats not ram increase prices. We just get shafted here in Aus.

Pro 6000 is $15,000 AUD

And yeah that’s it: being able to run M2.7 or StepFun 3.5 Flash and such sounds awesome. A cut above the 27B I’m stuck with now

[-]

ProfessionalJackals@reddit

Spark as basically a dedicated prefill box connected to other hardware for inference, but I have no idea how complicated that is.

One of the youtubers tested this and the results are mheh ... Your better off just keeping everything on a single Spark, then a Spark prefill > Halo ... Or just getting two Sparks.

[-]

unjustifiably_angry@reddit

Suspected as much, it sounded overengineered.

[-]

BankjaPrameth@reddit

Compare this Spark benchmark https://spark-arena.com/leaderboard with the one u/Internal_Werewolf_48 provided for Strix Halo https://kyuz0.github.io/amd-strix-halo-toolboxes/

The prefill speed is night and day different. I hope that AMD will be able to close this gap one day though.

[-]

Internal_Werewolf_48@reddit

That’s for a dense model. MOEs are better. Check out the various benchmarks https://kyuz0.github.io/amd-strix-halo-toolboxes/

[-]

rpkarma@reddit

Yeah I understand that, I’m not surprised about decode per se. The prefill is woeful though, that’s a lot worse than the Spark gets even if decode is similar for those huge models.

[-]

fallingdowndizzyvr@reddit

It's dire but it's not as dire as that thread portrays. There's just something wrong with what's happening there. Here are the numbers I just got running llama.cpp.

| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| mistral3 ?B Q4_K - Medium      |  74.13 GiB |   125.03 B | ROCm,Vulkan |  99 |    0 |           pp512 |         64.69 ± 0.17 |
| mistral3 ?B Q4_K - Medium      |  74.13 GiB |   125.03 B | ROCm,Vulkan |  99 |    0 |           tg128 |          2.78 ± 0.00 |

[-]

Herr_Drosselmeyer@reddit

Interesting, that's basically double what the other poster got, so he must be doing something wrong.

[-]

fallingdowndizzyvr@reddit

That poster doesn't say what he's using, but a clue is "Utilized HIPFIRE_MMQ=1". If it is hipfire, from what I've seen it's not as performant as llama.cpp overall.

[-]

unjustifiably_angry@reddit

That's a dense model, no? IIRC they're not only drastically slower in token generation but also in prefill. Not really relevant now that MoE architectures have become so common.

That said, I recall quite clearly that my RTX 6000 Pro gets well over a thousand tokens per second on Mistral Large (123b) so 32 tokens per second really is... fairly rough.

[-]

Herr_Drosselmeyer@reddit

I don't think the Strix Halo will do better with MoE models, at least compared to other solutions. MoE's will be usable on it, but I'm pretty sure they would run proportionally faster on 6000 PRO.

[-]

Hytht@reddit

It's because RDNA3.5 is lackluster at compute/inference, even Intel's panther lake has almost twice the INT8 TOPs of 8060S, the much weaker 17W Lunar lake has similar to that of strix halo as well. I'm waiting for Medusa Halo and Nova/Razer lake AX, strix halo isn't that compelling.

[-]

SpicyWangz@reddit

Being able to minimax at Q4 sounds awesome

[-]

2Norn@reddit

yeah 1 token per second is gonna be amazing lol

[-]

segmond@reddit

so, do you have a better alternative at the same price point? we love to whine around here. this weakness you speak off, can you tell us how. you can overcome it at the same price point for the mass?

[-]

BankjaPrameth@reddit

At same price point with this amount of ram? I don't have alternative too. You get what you pay for. For user who don't work with long context much, it would be fine. Or with long context and you are willing to wait before each respond, it's fine too.

In the end, it all depends on the use case and your patience. But for me, I'm happy to pay more to get DGX Spark for current gen hardware.

My comment is just a reminder for newbie. Since prefill or prompt processing speed is mostly overlook when talking about hardware performance. Let's hope AMD can do better on optimizing prefill speed for current or next gen hardware.

[-]

tamerlanOne@reddit

Magari in futuro ci saranno architetture più performanti con prefil tipo Mamba & Co

[-]

BobbyL2k@reddit

The additional memory could be a huge boon to its lacking prompt processing speed. Now you can juggle multiple contexts without having to wear out your SSDs and skip the need to reprocess the prompt altogether.

[-]

No_Lingonberry1201@reddit

Why would a bigger model wear out your SSD? I thought that llama.cpp can mmap images and then you're only reading from the SSD. Am I missing something?

[-]

jimmytoan@reddit

192GB is genuinely useful headroom - you can fit a full Llama 3.1 405B at Q4 quantization in that, or run a 70B model at higher quality without any layer offloading to system RAM. But the bandwidth figure matters more than capacity for most inference workloads. Strix Halo runs at \~256GB/s theoretical and real-world was closer to 170-200GB/s in practice. If Gorgon Halo keeps the same memory bus width, the extra capacity helps fit models but doesn't increase tokens/sec.

[-]

Awkward-Candle-4977@reddit

It will be better as npu card. The cpu is useless for ai anyway.

Amd shouldn't just follow what nvidia does

[-]

KnownAd4832@reddit

Either way you look at it - more options the better.

[-]

Monkey_1505@reddit

Yeah, this is quite promising for the leak that medusa would have 256, twice the bandwidth and 8x pcie slots.

[-]

randomfoo2@reddit

While I'm sure some people will enjoy the extra memory, a couple notes from someone's that done very extensive testing on Strixt Halo (and a lot of kernel work on RDNA3):

Memory bandwidth looks like it remains the same? 256GB/s theoretical. On Strix Halo the best measured GPU MBW I got (using ROCm/rocm_bandwidth_test) was 212 GB/s (83% max theoretical), and the best on llama.cpp (Llama-2-7B tg testing) was \~180GB/s (70%)
What's worse though is while theoretical max FP16 TFLOPS is \~59.4, the fastest I found w/ mamf-finder was about 37 TFLOPS (hipBLASLt), about 62% efficiency. Many shapes are much worse.
Note, at long context, I believe compute is actually what's killing decode speed. While the AMD APUs remain on RDNA3, this won't change. I would be hesitant to recommend Gorgon Halo even for LLM inference in 2026/2027

If Medusa Halo moves to RDNA5 or whatever has a better architecture for AI/ML, great, if not you'd be much better off with basically anything else (Mac Studio, GPU+workstation/server w/ K-Transformers, probalby even a DGX Spark).

[-]

Zc5Gwu@reddit

Anyone know if adding an egpu via oculink or usb4 can help improve pre-fill?

[-]

notdba@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/ - It helps a bit if we sacrifice VRAM and set a large -ub size. Not great, will not do it again.

[-]

PromptInjection_@reddit (OP)

Yes, here mostly stays the same, it's just a Strix Halo Refresh.
Medusa Halo will bring real progress.

[-]

-UndeadBulwark@reddit

I cant wait for Medusa Halo I wanted to go Strix but I couldnt get a board in time I am slightly regretting getting my 9070

[-]

No-Assist-4041@reddit

Having done a lot of kernel work on RDNA3/3.5 too, I would say that the max FP16 TFLOPs is lower than the theoretical max due to clocks just not reaching that high. And hipBLASLt still leaves room for more TFLOPs on the table, as it's possible to get ~45 TFLOPs on average. It's just unfortunate that it seems like there's still some work to be done for RDNA3.5 as kernel performance always seems to fluctuate

[-]

randomfoo2@reddit

It's not just clock I think, even when I look at the clock and run the math, everything's still much lower than it "should" be. I think there are a lost of things, but one of the main ones is that RDNA3 was not designed for AI and their advertising is basically a lie.

V_WMMA_F16_16X16X16_F16 executes a 16×16×16 matmul over the wave, takes 16 cycles to retire on a single SIMD32. *BUT* while NVIDIA tensor cores have dedicated hardware that runs concurrently with the SIMD pipeline, but for AMD WMMA is the SIMD pipeline since it's an ALU instruction. So every WMMA op blocks the same VGPR ports that scalar ops would use, and you can't overlap WMMA with FP32 accumulation (have to unpack). Since the max throughput number assumes back-to-back WMMA issue with no dependent ops between them, and since there's always non-WMMA work that needs to be done (scaling, softmax, masking, etc), WMMA is going to be displaced and you'll never get max FLOPS.

The other thing I found w/ my testing has been that VGPRs way too low to hide latency. Also LDS traffic sucks, especially if you're doing FA. Oh, and the compiler still sucks too. WMMA instructions get scheduled w/ dependent ops too close, redundant packing, bad memory wait counts, all kinds of things that stall the pipeline, so if you want anything better you need to tune by hand (well, these days, by AI).

[-]

No-Assist-4041@reddit

Definitely agreed on the above - also the compiler for RDNA3.5 indeed sucks when it comes to VGPR management and makes it hard to optimize well. I've got some FP16/BF16 GEMM implementations in HIP on my github (which is what I used to get the higher TFLOPs than hipBLASLt numbers) and the RDNA3 tuning is "stable", while RDNA3.5 is a pain in the ass with inconsistent performance (for now I've managed to beat hipBLASLt consistently at least, but after manually unrolling with templates - otherwise when I restart my system some configs just perform worse out of the blue). It's annoying that rocprof doesn't seem to work with RDNA3.5 either right now (constantly getting missing counters, so the best I can do is just benchmark manually).

I think I shared my github repos with you a long time ago but if you're ever curious I can share them again in case you want to take a look.

[-]

randomfoo2@reddit

Yeah, can you post and I'll look again, I have a loop going on right now (several hundred iterations in) in the background running some tunes for my W7900 for fun so I'll take another look and see if any of it applies (I stopped doing RDNA3.5 stuff btw since I use it as my workstation and if you poke at it too hard it kills the WM or worse).

[-]

No-Assist-4041@reddit

Haha I'm running some iterations on my RDNA3.5 station right now too for tuning purposes, but after this I might just keep it as my workstation (I hooked up an eGPU USB4 dock for my RDNA3 card, might consider RDNA4 or just wait until RDNA5).

Here's my WMMA version here which has a bunch of stuff that might not even be necessary haha; there's a single precision variant that I need to experiment further with to squeeze out more from the RDNA3.5. Might try a FA implementation later on if time permits

[-]

CommunityTough1@reddit

I don't think this would be resolved with the DGX or Mac Studio. The DGX has the same memory bandwidth as the 395 and similar compute. The Mac Studio has ~3x the memory bandwidth, but suffers from weaker compute (26 TFLOPS vs. your measured 37 on the Strix), making long context prompt processing slower on the Mac.

[-]

randomfoo2@reddit

DGX has similar memory bandwidth but the compute is not so similar...

Since I already made the table a while back from my Strix Halo wiki guide... https://strixhalo.wiki/AI/AI_Capabilities_Overview

Spec	AMD Ryzen AI Max Plus 395	NVIDIA DGX Spark
CPU	16x Zen5 5.1 GHz	10x Arm Cortex-X925, 10x Arm Cortex-A725
GPU	RDNA 3.5	Blackwell
GPU Cores	40	192
GPU Clock	2.9 GHz	2.5 GHz
Memory	128GB LPDDR5X-8000	128GB LPDDR5X-8533
Memory Bandwidth	256 GB/s	273 GB/s
FP16 TFLOPS	59.39	62.5
FP8 TFLOPS		125
INT8 TOPS (GPU)	59.39 (same as FP16)	250
INT4 TOPS (GPU)	118.78	500
INT8 TOPS (NPU)	50

On paper, BF16/FP16 is pretty close, however FP8 is already 2X and INT8 is 4X on DGX. This is just hardware - in practice rocBLAS and hipBLASLt for RDNA3.5 are also simply not very performant...

What does this mean practically? Looking at the most recent benchmarks from the DGX llama.cpp thread benchmarks I eyeballed similar models to kyuz0's Strix Halo benchmarks and while none of the exact quants, from what I looked at, prefill for DGX is currently about 2-5X faster than Strix Halo.

[-]

ZZer0L@reddit

The currently available M5 chips have 2-4 x faster PP compared to M1-M4. The assumption is that M5 Ultra will be even faster

[-]

No_Mango7658@reddit

No thanks 😒

[-]

whodoneit1@reddit

What a waste of VRAM as having that much VRAM with that lower memory bandwidth is going to be a crap. 96GB should be plenty in that thing.

[-]

sine120@reddit

Bandwidth needs to be improved before larger unified devices become super relevant. MiniMax is pretty efficient but unless that's the only model you're targeting, I want a model to be running hopefully at or above 20 tg and 600pp.

[-]

UnbeliebteMeinung@reddit

I am gonna wait for medusa

[-]

geldonyetich@reddit

Yeah me too, I hear it’ll be a bit faster.

I really liked my Halo Strix chip though. Glad to see them iterating the architecture.

[-]

UnbeliebteMeinung@reddit

I wish they would just solder gdrr7 with that ai max chip. This cant be that expensive.... We need more bandwith compute isnt the bottleneck.

[-]

geldonyetich@reddit

I think they soldered mine in with the GMTEK Evo V2 I bought, and they charged just as much! The trouble is sourcing it pre-solder.

Although AI demand is part of it, supply and distribution is the bigger issue, and it's a bummer that's going to slow down sales of what could have been an extremely influential chip.

[-]

UnbeliebteMeinung@reddit

What? The GMTEK Evo-X2 still has only the lddr5

[-]

geldonyetich@reddit

Yeah but they solder it onto the board.

[-]

CyberRenegade@reddit

I will be interested in the speed of these, especially compared to the upcoming M5 Mac Minis/Studios

[-]

Technical-Earth-3254@reddit

Bandwidth will be the most interesting spec

[-]

sleepingsysadmin@reddit

this is a minor change from amd's point of view. the supplied memory modules are simply more dense.

Probably no actual improved memory bandwidth.

So it wont cost much more.

GPT 120b a10b will likely only run marginally faster than minimax 230b a10b but big difference in intelligence and now you can load minimax is the difference.

given my tendency to be riding 200,000 context with minimax all the time. I do wonder what speeds ill be getting, but i will be buying :)

[-]

unjustifiably_angry@reddit

So it wont cost much more.

50% more memory and that memory is 50% more dense making it >50% more valuable since density greatly increases complexity and reduces yield. Price will increase massively and be no better for AI since the bandwidth is the same.

By the time Medusa Halo comes out there will almost certainly be a second-generation DGX Spark, this time not based on a repurposed laptop/console chip.

AMD will, once again, arrive on the scene just in time to be largely irrelevant. At least for our purposes. For regular consumers it's pretty exciting, I plan to get one.

[-]

notdba@reddit

For some reason, the 24GB DDR5 RDIMM has lower per-GB price.

Current price in taobao:

16GB - 2000 (was 2400 few weeks ago)
24GB - 2800 (was 2900 few weeks ago)
32GB - 4800 (was 5700 few weeks ago)

Before, it was almost half-price compared to 32GB, and only slightly more expensive than 16GB.

[-]

PromptInjection_@reddit (OP)

Gorgon Halo is just a Strix Halo Refresh. Massive improvements come with "Medusa Halo" (2027)

[-]

ttkciar@reddit

To circumvent paywall: https://archive.ph/qbJXJ