AMD Strix Halo rumored to have APU with 7600 XT performance & 96 GB of shared VRAM
Posted by 1ncehost@reddit | LocalLLaMA | View on Reddit | 60 comments
https://www.techradar.com/pro/is-amd-planning-a-face-off-with-apple-and-nvidia-with-its-most-powerful-apu-ever-ryzen-ai-max-395-is-rumored-to-support-96gb-of-ram-and-could-run-massive-llms-in-memory-without-the-need-of-a-dedicated-ai-gpu
Looks like the next AMD high end laptop chips are going to be at least somewhat decent for LLMs. ROCm doesn't currently officially support APUs but maybe that will change. However, Llama.cpp's vulkan kernels support them and are basically the same speed as the ROCm kernels from my testing on other AMD hardware.
Unfortunately the memory for the igpu is dual channel DDR5, but at least its up to 96 GB.
medialoungeguy@reddit
Susan just focus on getting rocm to work.
Rich_Repeat_22@reddit
It does work mate. I have a 7900XT and everything worked out of the box. In addition there is a ROCm update this week supporting APUs. The above has a good GPU and a 60 TOPS NPU to leverage through XDNA.
I do wonder if we can use both running LLMs 🤔
b0tbuilder@reddit
If only they would release a 7900xtx with 48gb of ram. I would pay… seriously
Rich_Repeat_22@reddit
At reasonable price.... Because this card exists but costs $3600 (W7900). Can have almost 4 7900XTX (96GB VRAM) for that money.
Matt_1F44D@reddit
Not on windows. Moan all you want about it but it’s much easier to train and run models with a nvidia GPU.
Rich_Repeat_22@reddit
The 7900XT runs on Windows. Having it running on the work environment all day. Both Ollama and LM Studio using ROCm without issue in W10
Matt_1F44D@reddit
I have the 7900xtx and sure running models on ollama is super easy but actually trying to train your own models (not LLMs) is a pain in the ass on all major libraries if you’re using CUDA you just enable it, you can’t do that with AMD.
AMD is ass for ML on windows.
b0tbuilder@reddit
I do it on dual Radeon VII cards all the time for object recognition and instance segmentation.
carnyzzle@reddit
Microsoft just needs to quit being lazy and get DirectML to the point that it doesn't matter what GPU you have
AnomalyNexus@reddit
CPU, GPU and NPU is all gonna use same mem b/w so don't think it's gonna scale like that
mumbo1134@reddit
That's going to take too goddamn long, frankly, they need to actually appealing hardware into our hands while they get their shit together so that some of us dumbasses can actually start building out libraries and tooling for rocm. Right now you would have to be 100% retarded to buy AMD for anything ML related.
medialoungeguy@reddit
Lol. Well said.
xadiant@reddit
AMD needs get their shit together and not let Nvidia dominate a multi trillion dolar market. Smh
Longjumping-Bake-557@reddit
Literally any GPU with 48gb of VRAM and cuda support would fly off the shelves
mumbo1134@reddit
They can't control CUDA support in a reasonable timeframe, but they sure as shit can (and should) put out a 48GB card for us to start working with. But they haven't gotten enough of getting their asses kicked to reach that conclusion I guess.
Sad_Schedule_9253@reddit
Amd just released the radeon 7800xt 48gb version.
Longjumping-Bake-557@reddit
There was an open source implementation of cuda called ZLUDA which they themselves shut down
mumbo1134@reddit
It's tricky legal waters, hence it's tough to commit to that in the near-term.
BiteFancy9628@reddit
nvidia has now open sourced a large percentage of their drivers so that should make it easier
Synthetic451@reddit
That is completely unrelated to CUDA. You're referring to the recent open-sourced kernel modules for Linux right? Those don't change the landscape all that much. All they've done is pushed proprietary functionality into the GSP firmware which is closed source and opened up the bare minimum to interface with the Linux kernel. And none of that is even related to CUDA.
LumpyWelds@reddit
W7900-PRO is 48GB but it's just under $4K. AMD doesn't want to undercut and cannibalize their Instinct line.
Only-Letterhead-3411@reddit
100% this. If AMD offers consumers more Vram compared to Nvidia, I'm pretty sure a lot of developers and model trainers will move on to AMD hardware and ROCm will get much more support very quickly
xadiant@reddit
According to stuff I read, the main problem with AMD is consistency. A decade old Nvidia cards support CUDA while AMD cards are all over the place. They need to open source a CUDA rival, pay OS developers handsomely, pay bug bounties and sell the damn cards. If it's 5 times cheaper for twice the performance, shit will be figured out in record time, especially by Chinese and Russian devs if I had to guess.
Not sure a rando in the internet like me can figure this out but AMD can't.
Rich_Repeat_22@reddit
The above supports ROCm, having additionally 60 TOPS NPU. It's pre-order for me as need a work laptop in addition to run LLMs tc.
roshanpr@reddit
How may tokens ?
SmellsLikeAPig@reddit
This will be so expensive almost nobody will buy it just for ai.
randomfoo2@reddit
Here are the common ways on r/LocalLlama I see to get to 96GB:
$3000 3 x 32GB MI100 (used)
$3200 4 x 24GB 3090 (used; might draw close to 2000W from the wall)
$5400 128GB M3 Max MBP w/ 2TB of storage (only 28 TFLOPS?)
$8000 2 x 48GB A6000 (Ampere)
While it has a lot of RAM, which is especially tempting to people here, it's worth keeping in mind that 4070 Mobile laptops (which this APU is competing with) are usually going for $1500-2000. I could see AMD charge a fair bit more (in the $2.5-3K range) for the extra RAM, and the unified memory being really popular w/ the mobile workstation crowd. I think if they go too high it becomes a tough sell for the here but there aren't a lot of other low power/mobile options for that much VRAM. I guess it'll have some competition on the AI front w/ the upcoming Nvidia Jetson Thor board.
kryptobolt200528@reddit
no way they'll sell any of these at $2K to the general public, price needs to be around $1.25-1.5k
randomfoo2@reddit
I actually don't think they plan on selling the 128GB version to the "general public", but I also think your general idea of what high end laptops cost in 2024/2025 must be way off. An ASUS Vivobook S 14 w/ a Ryzen 365 and 24GB of RAM is $1200. An Asus TUF A16 w/ an HX 370, 32GB, RTX 4060 is $1700. Zephyrus G16s with a 4070 are going for $2300+.
It's incredibly rare for any laptop to have 128GB of RAM (most non-workstations max out at 96GB of dual channel DDR5, although maybe this will change once 64GB SODIMMs become more widely available - Halo would still have double the MBW though). On Newegg, the cheapest 128GB laptops I could find are some desktop replacement laptops starting at $3500 (they come w/ a 4080, but are also 8lb).
Assuming AMD is aiming for the gamer market, I think they're going to try to have to undercut 4070 gaming laptop pricing with 48GB or 64GB models, but the 128GB version? Well, if they want any AI developer interest, they're probably need a steep discount vs an Nvidia Jetson Thor. If they're looking at the mobile workstation market (eg Thinkpad P16, HP ZBook, etc), then the pricing could be pretty grim.
kryptobolt200528@reddit
There's a reason why these laptops are selling like crap except the G16.We literally had 8000series+RTX 4070 go on sale for 1kUSD for alot of times.
The fact that AMD charges double the price for Ryzen 9 strix point to manufacturers(compared to 8000 series)for literally 5-10% improvement at best doesn't help either. There's a reason why virtually no manufacturer other than ASUS seems to mingle with the Ryzen 9 offering(the exclusive deal with asus has expired for quite some time now).
And they definitely seem to be targeting the gamer/developer market, thinkpads which are prominently used in traditional office jobs don't have any business having that performant of an APU.
LippyBumblebutt@reddit
The RTX 4070 Mobile has 15.5 TFlops
I didn't quickly find leaked Strix Halo Flops. Strix Point has 12TFlop Half, 6 Single precision. /16*40 results in 30 TFlop half and 15 TFlop Single precision.
If you can make use of the NPU, Halo likely has 40 TOPS. But M3 Max has 410GB/s bandwidth, while Strix Halo will likely be limited to 270.
So a 128GB Strix Halo might be available for $3000 and thus quite a bit cheaper, it probably won't be faster for AI workloads. For LLM inference it will likely be 30% slower.
randomfoo2@reddit
If the Halo has the same NPU as Strix Point that'd be 50 TOPS (Block FP16), so that could outperform the Halo GPU. Note, that you can't just go by FLOPS even for similar architectures - for example I have a Phoenix 7940HS w/ a 780M that has a theoretical 16.59 TFLOPS (7.4X less than a 7900 XTX, but in practice on
llama-bench
, the pp512 is closer to 12X slower).The MI100 has 184.6 FP16 TFLOPS and 1.23TB/s , which has raw specs better than an RTX 4090 (165.2 Tensor FP16, 1008 GB/s MBW), but in practice, performance is around a 3090 (71 FP16 Tensor TFLOPS, 936.2 GB/s MBW).
For those interested in an analysis of llama.cpp performance between a 4090, 3090, and 7900 XTX, here's an o1 output that does a good job summarizing relative MFU / MBW (real world vs theoretical): https://chatgpt.com/share/66ff502b-72fc-8012-95b4-902be6738665
These are using an up to date (HEAD)
build: d5ed2b92 (3878)
You can see there's a lot of performance left on the table on the AMD side. For those interested in following along on some of that, you can take a look here https://github.com/ggerganov/llama.cpp/pull/7011 or https://github.com/ggerganov/llama.cpp/pull/8082
On the Mac side, MLX seems to be the way to go, but I don't really have a high performance Mac to compare to (and no one who owns one ever bothers to post standard
llama-bench
numbers, but they can run their own math and compare to how the Nvidia and AMD cards perform I guess.__some__guy@reddit
I plan to buy it for casual AI use and work.
If it has a PCIe slot for a Nvidia GPU that is.
I don't care if it costs €2000+.
kryptobolt200528@reddit
Its supposed to be a laptop chip,i wouldn't expect any OEM to put two freakin GPU's on a single machine,that would make no sense,the whole point of a strong APU is to replace a dGPU.
__some__guy@reddit
Well, the demand is there.
I think it'll probably find it's way into mini-PCs.
A regular desktop board is probably wishful thinking, but I'd be fine with any connector that gives my at at x4 speed to a Nvidia GPU.
mumbo1134@reddit
What costs are we talking here? Like <2K? That's not bad if so.
b3081a@reddit
It's a perfect replacement of 9950X for a lot of the use cases even if you don't care about that GPU. The limited memory and (per CCD) fabric bandwidth of AM5 really hurts performance even doing stuff like code compilation. If someone build a mini workstation with Strix Halo it will be a perfect environment for software devs.
fasti-au@reddit
It’ll take a while for sentiment if and being bgrade or the black sheep of GPUs.
Ever since et4000 vs riva128 era amd has always been better but harder to sell because it just never get the pre support for new tech. Even now windows 11 released fubar for amd in general. Even with intel chips being toasters they are still only really there as cheaper or the cutting edge for threadripper use.
Intel Microsoft and nvidia. And OpenAI and Microsoft basically makes the trinity automatic because amd won’t sacrifice the Microsoft market for their own CPU’s for lowend. It’ll compete but I think the high end is where they want to hit and have a really home cluster to medium business size. Everyone else will play safe because all the money isn shared around at that scale. . Ie a4000 to h100 and companies that can compete but don’t dominate a market.
Big 3 sorta makes it all redundant to some extent up top. It’s the middle market chip pies can catch still. Inference offerers one to 5 rack it departments that don’t cloud everything out to others but actually develop
curios-al@reddit
That's the theory. The practice (with 8840HS and situation will NOT improve during nearest couple of years for sure because of issues with AMD drivers) is the following - despite having 96Gb of RAM, BIOS allows to assign only up to 8Gb as VRAM. The rest (up to 46Gb) is available as GTT RAM which ROCm/Vulkan drivers treat differently and despite that memory directly accessible by GPU, performance drops massively. To compare, 14..16B model with Q8_0 quantization (doesn't fit into dedicated VRAM) in llama.cpp about 4 times faster on pure CPU vs ROCM/Vulkan.
So, ability to load the big model gives essentially nothing because iGPU doesn't accelerate it due to driver limitations.
No_Afternoon_4260@reddit
That thing has plenty cpu cores, i don t have the number but may be crazy fast ram will be able to make cpu inference fast enough?
Noxusequal@reddit
It is rumored that its up to 96gb vram. And I think up to 128gb ram.
NEEDMOREVRAM@reddit
Any idea how much the flagship model will cost? And...no way to run two of these suckers at once for double the vram?
Noxusequal@reddit
Its first and foremost a laptop platform so I doubt we will see two on one board soon. For price probably low ball would be arround 2.5k and high end about 4k but I dont know. I think it will be probably comparable to apple ultra implementations and a few hundred bucks cheaper cause its not Apple xD
NEEDMOREVRAM@reddit
Why don't AMD or Intel put out a GPU with 96GB of VRAM and price it "affordably"???
And yes I know CUDA is Nvidia.
But why can't AMD or Intel come up with their own version of CUDA (or better)?
Why are we (people like you and me) essentially chained to suckle on Jensen's Nvidia teets whenever we want to run our AI rigs?
Rich_Repeat_22@reddit
128GB RAM system, up to 96GB allocated to GPU.
Glebun@reddit
No, we're talking 128GB RAM, 96 of which can be dedicated to VRAM
b3081a@reddit
Did you do that testing on Windows or Linux? On my Strix Point laptop I had the VRAM set to 2GB and I've been using llama 3.1 8B q8 on Linux/ROCm for quite a while with build option GGML_HIP_UMA=1, and the performance seems to be expected with the bandwidth it has.
curios-al@reddit
On both. Windows 11 (cpu/vulkan) and Linux Fedora rawhide (cpu/vulkan/rocm) with rocm 6.2.0. I also specified that when building rocm-enabled llama.cpp. When models fits VRAM llama.cpp generates tokens faster despite the memory bandwidth should be the same.
b3081a@reddit
That's interesting. From my experience running in dedicated VRAM is indeed faster than UMA or GTT, but it should be roughly in the 10-20% range and the gap shouldn't be as large as 4x. Maybe I should do some more testing with larger models.
curios-al@reddit
4x valid only if model doesn't fit into dedicated VRAM. I.e. if GTT is used.
Rick_06@reddit
FYI: https://community.amd.com/t5/gaming/maximizing-gaming-performance-on-amd-ryzen-ai-300-series-with/ba-p/704594
Rich_Repeat_22@reddit
Wasn't there an article this week about ROCm update for AMD APUs?
Wrong-Historian@reddit
Unfortunately the memory for the igpu is dual channel DDR5
Wasn't that for Strix Point and should Strix Halo not be quad channel LPDDRX?
Rich_Repeat_22@reddit
It has LPDDR5X-8333.
Cantflyneedhelp@reddit
I have not seen any indication that Halo will be quad channel. What I've seen is that it's supposed to be 256 bit through, which would put the bandwidth at ~480 GB/s (7500 MHz) or 512 GB/s (8000 MHz)
randomfoo2@reddit
The current leaks suggests 256-bit of LPDDR5X-8000: https://videocardz.com/181601/amd-testing-strix-halo-apu-with-128gb-memory-config
This should be about 256GB/s of theoretical MBW.
Better_Story727@reddit
256GB/s for either read or write. 512GB/s for the total io bandwidth.
shroddy@reddit
It probably is, but there are still rumors it is in the 500GB/s range. But that might also be something like "theoretical bandwidth in typical gaming workloads when taking the 3D cache into account" not the actual bandwidth that we need for LLM interference
DUFRelic@reddit
256bit means "Quad Channel". Technicallly its Octo Channel as every LPDDR5X Modul has 2 32bit half width channels.
Noxusequal@reddit
Yes op is wrong