Corsair desktop PC with Ryzen 395 and 128GB of unified RAM, has anyone tested it for LLM? Seems "a good" price

[-]

AI-Agent-Payments@reddit

The bandwidth number that actually matters here is around 256 GB/s on the Strix Halo, which sounds impressive until you compare it to a single 3090 at 936 GB/s. In practice I was hitting roughly 12-15 tok/s on a Q4 Qwen 72B, which is fine for single-user inference but falls apart the moment you add a second concurrent request. The unified memory architecture also means your system RAM and VRAM are competing for the same pool, so background processes you forget about will quietly eat into your effective context headroom.

[-]

No-Comfortable-2284@reddit

256gb/s memory bandwidth. doesnt matter how much ram it has, its not running anything more than single digit billion parameter in any meaningful speeds in inference.

[-]

Zealousideal-Lie8829@reddit

For running llama.cpp you don't need any of the poorly supported parts of the software stack

[-]

Fit-Produce420@reddit

It's a strix halo system you can google and read about them, multiple manufacturers have a similar system.

Other poster is right, they are bandwidth and compute limited, they're fun for a hobbyist but not really fast enough for agentic coding so you could just buy a lower capacity, higher bandwidth GPU and be in the same spot.

Also, the AMD software stack sucks. It seems like they have a couple folks poking at it but compared to cuda it sucks ass. The software stack alone is enough to avoid AMD (and Intel).

[-]

anykeyh@reddit

Not completely agreeing on this. Qwen 3.6 35B MoE run at 50tk/s with Vulkan backend, and 65\~70tk/s with MTP (Llamacpp). 256k context easy, no KV compression needed.

That's not fast, but that's not slow either and with proper harnessing you can do quite a lot of agentic tasks; The MoE is pretty good if you carefully frame your tasks and practice step by step automated check (TDD etc..).
Also the 128Gb allows you to run other models, e.g. image and sound generation, run full desktop for agentic search etc...

Now the only issue is to find (or build) your agent harness.
But all of this for under 100w when active (TDP 120w but it's memory bound so it doesn't use full TDP), and less than 5W when idle...

[-]

Fit-Produce420@reddit

I prefer larger models.

And like I said, if you're only going for a 36B model you don't need 128GB, you could get two GPUs and run on those. If you have a computer already it's cheaper than buying a strix.

I own one, I like it, but it isn't the right solution for a 36b model that fits on cards.

[-]

anykeyh@reddit

YMMV, my stack for example:

- Qwen 36B MoE for Agentic LLM
- Qwen 9B for vision & OCR tasks
- Z Image via ComfyUI for image generation
- Whisper for STT (not yet completely setup)
- PGVector + Qwen3-Embedding-8B for RAG; open webui, caddy etc...
- Custom agent harness

Basically a jack of all trade with everything generation (except video) you can imagine.

Running Ubuntu LTS; deploy using Ansible; accessible remotely (connect through SSH passthrough to one of my bare metal in the cloud).

I don't think using a machine with double CPU would be doable in this context. And my electric bill appreciate the 5W idle with a machine running 24/7 which I would not enjoy otherwise ;-)

[-]

wllmsaccnt@reddit

I use mine for smaller models (27B dense is about the max that makes sense to me, even with MTP). I got the strix halo so that I could test training models, which requires more VRAM. Its slower, and huggingface support for rocm is messy, but to train models of the same size with NVidia GPUs would cost several times as much (if you factor in needing a full PC before buying the GPUs).

Maybe a desktop with several Intel Arc Pro B70s would also be an alternative, and would perform better for inference, but the Intel backends have the same issues as Rocm and leaving a full desktop running all day can be awkward.

[-]

Acu17y@reddit (OP)

Yes, I was reading a bit, but some people speak highly of it, while others didn't. So I wanted to know if anyone here had tried it in daily use with large models.

Regarding the AMD issue, honestly, I have a custom desktop with a 7900XTX and couldn't be happier with it for AI work, productivity, and any other task. I'd recommend it to anyone, but in this case, I was trying to recommend a PC with much more memory to my friend, and this one from the link seemed like a good price, but apparently, the bandwidth isn't great yeah

[-]

Middle_Bullfrog_6173@reddit

For running llama.cpp you don't need any of the poorly supported parts of the software stack. Use vulkan backend and everything just works. But the RDNA "3.5" + unified memory is quirky and not perfectly supported in rocm even now a year after release.

[-]

fallingdowndizzyvr@reddit

But the RDNA "3.5" + unified memory is quirky and not perfectly supported in rocm even now a year after release.

How so? I run ROCm on my Strix Halo. Sure it was quirky many months ago. But it's been fine for months now.

[-]

Middle_Bullfrog_6173@reddit

I'm still seeing crashes in some random compute paths within the drivers, latest was when trying to fine-tune Gemma. And yes it was within the rocm driver itself.

[-]

Acu17y@reddit (OP)

I use my AMD card for everything AI related. Text, image, video, sounds, etc... 10/10 honestly

[-]

Middle_Bullfrog_6173@reddit

I also have a discrete AMD GPU, which has been supported fine. And the 3D drivers on the Strix have been working as well. Just rocm, though I haven't tried 7.13 yet, has been buggy.

[-]

MrShrek69@reddit

I’ve used mine for professional work and it’s okay. It’s just the issue with pro work is u need cuda. No one knew cuda would be the platform that everything works on and Nvidia doesn’t want to share their proprietary drivers so AMD’s been reverse engineering it and it’s just not there yet. We don’t know if it ever will be but it’s crazy how far it’s come already it’s just not usable for professional work as of now.

[-]

lol-its-funny@reddit

Personally I wouldn’t buy it. I bought the same hardware but by another OEM for around $1900 and it’s good. It’s got a TON of memory, and is fast “enough” (\~4x usual ddr 4/5, these are 256 gb/s memory bandwidth). Really fast for MoE models, which is where is shines (and is the likely future of models too). But right now they’re just expensive. I’d just use APIs like deepseek or MiniMax or codex/openai. Unified memory is the future, absolutely zero doubt.

Sorry it’s probably not what you wanted to hear.

[-]

IamVeryBraves@reddit

you're not kidding holy cow $3k+ for a lot of these machines that used to go for 1800-2200

for $3.5k I could get a ASUS Ascent GX10 (their version DGX Spark) and stick with CUDA and enjoy NVFP4 that everyone raves about.

What I wanna ask a Strix Halo user is: Do you think $2500 for a Strix Halo is worth it? Because I kinda regret not jumping on when it was 2500 a few weeks ago lol.

[-]

mjsxi__@reddit

its memory bandwidth is too slow even tho you can fit a large model in there working with it will be a pain (I guess depending on what you want to do)

[-]

fallingdowndizzyvr@reddit

for context an apple M1 Max has more bandwidth for memory than SH.

And is slower overall.

https://www.reddit.com/r/LocalLLaMA/comments/1le951x/gmk_x2amd_max_395_w128gb_first_impressions/

[-]

mjsxi__@reddit

Yeah and I was talking about how bandwidth is the limiting factor for this device so I talked about bandwidth and made a comparison to another unified memory device — no shit the SH performs better it’s a newer chip that’s beefier in all specifications of course it’s going to preform better overall

[-]

fallingdowndizzyvr@reddit

Yeah, but are you just talking about bandwidth for bandwidth's sake. Or are you talking about bandwidth being the "limiting factor" for performance? Again, overall even though it has higher bandwidth, a device with lower bandwidth is faster. Bandwidth is not the end all and be all that some people think. It's not that simple.

[-]

Acu17y@reddit (OP)

256 GB/s isn't enough, you say? I'd like to recommend it to a relative of mine who would use it for programming and other important projects.

[-]

KickLassChewGum@reddit

A 5090 sits at 1.7TB/s, for comparison.

[-]

Acu17y@reddit (OP)

Yes, and it also costs a bit more perhaps. ;((((

[-]

oShievy@reddit

It’s literally the cost of a strix halo + ddr5 ram + psu + cpu lmao. At that point you’re in an entirely different price bracket

[-]

mjsxi__@reddit

So a 27b dense model (which would be on the smaller end of models you could run with that memory amount) at 4bit quant (which would also not ideal since the memory size could fit 8bit or unquantized) be would run at like 12-17 tok/s which is slow…

Bumped up to 8bit (which would be more inline with the memory size) would run at like 6-8 tok/s so extremely slow

Imo not really worth it for the price. Others might feel different

[-]

Acu17y@reddit (OP)

Oh, okay, really slow, I thought it was faster.
I have a 7900 XTX in my custom PC, and I run qwen 3.6 MoE at about 90 or 100 tokens/s and 60+ with 256,000 cxt size for big project.
I wanted to recommend the one in the link so he could maybe use larger models than the one I use, but at this point, I'd better build him a PC with a GPU like mine.

[-]

MrShrek69@reddit

I think it really come down to what you want to do. I use 27b “slow” (I still think it’s pretty fast) to do planning and such. I’m a software engineer. Then I use smaller MOE models in parallel to implement it. But running multiple instances quickly overwhelms my strix halo machine. But it does shit and it’s home. So I love it. If u want to do agentic programming having more than one instance of a llamacpp server is nice cuz then u can start using more agents. The real benefit for me is just the fact I have 128 gb of vram. It really just lets u do whatever you want. I can support most context sizes etc. it’s just a totally different experience than say having one or two traditional gpus

[-]

mjsxi__@reddit

Just for comparison your gpu is like 1000gbps

[-]

LevianMcBirdo@reddit

You use a dense model, why if there are comparable moes that will be a lot faster on this machine and now with mtp potentially doubling in speed. Yeah it's not a dream pairing with 27B, but with the 122B and mtp?

[-]

floconildo@reddit

256GB/s is the theoretical limit. Actual bandwidth sits around half of it. I run mostly a SH, pretty beefy machine but dense/big models are a pain

[-]

Acu17y@reddit (OP)

Ok, thanks for the info (;

[-]

Massive-Question-550@reddit

256gb/s is quite slow once you get above 30b parameters. especially for programming you need it spitting out lines fast eg 20-50t/s vs 3-5t/s someone might need for creative writing.

at 70b q4 I think it slows to around 3-4t/s so it's pretty limiting compared to a stack of gpu's, especially if they can run in parallel as you could reach equivalent speeds of 2tb/s with 4 3090's.

[-]

sagiroth@reddit

3090 is 3x faster so you looking at about 20tkps if you optimise not great not terrible

[-]

Acu17y@reddit (OP)

Ok, not bad but not great either, thanks

[-]

kant12@reddit

I have a Corsair and a Nimo and I will say the build quality is certainly better with the Corsair. It's bigger and runs cooler and it doesn't feel like the motherboard is moving around on me when I connect things to it. However, I have had zero actual issues with both.

If your goal is to run larger models with large context and treat it like an assistant that goes off for an hour and completes tasks you're too busy to do yourself then you won't be disappointed.

[-]

DoorStuckSickDuck@reddit

I like strix halo and I recommend you check out the forum for it. They are indeed slower than running a server with GPUs, but they are very cheap electricity wise and quiet. They're excellent machines if you plan to run it 24/7 for automation or other uses.

They can load big models or models in parallel since they have 128GB of RAM to work with. For example, I have 4x parallel lanes of Qwen 3.6 35B, each with 128k context, and I get \~900 tok/s prefill \~40tok/s generation. The driver support is getting significantly better (and AMD is finally funding it properly) and there's a pretty good community of getting the machines working.

[-]

Massive-Question-550@reddit

how much of a penalty is there running an Nvidia gpu with the 395 since you can't use cuda or tensor cores?

[-]

DoorStuckSickDuck@reddit

Wym? You run two separate instances, I run llamacpp with the lemonade ROCm binaries for the strix halo GPU, and llamacpp cuda binaries for the 3070 I'm using in my eGPU. Surprisingly, it works without issues on Ubuntu

[-]

JayTheProdigy16@reddit

who told you this? You can run CUDA x ROCm simultaneously if you build llama.cpp with the right backends. I've been doing that for a while

[-]

MrShrek69@reddit

It’s pretty big difference in general. But as an inference platform at home it’s amazing. I love it but there’s def more tinkering involved (I love that’s stuff). It’s getting more mature by the day and the community is great. I seeing it only being a better option with time but rn it’s def niche. If ur looking for a gaming pc as well this might be the move. I use mine for gaming and llms. I couldn’t justify a gpu worth the same price as an entire computer. I got mine a launch so also rn things are different

[-]

sernamenotdefined@reddit

I'm thinking of ordering the oculink card and doc from minisforum and sticking a AI Pro R9700 in it for dense models and run MoE models on the iGPU.

[-]

sernamenotdefined@reddit

To be honest, if you are going to use an eGPU, get a minisforum MS-S1. It has an x16 (electrically x4) low profile slot for oculink adapters so you don't have to fiddle with a flimsy adapter and you have a proper way to bring it outside the case.

My first test with local LLM only was building a fairly large but simple CRUD web app using Gemma4, Qwen 3.6 27b and Qwen3-Coder-Next. The dense models are not super fast, but Qwen runs acceptable. I put quite some time in writing a good specification with logical and testable implementation steps and I set it to work.

The time on the documentation paid off, I ended up with all functionality implemented without even having to look at it. I just let it chug along for a day. Then some manual finetuning and UI work (using Qwen MoE for completion/suggestions as it is more than fast enough to not be annoying.)

First app I let local ai build from scratch without touching my Codex account and it ran as test while I was working for my day job.

Is it as good as Codex (or Claude)? No, but with some extra effort in advance it is good enough if you are on a budget. (I also used the ms-s1 as the workstation, so everything ran of a single machine. Local ai and my entire development environment.

[-]

BuilderUnhappy7785@reddit

Hey I’m planning to go a similar route. I am fairly experienced writing software spec but new to working with coding agents. Would you be up for posting or DMing me your spec doc? It would be very helpful as a point of reference. Thanks!

[-]

myidealab@reddit

For those going the MS-S1 route: YouTube - Alex Ziskind - Three months wrong about why my 4-node AMD cluster was slow

[-]

sernamenotdefined@reddit

Yeah just watched that, good link!

I'm not clustering, if I really need a large model I'll use Codex.

[-]

sudochmod@reddit

What settings are you using for qwen 35b with 900 tokens prefill?

[-]

michaelsoft__binbows@reddit

What's the MTP speedup on the likes of the 27B? My 5090 sees about a 2x speed jump and nearly erasing the speed advantage of the 35B, thereby making it obsolete

[-]

u23043@reddit

About double for code in my tests today. But 35B is still 3x the tg speed since it improves about 25% as well.

And 35B prompt processing is still *way* faster, so while it is nice 27B dense is closing the gap (and is arguably usable now for some tasks), overall task times still feel slow.

[-]

wllmsaccnt@reddit

On a Strix Halo, using recommended Unsloth parameters on llama.cpp with their 27B MTP model, I went from \~11 to \~19 (both tested with the MTP model, just toggling the speculative decoding off then on). The 35B A3 model gives me about 55 without MTP.

[-]

MrShrek69@reddit

With the pcie port isn’t full 16x slot right? It’s only 8x?

[-]

Fast_Paper_6097@reddit

I have this one in particular. I *almost* sent it back and then decided not to last second now that MTP is making progress for llama.cpp.

It actually runs MOE’s very well, and I have high hopes for the latest Gemma MTP.

setting it all back up again tonight

[-]

Fast_Paper_6097@reddit

Biggest gripe about this particular device - there’s no PCIe expansion capability. If I want to cluster I have to sacrifice an NVME slot and use a riser / adapter

[-]

high_on_meh@reddit

Had mine about a month. Experimented with models a bit but I'm mainly running with llama.cpp using the following:
llama-server -m models/Qwen3.6-35B-A3B/BF16/Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf --host 192.168.20.189 -c 524288 -n 131072 -ngl 99 -fa 1 --no-mmap --threads 2 -b 4096 -ub 4096 --n-cpu-moe 0 --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 0.0 --min-p 0.00 --chat-template-kwargs '{"preserve_thinking": true, "enable_thinking": true}' Been using it on a personal project almost every day and this is pretty representative of the performance I've been getting: prompt eval time = 443.40 ms / 40 tokens ( 11.09 ms per token, 90.21 tokens per second) eval time = 4631.51 ms / 81 tokens ( 57.18 ms per token, 17.49 tokens per second) total time = 5074.91 ms / 121 tokens Why did I get this Strix Halo and not one of the others? I have no idea why, but my price delivered was $2500. They show up as $3400 now.
Let me know if you have any questions.

[-]