On Strix Halo, what option do I have if 128GB unified RAM is not enough?

Posted by heshiming@reddit | LocalLLaMA | View on Reddit | 38 comments

Windows 11 let me allocate 96GB of unified RAM to VRAM. I can fit a 90+GB model, like the Qwen3.5-122B-A10B's Q5 under llama.cpp and have decent performance for coding. What would be the better option if I needed a larger model?

I understand one option is buy another Strix Halo and have llama.cpp spanning the calculation via RPC. But the current state of RPC, and the benchmarks in AMD's tutorial with a 4x cluster weren't convincing enough, and appears to be more of an experiment rather than a use case.

I can also get an eGPU dock. But the best card vendor claimed to support is RTX 5090 with 32GB of VRAM. So for any model that can't be fit into the 32GB VRAM (my use case), transfer rate is going to be a significant issue, which might prevent full utilization of the eGPU? And I don't see anything on the market that can support like RTX Pro 6000 that has 96GB of VRAM.

Which option is the better one or is there no point trying to pursue this configuration? Thanks!

[-]

notdba@reddit

I have a 128GB strix halo and a RTX 3090 eGPU connected via oculink. The slow PCIe 4.0 x4 is indeed a massive issue for PP. I shared a bit about this before in https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/

There is really no good option to expand a strix halo setup. It doesn't have enough PCIe lanes for good interconnect speed with anything else. I learnt it the hard way.

[-]

Fit-Produce420@reddit

Yep that's definitely the downside to strix!

You can run Linux and get to about 112-116GB, for one thing.

Also, you can split layers or tensors between devices.

[-]

VoiceApprehensive893@reddit

set gtt to the entire ram

[-]

Fit-Produce420@reddit

You can OOM if you do this, and kernel panic.

[-]

audioen@reddit

You can straight up go to about 124 GB. I think all you have to do is (if using Vulkan, maybe, not rocm): ttm.pages_limit=32505856 iommu=pt in the kernel config. The latter is some performance optimization, drops the iommu from the memory management altogether which probably saves some translation lookaside buffer action or similar indirection.

>>> 32505856 * 4096 / 1024 / 1024 / 1024
124.0

And that's how this constant is calculated. 4096 byte pages converted to GiB.

[-]

cunasmoker69420@reddit

yeah I'm running 124GB also

[-]

Fit-Produce420@reddit

Yes, 124GB is about the max that leaves enough for Linux + mcp servers + ide + web client.

I found that leaving it below 113GB helped depending on the kernel, there was a bug.

[-]

ShengrenR@reddit

"Appears to be more of an experiment rather than a use case." - honestly... kindof strix halo in its entirety there. The big challenge is the low memory bandwidth will hurt more and more as you go larger and larger. If you're even thinking of nvidia GPUs I really wouldn't think of that as an extension to the strix.. that, to me at least, would be an entire new build - be real odd to stick a sports car on the front of a bus.. you can do it though. For just a massive pool of URAM, likely macs.. and hope the m5 gets a huge option that doesn't cost a fortune. Can always try to stick intel or AMD GPUs together for savings over nvidia, and your hardware will be cheaper, but the software layer will likely need a lot more tinkering.

If it's for work and they have cash lying around go beg for a https://www.nvidia.com/en-us/products/workstations/dgx-station/ lol.. hasn't worked for me yet.. but I'll keep trying my luck.

[-]

RealPjotr@reddit

Yes, experiment. A test before the real stuff coming 2027-2028.

I think you'll see a lot of options coming with DDR6, more channels and more RAM, 256-512 GB options. Like the Apple Ultras, but with competitive pricing, just hope RAM prices can come down a bit.

[-]

brainmydamage@reddit

you really truly think there's going to be any consumer availability for ddr6? lol.

[-]

RealPjotr@reddit

RemindMe! 2 years

[-]

RemindMeBot@reddit

I will be messaging you in 2 years on 2028-04-11 19:03:43 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

catplusplusok@reddit

Bandwidth depends on active parameters, not total parameters, large A10B models run fine so long as they fit.

[-]

ShengrenR@reddit

Right, if you stay a10b and just have more and more experts nbd, but most of the huge ones have larger active too. Deepseek and kimi are mid 30s.. maybe the big qwen3.5, but that's 17b still which is going to be a lot for the strix potentially. Not impossible, just starts to really push the edge of useful imo.

[-]

xspider2000@reddit

in linux u can allocate up to 126gb of memory for igpu. Also egpu is another way to increase available total vram. interface between egpu and strix halo do not bottlenecking ur pp and tg. Recently I wrote 2 posts about that

[-]

Look_0ver_There@reddit

When you use layer-mode model splitting there actually isn't a whole lot of data that flows from card to card, or from machine to machine (or machine to card if in eGPU mode). It's only a few hundred KB/sec. The biggest factor is latency.

I have two Strix Halo's and set them in RPC mode, linked via the USB4 cable. For single user throughput the performance hit is about 10-15% for running a single model locally, vs splitting the exact same model/quant across the two machines. ie. Striping isn't terrible, but it's also not a magic pill.

I do recommend that you try the Unsloth IQ3_XXS quant of MiniMax M2.5 on Linux on the single Strix Halo. This can be run up to 200K context depth and it'll range from \~38tps initially down to \~15tps at 150K+ context. Keep the context window smaller and compact more frequently if you want it to run faster.

This is all being done on Linux (Fedora 43) though. If you want to stick with Windows then I have no experience there.

The other solution that doesn't cost an arm and a leg, is to grab 3 x Radeon AI 9700 Pro cards (they retail for about $1250-1350 each) and have 32GB. Avoid the PowerCooler brand as the fans they use are noisy AF. The XFX or ASRock ones use fans that barely make noise until you're pushing the cards super hard. This solution runs between 2.5-3x faster than the Strix Halo does when running the exact same model.

You can fit a Q5_S quant of Qwen3.5-122B-A10B on the 96GB this gives you and runs PP2048 @ 1150t/s, and TG512 @ 36t/s for single-user using llama.cpp. You can push up to 80tg/s when running in agentic mode (lots of parallel requests).

You can even keep the Strix Halo, and build out your 96GB GPU based solution (it'll cost you about $5K), and RPC over USB4 between the two, and that'll give you \~210GB of usable VRAM that you can stripe with RPC. In this scenario the half that runs on the Strix Halo runs at Strix Halo speeds, while the half that runs on the GPU's runs 3x faster, netting you a rough \~1.8x speed-up.

The above are all just suggestions. I'm not saying that it's perfect (nothing really is under $20K).

The other solution is a 256GB Mac Studio. That will cost you about the same as all of the above, but will run a bit slower than all of the above. It is a MUCH cleaner solution though.

If the M5 Ultra based Mac Studio ever does get released, that may very well be the most compact solution to what you're after, but it'll cost you a lot more than hacking something up.

Just my 2c.

[-]

cunasmoker69420@reddit

You can fit a Q5_S quant of Qwen3.5-122B-A10B on the 96GB this gives you and runs PP2048 @ 1150t/s, and TG512 @ 36t/s for single-user using llama.cpp. You can push up to 80tg/s when running in agentic mode (lots of parallel requests).

Can you elaborate more on this last part about the parallel agentic mode?

[-]

Look_0ver_There@reddit

Breaking it down into a high-level description without focusing too much on the details.

It just means when you're using a coding harness (that's the example that applies to me), or something else (like OpenClaw I guess - I've never used it), that spawns off a bunch of agents at the same time to do something, that rather than them taking turns running at 36t/s in sequence, that 4 requests (the default for llama-server is --parallel 4) will run at the same time, This allows for llama-server to keep the GPUs busier 'cos there's (almost) always something that needs to do something within some portion of the overall inferencing process. When it comes to 4 clients, those 4 clients effectively run at 20t/s each (20 x 4 = 80). The individual clients run a little more slowly, but more is being done overall.

[-]

cunasmoker69420@reddit

Neat I actually just tried that and the speed loss is definitely not as big as I assumed it would be. Now if there was only a way to have the model output text on one request while prompt processing on another request. It looks like prompt processing pauses everything, which I guess makes sense.

[-]

Look_0ver_There@reddit

vLLM is much better at doing what you're describing, but it's also fairly memory hungry, and kind of fragile on AMD GPUs. I've gotten it to work a number of times, but you definitely cannot ride the knife edge of squeezing out every last byte of VRAM memory from it like you can with llama.cpp.

Still, if you pick a smaller model, vLLM will stomp all over llama.cpp for total throughput when it comes to agentic use. For me I find it's generally more hassle than I can be bothered with as I'm constantly swapping models and tinkering, but if you're wanting to set up a long running stable inferencing platform, then vLLM will be better.

There's also SGLang which is meant to be better again, but after a whole day of trying to make it work on AMD GPUs I gave up in frustration.

[-]

heshiming@reddit (OP)

Thanks, all very practical suggestions.

[-]

ProfessionalSpend589@reddit

transfer rate is going to be a significant issue, which might prevent full utilization of the eGPU?

Yes, utilisation will always be less than 100% if you’re not using tensor parallelism.

For example 2x Strix halos with RPC will be utilised to 50%. 4x will be utilised at 25%.

Having an eGPU actually can help if you want to run a smaller model in parallel. That’s what I currently do - I run Gemma 4 26B A4B as a general fast model (I want to use it on mobile so fast is critical to save on battery) and then I have a Qwen 3.5 397B as a smarter, but slower model ready for heavy thinking.

[-]

El_90@reddit

I'm in the same position

I'm quite happy with q4 122b Moe for architect, then 27b for coding.

Even doubling ram to 256 really gives you better quant, you still can't run sota at anything useful so I've accepted there's no easy slight step up, it's a rebuild from scratch.

I'm just hoping ~90GB models continue to stay popular

[-]

Dontdoitagain69@reddit

you can build anything with small models, theres a lot of hype. just figure it out dude, it takes a lot of experimenting and if you know a design pattern that is comparible with your model you are set. i have 1.2tb of ram, im using a 30b to do real estate forcasting, i can build a blockchain node using old phi models. dont ever listen to anyone you need a lot of ram. this is a fact, no bs. DONT USE THIS SUB AS A SOURCE OF TRUTH. Its good to see some model news and entertainment. Ram chaising is bs

[-]

phido3000@reddit

RPC is very experimental. I didn't have much luck with it really. It worked, but didn't make it faster, and had all sorts of bottle necks.I don't think a 5090 is worth it for a strix unless you can fit your model into that. I would look at a 7900xt or a 9700Pro.

You start just building a completely different machine.

Eypc with DDR4 is what i have. 8 channels. but loads of PCIE. But even then I can't really afford 4 x pro 6000s... But its memory performance isnt no better than strix. But i can have load more ram, so large models can run, just very slowly.

[-]

Fit-Produce420@reddit

RPC is more for fitting a larger model, not speed unfortunately.

[-]

Final-Rush759@reddit

Some models have a PCIe slot for a GPU.

[-]

Middle_Bullfrog_6173@reddit

Currently there are basically two reasonable and one very expensive point with local models. The small dense route, up to ~30B, where dGPUs shine. And the sparse MoEs where Strix and some Macs get you a "cheapish" entry point, but limited speeds and no real upgrade path.

The large models that are a significant enough upgrade are IMO unreachable with any reasonable local setup. Either way too much money or way too many compromises compared to the more reasonable models. Unless running a business/lab setup for many users.

So you are basically already at a "local minimum" of the optimised inference setup. I personally don't think any current models justify a change, but it's always use case dependent. I have a Strix for work and and am waiting for next gen models to know how to upgrade my obsolete personal setup.

[-]

No_Success3928@reddit

you have the option of getting a refund and buying multiple RTX pro 6000s

[-]

sunshinecheung@reddit

Use Qwen 3.5 27B

[-]

inthesearchof@reddit

The thing with unified strix halo is that dense models crush tok/s. That is why almost all run moe's

[-]

ZCEyPFOYr0MWyHDQJZO4@reddit

There's really nothing stopping you from putting a 96GB Blackwell in an external enclosure AFAIK. You're just gonna be limited by the 40 Gbps connection and your wallet. Whether it's a good idea to spend $9k on a GPU and slap it in an eGPU case is a different matter.

[-]

heshiming@reddit (OP)

Thanks, though somehow I feel like quantizing and reap results in major downfall of accuracy and Qwen3.5 is the only option thats sort of resilient. Other options like minimax needs pretty high quants to perform well.

[-]

Fit-Produce420@reddit

Step 3.5 flash at 4 bits is smaller than minimax m2.5 (just barely) and runs pretty nicely. It's a good coding model.

[-]

Thepandashirt@reddit

Honestly you’re heading down an expensive path. Mac Studio used to be the next reasonable step up, but those are back ordered for months. Only other option is going 96GB Blackwell but those are expensive and need an expensive home.

But honestly I’m not sure how much use you’re going to get out of more VRAM. The trend seems to be keeping medium to larger models closed weights while releasing weights of smaller models. That’s what google and alibaba have done with their latest models. So going more than 128GB of ram gets you into this weird space with not a lot of LLM options. Most open source models moving forward will be runnable on 24-32GB vram since that’s what most people have. Most people don’t have 128GB+ of vram to run medium sized models so there isn’t much incentive to develop or release models that size. The medium sized models also cannibalize frontier model usage so there really is no business incentive for releasing them either.

[-]

heshiming@reddit (OP)

Thanks for the insight!

[-]

BumblebeeParty6389@reddit

If 128GB isn't enough most logical option is probably selling it and buying mac ultra instead

[-]

catplusplusok@reddit

I run this model on my Thor dev kit: catplusplus/MiniMax-M2.5-REAP-172B-A10B-NVFP4. It's pretty capable and I don't think larger model would perform well even if I had more memory. On AMD box, you would want a different quant (AWQ?), but you should be able to run this model with \~120K token context given 8 bit kv cache. The key is switching to Linux and clearing filesystem cache before starting the inference engine.