On Strix Halo, what option do I have if 128GB unified RAM is not enough?

Posted by heshiming@reddit | LocalLLaMA | View on Reddit | 38 comments

Windows 11 let me allocate 96GB of unified RAM to VRAM. I can fit a 90+GB model, like the Qwen3.5-122B-A10B's Q5 under llama.cpp and have decent performance for coding. What would be the better option if I needed a larger model?

I understand one option is buy another Strix Halo and have llama.cpp spanning the calculation via RPC. But the current state of RPC, and the benchmarks in AMD's tutorial with a 4x cluster weren't convincing enough, and appears to be more of an experiment rather than a use case.

I can also get an eGPU dock. But the best card vendor claimed to support is RTX 5090 with 32GB of VRAM. So for any model that can't be fit into the 32GB VRAM (my use case), transfer rate is going to be a significant issue, which might prevent full utilization of the eGPU? And I don't see anything on the market that can support like RTX Pro 6000 that has 96GB of VRAM.

Which option is the better one or is there no point trying to pursue this configuration? Thanks!