Impact of mixing architecture

Posted by zakadit@reddit | LocalLLaMA | View on Reddit | 8 comments

For context

As planned after my previous post, I now have a decent amount of VRAM to work with:

2x RTX 3090

maybe 2 more coming soon, if needed

1x RTX 4060

8x RX 6600 XT

1x RX 6700 XT

1x RX 9060 XT

(12 to 20 3060 more coming soon + 2 3090 if needed)

I’ve been pretty hyped to finally start building something with all of this, but from what I’ve read, mixing CUDA and Vulkan/ROCm seems like it can get messy pretty quickly.

Is that actually a big deal in practice, or is it manageable if everything is configured properly on my RPC?

Right now, I’m thinking about splitting the CUDA and Vulkan/ROCm GPUs instead of trying to force everything together.

But I’m not sure what the cleanest way to do that would be…

Should I go for something like 2 llama.cpp / llama-server instances?

because I’ve heard that multi-machine inference can become pretty slow or annoying, even with high-speed Ethernet, so I’m trying to avoid building something that sounds good on paper but performs badly in real use.

At the same time, I feel like each of these GPUs should still be capable of running decent models on their own, especially with the right GGUF quants.

For now Im kinda chasing Deepseek model but for now i think Qwen3.6 (uncensored 35b) is my go

(and i’ve tested, only with 4060 & 3090 and damn it’s impressive.)