Impact of mixing architecture
Posted by zakadit@reddit | LocalLLaMA | View on Reddit | 8 comments
As planned after my previous post, I now have a decent amount of VRAM to work with:
2x RTX 3090
maybe 2 more coming soon, if needed
1x RTX 4060
8x RX 6600 XT
1x RX 6700 XT
1x RX 9060 XT
(12 to 20 3060 more coming soon + 2 3090 if needed)
I’ve been pretty hyped to finally start building something with all of this, but from what I’ve read, mixing CUDA and Vulkan/ROCm seems like it can get messy pretty quickly.
Is that actually a big deal in practice, or is it manageable if everything is configured properly on my RPC?
Right now, I’m thinking about splitting the CUDA and Vulkan/ROCm GPUs instead of trying to force everything together.
But I’m not sure what the cleanest way to do that would be…
Should I go for something like 2 llama.cpp / llama-server instances?
because I’ve heard that multi-machine inference can become pretty slow or annoying, even with high-speed Ethernet, so I’m trying to avoid building something that sounds good on paper but performs badly in real use.
At the same time, I feel like each of these GPUs should still be capable of running decent models on their own, especially with the right GGUF quants.
For now Im kinda chasing Deepseek model but for now i think Qwen3.6 (uncensored 35b) is my go
(and i’ve tested, only with 4060 & 3090 and damn it’s impressive.)
HopePupal@reddit
so your 20 or 30 low-end GPUs are going in what, exactly? just focus on building one dual or quad 3090 rig
zakadit@reddit (OP)
6600XT it was actually a bad move, i don’t know how i thought they were 12gb ( it was 6700XT, oops)
For the 3060, i think they’re pretty good, 12go is more than decent.
although it was not the question at all, 12/20 3060 & 8 3060 XT + 6700XT 9060XT, won’t get me nowhere? i’d be disappointed for real…
HopePupal@reddit
probably nothing but a large electrical bill, yeah. multiple GPUs aren't useful unless they can fit an entire model each, or you're using them on a server motherboard with enough PCIe bandwidth to let them communicate with each other — there's a reason people in this sub have big expensive 6-slot or 8-slot boards that run on Xeon or EPYC CPUs. it doesn't sound like you have a plan to link these up.
zakadit@reddit (OP)
disappointed as hell you’ve replied 2 off subject (and false guess) for nun
zakadit@reddit (OP)
i’ve got 2 with 6 slots actually 😅. and i’m not worried about electrical bill…
alphatrad@reddit
Dude!! Stop Wasting your Multi-GPU setup with llama.cpp!!
Use vLLM or ExLlamaV2 for Tensor Parallelism. Llama.cpp is for pipeline parallelism.
But yes, don't mix.
zakadit@reddit (OP)
got it !
zakadit@reddit (OP)
6600XT it was actually a bad move, i don’t know how i thought they were 12gb ( it was 6700XT, oops)
For the 3060, i think they’re pretty good, 12go is more than decent.
although it was not the question at all, 12/20 3060 & 8 3060 XT + 6700XT 9060XT, won’t get me nowhere? i’d be disappointed for real…