Can VRAM be combined of 2 brands
Posted by tonyleungnl@reddit | LocalLLaMA | View on Reddit | 81 comments
Just starting into AI, ComfyUI. Using a 7900XTX 24GB. It goes not as smooth as I had hoped. Now I want to buy a nVidia GPU with 24GB.
Q: Can I only use the nVidia to compute and VRAM of both card combined? Do both card needs to have the same amount of VRAM?
lly0571@reddit
ComfyUI may not work.
For llms, maybe Llama.cpp vulkan backend can make both GPUs working together. But the backend is not fully optimized.
fallingdowndizzyvr@reddit
The llama.cpp Vulkan backend is as fast or faster than ROCm/CUDA.
Evening_Ad6637@reddit
No, that’s not true! The text generation speed is slightly slower under vulcan, but acceptable.
But the prompt processing speed will suffer immensely.
fallingdowndizzyvr@reddit
It is true. I and others have shown it to be true multiple times.
https://www.reddit.com/r/LocalLLaMA/comments/1kabje8/vulkan_is_faster_tan_cuda_currently_with_llamacpp/
https://www.reddit.com/r/LocalLLaMA/comments/1iw9m8r/amd_inference_using_amdvlk_driver_is_40_faster/
Vulkan is even faster now than it was then.
Evening_Ad6637@reddit
Okay wtf, i even upvoted your post from the first, so must I have tested it myself to agree. Still can’t believe it xD
I have to test it myself again lol
If I talked some bullshit, then sorry, my fault. But that would mean NVIDIA users only need cuda for training, otherwise obsolet, right?
fallingdowndizzyvr@reddit
Dude, it's totally cool. In fact, props for posting that. Not many people would.
For most people, yes.
Evening_Ad6637@reddit
Okay so at least I could reproduce results for one card, for another another unfortunately not. But I have to mention that for convenience I've used LM Studio. Tomorrow I am going trying with llama.cpp directly and with other models. But it’s indeed very interesting already now. Here the results from my 'quicky':
On an old mining card, Vulkan is approximately 5% FASTER than CUDA in text-generation.
Device - NVIDIA CMP 30HX
Vulkan
Time-to-first-token 0.44s
Text-generation 49.5 tok/sec
CUDA
Time-to-first-token 0.07
Text-generation 46.2 tok/sec
On an 3090, Vulkan is approximately 13% SLOWER than CUDA in text-generation.
Device - NVIDIA RTX 3090 Ti
Vulkan
Time-to-first-token 0.14
Text-generation 136.0 tok/sec
CUDA
Time-to-first-token 0.02
Text-generation 154.1 tok/sec
Note
fallingdowndizzyvr@reddit
Please use llama-bench. That's the point of it. To keep as many variables constant as possible. Ideally on one variable should change, Vulkan vs CUDA. That's how benchmarking is done. You can't do that by using LM Studio.
Evening_Ad6637@reddit
I known i know, it was just a quick and dirty vibe check
AppearanceHeavy6724@reddit
not on Nvidia.
fallingdowndizzyvr@reddit
Yes. Yes it is.
https://www.reddit.com/r/LocalLLaMA/comments/1kabje8/vulkan_is_faster_tan_cuda_currently_with_llamacpp/
AppearanceHeavy6724@reddit
I need to check myself.
a_beautiful_rhind@reddit
Run comfyui in AMD environment and LLM in opposite environment. Install both drivers to host system.
fallingdowndizzyvr@reddit
For LLM yes, you can "combine" the RAM and run larger models. They do not have to be the same anything.
But, since you are saying ComfyUI, I take it you want to do image/video gen too. It won't help for that. Other than maybe Wan, I don't know of a model that can be split across GPUs for image/video gen. You might be able to do things like run different parts of the workflow on different GPUs to conserve RAM but you might as well do offloading.
CommunityTough1@reddit
Correct me if I'm wrong, but I don't think you can combine the VRAM from AMD and Nvidia cards.
CatalyticDragon@reddit
You indeed can! Pipeline, tensor, data, expert. There are many types of parallelism which will all work with a mix of GPUs.
a_beautiful_rhind@reddit
Only with vulkan. I dunno what pytorch does if you split across amd + nvida. Probably fails.
fallingdowndizzyvr@reddit
No. You can do it running CUDA on Nvidia and ROCm on AMD. It's not only with Vulkan.
a_beautiful_rhind@reddit
Splitting the same model?
fallingdowndizzyvr@reddit
Yes. You can split a model between a GPU running CUDA and a GPU running ROCm. I've posted that so many times. I'm surprise this is news to you.
a_beautiful_rhind@reddit
It's news to me that you can do it without using vulkan.
fallingdowndizzyvr@reddit
What are my favorite things about llama.ccp? Vulkan and RPC. You can use CUDA and ROCm together through RPC. Spin up a RPC server using CUDA and then run the master llama-cl using ROCm.
a_beautiful_rhind@reddit
Now now.. that's quite the caveat. RPC has overhead. Not the same as running one llama.cpp and it using both cards to split the same model. If you can't do that, then it's still kinda like it was.
m18coppola@reddit
Yeah, you just have to enable all the needed backends in the cmake flags, and then they will show up as available devices in llama.cpp
fallingdowndizzyvr@reddit
I've never been able to get that to work. Have you? It doesn't seem like it should work since llama.cpp is very ifdef. So if it's ifdef CUDA then that overrides the ifdef for ROCm.
m18coppola@reddit
It works because it would just build the shared library multiple times. You'd have one .so/.dll file for CUDA ifdefs and another .so/.dll file for ROCm ifdefs. See the "Notes about GPU-accelerated backends" here. Pinging u/a_beautiful_rhind too, I think this was added back when they deprecated Makefile support.
a_beautiful_rhind@reddit
Makefile thing was at the end of april I think. I remember having to switch to ccmake to save build parameters.
Docs say you can build it with all backends included but I didn't know they'd play nice.
fallingdowndizzyvr@reddit
Ah... that would explain it. I haven't tried in a while. Definitely pre cmake.
a_beautiful_rhind@reddit
When did they add that? Wouldn't stuff like FA be incompatible across kernels?
fallingdowndizzyvr@reddit
Yes. Yes, you can. I do it all the time. I recently posted numbers again doing it.
https://www.reddit.com/r/LocalLLaMA/comments/1le951x/gmk_x2amd_max_395_w128gb_first_impressions/
Such_Advantage_6949@reddit
Even for llm not really. Even within single brand, there are a lot of driver issue and compatibility
fallingdowndizzyvr@reddit
That's not true at all. I run AMD, Intel, Nvidia and for a bit of spice a Mac all together to run big models. It couldn't be easier.
Such_Advantage_6949@reddit
How do u run it, how many backend actually support this? Is the speed as fast as running with same brand?
fallingdowndizzyvr@reddit
The easy thing to do is to use Vulkan on all of them except the Mac. For that use Metal. If you must, you could run ROCm/CUDA instead but why?
Having the same brand doesn't really change anything. Having the same card doesn't really change anything. What does is if you do tensor parallel. For that you would need identical cards and a MB with enough at least x4 slots to hosts those cards. But that's not what OP is asking about.
Such_Advantage_6949@reddit
Does running vulkan give same speed as cuda for nvidia card?
fallingdowndizzyvr@reddit
Yes, sometimes even better. There have been threads about it. Go look.
Such_Advantage_6949@reddit
Nah i am happy with vllm and tensor parallel. Dont think vllm support vulkan. So it will be slower regardless
fallingdowndizzyvr@reddit
So you don't have any experience with using multiple types of GPUs then do you? You are just making stuff up.
SashaUsesReddit@reddit
So now he's making stuff up too since you don't know what vllm is or how tensor parallelism works?
fallingdowndizzyvr@reddit
LOL. So you admit you were making stuff up.
Ah... trying reading. You just proved that you don't know how "vllm is or how tensor parallelism works". You just exposed your lie. Since if you did, you would know you can't get tensor parallelism working with "multiple types of GPUs".
Dude, is there nothing you don't lie about?
SashaUsesReddit@reddit
My lie? What is that exactly?
So now you are saying that there are issues with multi brand? Contrary to your other comments?
Such_Advantage_6949@reddit
No point arguing with him actually. I dont think he has experience with any such setup. Even with same brand, have issue getting driver and cuda working for 5090 and my 4090 3090 lol. Let alone other brand
SashaUsesReddit@reddit
Fair hah. Blackwell has been a real pain for implementation. DM me if you want some early nvfp4 cuts of vllm for your 5090. Just finished sample dockers last week.
LLM arguments on reddit is my late night whiskey sport lol.. I know I shouldn't, but its entertaining ha
Such_Advantage_6949@reddit
Now i am not even there yet, now i am trying to figure why my 5090 show artifact if i use more than 2 screens on linux ☹️
SashaUsesReddit@reddit
Open driver or proprietary?
I only have luck with the 575 open
Such_Advantage_6949@reddit
I am trying both and different version of 575 573 etc. can me the the exact version u use so i can try? U download from nvidia website right?
SashaUsesReddit@reddit
Yeah dirrect from nvidia with their keyring with nvidia-open and gl libs etc. Im headed to bed, but i can send you what's in my install in the morning for you on exact versions
Such_Advantage_6949@reddit
Thank u in advanced
SashaUsesReddit@reddit
Driver Version: 575.57.08 CUDA Version: 12.9
Are you on ubuntu? I can send you my full install stack
fallingdowndizzyvr@reddit
I literally said in the post your responded to. I literally answered they question you are asking in the very next sentence. Go read it.
You really don't understand how anything works, do you?
Evening_Ad6637@reddit
No, of course not! The text generation speed is slightly slower under vulcan, but really acceptable.
But the prompt processing speed will suffer immensely.
FieldProgrammable@reddit
There seem to be some strange comments in this thread. I would say that if you want an easy time of setting this up then absolutely do not mix brands. Just mixing different generations of the same brand can be a problem, let alone getting two very different compute platforms to behave optimally with each other. My advice is if you want more VRAM, stick with AMD and live with the consequences (namely that it has less support than CUDA for many ML tasks beyond LLM). If you now want a CUDA card for that reason, then expect to not be able to share a model between them.
In terms of ComfyUI diffusion models are much less tolerant of mult-GPU setups than LLMs. You would need a special set of "Mult-GPU" nodes just to do anything and those are really designed for putting VAE and embedding models to a separate GPU to the latent space and diffusion model. Splitting the diffusion model itself can be done with something like the DisTorch multi-GPU node but this isn't particularly stable and won't perform nearly as well as a single GPU.
It might be theoretically possible with hours of research on getting an LLM running in one particular configuration with Vulkan. But do yourself a favour and save that time, money and energy doing something you enjoy rather than fighting obscure driver and library conflicts based on random anonymous forums.
fallingdowndizzyvr@reddit
That's absolutely not true. It's trivially simple to mix brands.
Have you ever tried? I do it all the time. It's trivial.
Ah... what? It's trivial to get Vulkan working on one GPU or a gaggle of GPUs together. It's far easier to get Vulkan working than CUDA or ROCm. Vulkan is built into the driver for pretty much any GPU. There's nothing to install. Just download your LLM program that supports Vulkan and go. It's the closest thing to "plug and play".
Do yourself a favor and give Vulkan a try? Since it's clear you have never even tried and thus are speaking from a position of ignorance.
FieldProgrammable@reddit
And it's pretty clear you speaking from a position of arrogance.
fallingdowndizzyvr@reddit
I rather people speak truth from a position of arrogance than made up lies from a position of ignorance.
FieldProgrammable@reddit
You rather spew subjective statements like "this is trivial". Have you even asked what OS OP is running? You seem to have a high opinion of your knowledge, perhaps when OP has bought both an AMD and Nvidia card is struggling to get it running you might provide him with technical support in getting it running.
fallingdowndizzyvr@reddit
I rather people speak truth from a position of arrogance than made up shit from a position of ignorance.
SashaUsesReddit@reddit
No mix and match of brands.
Also some mix and match can work with same brand GPUs... but its hit or miss depending on the application and compute level required (fp16, fp8 etc)
reacusn@reddit
What if you use vulkan on the nvidia gpu? Is that possible?
SashaUsesReddit@reddit
Device drivers and libs will have conflicts all over the place. If you had trouble just with AMD, this would be hell
fallingdowndizzyvr@reddit
That's just user error. I don't have those problems.
SashaUsesReddit@reddit
So.. your performance is just terrible as a consequence
fallingdowndizzyvr@reddit
LOL. You said you couldn't even do it because of "drivers and libs will have conflicts all over the place". Now you say the "performance is just terrible". How would you know? You've never been able to do it.
There are no "Device drivers and libs" conflicts. Let alone all over the place. And the performance is just fine. There is a performance penalty for going multi-gpu. But that's because it's multi-gpu and thus there is a loss of efficiency.
SashaUsesReddit@reddit
That's absolutely not the case. There are drivers and libs that will break all over the place. P2p memory won't function correctly without heavy system root load, there will be serious function level issues for trying to do fp16 or fp8 functions, tensor parallelism will negatively scale if you can even it it to actually work (real parallelism, not just slow mem sharding)
Being broken to me includes the perf being a total waste of time and money.
fallingdowndizzyvr@reddit
That is absolutely not the case. Please stop making stuff up.
SashaUsesReddit@reddit
Im sure you have a car with 4 different size wheels also and are happy it gets up to 10mph
Grow up. This person is looking to actively spend money lol
fallingdowndizzyvr@reddit
Still making stuff up I see. What you said doesn't even make any sense. You don't have any understanding of how multi-gpu works do you?
SashaUsesReddit@reddit
Yeah man... I do this for a living. I'm one of the people writing p2p, fp4/6/8 and nccl fixes for vllm lol
I hope you find whatever peace you want buddy. Im not telling you that you're bad for having something that works for you. Im saying if hes going to spend money there's a much better way of accomplishing a goal.
fallingdowndizzyvr@reddit
Uh huh. Sure..... you don't even know you can't do tensor parallel with vllm on "multiple types of GPUs". Even an absolute newb to vllm knows that.
SashaUsesReddit@reddit
That's what I've been saying this whole time dipshit.
How some applications may work but compute types and p2p will break.
fallingdowndizzyvr@reddit
Well then, clearly you don't know you don't need to use tensor parallelism to do multi-gpu. Not at all. I guess you haven't googled that yet.
fallingdowndizzyvr@reddit
Your BS machine never stops does it? Your BS doesn't even make any sense. You have no idea what you are talking about.
fallingdowndizzyvr@reddit
As a community, we should speak about things we know about. Things we have experience doing. Not making stuff up when we have no idea what we are talking about.
fallingdowndizzyvr@reddit
That's not true at all. I run AMD, Intel, Nvidia and for a bit of spice a Mac all together to run big models.
SashaUsesReddit@reddit
Oof. Sorry for your performance.
fallingdowndizzyvr@reddit
How would you know? You've never done it.
SashaUsesReddit@reddit
Good comment, enjoy your duct tape.
Im here to make and suggest good purchases for the community. Why encourage him to do this when you know it'll be crap?
fallingdowndizzyvr@reddit
You only seem to be here to make up stuff about things you know nothing about.
Rich_Repeat_22@reddit
If you are using Windows, before you buy another card, please have a look at this guide to use ROCm with the 7900XTX on Windows with ComfyUI.
https://youtu.be/gfcOt1-3zYk
Used it and works on the 7900XT, as you see the comments can be used for all 7000 and 9000 series within 10 minutes.
Threatening-Silence-@reddit
Vulcan.
altoidsjedi@reddit
For autoregressive LLM models, especially using frameworks like Llama.cpp, you CAN mix, match, and combine GPUs / VRAM. So you can run a single large LLM across 2 or more GPUs.
However for image and video diffusion models, such as those you might be using through ComfyUI, you generally cannot split the model to run across multiple GPU, even if they are totally identical in model, make, VRAM capacity.
Each diffusion model must fit and run entirely within a single GPU. The model architecture does not allow for split the model or pooling GPUs together. There may be some exceptions to the, but none that I'm familiar with.
What you can instead perhaps do, however, is run two separate instances of a diffusion model simultaneously on each GPU.