Can VRAM be combined of 2 brands

[-]

lly0571@reddit

ComfyUI may not work.

For llms, maybe Llama.cpp vulkan backend can make both GPUs working together. But the backend is not fully optimized.

[-]

fallingdowndizzyvr@reddit

But the backend is not fully optimized.

The llama.cpp Vulkan backend is as fast or faster than ROCm/CUDA.

[-]

Evening_Ad6637@reddit

No, that’s not true! The text generation speed is slightly slower under vulcan, but acceptable.

But the prompt processing speed will suffer immensely.

[-]

fallingdowndizzyvr@reddit

No, that’s not true! The text generation speed is slightly slower under vulcan, but acceptable.

It is true. I and others have shown it to be true multiple times.

https://www.reddit.com/r/LocalLLaMA/comments/1kabje8/vulkan_is_faster_tan_cuda_currently_with_llamacpp/

https://www.reddit.com/r/LocalLLaMA/comments/1iw9m8r/amd_inference_using_amdvlk_driver_is_40_faster/

Vulkan is even faster now than it was then.

[-]

Evening_Ad6637@reddit

Okay wtf, i even upvoted your post from the first, so must I have tested it myself to agree. Still can’t believe it xD

I have to test it myself again lol

If I talked some bullshit, then sorry, my fault. But that would mean NVIDIA users only need cuda for training, otherwise obsolet, right?

[-]

fallingdowndizzyvr@reddit

If I talked some bullshit, then sorry, my fault.

Dude, it's totally cool. In fact, props for posting that. Not many people would.

But that would mean NVIDIA users only need cuda for training, otherwise obsolet, right?

For most people, yes.

[-]

Okay so at least I could reproduce results for one card, for another another unfortunately not. But I have to mention that for convenience I've used LM Studio. Tomorrow I am going trying with llama.cpp directly and with other models. But it’s indeed very interesting already now. Here the results from my 'quicky':

On an old mining card, Vulkan is approximately 5% FASTER than CUDA in text-generation.

Device - NVIDIA CMP 30HX

Vulkan

Time-to-first-token 0.44s

Text-generation 49.5 tok/sec

CUDA

Time-to-first-token 0.07

Text-generation 46.2 tok/sec

On an 3090, Vulkan is approximately 13% SLOWER than CUDA in text-generation.

Device - NVIDIA RTX 3090 Ti

Vulkan

Time-to-first-token 0.14

Text-generation 136.0 tok/sec

CUDA

Time-to-first-token 0.02

Text-generation 154.1 tok/sec

Note

Always using Model - gemma-3-1b-qat (Q4_0)
Always have 2 runs
average value for TG
first value for TTFT
in both cases, the cards get hotter and the fans louder when running with CUDA

[-]

fallingdowndizzyvr@reddit

Tomorrow I am going trying with llama.cpp directly and with other models.

Please use llama-bench. That's the point of it. To keep as many variables constant as possible. Ideally on one variable should change, Vulkan vs CUDA. That's how benchmarking is done. You can't do that by using LM Studio.

[-]

Evening_Ad6637@reddit

I known i know, it was just a quick and dirty vibe check

[-]

AppearanceHeavy6724@reddit

not on Nvidia.

[-]

fallingdowndizzyvr@reddit

Yes. Yes it is.

https://www.reddit.com/r/LocalLLaMA/comments/1kabje8/vulkan_is_faster_tan_cuda_currently_with_llamacpp/

[-]

AppearanceHeavy6724@reddit

I need to check myself.

[-]

a_beautiful_rhind@reddit

Run comfyui in AMD environment and LLM in opposite environment. Install both drivers to host system.

[-]

fallingdowndizzyvr@reddit

For LLM yes, you can "combine" the RAM and run larger models. They do not have to be the same anything.

But, since you are saying ComfyUI, I take it you want to do image/video gen too. It won't help for that. Other than maybe Wan, I don't know of a model that can be split across GPUs for image/video gen. You might be able to do things like run different parts of the workflow on different GPUs to conserve RAM but you might as well do offloading.

[-]

CommunityTough1@reddit

Correct me if I'm wrong, but I don't think you can combine the VRAM from AMD and Nvidia cards.

[-]

CatalyticDragon@reddit

You indeed can! Pipeline, tensor, data, expert. There are many types of parallelism which will all work with a mix of GPUs.

[-]

a_beautiful_rhind@reddit

Only with vulkan. I dunno what pytorch does if you split across amd + nvida. Probably fails.

[-]

fallingdowndizzyvr@reddit

Only with vulkan.

No. You can do it running CUDA on Nvidia and ROCm on AMD. It's not only with Vulkan.

[-]

a_beautiful_rhind@reddit

Splitting the same model?

[-]

fallingdowndizzyvr@reddit

Yes. You can split a model between a GPU running CUDA and a GPU running ROCm. I've posted that so many times. I'm surprise this is news to you.

[-]

a_beautiful_rhind@reddit

It's news to me that you can do it without using vulkan.

[-]

fallingdowndizzyvr@reddit

What are my favorite things about llama.ccp? Vulkan and RPC. You can use CUDA and ROCm together through RPC. Spin up a RPC server using CUDA and then run the master llama-cl using ROCm.

[-]

a_beautiful_rhind@reddit

Now now.. that's quite the caveat. RPC has overhead. Not the same as running one llama.cpp and it using both cards to split the same model. If you can't do that, then it's still kinda like it was.

[-]

m18coppola@reddit

Yeah, you just have to enable all the needed backends in the cmake flags, and then they will show up as available devices in llama.cpp

[-]

fallingdowndizzyvr@reddit

I've never been able to get that to work. Have you? It doesn't seem like it should work since llama.cpp is very ifdef. So if it's ifdef CUDA then that overrides the ifdef for ROCm.

[-]

m18coppola@reddit

It works because it would just build the shared library multiple times. You'd have one .so/.dll file for CUDA ifdefs and another .so/.dll file for ROCm ifdefs. See the "Notes about GPU-accelerated backends" here. Pinging u/a_beautiful_rhind too, I think this was added back when they deprecated Makefile support.

[-]

a_beautiful_rhind@reddit

Makefile thing was at the end of april I think. I remember having to switch to ccmake to save build parameters.

Docs say you can build it with all backends included but I didn't know they'd play nice.

[-]

fallingdowndizzyvr@reddit

I think this was added back when they deprecated Makefile support.

Ah... that would explain it. I haven't tried in a while. Definitely pre cmake.

[-]

a_beautiful_rhind@reddit

When did they add that? Wouldn't stuff like FA be incompatible across kernels?

[-]

fallingdowndizzyvr@reddit

Yes. Yes, you can. I do it all the time. I recently posted numbers again doing it.

**7900xtx + 3060 + 2070**
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | RPC,Vulkan | 999 |    0 |           pp512 |       342.35 ± 17.21 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | RPC,Vulkan | 999 |    0 |           tg128 |         11.52 ± 0.18 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | RPC,Vulkan | 999 |    0 |  pp512 @ d10000 |        213.81 ± 3.92 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | RPC,Vulkan | 999 |    0 |  tg128 @ d10000 |          8.27 ± 0.02 |

https://www.reddit.com/r/LocalLLaMA/comments/1le951x/gmk_x2amd_max_395_w128gb_first_impressions/

[-]

Such_Advantage_6949@reddit

Even for llm not really. Even within single brand, there are a lot of driver issue and compatibility

[-]

fallingdowndizzyvr@reddit

That's not true at all. I run AMD, Intel, Nvidia and for a bit of spice a Mac all together to run big models. It couldn't be easier.

[-]

Such_Advantage_6949@reddit

How do u run it, how many backend actually support this? Is the speed as fast as running with same brand?

[-]

fallingdowndizzyvr@reddit

How do u run it, how many backend actually support this?

The easy thing to do is to use Vulkan on all of them except the Mac. For that use Metal. If you must, you could run ROCm/CUDA instead but why?

Is the speed as fast as running with same brand?

Having the same brand doesn't really change anything. Having the same card doesn't really change anything. What does is if you do tensor parallel. For that you would need identical cards and a MB with enough at least x4 slots to hosts those cards. But that's not what OP is asking about.

[-]

Such_Advantage_6949@reddit

Does running vulkan give same speed as cuda for nvidia card?

[-]

fallingdowndizzyvr@reddit

Yes, sometimes even better. There have been threads about it. Go look.

[-]

Such_Advantage_6949@reddit

Nah i am happy with vllm and tensor parallel. Dont think vllm support vulkan. So it will be slower regardless

[-]

fallingdowndizzyvr@reddit

So you don't have any experience with using multiple types of GPUs then do you? You are just making stuff up.

[-]

SashaUsesReddit@reddit

So now he's making stuff up too since you don't know what vllm is or how tensor parallelism works?

[-]

fallingdowndizzyvr@reddit

LOL. So you admit you were making stuff up.

Ah... trying reading. You just proved that you don't know how "vllm is or how tensor parallelism works". You just exposed your lie. Since if you did, you would know you can't get tensor parallelism working with "multiple types of GPUs".

Dude, is there nothing you don't lie about?

[-]

SashaUsesReddit@reddit

My lie? What is that exactly?

So now you are saying that there are issues with multi brand? Contrary to your other comments?

[-]

Such_Advantage_6949@reddit

No point arguing with him actually. I dont think he has experience with any such setup. Even with same brand, have issue getting driver and cuda working for 5090 and my 4090 3090 lol. Let alone other brand

[-]

SashaUsesReddit@reddit

Fair hah. Blackwell has been a real pain for implementation. DM me if you want some early nvfp4 cuts of vllm for your 5090. Just finished sample dockers last week.

LLM arguments on reddit is my late night whiskey sport lol.. I know I shouldn't, but its entertaining ha

[-]

Such_Advantage_6949@reddit

Now i am not even there yet, now i am trying to figure why my 5090 show artifact if i use more than 2 screens on linux ☹️

[-]

SashaUsesReddit@reddit

Open driver or proprietary?

I only have luck with the 575 open

[-]

Such_Advantage_6949@reddit

I am trying both and different version of 575 573 etc. can me the the exact version u use so i can try? U download from nvidia website right?

[-]

SashaUsesReddit@reddit

Yeah dirrect from nvidia with their keyring with nvidia-open and gl libs etc. Im headed to bed, but i can send you what's in my install in the morning for you on exact versions

[-]

Such_Advantage_6949@reddit

Thank u in advanced

[-]

SashaUsesReddit@reddit

Driver Version: 575.57.08 CUDA Version: 12.9

Are you on ubuntu? I can send you my full install stack

[-]

fallingdowndizzyvr@reddit

My lie? What is that exactly?

I literally said in the post your responded to. I literally answered they question you are asking in the very next sentence. Go read it.

So now you are saying that there are issues with multi brand? Contrary to your other comments?

You really don't understand how anything works, do you?

[-]

Evening_Ad6637@reddit

No, of course not! The text generation speed is slightly slower under vulcan, but really acceptable.

But the prompt processing speed will suffer immensely.

[-]

FieldProgrammable@reddit

There seem to be some strange comments in this thread. I would say that if you want an easy time of setting this up then absolutely do not mix brands. Just mixing different generations of the same brand can be a problem, let alone getting two very different compute platforms to behave optimally with each other. My advice is if you want more VRAM, stick with AMD and live with the consequences (namely that it has less support than CUDA for many ML tasks beyond LLM). If you now want a CUDA card for that reason, then expect to not be able to share a model between them.

In terms of ComfyUI diffusion models are much less tolerant of mult-GPU setups than LLMs. You would need a special set of "Mult-GPU" nodes just to do anything and those are really designed for putting VAE and embedding models to a separate GPU to the latent space and diffusion model. Splitting the diffusion model itself can be done with something like the DisTorch multi-GPU node but this isn't particularly stable and won't perform nearly as well as a single GPU.

It might be theoretically possible with hours of research on getting an LLM running in one particular configuration with Vulkan. But do yourself a favour and save that time, money and energy doing something you enjoy rather than fighting obscure driver and library conflicts based on random anonymous forums.

[-]

fallingdowndizzyvr@reddit

I would say that if you want an easy time of setting this up then absolutely do not mix brands.

That's absolutely not true. It's trivially simple to mix brands.

Just mixing different generations of the same brand can be a problem, let alone getting two very different compute platforms to behave optimally with each other.

Have you ever tried? I do it all the time. It's trivial.

It might be theoretically possible with hours of research on getting an LLM running in one particular configuration with Vulkan.

Ah... what? It's trivial to get Vulkan working on one GPU or a gaggle of GPUs together. It's far easier to get Vulkan working than CUDA or ROCm. Vulkan is built into the driver for pretty much any GPU. There's nothing to install. Just download your LLM program that supports Vulkan and go. It's the closest thing to "plug and play".

But do yourself a favour and save that time, money and energy doing something you enjoy rather than fighting obscure driver and library conflicts based on random anonymous forums.

Do yourself a favor and give Vulkan a try? Since it's clear you have never even tried and thus are speaking from a position of ignorance.

[-]

FieldProgrammable@reddit

And it's pretty clear you speaking from a position of arrogance.

[-]

fallingdowndizzyvr@reddit

I rather people speak truth from a position of arrogance than made up lies from a position of ignorance.

[-]

FieldProgrammable@reddit

You rather spew subjective statements like "this is trivial". Have you even asked what OS OP is running? You seem to have a high opinion of your knowledge, perhaps when OP has bought both an AMD and Nvidia card is struggling to get it running you might provide him with technical support in getting it running.

[-]

fallingdowndizzyvr@reddit

I rather people speak truth from a position of arrogance than made up shit from a position of ignorance.

[-]

SashaUsesReddit@reddit

No mix and match of brands.

Also some mix and match can work with same brand GPUs... but its hit or miss depending on the application and compute level required (fp16, fp8 etc)

[-]

reacusn@reddit

What if you use vulkan on the nvidia gpu? Is that possible?

[-]

SashaUsesReddit@reddit

Device drivers and libs will have conflicts all over the place. If you had trouble just with AMD, this would be hell

[-]

fallingdowndizzyvr@reddit

That's just user error. I don't have those problems.

[-]

SashaUsesReddit@reddit

So.. your performance is just terrible as a consequence

[-]

fallingdowndizzyvr@reddit

LOL. You said you couldn't even do it because of "drivers and libs will have conflicts all over the place". Now you say the "performance is just terrible". How would you know? You've never been able to do it.

There are no "Device drivers and libs" conflicts. Let alone all over the place. And the performance is just fine. There is a performance penalty for going multi-gpu. But that's because it's multi-gpu and thus there is a loss of efficiency.

[-]

SashaUsesReddit@reddit

That's absolutely not the case. There are drivers and libs that will break all over the place. P2p memory won't function correctly without heavy system root load, there will be serious function level issues for trying to do fp16 or fp8 functions, tensor parallelism will negatively scale if you can even it it to actually work (real parallelism, not just slow mem sharding)

Being broken to me includes the perf being a total waste of time and money.

[-]

fallingdowndizzyvr@reddit

There are drivers and libs that will break all over the place.

That is absolutely not the case. Please stop making stuff up.

[-]

SashaUsesReddit@reddit

Im sure you have a car with 4 different size wheels also and are happy it gets up to 10mph

Grow up. This person is looking to actively spend money lol

[-]

fallingdowndizzyvr@reddit

Still making stuff up I see. What you said doesn't even make any sense. You don't have any understanding of how multi-gpu works do you?

[-]

SashaUsesReddit@reddit

Yeah man... I do this for a living. I'm one of the people writing p2p, fp4/6/8 and nccl fixes for vllm lol

I hope you find whatever peace you want buddy. Im not telling you that you're bad for having something that works for you. Im saying if hes going to spend money there's a much better way of accomplishing a goal.

[-]

fallingdowndizzyvr@reddit

Yeah man... I do this for a living. I'm one of the people writing p2p, fp4/6/8 and nccl fixes for vllm lol

Uh huh. Sure..... you don't even know you can't do tensor parallel with vllm on "multiple types of GPUs". Even an absolute newb to vllm knows that.

[-]

SashaUsesReddit@reddit

That's what I've been saying this whole time dipshit.

How some applications may work but compute types and p2p will break.

[-]

fallingdowndizzyvr@reddit

That's what I've been saying this whole time dipshit.

Well then, clearly you don't know you don't need to use tensor parallelism to do multi-gpu. Not at all. I guess you haven't googled that yet.

[-]

fallingdowndizzyvr@reddit

Your BS machine never stops does it? Your BS doesn't even make any sense. You have no idea what you are talking about.

[-]

fallingdowndizzyvr@reddit

Edit: we as a community should steer people in the right direction.

As a community, we should speak about things we know about. Things we have experience doing. Not making stuff up when we have no idea what we are talking about.

[-]

fallingdowndizzyvr@reddit

No mix and match of brands.

That's not true at all. I run AMD, Intel, Nvidia and for a bit of spice a Mac all together to run big models.

[-]

SashaUsesReddit@reddit

Oof. Sorry for your performance.

[-]

fallingdowndizzyvr@reddit

How would you know? You've never done it.

[-]

SashaUsesReddit@reddit

Good comment, enjoy your duct tape.

Im here to make and suggest good purchases for the community. Why encourage him to do this when you know it'll be crap?

[-]

fallingdowndizzyvr@reddit

You only seem to be here to make up stuff about things you know nothing about.

[-]

Rich_Repeat_22@reddit

If you are using Windows, before you buy another card, please have a look at this guide to use ROCm with the 7900XTX on Windows with ComfyUI.

https://youtu.be/gfcOt1-3zYk

Used it and works on the 7900XT, as you see the comments can be used for all 7000 and 9000 series within 10 minutes.

[-]

Threatening-Silence-@reddit

Vulcan.

[-]

altoidsjedi@reddit

For autoregressive LLM models, especially using frameworks like Llama.cpp, you CAN mix, match, and combine GPUs / VRAM. So you can run a single large LLM across 2 or more GPUs.

However for image and video diffusion models, such as those you might be using through ComfyUI, you generally cannot split the model to run across multiple GPU, even if they are totally identical in model, make, VRAM capacity.

Each diffusion model must fit and run entirely within a single GPU. The model architecture does not allow for split the model or pooling GPUs together. There may be some exceptions to the, but none that I'm familiar with.

What you can instead perhaps do, however, is run two separate instances of a diffusion model simultaneously on each GPU.