RX 7900 XTX (24 GB) + RX 6800 XT (16 GB)?

Posted by xeeff@reddit | LocalLLaMA | View on Reddit | 21 comments

i bought an RX 7900 XTX a few days ago and i wasn't planning on buying a new power supply to have them both plugged in but - would it be possible to "combine" the VRAM from both for a model? i understand it would still result in some sort of overhead, but it'd be better than not being able to run a model at all

the other thing i'm considering is running a different model/set of models on RX 6800 XT (like embedding, a smaller one to use for conversation titles or managing memories, etc) while using my RX 7900 XTX primarily for qwen3.6-27b

either way i'd need to buy a power supply (currently only got 850 W) so i thought i may as well ask if option A (combining 24 + 16 to run bigger/better models despite different cards) is possible

[-]

p_235615@reddit

you can power limit both cards and you should be good with power.

[-]

xeeff@reddit (OP)

realised i've got 750w not 850w. i don't think that'll work :(

[-]

ThisGonBHard@reddit

As someone with a similar setup (4090 + 5060 Ti), there is quite a bit of overheads based on the tool you use.

[-]

Ell2509@reddit

Use Linux. ROCm. Llama.cpp. layer split.

[-]

Nindaleth@reddit

I didn't have any luck running on heterogenous ROCm GPUs yet, any tips on compilation and running?

I compile using cmake -S . -B build -DGGML_VULKAN=OFF -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1100;gfx1030" -DGPU_TARGETS="gfx1100;gfx1030" -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 16

With any runtime combination of parameters I crash on ggml_cuda_compute_forward: SCALE failed. I had to unset HSA_OVERRIDE_GFX_VERSION (can't set per-GPU it seems) or else I would fail a short moment earler. This is on ROCm 6.4.2.

[-]

orinoco_w@reddit

Oops tried to reply and missed.. see my other comment

[-]

orinoco_w@reddit

Here's what I used... Using current pull from ggml llama.cpp and rocm 7.2.2 with torch 2.10

export LLAMACPP_ROCM_ARCH="gfx908,gfx1100"

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \ cmake -S . -B build \ -DGGML_HIP=ON \ -DAMDGPU_TARGETS=$LLAMACPP_ROCM_ARCH \ -DCMAKE_BUILD_TYPE=Release \ -DLLAMA_OPENSSL=ON \ -DLLAMA_CURL=ON && \ cmake --build build --config Release -j$(nproc)

works just fine with 7900xtx and mi100.

./build/bin/llama-server -fa 1 -ts 5/8 --hf-repo unsloth/Qwen3.6-27B-GGUF:Q8_0 --ctx-size 262144 --host 0.0.0.0 --port 8000

I have found that flash attention only seems to play nicely with _0 quants and split-model tensor is buggy as hell, but I'm getting good tps.. 1200pp (on p 4096 tests with llama-bench) and 23tg.

[-]

ea_man@reddit

You won't get the same speed as having 40GB of VRAM coz you have to pass through PCI and you'll run as fast as the slowest memory, yet if you have slow RAM and you often offload it will sure help a lot.

[-]

Nindaleth@reddit

Hey, I run almost the same setup! 7900 XTX + 6700 XT in my case, "just" 36 GB combined VRAM for me. Got it set up about a week ago so it's very new for me. My specific 7900 XTX requires four slots and I took me a lot of time to find a motherboard that can fit two GPUs in a non-monstrous case with that condition.

It allows me to run Qwen 3.6-35B-A3B in Q6_K fully offloaded with 200K context on Vulkan, pretty cool stuff! With ROCm I didn't try yet.

the other thing i'm considering is running a different model/set of models on RX 6800 XT (like embedding, a smaller one to use for conversation titles

I just run llama-server with --parallel 2 --kv-unified so that the initial session titling happens in the background while I'm able to run a single subagent in OpenCode without reprocessing everything. This allows me to reach >100k context (of the 200k total) in the main agent without any issues because the subagent usually reaches less.

currently only got 850 W

I used to have a 500 W one and for the upgrade I was torn between an 850 W and a 1000 W PSU, bought the stronger one so that I don't have to upgrade again in case I manage to score a second 7900 XTX in the future. My CPU runs in ECO mode and both GPUs run power limited and undervolted so I have plenty of PSU headroom. I don't do it just to save my wallet, but also to push out more tokens before GPU slows momentarily due to thermal throttling + to generate less heat.

If you have an ATX3.0-compliant PSU, the transient spike handling is built-in.

I agree with this other comment - for your 7900 and 6700 just power limit, undervolt and/or underclock, you can keep your current PSU as long as you have enough connectors to power the GPUs.

[-]

xeeff@reddit (OP)

awesome comment, would give an award. you're goated as fuck

[-]

Then-Topic8766@reddit

Go for it. I have Nvidia version of your setup, rtx3090 + rtx4060ti, so 24+16 GB VRAM. Every GB of VRAM meters. It works like charm. Second card adds a lot good options. You can load larger models, two models at time, bigger LLM and smaller diffusion model (e.g. Z-Image). I have 1000 W psu and under-powered cards to 260 and 120 W.

[-]

BigYoSpeck@reddit

In llama.cpp with either Vulkan or ROCm (or if you feel crazy both) you can split across both cards to use the combined VRAM yes (I did it when I bought my first 7900 XTX until I replaced the 6800 XT with another 7900 XTX)

Performance wise splitting a model that would fit on either card alone will degrade performance (I'm not sure you can use the tensor split method which on matching cards gives a slight speed boost) but if the model or context didn't fit on the single card anyway then that point is moot. For bigger MOE models with expert layers offloaded to CPU it will also be faster as now you can offload fewer layers

Power limit them though, more so than just lowering their max wattage allowed, set a lower max clock as even if your systems max combined sustain power load is within your PSU specs, there can be spikes which if you have a good PSU will trip its protection, and if it's a not so great PSU eventually kill it

I have a 1000W power supply and two 7900 XTX. Even when their combined total board power was only at \~700W and the CPU taking a leisurely 70W, starting a new prompt could trip the power supply. Limiting their clocks to 2.6ghz barely made any performance difference but cut their power usage by enough

As long as you aren't using both card at peak power consumption, and assuming your PSU has enough PCIe power cables then 850W would be enough with their power and clocks limited couple with a mild undervolt

[-]

Krillian58@reddit

Using the small one for embedding, summarizing, reranking etc Is a good use depending on your workload. You could also use it for the kvcache on a bigger model. Use 22-23gb of the 7900 xtx and cram 1 million kvcache on the other. Not in full fp16 of course. Potentially run a draft model on it at the same time to speed up the mains inference time since the bigger model might be slow.

But ya, min power requirement should be met.

[-]

taking_bullet@reddit

either way i'd need to buy a power supply (currently only got 850 W)

There's no need in changing high quality 850W PSU (unless you need more 8-pin connectors). Set the lowest power limit on both cards and everything will be fine.

[-]

xeeff@reddit (OP)

I was thinking about whether that'd be the case but it seems risky and I hadn't done the math to see if everything checks out. thanks for the suggestion I'll have to see if this works out :)

[-]

One-Pain6799@reddit

Running main model on the XTX and a smaller embedding model on the 6800 works fine with Ollama, but 850W won't be enough with both cards under load, you'll need to upgrade the PSU

[-]

xeeff@reddit (OP)

I recommend you to switch from Ollama to llama.cpp

a 5min read of how to use llama.cpp or going through unsloth's guide should help you. no reason to use Ollama in 2026

[-]

LagOps91@reddit

yes, you can distribute weights accross multiple gpus. the exact overhead i'm not sure about, some data needs to be moved for sure, but multi-gpu setups are common and for the larger models it's impossible to fit them on just a single card.

[-]

xeeff@reddit (OP)

yeah i'm aware of this but i mostly see people do multi-gpu with the same kind of GPU rather than mix-and-match so I wasn't sure

[-]

Miserable-Dare5090@reddit

Yes all possible. But I would try 27B on the 24GB and 35B MoE in the 16GB card with ram offloading — should get both models going

[-]

xeeff@reddit (OP)

i've got 32gb ddr4 as well but i didn't like the offload as it significantly slows everything down, unfortunately. I appreciate it though.