5070ti + RX 9070 (non XT), over 100 tps on Qwen 3.6 35B Q4

Posted by DavidBolkonsky@reddit | LocalLLaMA | View on Reddit | 9 comments

Hi guys, just want to share with you guys a Frankenstein build I put together that is surprisingly decent

I have a i5 12400 / B660 / 32GB DDR4 build that was previously paired with a 3060ti. Last Christmas I upgraded it to a RX9070, then I found a great deal for a 5070ti that I couldn't pass up, thinking I would sell the 9070

I ran Qwen 3.5 9B as well as various Stable diffusion models on the 5070ti no problem, as expected. However, I've been dreaming of running bigger models and wanted to see if I can make pooled VRAM from these two cards work.

After a lot of tinkering, I am now running Qwen3.6-35B-A3B-UD-Q4_K_M in llama.cpp on vulkan at over 100 tps with 64K context window.

Alternative uses I've found for this set up is running two turboquant llama.cpp fork side by side. alternatively, in SillyTavern, I set the 9070 on text generation (about 50 tps) and 5070ti on image generation, since CUDA is better for stable diffusion.

Thinking this a bit further, I think this is a decent way to get a cheap 32GB VRAM set up. I got them both pretty much at MSRP, which is just shy of $1300. 9070 has 256 bus width and 644.6 GB/s memory bandwidth, way superior than 5070 or 5060ti, and only about 2/3 of the cost of an other 5070ti.

llama setup: .\llama-cli.exe -m "E:\AI\Models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -n -1 --temp 1.0 --top-k 20 --n-gpu-layers 99 --split-mode layer --main-gpu 0 --cache-type-k q4_0 --cache-type-v q4_0 --ctx-size 65536 `

Curious if anyone else have similar setup as mine, or any tips or advice on how to make my setup better.