5070ti + RX 9070 (non XT), over 100 tps on Qwen 3.6 35B Q4
Posted by DavidBolkonsky@reddit | LocalLLaMA | View on Reddit | 9 comments
Hi guys, just want to share with you guys a Frankenstein build I put together that is surprisingly decent
I have a i5 12400 / B660 / 32GB DDR4 build that was previously paired with a 3060ti. Last Christmas I upgraded it to a RX9070, then I found a great deal for a 5070ti that I couldn't pass up, thinking I would sell the 9070
I ran Qwen 3.5 9B as well as various Stable diffusion models on the 5070ti no problem, as expected. However, I've been dreaming of running bigger models and wanted to see if I can make pooled VRAM from these two cards work.
After a lot of tinkering, I am now running Qwen3.6-35B-A3B-UD-Q4_K_M in llama.cpp on vulkan at over 100 tps with 64K context window.
Alternative uses I've found for this set up is running two turboquant llama.cpp fork side by side. alternatively, in SillyTavern, I set the 9070 on text generation (about 50 tps) and 5070ti on image generation, since CUDA is better for stable diffusion.
Thinking this a bit further, I think this is a decent way to get a cheap 32GB VRAM set up. I got them both pretty much at MSRP, which is just shy of $1300. 9070 has 256 bus width and 644.6 GB/s memory bandwidth, way superior than 5070 or 5060ti, and only about 2/3 of the cost of an other 5070ti.
llama setup:
.\llama-cli.exe -m "E:\AI\Models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"
-n -1 --temp 1.0
--top-k 20 --n-gpu-layers 99
--split-mode layer --main-gpu 0
--cache-type-k q4_0 --cache-type-v q4_0
--ctx-size 65536 `
Curious if anyone else have similar setup as mine, or any tips or advice on how to make my setup better.
grumd@reddit
This setup is very interesting, could you test Qwen 3.6 27B at Q6_K? I'm running it on 5080 + 6900xt, the speeds are bad, I wonder how much better 9070 is
DavidBolkonsky@reddit (OP)
I ran several prompts, first with kv cache at q8_0
"write me a story, 2000 tokens, about the Middle Ages" = [ Prompt: 61.8 t/s | Generation: 24.4 t/s ]
"build me a landing page" = [ Prompt: 8.5 t/s | Generation: 23.9 t/s ]
i fed it the entire 3 volumes of Frankenstein: "what is the 6th word from volumn 1, chapter 1?" = [ Prompt: 141.6 t/s | Generation: 11.6 t/s ], answer was correct
grumd@reddit
Generation speeds seem pretty good. Sadly prompt processing speeds don't tell me much, first two were very short, and I don't know the length of the last one. Something stable like pp4096 would be great. https://github.com/eugr/llama-benchy This is really easy to use if you have
uvxDavidBolkonsky@reddit (OP)
uvx llama-benchy --base-url http://127.0.0.1:8080/v1 --model Qwen3.6-27B-Q6_K.gguf --pp 4096 --tg 128 --depth 0 4096 --latency-mode generation
Installed 50 packages in 2.07s
[transformers] PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.7)
Date: 2026-05-15 13:12:23
Benchmarking model: Qwen3.6-27B-Q6_K.gguf at http://127.0.0.1:8080/v1
Concurrency levels: [1]
Error loading tokenizer: Qwen3.6-27B-Q6_K.gguf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=`
Falling back to 'gpt2' tokenizer as approximation.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 100%|████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 3.71MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 143kB/s]
vocab.json: 1.04MB [00:00, 19.1MB/s]
merges.txt: 456kB [00:00, 70.1MB/s]
tokenizer.json: 1.36MB [00:00, 27.6MB/s]
Downloading book from https://www.gutenberg.org/files/1661/1661-0.txt...
Saved text to cache: C:\Users\David\.cache\llama-benchy\cc6a0b5782734ee3b9069aa3b64cc62c.txt
[transformers] Token indices sequence length is longer than the specified maximum sequence length for this model (171736 > 1024). Running this sequence through the model will result in indexing errors
Total tokens available in text corpus: 171736
Warming up...
Warmup (User only) complete. Delta: 8 tokens (Server: 30, Local: 22)
Warmup (System+Empty) complete. Delta: 13 tokens (Server: 35, Local: 22)
Running coherence test...
Coherence test PASSED.
Measuring latency using mode: generation...
Average latency (generation): 917.39 ms
Running test: pp=4096, tg=128, depth=0, concurrency=1
Run 1/3 (batch size 1)...
No token_ids in response, using local tokenization
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Running test: pp=4096, tg=128, depth=4096, concurrency=1
Run 1/3 (batch size 1)...
Run 2/3 (batch size 1)...
Run 3/3 (batch size 1)...
Printing results in MD format:
grumd@reddit
Thanks so much! That's over twice the prefill of 6900XT. Pretty cool!
DavidBolkonsky@reddit (OP)
I got my hand on a 4070ti Super, I was expecting better results since it has higher memory speed than the 9070, and also I can run CUDA instead of Vulcan, but the difference is actually not that big if at all.
xspider2000@reddit
--cache-type-k q4_0 --cache-type-v q4_0is bad especially key low quant. use kv q8DavidBolkonsky@reddit (OP)
B660, my 9070 is only running at 4x lane
Ok_Conversation3488@reddit
What motherboard did you get? I run a 12600kf, but my mb only has one x16. I bought a 9070xt, i think i'll swap my mobo instead of selling the b580