5070ti + RX 9070 (non XT), over 100 tps on Qwen 3.6 35B Q4

Posted by DavidBolkonsky@reddit | LocalLLaMA | View on Reddit | 9 comments

Hi guys, just want to share with you guys a Frankenstein build I put together that is surprisingly decent

I have a i5 12400 / B660 / 32GB DDR4 build that was previously paired with a 3060ti. Last Christmas I upgraded it to a RX9070, then I found a great deal for a 5070ti that I couldn't pass up, thinking I would sell the 9070

I ran Qwen 3.5 9B as well as various Stable diffusion models on the 5070ti no problem, as expected. However, I've been dreaming of running bigger models and wanted to see if I can make pooled VRAM from these two cards work.

After a lot of tinkering, I am now running Qwen3.6-35B-A3B-UD-Q4_K_M in llama.cpp on vulkan at over 100 tps with 64K context window.

Alternative uses I've found for this set up is running two turboquant llama.cpp fork side by side. alternatively, in SillyTavern, I set the 9070 on text generation (about 50 tps) and 5070ti on image generation, since CUDA is better for stable diffusion.

Thinking this a bit further, I think this is a decent way to get a cheap 32GB VRAM set up. I got them both pretty much at MSRP, which is just shy of $1300. 9070 has 256 bus width and 644.6 GB/s memory bandwidth, way superior than 5070 or 5060ti, and only about 2/3 of the cost of an other 5070ti.

llama setup: .\llama-cli.exe -m "E:\AI\Models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -n -1 --temp 1.0 --top-k 20 --n-gpu-layers 99 --split-mode layer --main-gpu 0 --cache-type-k q4_0 --cache-type-v q4_0 --ctx-size 65536 `

Curious if anyone else have similar setup as mine, or any tips or advice on how to make my setup better.

[-]

grumd@reddit

This setup is very interesting, could you test Qwen 3.6 27B at Q6_K? I'm running it on 5080 + 6900xt, the speeds are bad, I wonder how much better 9070 is

[-]

DavidBolkonsky@reddit (OP)

I ran several prompts, first with kv cache at q8_0

"write me a story, 2000 tokens, about the Middle Ages" = [ Prompt: 61.8 t/s | Generation: 24.4 t/s ]
"build me a landing page" = [ Prompt: 8.5 t/s | Generation: 23.9 t/s ]
i fed it the entire 3 volumes of Frankenstein: "what is the 6th word from volumn 1, chapter 1?" = [ Prompt: 141.6 t/s | Generation: 11.6 t/s ], answer was correct

Set-Location -Path "E:\AI\llama-official-vulkan"
.\llama-cli.exe `
  -m `
  "E:\AI\Models\Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf" `
  -n -1 `
  --temp 0.5 `
  --top-k 20 `
  --n-gpu-layers 99 `
  --split-mode layer `
  --main-gpu 0 `
  --cache-type-k q8_0 `
  --cache-type-v q8_0 `
  --ctx-size 131072  `
  -fa on `

[-]

grumd@reddit

Generation speeds seem pretty good. Sadly prompt processing speeds don't tell me much, first two were very short, and I don't know the length of the last one. Something stable like pp4096 would be great. https://github.com/eugr/llama-benchy This is really easy to use if you have uvx

[-]

DavidBolkonsky@reddit (OP)

uvx llama-benchy --base-url http://127.0.0.1:8080/v1 --model Qwen3.6-27B-Q6_K.gguf --pp 4096 --tg 128 --depth 0 4096 --latency-mode generation

Installed 50 packages in 2.07s

[transformers] PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

llama-benchy (0.3.7)

Date: 2026-05-15 13:12:23

Benchmarking model: Qwen3.6-27B-Q6_K.gguf at http://127.0.0.1:8080/v1

Concurrency levels: [1]

Error loading tokenizer: Qwen3.6-27B-Q6_K.gguf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=`

Falling back to 'gpt2' tokenizer as approximation.

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

config.json: 100%|████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 3.71MB/s]

tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 143kB/s]

vocab.json: 1.04MB [00:00, 19.1MB/s]

merges.txt: 456kB [00:00, 70.1MB/s]

tokenizer.json: 1.36MB [00:00, 27.6MB/s]

Downloading book from https://www.gutenberg.org/files/1661/1661-0.txt...

Saved text to cache: C:\Users\David\.cache\llama-benchy\cc6a0b5782734ee3b9069aa3b64cc62c.txt

[transformers] Token indices sequence length is longer than the specified maximum sequence length for this model (171736 > 1024). Running this sequence through the model will result in indexing errors

Total tokens available in text corpus: 171736

Warming up...

Warmup (User only) complete. Delta: 8 tokens (Server: 30, Local: 22)

Warmup (System+Empty) complete. Delta: 13 tokens (Server: 35, Local: 22)

Running coherence test...

Coherence test PASSED.

Measuring latency using mode: generation...

Average latency (generation): 917.39 ms

Running test: pp=4096, tg=128, depth=0, concurrency=1

Run 1/3 (batch size 1)...

No token_ids in response, using local tokenization

Run 2/3 (batch size 1)...

Run 3/3 (batch size 1)...

Running test: pp=4096, tg=128, depth=4096, concurrency=1

Run 1/3 (batch size 1)...

Run 2/3 (batch size 1)...

Run 3/3 (batch size 1)...

Printing results in MD format:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen3.6-27B-Q6_K.gguf	pp4096	915.64 ± 289.70		5209.46 ± 1740.25	4292.06 ± 1740.25	5209.46 ± 1740.25
Qwen3.6-27B-Q6_K.gguf	tg128	28.86 ± 0.58	32.67 ± 2.36
Qwen3.6-27B-Q6_K.gguf	pp4096 @ d4096	1109.23 ± 7.25		7121.26 ± 93.57	6203.87 ± 93.57	7121.26 ± 93.57
Qwen3.6-27B-Q6_K.gguf	tg128 @ d4096	28.44 ± 0.56	32.33 ± 2.62

[-]

grumd@reddit

Thanks so much! That's over twice the prefill of 6900XT. Pretty cool!

[-]

DavidBolkonsky@reddit (OP)

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen3.6-27B-Q6_K.gguf	pp4096	1198.31 ± 42.07		3052.91 ± 136.20	2963.12 ± 136.20	3052.91 ± 136.20
Qwen3.6-27B-Q6_K.gguf	tg128	28.20 ± 0.05	31.00 ± 0.00
Qwen3.6-27B-Q6_K.gguf	pp4096 @ d4096	1144.65 ± 18.86		6124.97 ± 16.47	6035.19 ± 16.47	6124.97 ± 16.47
Qwen3.6-27B-Q6_K.gguf	tg128 @ d4096	27.87 ± 0.17	30.33 ± 0.47

I got my hand on a 4070ti Super, I was expecting better results since it has higher memory speed than the 9070, and also I can run CUDA instead of Vulcan, but the difference is actually not that big if at all.

[-]

xspider2000@reddit

--cache-type-k q4_0 --cache-type-v q4_0 is bad especially key low quant. use kv q8

[-]

DavidBolkonsky@reddit (OP)

B660, my 9070 is only running at 4x lane

[-]

Ok_Conversation3488@reddit

What motherboard did you get? I run a 12600kf, but my mb only has one x16. I bought a 9070xt, i think i'll swap my mobo instead of selling the b580