Performance Benchmark - Qwen3.5 & Gemma4 on dual GPU setup (RTX 4070 + RTX 3060)
Posted by DracoTorpedo@reddit | LocalLLaMA | View on Reddit | 16 comments
Hi everyone,
Been following a lot of local LLM talk in this forum lately—learned quite a bit from you all! This is my first post, hopefully not my last. I wanted to share some interesting benchmarks I did in my free time testing out a dual-GPU setup.
Hardware Specs:
- CPU: 7700x (slightly undervolted to save temps, but performance is like stock)
- RAM: 32 GB DDR5 @ 6000 MHz
- Motherboard: MSI B650 Tomahawk Wifi
- GPU Setup:
- Primary: RTX 4070 (12 GB) at PCI 4.0 x16
- Secondary: RTX 3060 (12 GB) at PCI 4.0 x2 (Note: This is a new addition. My mobo only allows x2 for the second slot from the chipset, but I wanted more VRAM for bigger models without breaking the bank.)
Software Setup:
- OS: Win 11 + latest Nvidia drivers (595.97)
- LMStudio v0.4.11 Build 1 (Latest as of writing)
- I started with Ollama a year ago but graduated to LMStudio because it makes downloading models and modifying settings so much easier for an enthusiast like me. I have tried llamacpp in a professional server briefly in the past, but the UI and ease of setup alone make me return to LMStudio😅
- Split Strategy: Priority Order: 1. RTX 4070, 2. RTX 3060
- Model Loading Guardrails: Relaxed
The "Llama_benchy" Metrics:
- pp12000: Prompt processing / prefill speed on a 12,000-token input (simulates my opencode usage).
- tg32: Short generation speed (quick replies).
- tg4096: Sustained generation speed (long outputs).
I’ve had a blast with the Qwen3.5 series lately—especially the 35BA3B model. It was already fast on my old setup (4070 + RAM offload), but adding the RTX 3060 gives me way more headroom. I tested these 4 models:
- Bartowski Qwen3.5 35BA3B Q4KS @ 50k context
- Jackrong qwopus3.5-27b-v3 Q4KM @ 50k context
- Unsloth Gemma4-26BA4B Q4KM @ 60k context
- Unsloth Gemma4-31B-IT Q4KM @ 15k context (Higher context wouldn't fit in my VRAM)
All models used max_concurrent_preds=1, full GPU offload, and flash attention enabled.
Benchmark Results:
[Prompt Processing Speed - Dual GPU]()
[Token Generation - Dual GPU]()
[Time to first response - Dual GPU]()
Analysis:
- Gemma4 26B-A4B vs Qwen3.5 35B-A3B: Gemma4 was slightly faster on prompt processing (around 15.6% faster), but when it comes to actual token generation, Qwen3.5 wins hands down: at least 20% faster on short outputs and 29% faster on long ones (tg4096). But in terms of actual usefulness of output, gemma4 could still win this for me in the future (after testing its quality) – as I have seen in other posts and comparisons on how token efficient gemma4 actually is!
- The Speed: Seeing speeds around 79 tok/s was honestly astonishing—so much so that the LMStudio UI was actually struggling to keep up! 😂
- The "Big Boys" (Qwopus-27b-v3 vs Gemma4-31B-IT): There was a noticeable drop in speed compared to the MoE models. Qwopus is 11% faster than Gemma4-31B in tg4096 and 20% faster in prompt processing. Even though the prompt boost was huge, the generation speed “felt” similar (18.23 tok/s for 27B vs 16.29 for 31B).
- The Context Trade-off: The extra 4B params in Gemma4 really weigh down my context window (only 15k vs 50k with Qwopus). This might be a dealbreaker for coding, though maybe it's still useful for deep architectural tasks. This architecture is still quite new – hopefully there are more refinements down the line to optimize it similar to qwen3.5
The "New GPU" Comparison
I wanted to see how much the RTX 3060 actually helped my favorite model, Qwen3.5 35B-A3B, compared to my old setup (4070 + CPU + RAM offload):
Analysis:
[Prompt Processing - Dual vs Single GPU]()
[Token Generation Throughput - Dual vs Single GPU]()
[Time to first response - Dual vs Single GPU]()
- The Speed Buff is bonkers!!
- Prompt Processing: This was my Achilles' heel before. Every time I had 10k–30k tokens, it took forever. With the new setup, the boost is around 1.5x faster!
- Token Generation: For long context (tg4096), it’s about 44% faster (79 tok/s). It's crazy to see these kinds of speeds on a home setup.
VRAM & Utilization Notes: I didn't get perfect readings (mostly just Task Manager), so take this with a grain of salt. The RTX 4070 hovered around 40-45% utilization, while the 3060 was between 50-60%.
The memory split was a bit weird; despite the 4070 being primary, the 3060 always seemed to take a slightly larger chunk of VRAM (about 300–400 MB more), excluding the base Windows usage.
- Qwopus 27B: RTX 3060: 10.9 GB | RTX 4070: 10.4 GB
- Qwen3.5 35B: RTX 3060: 11.3 GB | RTX 4070: 10.9 GB
- Gemma4 31B: RTX 3060: 11.4 GB | RTX 4070: 10.4 GB
- Gemma4 26B: RTX 3060: 9.7 GB | RTX 4070: 11.5 GB (The only exception where 4070 seems to have higher utilization compared to Qwen3.5 – a possible impact on why this has the faster prompt processing speed)
Conclusions:
- No regrets on the 3060 purchase. I’m still not sure how much the PCIe 4.0 x2 slot is holding me back, but so far it seems decent. If anyone has insights on testing that bottleneck, let me know!
- Qwen3.5 35B-A3B is my bread and butter for coding. I'm just waiting for some Opus distilled finetunes (Jackrong, any updates?!) to help decrease the excessive thinking time - so far my only issue with qwen3.5 series.
- Qwopus 27B v3 runs fast enough that I can finally start testing its actual output quality.
Final advice: If you’re on the fence about a dual-GPU setup, go for it! Just keep realistic expectations—it's amazing for hobbyist use and honestly just a lot of fun to hunt for deals, installing them and playing around with.
If anyone has suggestions to improve my setup or tools for objective quality testing, please let me know!
Closing remarks: I corrected the text for grammar issues with Gemma4-26B-A4B at the end: It was quite fast but kept insisting that qwen2.5 and gemma2 are the latest models – and added that I would lose credibility if I don’t use the correct version numbers😂
mr_Owner@reddit
I have higher pps with 1 rtx 4070s 12gb vram.
But your 20 tgs for 27b a4km seems a lot higher then i would expect.
What is your pcie bandwidth?
I think you should really really use llama cpp and set Ubatch size higher then the default 512 of lmstudio.
DracoTorpedo@reddit (OP)
Really?? What ram / CPU config do you have? I am kind of putting off trying llama cpp due to UI thing. Good point, regarding U batch size: I will try increasing it !!
unjustifiably_angry@reddit
Additional GPUs make LLMs run slower, not faster. You get more capacity at the cost of making every input a relay race.
DracoTorpedo@reddit (OP)
I am gonna stop with the number of gpus at this point !! I got pretty good deal on 3060 - not a bad investment for a hobby. Any other addition would be pretty expensive change in my setup - it's not within my range. I am also looking into ik_llamacpp for possible graph parallelism, which should be the hybrid tensor parallelism that llama cpp has experimental !! But my bottleneck is again the pcie 4.0 x2 on 3060, not sure how much benefit anything apart from pipeline parallelism will bring
mr_Owner@reddit
4070s 12gb + 4x16 ddr5 6000 (64gb) and amd 9800x3d.
Qwen3.5 35b runs at (mudler) apex i quality 44tgs 2100pps
Gemma 4 26b IQ4XS 33 tgs 2000pps
Ctx size 100k ubatch 4096 batch 4096 np=1 flash attention on, k and v at q4 due to attn_rot gives good kld.
DracoTorpedo@reddit (OP)
I tried llama cpp today with batch 2048, I batch 1024 (larger u batches doesnt seem to throw errors) today and I got a pretty good boost to pp speeds - now it's around 2600 Tok/s. Tg speed took a slight hit to 77 Tok/s, but I guess this trade off is worth it. I read about the rot quant too !! Excited for spectral quant in future....I can now run 110k context at q8 kv quantization (this is close to lossless if all the data I read are correct)
Cooproxx@reddit
If you have multiple gpus, does this just allow multiple models to run in parellel or can they actually combine vram to run larger models like qwen3 27b?
unjustifiably_angry@reddit
You will almost always get usable output sooner from a single large model than multiple slow models in a trenchcoat even if the large model is slower per token.
TheFunSlayingKing@reddit
Both, currently running a dual setup with vulkan, I can run qwen 27b even though it has some hiccups but that's mainly due to my vram capacity, you can also assign one model on one gpu and the other on a different GPU
I.e. qwen 9b cuda 0 qwen 3b cuda 1
DracoTorpedo@reddit (OP)
I could do both; smaller models (that fit in my VRAM) parallel or fit a bigger model split across gpus.
unjustifiably_angry@reddit
VRAM: It's important
The sooner you can get using llama.cpp the better, LM studio's VRAM allocation system is a crock of shit.
pepedombo@reddit
Few days ago I was asking for issues with lmstudio, becase i'm running 5070ti+5060ti+4060ti, for few days i had 2x5070+5060+4060. The point with lmstudio is that it does not manage vram to its full extent as ollama does, though ollama will choose random gpu if model weights are much smaller than total vram.
Ollama will bloat all available vram across all gpus including context and it seems to keep constant tps during whole session.
Lmstudio pushes smaller buffers, splits are uneven. Tps depends on current context size, the more the slower. Once I had issues while using gemma 4 - it started to push my system ram instead of vram. Sometimes I take more cpu usage. Though I have enough vram lms pumps up shared memory a little bit. (limit model offload is on and as it says it may still use shared memory)
Since then I played around with ollama mostly. I'm switching back to lms to test gemma 31b and qwen27b vs ollama.
Medical-Welcome-6924@reddit
Hey, I've also been thinking about adding a 12 GB 3060 to my current setup, so this is juicy information. Thanks for the tests! Seems worth it to me.
Eyelbee@reddit
I'd sell the 4070 for a 3090
DracoTorpedo@reddit (OP)
I was seriously considering this too!! But it was quite expensive in my region (even used) and I must upgrade other parts of my setup as well !! This is more of a lazy upgrade to play around
DracoTorpedo@reddit (OP)
Sure thing !! Just keep an eye on PCI E Bandwidth on the free slot😃