GPU advice for Qwen 3.5 27B / Gemma 4 31B (dense) — aiming for 64K ctx, 30+ t/s

Posted by Fit-Courage5400@reddit | LocalLLaMA | View on Reddit | 89 comments

Hey all,

Looking for some real-world advice on GPU choices for running the new dense models — mainly Qwen 3.5 27B and Gemma 4 31B.

What I’m targeting

Context: 64K+ (ideally higher later)
Speed: 30+ tok/s @ tg128 minimum
Power: not critical, but lower is a bonus

From what I’ve seen, these dense models are way more demanding than MoE.

Why not MoE?

I’m already running MoE just fine on P40s:

Gemma 4 26B MoE
\~32K ctx
\~42+ tok/s @ tg128

So now I want to move to dense models for better quality / reasoning.

Budget

\~2500 AUD (\~$1800 USD)
GPU only (already have CPU / RAM / board)
Ignore PCIe lane limits for now

Options I’m considering

A. 2× 9070 XT (16GB)
B. 1× R9 9700 (32GB)
C. 2× 7900 XTX (24GB)
D. 1× RTX Pro 4000 (24GB)

N. 1× Intel Arc Pro B70 (32GB, maybe future option, but not now)

My current understanding (please correct me)

16GB cards → basically forced into pipeline parallel, so per-GPU compute matters a lot
2× 7900 XTX should have the best raw throughput
RTX Pro 4000 maybe similar class, but VRAM limits context flexibility
32GB single card (R9 9700) is attractive for KV cache / long ctx, BUT:
perf ≈ 9070 XT?
price = \~2× 9070 XT + extra GPU…
2× 9070 XT might be best “budget parallel” option

Concerns (based on what I’ve seen here)

KV cache is brutal on Gemma 4 31B“massive KV cache… biggest drawback”
Even people with large VRAM struggle with higher quants / context
24GB seems like the minimum viable tier for 31B dense
Long context scaling is still very hardware-sensitive
Multi-GPU scaling (esp PCIe) seems very inconsistent depending on backend

What I want to know

If you’ve actually run Qwen3.5 27B / Gemma 4 31B (dense):

What GPU are you using?
What real tok/s are you getting (esp @ 64K+)
Does multi-GPU actually scale well or just look good on paper?
Is 32GB single GPU > dual 16/24GB in practice?
Any regrets / “don’t buy this” advice?

Bonus question

If you had \~$1800 today, would you:

go multi-GPU AMD (cheap + raw compute)
or single high-VRAM card (simpler + better ctx)

Appreciate any real benchmarks / configs 🙏

[-]

ChukwuOsiris@reddit

Not an option you asked for, but dual 3090's, Qwen3.5-27B-UD-Q5_K_XL PP & TG in at every 20k context up to 200k

| test | t/s |

| --------------: | -------------------: |

| pp4096 | 1847.78 ± 4.74 |

| tg512 | 34.51 ± 0.37 |

| pp4096 @ d20000 | 1486.08 ± 0.63 |

| tg512 @ d20000 | 32.53 ± 0.10 |

| pp4096 @ d40000 | 1222.48 ± 9.61 |

| tg512 @ d40000 | 30.99 ± 0.06 |

| pp4096 @ d60000 | 1050.72 ± 21.75 |

| tg512 @ d60000 | 29.51 ± 0.18 |

| pp4096 @ d80000 | 924.71 ± 3.13 |

| tg512 @ d80000 | 28.18 ± 0.12 |

| pp4096 @ d100000 | 818.69 ± 13.18 |

| tg512 @ d100000 | 26.93 ± 0.06 |

| pp4096 @ d120000 | 740.77 ± 1.04 |

| tg512 @ d120000 | 26.02 ± 0.06 |

| pp4096 @ d140000 | 668.45 ± 3.31 |

| tg512 @ d140000 | 24.93 ± 0.04 |

| pp4096 @ d160000 | 613.00 ± 3.53 |

| tg512 @ d160000 | 23.99 ± 0.04 |

| pp4096 @ d180000 | 565.57 ± 0.68 |

| tg512 @ d180000 | 23.10 ± 0.04 |

| pp4096 @ d200000 | 524.21 ± 0.63 |

| tg512 @ d200000 | 22.33 ± 0.04 |

[-]

Brah_ddah@reddit

Why would you be forced into pipeline parallel instead of tensor parallel?

[-]

fulgencio_batista@reddit

I have dual rtx 5060ti and run Qwen3.5-27B-NVFP4 at 62t/s tg512 in vLLM with MTP. Costed me 1k

[-]

a-babaka@reddit

Is it with concurrency=1? Could you share run config?

[-]

fulgencio_batista@reddit

Yep concurrency of 1. You can check my config here. If you have any advice for improving it or need any help feel free to send me a dm.

https://www.reddit.com/r/LocalLLaMA/comments/1smqqx5/comment/ogjveqq/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

a-babaka@reddit

That's interesting. What motherboard do you have? Are both cards working on pcie 5.0 x8?

[-]

Fit-Courage5400@reddit (OP)

Wow, this record is already matched by the 5090. Are you running each card in PCIe 5.0 x16? Also, what CLI command do you use to launch it?

[-]

I'm honestly not sure what my type of PCIE slots my mobo has - the spec sheet is confusing. But 'nvtop' tells me 'PCIe GEN 5@ 8x' and 'GEN 3@ 4x'. The one is 8x probably because the 5060ti is only an 8x card

My server command is here: https://www.reddit.com/r/LocalLLaMA/comments/1smqqx5/comment/ogjveqq/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

finevelyn@reddit

It's not matching the 5090. If you use MTP with the 5090 it will be twice as fast. Also Gemma 31B doesn't have MTP. Also MTP is currently only supported in vllm, which will have various tradeoffs compared to llama.cpp.

TLDR if you go cheap, it will be slower and more limiting.

[-]

Mir4can@reddit

Hey, i got the same and get around 40 tps. Can you share your model, run command etc?

[-]

fulgencio_batista@reddit

export MODEL="AxionML/Qwen3.5-27B-NVFP4"
docker run --rm -it \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e HF_HUB_DISABLE_XET=1 \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  vllm/vllm-openai:cu130-nightly \
    --model "$MODEL" \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --kv-cache-dtype fp8 \
    --max-num-seqs 1 \
    --gpu-memory-utilization 0.90 \
    --reasoning-parser qwen3 \
    --enable-prefix-caching \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}'

[-]

RemarkableGuidance44@reddit

Nvidia has stopped making 5060ti's with 16G of Vram.

[-]

Glittering-Call8746@reddit

Omg why..

[-]

New-Implement-5979@reddit

because they need that RAM for more expensive GPUs

[-]

chuckbeasley02@reddit

The Gemma 4 26B MoE is better than the 31B dense model

[-]

gpalmorejr@reddit

I'm not 100% sure what you would need to achieve that.... That larger dense models are ROUGH to squeeze tokens from. BUT one thing I can impart..... Multiple GPUs do not generally run the model in parallel unless you are running more than one instance of the model. They usually run in series. One GPU will get some layers and the other will get some other layers. And they work one at a time. So they don't really scale for compute. Theg usually scale for VRAM. If you have 2x on some GPU, you still only get 1x of that GPUs speed, you just get it on a larger VRAM pool. (For most home implementations). At least from the research I have done.

[-]

Fit-Courage5400@reddit (OP)

Yeah exactly, multi-GPU at home is mostly just about VRAM pooling, not real compute scaling.

Unless you’ve got something like NVLink (and even then it’s not perfect), you usually end up with the model being split across cards and running more like layer sharding rather than true parallel execution — so the speedup is pretty limited, it’s mostly just “can it fit” rather than “does it run faster”.

That’s why I still think dual 3090s with NVLink is kind of the sweet spot in the secondhand world right now — decent bandwidth, good VRAM per dollar, and at least you’re not completely stuck on PCIe.

But yeah, power draw and heat is another story 😅

[-]

Upstairs_Tie_7855@reddit

Not exactly true, I got 3x Tesla p40, and with parallelism enabled, I get a massive speed boost for Gemma 4 31b. Of course, you will need 16x pcie lanes for best performance

[-]

Fit-Courage5400@reddit (OP)

How fast is the Gemma 4 31b model on your 3× P40s? \~30t/s?

[-]

Upstairs_Tie_7855@reddit

Haven't measured it tbh. Currently at work, will do a benchmark later

[-]

Fit-Courage5400@reddit (OP)

Thanks! I actually have 4× P40s, and I’m using a PCIe x16 slot split into 4× x4 lanes for them. Tensor parallelism doesn’t seem to work very well in my setup right now, so I’m mostly relying on pipeline parallelism.

I’m really looking forward to your results. If the performance looks good, I might consider upgrading my motherboard as well.

[-]

Upstairs_Tie_7855@reddit

So it's 15 t/s and without it's about 6 t/s. I'm don't think it's true tensor parallelism, but I'm talking about pipeline parallelism + row split. (I assume that's why PCIE lanes are so important in this setup). You might get up to 20 t/s with 4 cards.

[-]

gpalmorejr@reddit

If you guys are using P40s and such and really large models, tensor parallelism is much more an option. I apologize. I was still operating with the consumer hardware assumption, where we really don't have such options and the latency overhead to sync across cards would be higher than the math to process a layer for inference. For models that big and with those data center type GPUs, you do have tensor parallelism available but it is best used for larger models, of course. Y'all are a little outside my playzone/budget, so I'll let someone who know more about it talk. Lol

[-]

Makers7886@reddit

[-]

gpalmorejr@reddit

Too be honest. If you were in enough VRAM, the power of one of those cards would probably be enough to get a reasonable Token rate. Issue is probably mostly bandwidth.

[-]

zeitplan@reddit

I run a 9070 xt and a 9060xt i get around 300 PP/s and 15 Token/s with qwopus 27b q4 in llamacpp. My Mainboard config is also not optimal and im using Vulkan. ROCM goes out oft memorx fast. Even with full 262k Context speed stays roughly the same

[-]

Glad-Mode9459@reddit

I use 27B Q4_K_L on rx 9070xt with old rx 6600 with 98k context Q8 and get around 18 t/s on Windows on linux is even faster Without rx 6600 with IQ4XS quant 13/ t/s

[-]

zeitplan@reddit

Mh, Llama CPP vor what are you using? 9060xt is only connected with pcie 4 x4

[-]

Fit-Courage5400@reddit (OP)

Thanks for sharing. Yeah, the 9060 XT isn’t very fast, and it ends up dragging down the 9070 XT’s performance.

But if you’re running MoE models, the cost-performance ratio of the 9060 XT 16G is probably among the best.

[-]

zeitplan@reddit

Its happily chugging along, most pain i have with the preprocessing. Better PCIE Slot could help the 9060XT out here. Got a good deal on it, so I took my chances.

[-]

DeepBlue96@reddit

hello 3090 user here (bought used for 700€ 3yago) and on qwen3.5-27b q4_k_m it goes to 23-25 tk/s generation 1600tk/s prompt ingestion. context deployed: 131072
context was quantized (i discovered such thing existed 4weeks ago lol)

how with llamacpp:

.\llama-server.exe -hf unsloth/Qwen3.5-27B-GGUF:Q4_K_M --host 127.0.0.1 --port 12333 --ctx-size 131072 --cache-ram 4096 --cache-reuse 1024 --cache-type-k q4_0 --cache-type-v q4_0

might want to add --reasoning false it's a waste of token and time output doesn't improve and it's already good enough for most coding lol still i prefer the qwen3.5-35b-a3b close enough code quality and better understanding not to mention the 4x speed xD

[-]

ForsookComparison@reddit

tossup between r9700 and 2x7900xtx's

If TG is your main concern and you don't mind the extra cost/heat/power/lanes then the 7900xtx's is the clear winner here.

[-]

Fit-Courage5400@reddit (OP)

Yeah, you nailed my dilemma exactly 😄

I’ve literally got both options sitting in my cart right now… just not sure which one I should remove.

Appreciate the insight 🙏

[-]

veinamond@reddit

I was considering the same topic. I think you should aim either at 2 * r9700 (one now one later) or 2*7900 xtx, buy now and forget about it until your itch makes you upgrade to server/workstation platforms and/or Sparks/Strix Halos, mac studios.

[-]

DistanceSolar1449@reddit

Why not 2x 3090s and have better CUDA and VLLM support? That gets you much faster tensor parallelism.

[-]

Prudent-Promotion512@reddit

I have dual 3090s and was having trouble getting the model running with decent context. Can you share advice on a good setup?

[-]

Holiday_Bowler_2097@reddit

Qwen 3.5 27b bartowski q6_k_l with full context (-ctk q8_0 -ctv q8_0) llama.cpp vulkan. Llama-swap shows Opencode session stats 45-55 t/s decode on downvolted (400w-) rtx5090 32gb. ~45 t/s at 200k+ context.

Strix halo with oculink +5090. Halo to play with moe models (5090 works as prefill booster and vram extender in this case in tensor split layer mode)

32gb card for sure. 24gb is too small. Need Q5 quant at least, Q5-Q6 is where models like qwen 3.5 27b are not too lobotomized for coding, and need ctx 100k+ anyway.

[-]

PsychicAnomaly@reddit

what about qwopus models?

[-]

Holiday_Bowler_2097@reddit

I'm more interested in https://www.reddit.com/r/LocalLLaMA/comments/1s1t5ot/rys_ii_repeated_layers_with_qwen35_27b_and_some/ but too lazy to actually try make gguf (

Btw, Qwen 3.6 35b is released, so, I guess won't even try qwopus )

[-]

cristianlukas@reddit

Come here to argentina on vacation, second hand 3090 are 600usd

[-]

EvilGuy@reddit

I have a 3090 and I can do 128k context on qwen 27b at around 45 tokens a second. Gemma I have not really played around much with but I believe I was doing at least 65k context at 30-something tokens a second. Could probably do a bit better with some tweaking.

[-]

Thanks-Suitable@reddit

I am also looking for the same setup right now! I share your concerns with second hand 3090! (Europe btw) What would be interesting for me is if you want to run qwen27b Consider the pipeline with Dflash models as if you can get those to work you boost your tokens/second, and only need to focus on prompt processing for agentic coding applications. I would be very curious to see if anybody has this setup up and running! I would love to chat!

[-]

iLaurens@reddit

I live in a power with high power costs too so I went for the rtx pro 4000 sff blackwell. Consumes only 70w.

Am able to run Gemma 4 31B UD_4_k_xl quant with 70k context (headless server, so can use full GPU vram). It's not fast because of limited bandwidth of the GPU at 16t/s and 650pp/s. But with a small Gemma 4 E2B q2 as speculative decode I get about 19 t/s on creative tasks but 30 t/s on coding tasks. That's pretty decent!

[-]

ixdx@reddit

RTX 5070 Ti + RTX 5060 Ti

llama-bench -ctk f16 -ctv f16 -fa 1
build: b8763 (ff5ef82)

bartowski/Qwen3.5-27B
Q3_K_S  1105.55 ±  9.38  /  33.16 ± 0.07
Q4_K_M  1269.32 ± 12.86  /  28.47 ± 0.04
Q4_K_L  1263.93 ± 12.36  /  27.91 ± 0.00
Q5_K_S  1270.67 ± 13.07  /  26.14 ± 0.05
Q5_K_M  1219.94 ± 11.83  /  25.04 ± 0.07
Q5_K_L  1219.06 ± 12.62  /  24.63 ± 0.00
Q6_K    1102.83 ±  8.17  /  22.10 ± 0.01

With a 128k context, KV=f16, and mmproj, Q4_K_L fits into VRAM.

Without mmproj, Q6_K fits.

Qwen3.5-27B performance barely drops when the context is filled to 128k (I haven't tested it with a larger context size).

bartowski/gemma-4-31B-it
Q4_K_L  1205.96 ± 4.85  /  25.15 ± 0.00
Q5_K_S  1213.65 ± 5.15  /  23.55 ± 0.00

Without mmproj, for Q4_K_L it is possible to fit a maximum of 80k context (KV=f16), for Q5_K_S - 70k.

[-]

exact_constraint@reddit

R9700 running llama.cpp w/ Vulkan. Qwen3.5 27B starts at about 30tps, drops to around 23 in OpenCode when I’m bumping up against the context limits. Been using it every day.

[-]

picosec@reddit

A single 24BG GPU (3090, 3090ti, 4090) can run Qwen 3.5 27B or Gemma 4 31B at decent rates (30-40 tokens/s) with 4-bit quants (like UD-Q4_K_XL), though context size with Gemma 4 31B is limited to more like \~32K at F16. Dual 24BG cards should be better as far as quantization. I haven't tested with a 7900XTX, though I have one sitting in a box.

[-]

wil_is_cool@reddit

If running a NVFP4 27B in vllm with full context + vision encoder you will be using about 60gb VRAM FYI. If running a different quant, or on llamacpp that will be lower, but less efficient for parallel users. (I was surprised too, I thought it would be lower before I implemented)

[-]

Fit-Courage5400@reddit (OP)

What’s your context setup like — is the KV cache running in FP16?

60GB VRAM is higher than I expected too…

[-]

wil_is_cool@reddit

Yeah full FP16 cache. It's on a RTX Pro 6000 so I wasn't too concerned. That gets 110% max length concurrency with 19.24 GiB KV cache memory. Model: 26.27 GiB. Total with all the CUDA graphs etc 61.7 GiB.

Command (yaml formatted) wsl.exe docker run --rm --runtime nvidia --gpus all --name qwen35-27b-nvfp4-container --ipc=host -v ~/models:/models -v ~/vllm-cache:/root/.cache/vllm -p ${PORT}:8000 vllm/vllm-openai:v0.19.0 /models/Sehyo_Qwen3.5-27B-NVFP4 --trust-remote-code --gpu-memory-utilization 0.6 --max-model-len -1 --enable-prefix-caching --enable-chunked-prefill --enable-auto-tool-choice --reasoning-parser qwen3 --tool-call-parser qwen3_coder --default-chat-template-kwargs "{\"enable_thinking\": false}" --mm-encoder-tp-mode data --mm-processor-cache-type shm --served-model-name ${MODEL_ID} Qwen3.5-27B-NVFP4 agent:Qwen3.5-27B agent-vision:Qwen3.5-27B agent:latest agent-vision:latest --speculative-config "{\"method\": \"mtp\", \"num_speculative_tokens\": 1}"

[-]

comanderxv@reddit

How fast is the Prompt processing in your setup at 100k context?

[-]

lmyslinski@reddit

1) Why gpu utilization set to only 0.6? 2) This seems highly inefficient for that much VRAM. I'm running 27B at Q4 with 128k context with vllm with 48gb ram and I likely could squeeze a bit more out of it.

Now, the model does degrade with tool calling over 64k context, I have yet to figure out why

[-]

ProfessionalSpend589@reddit

You don’t say what quant you want to run the Qwen model, but here a user tested it in q6 with Radeon AI Pro R9700: https://www.reddit.com/r/LocalLLaMA/comments/1sh1u4k/comment/ofc0i41/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

And I honestly think if you want Gemma 4 31B, you’d want more than 32GB VRAM for speed. I get the feeling when large context fills in RAM - the latency over PCIe is noticeable (although it may be my setup).

KV cache is brutal on Gemma 4 31B“massive KV cache… biggest drawback”

There’s a trick to reduce the RAM requirements: https://github.com/ggml-org/llama.cpp/discussions/21480

[-]

AurumDaemonHD@reddit

Why not consider secondhand rtx3090. Here on bazaar starting at 900.

[-]

Fit-Courage5400@reddit (OP)

Mainly because they’re not that easy to find here in Australia at a good price.

Also, a lot of 3090s have likely been used for mining long-term, which adds some risk. On top of that, the Samsung process + thermal issues (the “hot dragon” reputation) means I’d probably end up needing to mod it with better cooling or even go water.

Considering all that, I’d rather just go with a new card.

[-]

ormandj@reddit

Just get blower cards like the gigabyte turbo. I bought three used, cracked them open and cleaned them up (miner cards like you are afraid of), used ptm and some new thermal pads and had them back up and running for around $20 bucks each at perfectly fine temps. It was around an hour a card to disassemble and cleaned out all of the dust and garbage from running in an open frame system plus cutting all the thermal pads to size but even at full load packed three deep beside 100GB mellanox cards they run just fine.

[-]

Fit-Courage5400@reddit (OP)

Yeah, before these cards get a second life in a new setup, they probably need a proper “spa treatment” first 😄

[-]

AurumDaemonHD@reddit

I took strix and zotac from a mining shop for like 1100 eur altogether in europe . They actually know what they r doing and undervolting which is better than gammer heavy sporadic use.

If you benchmark a card for a good amount of time and look for defects (mainly thermal) i d say you can minimize risk.

But ohtaining one is maybe difficult. At good price they survive a day on bazaar. I used to check daily.

[-]

Fit-Courage5400@reddit (OP)

Yeah, I get that — secondhand cards can definitely be cheaper, but there’s always a trade-off.

You’re basically inheriting whatever the previous owner did to the card, and if something was pushed too far, you might end up dealing with it later… sometimes it just dies out of nowhere.

That said, if there’s a reliable source for 3090s, I’d definitely consider it. Appreciate you sharing your experience 👍

[-]

ryfromoz@reddit

Not im australia mate, i hunted hard for the four i got

[-]

Puzzleheaded_Base302@reddit

i don't think you can get 64K context length with a 24GB VRAM card. If you quantize the model to Q3, the output quality will be bad.

[-]

GrungeWerX@reddit

You don't need to pay attention to that. I have 24GB card, and have 100K context. I get speeds of 26 tok/sec. RTX 3090 TI. Kv cache q8/q8

[-]

Fit-Courage5400@reddit (OP)

Yeah, Q3 isn’t great overall, but TurboQuant might be worth a try. I’ll test it once I get the new GPU(s).

https://www.reddit.com/r/LocalLLaMA/comments/1s9ig5r/turboquant_isnt_just_for_kv_qwen3527b_at_nearq4_0/

[-]

Puzzleheaded_Base302@reddit

i think at this moment turboquant slows down token rate, so still need more work.

[-]

fastheadcrab@reddit

2x 5060 Ti or 5090

[-]

Fit-Courage5400@reddit (OP)

What kind of speed are you getting on 2× 5060 Ti on your setup?

[-]

fastheadcrab@reddit

With a single 5090 and easily exceed a 50 t/s at Qwen3.5-27B

I'd imagine you could get 30-40 t/s with dual 5060 Ti if you set up tensor parallel.

You will not be able to get close to 64k kv cache for Gemma 4 31B in 24GB of RAM

[-]

Fit-Courage5400@reddit (OP)

Thanks for your inspiration—RTX 5090 is the king!

https://arxiv.org/html/2601.09527v1

[-]

fastheadcrab@reddit

Yeah that's fairly accurate but with small 8B 4-bit models like the one tested in the paper, parallel scaling will be less effective since there's more overhead

[-]

DataPhreak@reddit

I get 40 tok/s with the strix halo, tested at over 100k context.

[-]

Fit-Courage5400@reddit (OP)

Are you sure you’re actually running Qwen 3.5 27B?

https://www.reddit.com/r/StrixHalo/comments/1rhgi9p/qwen_3527b_how_was_your_experience/

[-]

DataPhreak@reddit

Oh, you're right. I'm using the sparse models.

[-]

ryfromoz@reddit

You also can go with two b60s at $899 aud each giving you a total 48GB vram

[-]

Fit-Courage5400@reddit (OP)

VRAM is definitely a constraint, but for 30B dense models, compute is honestly the bigger bottleneck for b60s,

[-]

MotokoAGI@reddit

With llama.cpp I get 24tk/sec with 31B @ Q8 on multiple 3090s sitting on Pcie4x8

[-]

lionellee77@reddit

My 3090 desktop runs Gemma 4 31B with UD-Q4_K_XL, 73K context at Q8_0. Slightly over 30 token/s

[-]

Fit-Courage5400@reddit (OP)

That’s a pretty solid setup — kinda makes me want a 3090 now 😄

So… you looking to sell it? haha

[-]

lionellee77@reddit

To support 30tk/s on 30B dense, you need at least 24GB with 900GB/s memory bandwidth. Both 3090 and 7900 XTX meet the requirements.

[-]

Fit-Courage5400@reddit (OP)

Glad it helped 👍

[-]

anthonyg45157@reddit

How do you have this setup sharing context with RAM? Can you share your settings if using llama CPP or something similar

[-]

lionellee77@reddit

llama-server -a gemma-4-31B-it -np 1 -m /home/lm/gemma-4-31B-it/gemma-4-31B-it-UD-Q4_K_XL.gguf --mmproj /home/lm/gemma-4-31B-it/mmproj-BF16.gguf --host 0.0.0.0 --port 8000 --flash-attn on --temp 1.0 --top-p 0.95 --top-k 64 -c 73728 --cache-type-k q8_0 --cache-type-v q8_0

VRAM is almost full after hosting the model.

[-]

Puzzleheaded_Base302@reddit

RTX PRO 4500 32GB at $2899. just enough for what you need. I managed 95-115K context and 36tps on LM Studio (llama.cpp)

[-]

Fit-Courage5400@reddit (OP)

RTX PRO 4500 is solid, but it really hurts the wallet 😅

[-]

putrasherni@reddit

AMD R9700 INTEL B70 NVIDIA 5090

[-]

Minimum-Lie5435@reddit

Yea.. dual 3090s is the best option for the price/performance.. with vLLM and 262k context I can get 65tps with tp=2 for the dense models

[-]

catplusplusok@reddit

Sounds like a perfect use case for Intel Arc Pro B70, 32GB will fit these models in 4 bit comfortably.

[-]

Gesha24@reddit

In the future when maybe all the issues are fixed - sure. Right now - no. The best token generation I can get for Qwen 3.5 27B Q4 is 20 tokens/sec and that comes with very slow ingestion (500-ish). That's when running llama.cpp with SYCL. Otherwise I can run vllm, which has bunch of issues with chat formatting and tool calling, generates only 9 tokes/sec, but can ingest them at 2500/sec...

[-]

Fit-Courage5400@reddit (OP)

I’d love for the B70 to be a viable option, honestly.

But from what I’ve seen so far, real-world performance is only around \~20 tok/s at best, which is a bit below what I’m aiming for.

Feels like it might be another case like the A770 — where the hardware is decent, but it takes a couple of years for drivers and the software stack to really mature.

So yeah, maybe worth waiting and seeing how Intel + the community improve things, but right now it still feels a bit early.

[-]

Nutty_Praline404@reddit

Check discussion here: https://www.reddit.com/r/LocalLLaMA/comments/1smlvni/qwen3535b_running_well_on_rtx4060_ti_16gb_at_60/

[-]

Fit-Courage5400@reddit (OP)

Thanks — I’m not factoring in MoE requirements here, since my P40 24GB setup already handles them fine (\~42 tok/s).