I've got $3000 to make Qwen3.5 27B Q4 run, what do I need?

[-]

Woof9000@reddit

If I had that much money to spare, I'd get R9700.
But I'm linux-only type of guy, so I'm always biased towards team red.

[-]

ProfessionalSpend589@reddit

It’s not that fast: https://www.reddit.com/r/LocalLLaMA/comments/1sh1u4k/comment/ofc0i41/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

[-]

No idea how that person got those numbers. OP looking for 15-20tps, and my current system with 2x 9060xt does up to 15tps, so if R9700 can't reach 20tps (with Qwen3.5 27B Q4), then mostly likely there's some issue somewhere in the software stack. Gotta update kernel, vulkan sdk, llama.cpp/vllm, or smth.

[-]

HopePupal@reddit

quick update to this, re-benched today after two changes: - Ryzen 1800X → Ryzen 5900XT - PCIe ReBAR enabled (previous GPU wasn't ReBAR-capable but i forgot to turn it on after upgrading)

software, config, etc. entirely the same other than that.

now hitting between 19 and 24 t/s TG for Q6_K, depending on context depth. i'll run another bench with ReBAR off to see if it was that or the CPU.

test	context	t/s
pp2048	0	841.58
tg32	0	24.58
pp2048	8000	652.03
tg32	8000	24.23
pp2048	16000	789.47
tg32	16000	23.68
pp2048	32000	732.68
tg32	32000	22.76
pp2048	48000	686.29
tg32	48000	22.01
pp2048	64000	645.11
tg32	64000	21.10
pp2048	80000	598.14
tg32	80000	20.36
pp2048	96000	559.87
tg32	96000	19.75
pp2048	112000	526.61
tg32	112000	19.10

[-]

Woof9000@reddit

New numbers looks much closer to what I'd expect from that card.

[-]

HopePupal@reddit

see this is why we share numbers, i just got a ~50% TG boost for free. thanks r/LocalLLaMA

[-]

Woof9000@reddit

I forgot we even had those settings. ReBAR and Above 4G Decoding was just something I enabled years ago and completely forgot it exited.
But it's great to hear you figured it out.

[-]

ProfessionalSpend589@reddit

That user tested with Q6_K. Q4 would be faster.

[-]

Woof9000@reddit

That user's setup is not known, so for all intents and purposes it's merely an edge case.
Here's example of more thorough and better documented testing (with TG up to 30 tps):
https://github.com/ggml-org/llama.cpp/discussions/21043

[-]

HopePupal@reddit

y'all could have just mentioned me. those numbers are on llama.cpp (not sure about the build), Linux 6.17 (Fedora Bazzite), Vulkan backend, Ryzen 1800X, card in a PCIe Gen 3 x16 slot, 2 16 GB DDR 4 modules (not sure about the clock speed). llama settings basically just default with auto-fit.

[-]

Woof9000@reddit

Thanks. That looks like older system, but the rest of HW shouldn't matter much. I'm fairly sure your numbers would improve with more recent kernel, mesa, and fresh compile of llama.cpp against latest vulkan sdk.

[-]

HopePupal@reddit

i actually dropped in a 5900XT yesterday, so if the CPU was a bottleneck before, it's less of one now. i agree that it's probably not, though. (i upgraded because that machine also hosts agent isolation VMs.)

newer kernel should also unblock ROCm

[-]

Woof9000@reddit

Probably newer kernel would help. Apart from that, I'm don't have much confidence in llama-bench these days. At least on my system (dual 9060) during actual use, with llama cpp server, I get fairly consistent 14-15 tps, while at the same time whenever I run llama bench - results are all over the place, from 7 to 16 tps, depending settings/options I use, day of the week, or phase of the moon, really not sure what's going on with it and I have no patience for figuring it out.

[-]

HopePupal@reddit

i was using llama-benchy anyway, which tests by calling any OpenAI-compatible HTTP API

[-]

NetTechMan@reddit (OP)

That actually works for my budget. Let me look into those.

[-]

HopePupal@reddit

OP i'm now hitting 19-25 t/s on Q6_K and 28-30 on Q4_K_M with my R9700 (after some troubleshooting elsewhere in this thread) so that should definitely do it for ya

[-]

SexyAlienHotTubWater@reddit

If you want to buy one card, it's a convenient option, but it's very very noisy. I just returned mine.

2x7900 xtx is around the same price, but with 3x the aggregate bandwidth and 3x the aggregate FLOPs, with 48GB VRAM. If you can find a dual-GPU board, that's probably a better option.

[-]

Woof9000@reddit

Here's a decently documented example of somebody testing Qwen3.5-27B Q4 on R9700 with llama.cpp and getting up to 30tps:
https://github.com/ggml-org/llama.cpp/discussions/21043

[-]

Blackdragon1400@reddit

What? NVIDIA has always had better driver support in Linux lmao

AMD lags in support across all platforms except closed ones (consoles)

[-]

LuckyZero@reddit

you know that picture of Linus Torvalds giving the finger? That finger was for Nvidia.

[-]

mustafar0111@reddit

As someone who has both Nvidia and AMD GPU's you have no idea at all what you are talking about.

If you bought something in the R9700 Pro's price range from Nvidia the R9700 Pro would run circles around it.

[-]

Woof9000@reddit

"NVIDIA has always had better driver support in Linux" - LOL
I used Nvidia GPU's on Linux, for couple years, and it's was such an "amazing" experience, that the moment 9060 hit the shelves, I unloaded all my nvidia stock to ebay, even knowing initial driver support for those new cards will be less than stellar.
And now after half year with it, I can confidently say I was never happier with my system. Low power usage, low cost, great for all kinds of uses, gaming and AI, not just one or the another. Previously I had to maintain two systems, one for gaming and another for AI, now one system rules them all.

[-]

lemon07r@reddit

Just a 7900 XTX would do and is probably the cheapest way to get what you want. These go for around 700 USD in my area. Sometimes a lil less. RTX 3090 are a good option too, but usually more expensive, and not as fast for non-ai stuff. Vulkan and rocm support for LLM inference is great, and pretty easy to setup now too. Since it will fit completely on vram, it will be a lot faster than your 15-20t/s target.

[-]

flavio_geo@reddit

This is correct.

I use a 7900 XTX 24GB VRAM

Using Qwen3.5 IQ4_XS Unsloth with 240k context and F16 mmproj on llama.cpp vulkan

pp512 = 950 t/s
tg 128 = 41 t/s

[-]

grumd@reddit

Do you get better pp speed with ubatch 2048-4096? I usually test pp4096 btw to see results closer to real world

[-]

flavio_geo@reddit

Doing ubatch higher then 1024 yields Very little gain in PP, and Costa valuable vram space

[-]

grumd@reddit

For me with 122b it's worth it, I get 1000pp with 1024 ub and 1500pp with 2048ub. I just increase ub until it's not worth it anymore

[-]

pwlee@reddit

Yes, I have 2 of them and 27b Q4 can run on a single GPU. Expect 25-30t/s generation, 200-500t/s prompt processing.

I'd recommend llama.cpp since vllm was difficult for me to set up using debian 13.

[-]

InuRyu@reddit

with 2, can you realistically run q6 or q8?

[-]

d4t1983@reddit

What kinda performance do you get with two compared to one out of interest?

[-]

GMerton@reddit

I think you need to be a bit more specific about your budget. Do you have a pc and just need to swap GPU? Do you care about electricity cost? Without buying used stuff, a pc without GPU is probably $1000. Electricity difference between Mac and discrete GPU can be another $700 over 2 years if you run them near 24/7.

I did quite a lot of research for my use case (mainly deep research). Will just wait for M5 Max that will hopefully get released in a few months.

[-]

FalconX88@reddit

Alternative to the 3090s if you want to just plug and play: The best Mac Studio you can get, although for Q4 probably 24GB of memory would already be enough.

[-]

Automatic-Arm8153@reddit

27b is a dense model and would be painfully slow on any Mac. Definitely best to stick to GPU only for this.

Source: own Mac max and some Nvidia GPU’s

[-]

hellomyfrients@reddit

depends on your use case

i use models for personal assistant, minutes to hours is fine (hours for 20+ skillchain calls), being smart/more active parameters wins over speed any time

if you want to build a "chatbot", yeah of course not. a deep researcher? probably fine

IMO we should embrace lower tok/s more, it opens up way more actually practical meaningful use cases that happen at speeds that are still great for peoples' lives (a lot less hardware $ and power, too)

[-]

my_story_bot@reddit

Qwen3.6 35b A3b just dropped, that will be smarter and faster than 3.5 27b

[-]

Erdnalexa@reddit

That’s what I was about to say, it SSSOOOO much better

[-]

my_story_bot@reddit

Right! Very impressive release.

[-]

Erdnalexa@reddit

Spent the evening getting help setting up my server, never felt the need to fallback to Claude or read the documentation my self. First time it happens with a local model. After the terrible day I’ve have with Claude Opus 4.7 that even felt better

[-]

Anxious_Comparison77@reddit

To research local llms, they are all junk

[-]

StardockEngineer@reddit

5090 is the only way.

[-]

putrasherni@reddit

Two R9700

[-]

RedParaglider@reddit

Works on the strix halo, but I think it's a lot to spend at the current prices, but it is in your budget.

[-]

HopePupal@reddit

speaking as someone who was trying to run 27B on a Strix Halo, there's a reason i own an R9700 now

[-]

imonlysmarterthanyou@reddit

As a fellow strix halo owner, what speeds are you seeing?

[-]

HopePupal@reddit

on my R9700? https://www.reddit.com/r/LocalLLaMA/comments/1sh1u4k/comment/ofc0i41/

people were debating elsewhere in this thread what might be changed to go faster, so treat that as the floor. roughly 4x faster PP than the same quant on the Strix.

[-]

RedParaglider@reddit

Yep. I will say that the new qwen 3.6 MOE is pretty nice on the strix, but it's truly a capability machine not a speed machine.

[-]

Klutzy-Snow8016@reddit

Get 2x3090 and use an 8-bit quant (like this one: https://huggingface.co/Qwen/Qwen3.5-27B-FP8) with an inference engine that supports the model's built-in speculative decoding.

[-]

SillyLilBear@reddit

5090 is the goat at this price range

[-]

mkMoSs@reddit

You can do what I did 4x RTX 5060 Ti for a total of 64GB VRAM.
I use a consumer grade motherboard, 2 GPUs are directly plugged on board, and I bought 2 nvme to pcie extender/adapters and plugged the other 2.
1000w PSU is more than enough for all of that.
I run it with vllm. I don't recommend llama.cpp for parallelism.
I get \~50t/s with Qwen3.5 27B NVFP4.

[-]

dunnolawl@reddit

If you're looking for a platform that has the best creature comforts (no weird form factors, fits into a standard case). The best value would be something like this:

HUANANZHI H12D 8D with EPYC 7532 (cheapest EPYC with full 8-channel DDR4 memory bandwidth) ~$600.

8x8GB DDR4 RDIMM (8GB DIMMs can still be had on the cheap (make an offer to an ebay seller), because the demand is low for low capacity memory chips). ~$200

Without the GPUs you'd be looking at around ~$1,200.

For the GPUs you can find a lot of opinions for options (3090, R9700, Intel Arc Pro B70, 7900XTX), but if you're looking at pure value I still think the MI50 16GB can be a contender for pure text inference. You can get 4x MI50 16GB for ~$400 shipped from China and get within the ballpark of ~15 t/s at reasonable context on current llama.cpp (if llama.cpp ever gets proper Tensor parallelism support then the speed will massively increase on a multi GPU setup).

Comparing apples to apples, 3090 vs 2x MI50 (16GB) on current llama.cpp:

Metric	AMD MI50 (2x 16GB)	NVIDIA RTX 3090 (24GB)	Comparison
Total Execution Time	17m 33.5s	3m 31.4s	3090 is ~5.0x faster
Prompt Eval Time	16m 09.7s	2m 55.0s	3090 is ~5.5x faster
Generation Time	1m 23.9s	36.4s	3090 is ~2.3x faster
Prompt Speed	137.63 tokens/s	762.51 tokens/s	-
Generation Speed	10.65 tokens/s	21.94 tokens/s	-
Avg. Latency	93.91 ms/token	45.58 ms/token	-

[-]

Darth_Candy@reddit

Are you looking at $3,000 for GPUs on an existing machine or $3,0000 for a new machine? Either way, the new Intel B70 (32GB VRAM for $1000) is the best bang-for-your-buck VRAM-wise at the moment (Intel support is best on vLLM, although most people here use llama.cpp on NVIDIA). At Q4, you can get almost 32k token context on one of these cards.

Why Qwen 3.5 27B Q4 specifically? MoE models are generally a lot more efficient on VRAM usage since the activations scale better, so Gemma 4 26B-A4B and Qwen 3.5 (or 3.6) 35B-A3B are worth looking into.

Regarding your first question, there are plenty of model benchmarks and GPU benchmarks, but there are so many possible permutations (backend, hardware, quantization, specific benchmark, etc.) and things change so quickly that no "single source of truth" has really emerged.

[-]

HopePupal@reddit

you should be able to get a lot more than 32k context for 27B on any 32 GB card 🤨

also, the reason anyone runs dense models is because they're a lot smarter than MoEs of roughly the same total parameter count. i found Qwen 3.5 35B-A3B nearly unusable for the kind of coding i do.

[-]

Darth_Candy@reddit

I just ran it through the APXML online VRAM calculator and got ~30GB VRAM usage for the proposed setup. Is that calculator wrong?

And you’re right, dense is probably the right call since OP is only looking for 15-20 tps.

[-]

HopePupal@reddit

i think it's wrong for Qwen Next and 3.5 specifically. 3/4s of the layers on those models are deltanet, not regular quadratic attention, and deltanet has much smaller KV cache requirements.

[-]

No_War_8891@reddit

NVIDIA man and a good motherboard with 2 fast pci-e slots

[-]

sagiroth@reddit

Heh u dont need as much to achieve that. I have 3090 and it runs at q4 at 200k context at 40+

[-]

dunnolawl@reddit

I don't think that's a realistic number. You can't even load Qwen3.5-27B-Q4_K_M with full context and on my 3090 with Q8_0 KV the performance is around half that:

prompt eval time = 175018.21 ms / 133454 tokens ( 1.31 ms per token, 762.51 tokens per second) eval time = 36414.87 ms / 799 tokens ( 45.58 ms per token, 21.94 tokens per second) total time = 211433.09 ms / 134253 tokens

[-]

def_not_jose@reddit

Sounds like quantized cache numbers, and you don't want to do that when coding

[-]

sagiroth@reddit

q8

[-]

bobaburger@reddit

I think it's realistic if 3k include the full PC, not just the card

[-]

NetTechMan@reddit (OP)

Seriously? Do you have benchmarks for this? If so that's absurd

[-]

sagiroth@reddit

No benchmarks but plenty resources on this sub. I run q5 at around 100k as its more than I need really. No offload to ram either.

[-]

mohelgamal@reddit

you need to get as big a hardware as you can afford, because in 2 month there will be another model that you will want to run and prices aren’t going down anytime soon.

[-]

trevorbg@reddit

You could get a DGX spark

[-]

ketosoy@reddit

My old MacBook m1 max and my strix halo hit 10-12 tokens per second out of the box. I know that’s below your target, but it’s easy. One version at budget, the other quite a bit below.

[-]

tecneeq@reddit

Two Intel B70 for 64GB VRAM in any PC that runs Linux. You could run Q8 and 256k context.

[-]

Dolboyob77@reddit

Just buy a mini pc with amd 395+ and 128giga unified memory, you will be fine. Under 3k ))))

[-]

ProfessionalSpend589@reddit

No, it’ll be slow. Dense models require a GPU to run decently.

Source: me. I bought a GPU to run 20B-30B dense models faster than Strix Halo.

[-]

Dolboyob77@reddit

I use a mini pc with dual intel arc pro b70 but many users say that strix halo is fantastic for big models so I thought it would suit you… nevermind then )))

[-]

ProfessionalSpend589@reddit

I’m not OP :)

It’s great for MoEs and to test things out.

But its iGPU is weak and RAM bandwidth is slow. If a dense models can fit on a GPU it’ll be several times faster (at least 3 times on modern GPUs with hardware optimisations).

[-]