I've got $3000 to make Qwen3.5 27B Q4 run, what do I need?
Posted by NetTechMan@reddit | LocalLLaMA | View on Reddit | 76 comments
I'm having a hard time determining the hardware I need to run a model like this, and I'm a bit confused about the number of resources publicly available. Is there a centralized hardware benchmark platform for these models, or is it all just hear-say from the community?
Along those lines, how could I make 3k stretch to work? I'm looking for about 15-20t/s.
Woof9000@reddit
If I had that much money to spare, I'd get R9700.
But I'm linux-only type of guy, so I'm always biased towards team red.
ProfessionalSpend589@reddit
It’s not that fast: https://www.reddit.com/r/LocalLLaMA/comments/1sh1u4k/comment/ofc0i41/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button
Woof9000@reddit
No idea how that person got those numbers. OP looking for 15-20tps, and my current system with 2x 9060xt does up to 15tps, so if R9700 can't reach 20tps (with Qwen3.5 27B Q4), then mostly likely there's some issue somewhere in the software stack. Gotta update kernel, vulkan sdk, llama.cpp/vllm, or smth.
HopePupal@reddit
quick update to this, re-benched today after two changes: - Ryzen 1800X → Ryzen 5900XT - PCIe ReBAR enabled (previous GPU wasn't ReBAR-capable but i forgot to turn it on after upgrading)
software, config, etc. entirely the same other than that.
now hitting between 19 and 24 t/s TG for Q6_K, depending on context depth. i'll run another bench with ReBAR off to see if it was that or the CPU.
Woof9000@reddit
New numbers looks much closer to what I'd expect from that card.
HopePupal@reddit
see this is why we share numbers, i just got a ~50% TG boost for free. thanks r/LocalLLaMA
Woof9000@reddit
I forgot we even had those settings. ReBAR and Above 4G Decoding was just something I enabled years ago and completely forgot it exited.
But it's great to hear you figured it out.
ProfessionalSpend589@reddit
That user tested with Q6_K. Q4 would be faster.
Woof9000@reddit
That user's setup is not known, so for all intents and purposes it's merely an edge case.
Here's example of more thorough and better documented testing (with TG up to 30 tps):
https://github.com/ggml-org/llama.cpp/discussions/21043
HopePupal@reddit
y'all could have just mentioned me. those numbers are on llama.cpp (not sure about the build), Linux 6.17 (Fedora Bazzite), Vulkan backend, Ryzen 1800X, card in a PCIe Gen 3 x16 slot, 2 16 GB DDR 4 modules (not sure about the clock speed). llama settings basically just default with auto-fit.
Woof9000@reddit
Thanks. That looks like older system, but the rest of HW shouldn't matter much. I'm fairly sure your numbers would improve with more recent kernel, mesa, and fresh compile of llama.cpp against latest vulkan sdk.
HopePupal@reddit
i actually dropped in a 5900XT yesterday, so if the CPU was a bottleneck before, it's less of one now. i agree that it's probably not, though. (i upgraded because that machine also hosts agent isolation VMs.)
newer kernel should also unblock ROCm
Woof9000@reddit
Probably newer kernel would help. Apart from that, I'm don't have much confidence in llama-bench these days. At least on my system (dual 9060) during actual use, with llama cpp server, I get fairly consistent 14-15 tps, while at the same time whenever I run llama bench - results are all over the place, from 7 to 16 tps, depending settings/options I use, day of the week, or phase of the moon, really not sure what's going on with it and I have no patience for figuring it out.
HopePupal@reddit
i was using
llama-benchyanyway, which tests by calling any OpenAI-compatible HTTP APINetTechMan@reddit (OP)
That actually works for my budget. Let me look into those.
HopePupal@reddit
OP i'm now hitting 19-25 t/s on Q6_K and 28-30 on Q4_K_M with my R9700 (after some troubleshooting elsewhere in this thread) so that should definitely do it for ya
SexyAlienHotTubWater@reddit
If you want to buy one card, it's a convenient option, but it's very very noisy. I just returned mine.
2x7900 xtx is around the same price, but with 3x the aggregate bandwidth and 3x the aggregate FLOPs, with 48GB VRAM. If you can find a dual-GPU board, that's probably a better option.
Woof9000@reddit
Here's a decently documented example of somebody testing Qwen3.5-27B Q4 on R9700 with llama.cpp and getting up to 30tps:
https://github.com/ggml-org/llama.cpp/discussions/21043
Blackdragon1400@reddit
What? NVIDIA has always had better driver support in Linux lmao
AMD lags in support across all platforms except closed ones (consoles)
LuckyZero@reddit
you know that picture of Linus Torvalds giving the finger? That finger was for Nvidia.
mustafar0111@reddit
As someone who has both Nvidia and AMD GPU's you have no idea at all what you are talking about.
If you bought something in the R9700 Pro's price range from Nvidia the R9700 Pro would run circles around it.
Woof9000@reddit
"NVIDIA has always had better driver support in Linux" - LOL
I used Nvidia GPU's on Linux, for couple years, and it's was such an "amazing" experience, that the moment 9060 hit the shelves, I unloaded all my nvidia stock to ebay, even knowing initial driver support for those new cards will be less than stellar.
And now after half year with it, I can confidently say I was never happier with my system. Low power usage, low cost, great for all kinds of uses, gaming and AI, not just one or the another. Previously I had to maintain two systems, one for gaming and another for AI, now one system rules them all.
lemon07r@reddit
Just a 7900 XTX would do and is probably the cheapest way to get what you want. These go for around 700 USD in my area. Sometimes a lil less. RTX 3090 are a good option too, but usually more expensive, and not as fast for non-ai stuff. Vulkan and rocm support for LLM inference is great, and pretty easy to setup now too. Since it will fit completely on vram, it will be a lot faster than your 15-20t/s target.
flavio_geo@reddit
This is correct.
I use a 7900 XTX 24GB VRAM
Using Qwen3.5 IQ4_XS Unsloth with 240k context and F16 mmproj on llama.cpp vulkan
pp512 = 950 t/s
tg 128 = 41 t/s
grumd@reddit
Do you get better pp speed with ubatch 2048-4096? I usually test pp4096 btw to see results closer to real world
flavio_geo@reddit
Doing ubatch higher then 1024 yields Very little gain in PP, and Costa valuable vram space
grumd@reddit
For me with 122b it's worth it, I get 1000pp with 1024 ub and 1500pp with 2048ub. I just increase ub until it's not worth it anymore
pwlee@reddit
Yes, I have 2 of them and 27b Q4 can run on a single GPU. Expect 25-30t/s generation, 200-500t/s prompt processing.
I'd recommend llama.cpp since vllm was difficult for me to set up using debian 13.
InuRyu@reddit
with 2, can you realistically run q6 or q8?
d4t1983@reddit
What kinda performance do you get with two compared to one out of interest?
GMerton@reddit
I think you need to be a bit more specific about your budget. Do you have a pc and just need to swap GPU? Do you care about electricity cost? Without buying used stuff, a pc without GPU is probably $1000. Electricity difference between Mac and discrete GPU can be another $700 over 2 years if you run them near 24/7.
I did quite a lot of research for my use case (mainly deep research). Will just wait for M5 Max that will hopefully get released in a few months.
FalconX88@reddit
Alternative to the 3090s if you want to just plug and play: The best Mac Studio you can get, although for Q4 probably 24GB of memory would already be enough.
Automatic-Arm8153@reddit
27b is a dense model and would be painfully slow on any Mac. Definitely best to stick to GPU only for this.
Source: own Mac max and some Nvidia GPU’s
hellomyfrients@reddit
depends on your use case
i use models for personal assistant, minutes to hours is fine (hours for 20+ skillchain calls), being smart/more active parameters wins over speed any time
if you want to build a "chatbot", yeah of course not. a deep researcher? probably fine
IMO we should embrace lower tok/s more, it opens up way more actually practical meaningful use cases that happen at speeds that are still great for peoples' lives (a lot less hardware $ and power, too)
my_story_bot@reddit
Qwen3.6 35b A3b just dropped, that will be smarter and faster than 3.5 27b
Erdnalexa@reddit
That’s what I was about to say, it SSSOOOO much better
my_story_bot@reddit
Right! Very impressive release.
Erdnalexa@reddit
Spent the evening getting help setting up my server, never felt the need to fallback to Claude or read the documentation my self. First time it happens with a local model. After the terrible day I’ve have with Claude Opus 4.7 that even felt better
Anxious_Comparison77@reddit
To research local llms, they are all junk
StardockEngineer@reddit
5090 is the only way.
putrasherni@reddit
Two R9700
RedParaglider@reddit
Works on the strix halo, but I think it's a lot to spend at the current prices, but it is in your budget.
HopePupal@reddit
speaking as someone who was trying to run 27B on a Strix Halo, there's a reason i own an R9700 now
imonlysmarterthanyou@reddit
As a fellow strix halo owner, what speeds are you seeing?
HopePupal@reddit
on my R9700? https://www.reddit.com/r/LocalLLaMA/comments/1sh1u4k/comment/ofc0i41/
people were debating elsewhere in this thread what might be changed to go faster, so treat that as the floor. roughly 4x faster PP than the same quant on the Strix.
RedParaglider@reddit
Yep. I will say that the new qwen 3.6 MOE is pretty nice on the strix, but it's truly a capability machine not a speed machine.
Klutzy-Snow8016@reddit
Get 2x3090 and use an 8-bit quant (like this one: https://huggingface.co/Qwen/Qwen3.5-27B-FP8) with an inference engine that supports the model's built-in speculative decoding.
SillyLilBear@reddit
5090 is the goat at this price range
mkMoSs@reddit
You can do what I did 4x RTX 5060 Ti for a total of 64GB VRAM.
I use a consumer grade motherboard, 2 GPUs are directly plugged on board, and I bought 2 nvme to pcie extender/adapters and plugged the other 2.
1000w PSU is more than enough for all of that.
I run it with vllm. I don't recommend llama.cpp for parallelism.
I get \~50t/s with Qwen3.5 27B NVFP4.
dunnolawl@reddit
If you're looking for a platform that has the best creature comforts (no weird form factors, fits into a standard case). The best value would be something like this:
HUANANZHI H12D 8D with EPYC 7532 (cheapest EPYC with full 8-channel DDR4 memory bandwidth) ~$600.
8x8GB DDR4 RDIMM (8GB DIMMs can still be had on the cheap (make an offer to an ebay seller), because the demand is low for low capacity memory chips). ~$200
Without the GPUs you'd be looking at around ~$1,200.
For the GPUs you can find a lot of opinions for options (3090, R9700, Intel Arc Pro B70, 7900XTX), but if you're looking at pure value I still think the MI50 16GB can be a contender for pure text inference. You can get 4x MI50 16GB for ~$400 shipped from China and get within the ballpark of ~15 t/s at reasonable context on current llama.cpp (if llama.cpp ever gets proper Tensor parallelism support then the speed will massively increase on a multi GPU setup).
Comparing apples to apples, 3090 vs 2x MI50 (16GB) on current llama.cpp:
Darth_Candy@reddit
Are you looking at $3,000 for GPUs on an existing machine or $3,0000 for a new machine? Either way, the new Intel B70 (32GB VRAM for $1000) is the best bang-for-your-buck VRAM-wise at the moment (Intel support is best on vLLM, although most people here use llama.cpp on NVIDIA). At Q4, you can get almost 32k token context on one of these cards.
Why Qwen 3.5 27B Q4 specifically? MoE models are generally a lot more efficient on VRAM usage since the activations scale better, so Gemma 4 26B-A4B and Qwen 3.5 (or 3.6) 35B-A3B are worth looking into.
Regarding your first question, there are plenty of model benchmarks and GPU benchmarks, but there are so many possible permutations (backend, hardware, quantization, specific benchmark, etc.) and things change so quickly that no "single source of truth" has really emerged.
HopePupal@reddit
you should be able to get a lot more than 32k context for 27B on any 32 GB card 🤨
also, the reason anyone runs dense models is because they're a lot smarter than MoEs of roughly the same total parameter count. i found Qwen 3.5 35B-A3B nearly unusable for the kind of coding i do.
Darth_Candy@reddit
I just ran it through the APXML online VRAM calculator and got ~30GB VRAM usage for the proposed setup. Is that calculator wrong?
And you’re right, dense is probably the right call since OP is only looking for 15-20 tps.
HopePupal@reddit
i think it's wrong for Qwen Next and 3.5 specifically. 3/4s of the layers on those models are deltanet, not regular quadratic attention, and deltanet has much smaller KV cache requirements.
No_War_8891@reddit
NVIDIA man and a good motherboard with 2 fast pci-e slots
sagiroth@reddit
Heh u dont need as much to achieve that. I have 3090 and it runs at q4 at 200k context at 40+
dunnolawl@reddit
I don't think that's a realistic number. You can't even load Qwen3.5-27B-Q4_K_M with full context and on my 3090 with Q8_0 KV the performance is around half that:
prompt eval time = 175018.21 ms / 133454 tokens ( 1.31 ms per token, 762.51 tokens per second) eval time = 36414.87 ms / 799 tokens ( 45.58 ms per token, 21.94 tokens per second) total time = 211433.09 ms / 134253 tokens
def_not_jose@reddit
Sounds like quantized cache numbers, and you don't want to do that when coding
sagiroth@reddit
q8
bobaburger@reddit
I think it's realistic if 3k include the full PC, not just the card
NetTechMan@reddit (OP)
Seriously? Do you have benchmarks for this? If so that's absurd
sagiroth@reddit
No benchmarks but plenty resources on this sub. I run q5 at around 100k as its more than I need really. No offload to ram either.
mohelgamal@reddit
you need to get as big a hardware as you can afford, because in 2 month there will be another model that you will want to run and prices aren’t going down anytime soon.
trevorbg@reddit
You could get a DGX spark
ketosoy@reddit
My old MacBook m1 max and my strix halo hit 10-12 tokens per second out of the box. I know that’s below your target, but it’s easy. One version at budget, the other quite a bit below.
tecneeq@reddit
Two Intel B70 for 64GB VRAM in any PC that runs Linux. You could run Q8 and 256k context.
Dolboyob77@reddit
Just buy a mini pc with amd 395+ and 128giga unified memory, you will be fine. Under 3k ))))
ProfessionalSpend589@reddit
No, it’ll be slow. Dense models require a GPU to run decently.
Source: me. I bought a GPU to run 20B-30B dense models faster than Strix Halo.
Dolboyob77@reddit
I use a mini pc with dual intel arc pro b70 but many users say that strix halo is fantastic for big models so I thought it would suit you… nevermind then )))
ProfessionalSpend589@reddit
I’m not OP :)
It’s great for MoEs and to test things out.
But its iGPU is weak and RAM bandwidth is slow. If a dense models can fit on a GPU it’ll be several times faster (at least 3 times on modern GPUs with hardware optimisations).
Either_Pineapple3429@reddit
Buy and old server work station like a T7910 ($300) and then buy 2-3 3090s ($800-$1000)
urekmazino_0@reddit
Buy 2x3090s
SkinnyCTAX@reddit
Just grab two rtx 3090s off marketplace, usually around $800, and then figure out the cheapest mobo and ram setup you can build around it with decent pcie support.
mslindqu@reddit
They've dried up pretty hard around here. And now Ebay is full of scam listing.. anyone local seems to want like 1k.. ridiculous.
SmallHoggy@reddit
OP the pcie lane splitting is important. Most boards cannot split x8 x8 electrically, so check before you buy.
xeeff@reddit
buy an 2x 3090 and buy me a couple too, keep the rest