Hardware Choice for 27b to 31b models.
Posted by rebelSun25@reddit | LocalLLaMA | View on Reddit | 104 comments
I've come to a point where I find the 27b and 31b models quite impressive.
I have a 16 GB AMD Radeon 7800xt. It performs quite well. It was $700. Here is my question:
Is the dual GPU approach performance hit worth it if I save around $400 over a single larger card? Is 32gb even a meaningful step up and is running 9700xt pro with a second 7800xt for total of 48gb a more realistic requirement for these size models?
I would like to have more vram for running these models and I could go with dual 16 GB cards or a single larger card, but here's the cost difference:
A)
Sell 7800xt for $550.
Buy, single 9700xt pro , 32gb, $1900+ tax. Final cost $1600.
B)
Add second 7800xt, $550 on second hand market. Final cost $700 + $550.
C)
Add 9700xt pro, total price $1900+tax plus $700.
Price isn't a factor, only to outline the difference so that it can be compared with performance, to decide if it's even worth it.
The bandwidth of these cards is the same, except for the fact there's a second PCIe device.
I've been using llama.cpp, and like it, but vllm is an option if dual GPU setup on vllm runs better.
Radiant_Condition861@reddit
dual 3090 with nvlink. I get 30-150tok/s with kv cache and model quantization, 3090 has INT4 accelerators and speculative decode 5 step is the speed boost, depending on cache hit.
Kofeb@reddit
Wow. Thank you! 🙏
E1Extrano@reddit
What motherboard do you have for the 3090s?
Radiant_Condition861@reddit
Dell Precision 7865. I love this thing. Rack mountable too.
https://www.microcenter.com/product/663925/dell-precision-7865-tower-workstation-desktop-computer
kapitanfind-us@reddit
Two questions for you:
1) why do you have a custom chat template? 2) what is compressed-tensors quantization?
Apologies if I'm lazy, I will start googling next 🙄
Radiant_Condition861@reddit
I was trouble shooting the tooling call problem. thought it was the template, but it turned out that I has to increase the output tokens (maxTokens) in pi to 64k
it's a technique to compress the models weights. if I understand it correctly, this setting stages the compressed weights in memory for faster inference.
drallcom3@reddit
30+ with those? Damn. With my single 5060 16gb I get like 1.
2Norn@reddit
i mean if ur loading the same model as him yeah that makes sense
Kahvana@reddit
Personally I wouldn't use the 7800 XT, it's very power hungry compared to other options.
I'm running 2x 5060 Ti, 20t/s generation (degrades to 14 t/s on 100k context) isn't fast but it gets the job done. For that price and energy output, it's well worth.
If I had to buy a new GPU today, I would've gotten the R9700 Pro instead. Still very good energy use, 32GB unified VRAM (helps with fitting model layers) vs 32 GB paralel (some models might not fit as layers can't neatly be offloaded). Only downside I've heard is the very loud blower-style fan and the lack of CUDA.
With the single R9700 Pro, you leave room to expand later in a costumer case / motherboard also. Well worth it.
foxpro79@reddit
Just bought the 9700 for a test and will return it for the noise. Cannot believe how loud it is. On par with having a hair dryer right beside you.
horeaper@reddit
Buy a custom water cooler for it, works great.
ProfessionalSpend589@reddit
I tolerate it (I have two behind my PC table), but most people will probably not like it.
The blower fan is definitely not as silent as a Noctua fan.
Spare-Ad-4810@reddit
Dual 3090s is always the budget gpu offer.
starkruzr@reddit
that's over like $2K at this point at least on eBay in the US, idk that we can call it budget anymore
2Norn@reddit
conpared alternatives yes its budget
starkruzr@reddit
"pair of 5060Tis" makes a lot more sense budget-wise in terms of both available precision levels and total VRAM per dollar
2Norn@reddit
5060 ti has less than half the speed of 3090 tho... and its same price as used 3090...
u get less vram less speed just to have brand new and blackwell which wont do anything for you
Tenuous_Fawn@reddit
A new 5060ti 16gb is $550, a used 3090 is $1000
2Norn@reddit
a 16gb 5060 ti and used 3090 are the same price here about 600usd
Tenuous_Fawn@reddit
Here it's 1000 usd https://imgur.com/a/WEBajuY
2Norn@reddit
idk whats that supposed to prove
are you claiming entire used 3090 supply worldwide goes through that website?
starkruzr@reddit
they may not be global prices but they are national
2Norn@reddit
so? flash news things don't cost the same in 2 different countries more at 5...
Tenuous_Fawn@reddit
It's supposed to prove that the market price for a used 3090 here is about $1000 on ebay
2Norn@reddit
okay so its 1000 there and 600 here
what now?
Tenuous_Fawn@reddit
What is the value of knowledge? What is the purpose of living? What is the market price of a used RTX 3090? These are the mysteries pursed by the great philosophers of our day, u/Tenuous_Fawn and u/2Norn
munkiemagik@reddit
People can 'demand' $1000 but they can f*(k right off if they think most of us are going to give them $1000 for a battered 3090, greedy begging knob jocks!
Just keep an eye on eBay especially the non-FE cards. If you do a filter for completed + auctioned you will see they generally tend to end up selling for a lot less. Good luck hunting mate!
Tenuous_Fawn@reddit
This is with sold + auctioned:
https://imgur.com/a/WEBajuY
Hodler-mane@reddit
even a single 3090 at int4 on 27b dense will be a good choice for using it at home. 2x opens that up to 8 parallel sessions
Caffdy@reddit
what do you mean by that? interacting with just one instanc of 27B@Q4 on my 3090 already make it use full power
Hodler-mane@reddit
you can batch requests and get a much higher aggregate tps.
instead of one session at 90 tps. you can do 2 at say, 70. 4 at 55, 8 at 40 (these numbers aren't exact).
8x sessions at 4 would give you a total of 320 tps.
but to do this you need 8x the KV cache stored in vram, a single 3090 can only store one. so if you get another, you get the benefit of an additional cards processing power to increase overall tps + an additional 24gb vram just for kv caches. = 8 sessions
Hedede@reddit
Full power doesn't mean the card is fully itilised. Since it's still burning power while waiting for data to arrive from the memory.
samandiriel@reddit
Confirmed; it's what we have set up, and the hype is true - it's the sweet spot in terms of power/price/performance trade off for non-millionaires amongst us
Spare-Ad-4810@reddit
Same, its great. Pro WS-x570-ace is a great budget board to pair with it, with a ryzen9 5950x and 128gb ram. Regret not going bigger seeing what prices are a couple years later now. I paid 260$ for the ram, feb 2024, same ram now 1100. What the fuck.
RomanticDepressive@reddit
Have you OCed the memory for more bandwidth? So far my 5950x+dual 3090s feel more limited by system ram/pcie bandwidth. I got up to 3600mhz @ cl 16 but feel I could maybe do slightly better
Spare-Ad-4810@reddit
Actually the opposite, I got better performance and stability clocking down to 3200 and fclk to 1600 change alot of motherboard settings. Like making sure pcie gen is not on auto, and some others.
RomanticDepressive@reddit
Interesting… I’m currently 2+ year stable at my current settings. My biggest hurdle seems to be my flck at 1800 I’d love to break beyond but that takes time to test
Spare-Ad-4810@reddit
Ill do some benchmarks this week and we can compare
TheOnlyBen2@reddit
I agree, but I am surprised by how little ressources I find on optimized configurations with this setup. Or I am not looking at the good places ?
Spare-Ad-4810@reddit
I mean this sub alone has alot
TheOnlyBen2@reddit
Available information is quickly outdated as new models go out
Spare-Ad-4810@reddit
The benefit of using a 6 year old card. Hardware wise, the information stays pretty the same, same with Linux OS setup. Then once you've decided that, chatgpt or whatever your preferred flavor is, can help you decide on ollama vs llama.cpp.
In fact, youd be hardpressed to find a build more thoroughly explored on this sub than the dual 3090 setup.
TheOnlyBen2@reddit
Well, I have 2 RTX 3090 with NVLINK and kinda struggle to see what my best setup would be.
You have to pick a Quant format (AWQ, gguf), a Quant provider, decide if you should parallelize tensor, layers, or maybe on a single card ?
Then if you should use vLLM or llama.cpp, mostly depending on your quant. But wait do you realy need a quand or you can go FP8 ? What about your quant for the KV cache ? Should you quant the model for more context ? What's the sweet spot between all those parameters ?
Wait it also depends on your use case, so one setting for one is not good for another.
So you ask gemini or chat gpt and their advices is crap. I tired and got worth perf than bare config.
So I suppose I do it wrong somewhere, but I don't feel like that, dispite being a common config, it is easy to find optimized recommandations.
Spare-Ad-4810@reddit
Thats part of the fun of a local rig, testing and benchmarking all of those. And with agent systems crazy easy to setup, you could have Hermes or oh my agent run benchmarks for a few days and tell you definitively what your best setup is. Though it sounds like youre not even sure what exactly you need, so good luck buddy
TheOnlyBen2@reddit
Well, either there are plenty of ressources, or figuring it out by yourself is part of the fun. Pick one, buddy
Spare-Ad-4810@reddit
Its both. Definitely both. You may find 50% of 1 guide helpful and 50% of another. If you want your hand held, just say so.
sleepingsysadmin@reddit
You'll find that the amd 9700 has poor memory bandwidth and running dense on it will get 20-25TPS. Closer to 10-15 when you have any reasonable amount of context used.
The people telling you to buy 3090. They arent telling you that you're only getting \~120k context.
Long story short.
You're going 5090 or RTX pro 5000 for 27b.
Kagemand@reddit
Prompt processing also matter a lot, and you don’t need more than 32gb vram for 27b. So I’d say 2x 9070 is a good option still?
sleepingsysadmin@reddit
2x 9700 means you're going server cpu and mobo and still being about 50% slower than 5090. You dont just double your bandwidth in multigpu setups. Though yes, tensor and row splits can be better but doubtful.
While basically being the same price as the 5090.
Kagemand@reddit
I am not sure a server CPU and mobo is really necessary for 2x GPUs. Sure, one of the cards will be on 4x PCIe, but my impression is that it isn’t a huge problem, but sure correct me if I am wrong.
TiK4D@reddit
I have 2x R9700's on my Asrock X870 taichi creator board x8/x8, its not great but very useable
Kagemand@reddit
Out of interest, what does not great mean?
TiK4D@reddit
My goal was to run smaller dense models with a large context super quick but I just didn't do enough research before buying, with qwen-3.6-27b I get about 17tok/sec which I just don't think justifies how much I spent on everything
akira3weet@reddit
Are you sure you are running tensor parallel? That sounds like single r9700 speed.
Kagemand@reddit
17 tok/sec sound low for a dual gpu setup.
TiK4D@reddit
Any AI I ask seem to say it's about expected with my hardware
sleepy_roger@reddit
inference tokens have very little to do with PCIE bandwidth. It affects prefilling.
sleepingsysadmin@reddit
>Sure, one of the cards will be on 4x PCIe on a consumer board, but my impression is that it isn’t a huge problem, but sure correct me if I am wrong.
We are trying to compare apples to apples here. If you're allowing very significant limiters to further reduce performance, then now you need 3x r9700 to get similar performance, maybe.
rebelSun25@reddit (OP)
Thanks for being realistic about it. 🤝 That's my I'm asking hoping others know much better than I can guess
sleepingsysadmin@reddit
Lets be further realistic.
If AI is your hobby.
Spending $5000 on a 5090 sounds like alot.
But $5000 in golf clubs?
$5000 in tires, rims, and supercharger? Cobb stage 1?
It's really not unreasonable and the resell value of 5090 will remain for 5+ years.
Oh boy, im really convincing myself lol.
munkiemagik@reddit
So too will the resale value of an RTX 6000 Pro 😂
sleepy_roger@reddit
5000 on a 5090 is insane. They're 3200-3500 at micro center. Everyday I realize how much I was to get 2 fe's from Best buy for 2000.
For 5k I'd go with the rtx pro 5000, 48gb of vram or 2x5090s
Radiant_Condition861@reddit
did you upgrade your house electric circuit or cut back on power consumption via nvidia-smi -pl ?
sleepy_roger@reddit
For the 3090 box I run them all at 250, I've got a 20 amp circuit in the room that runs that box. Most it pulls is like 1600-ish, the 2x5090 box runs on a different circuit most it pulls is around 1100 or so, I have those limited to 460 each.
Caffdy@reddit
what about your 5090s? did you power limit them as well? if so, to which power target?
sleepy_roger@reddit
Yeah 460w for both
rebelSun25@reddit (OP)
It's not a hobby. I'm an old dev who is doing this every day, and I have created processes which use LLMs, but running off-site. Our office uses proper hardware, but my personal machine wasn't created with LLMs in mind.
With the advent of good 30b dense models, I want to move my testing in house and that's the situation I find myself in. $5k is perfectly reasonable to be honest as I expect this card to last me for a few years
ProfessionalSpend589@reddit
If money is not an issue - get a card with more than 32GB of VRAM.
You’ll run any model in the range of 32GB GPUs, but in good quality (quant 8 or better) and context won’t spill over to other hardware and you’ll avoid communication over the slow PCIe.
sleepingsysadmin@reddit
>It's not a hobby. I'm an old dev who is doing this every day, and I have created processes which use LLMs, but running off-site.
IF this is a business write off. There's little to no justification to be trying to pinch pennies on a 9700.
The big difference between the 5090 and the rtx 5000 is wattage. You're going to want to go the rtx pro 5000 unless you have the power supply to back 600+ watts for just the gpu.
2Norn@reddit
idk man i looked at the prices 5090 costs about 6x 3090
sure i want the goodies but i dont want it that much
slap 2x 3090 call it a day get 40tks
RoterElephant@reddit
RTX PRO 4500 also exists as a 32GB alternative.
sleepingsysadmin@reddit
That's true but that memory bandwidth. It's slower than a 3090. It's about the same speed as the r9700. While being twice the price? No thanks. that card doesnt exist.
Caffdy@reddit
896 vs 936GB/s, the difference is not that big
sleepy_roger@reddit
Bump to a 5000 then, speed is 1.34 TB/s, under a 5090 but a bit faster than a 3090.
This_Maintenance_834@reddit
RTX PRO 4500 runs faster than any non nvidia consumer cards on the market.
It gets you 36 tps on Qwen3.6-27b-Q4 without any tuning.
Anbeeld@reddit
You absolutely can fit 200k+ into 3090, even if with trade-offs. But I bought mine so cheap I can't complain.
sleepingsysadmin@reddit
>You absolutely can fit 200k+ into 3090, even if with trade-offs. But I bought mine so cheap I can't complain.
Are you saying that you kv cache quantize? Or like run Q2? Yikes.
Anbeeld@reddit
I'm running Q4 with TurboQuant (turbo3). It's not like 27B model is gonna solve the world hunger if you just don't quantize the cache, while 200k context is something I need just right now for the task it's doing at this exact moment.
And I'm on Windows which steals ton of VRAM by just existing, on Linux you can squeeze it more and use virtually lossless turbo4 for K or both K and V, add vision, etc.
sleepingsysadmin@reddit
Ok, ill give you the turboquant thing.
How's that working out for you? Stable?
LirGames@reddit
For coding, I personally wouldn't go the TurboQuant direction until the perfect implementation comes around on main llama.cpp.
The introduction of rotations a few weeks ago has made Q8 viable to FP16, before than all my tests at Q8 broke down at around 40-50K context (normal on real codebases). Now I feel comfortable with Q8, tested up to 96K context with very little issues (easy to spot of you don't vibe code but review whatever the model is generating).
Savantskie1@reddit
I’ve upgraded my setup to two MI50 32GB’s and plan to buy a third one soon. And I’m glad I did. Yeah it’s a bit slow for pp, but way better than the 7900xt and 6800 I traded the MI50’s for.
ea_man@reddit
for running 27B: sell 7800xt buy 7900 xtx
That should cost little vs the other alternatives
orinoco_w@reddit
I'd go with a 48GB card. I strongly recommend sizing for your next upgrade even if you can't afford it now.
I'm running a 7900xtx and an mi100 (bought the mi100 over a year ago), and 32GB isn't quite enough to run long context on the one card, and mixing and matching two architectures, especially AMD ones, is painful as all heck.
I find really inconsistent results between different quants, single cards, two cards, flash attention on/off, split-model tendor/layer and then different model architectures - and its never clear whether its just buggy llama.cpp features, quant impacts, differences between gfx1100 and gfx908 etc.
My 52GB VRAM (useable - I need to leave \~2GB for OS as the 7900xtx is my display) is enough to run 256k context with 27B at Q8 and maintain 20tps above 120k context but the amount of time I lose to rebuilding llama.cpp, and testing is just painful to try and get new models stable.
My advice: go with one 48GB card, or 72GB card (or even one 96GB card).
Two 32GB identical cards has limited upgrade potential - my next capacity goal is "be able to run 120B range models" so I'm after 96GB VRAM, as going from 54 -> 64 isn't
boulderingfanatix@reddit
How much are you guys finding 3090s for? Can't seem to find one for less that $1k. Is that standard basically?
2Norn@reddit
i think it heavily depends on where you live
i can find some for 700-750
starkruzr@reddit
not finding any for much less than $1100 on eBay
2Norn@reddit
like i said it depends on where you live
Mikolai007@reddit
$1900 will give you several years of usage of even stronger models through APIs. People afraid to send personal data through the API are sus.
buttplugs4life4me@reddit
As much as I was always a defender of AMD and always tried to support projects like ZLUDA, in the end its just not worth the time anymore for me and you gotta know what you sign up for.
The little bit that kinda made it overflow was recently running a model on my 6950XT and experiencing driver timeouts right at the end of the generation.
In addition, when trying to debug basically anything with the Radeon Developer Suite, it always said that the profile cannot be opened. Thanks. That's after 2 hours spent trying to even get profiling to work, since it likes to cause driver timeouts as well.
Now to be clear, this was Windows. Linux is likely better, but since you didn't say what you used, it's important to know. That and more exotic options like ik_llama.cpp and others usually dont support Vulkan or ROCm. IMHO Vulkan should always be the first target, but unfortunately its usually CUDA instead.
rpkarma@reddit
Maybe I’m biased but to your last point: that’s because writing CUDA kernels is easier than Vulkan, at least for me
FinancialBandicoot75@reddit
27b is safe due to MOE unless you ha e 64g VRAM or Mac with 64
This_Maintenance_834@reddit
if the purpose is solely to use localllm, you should get a nvidia card. it you want to play with other brands, then 9700 32GB should be better.
Toastti@reddit
Sell everything and buy two 3090s
Or if you wanted to sell the gpus + one kidney buy a rtx pro 6000
triynizzles1@reddit
Buy yourself a single rtx 8000 48gb. They cost about the same as all of the other proposed solutions. But its a single card. I have 1 in my setup and it can run gemma 4 and qwen 3.6 at max context length no problem.
Fortunato_NC@reddit
What you paid for your 7800XT is a sunk cost and is irrelevant to your decision. You should be evaluating your options based on improved results versus new costs incurred.
gladfelter@reddit
"Sunk cost" implies that the cost cannot be recouped. OP is willing to sell the card. This is an optimization problem.
GoodTip7897@reddit
It is partially right to call it a sunk cost. The $150 is sunk because the gpu is being treated as $550 resale but it is being valued at $700 in other parts of the problem...
Enough_Big4191@reddit
dual gpu looks nice on paper but the pcie bottleneck hits fast, especially for inference. you’ll get the vram, but tokens/sec usually drops enough that it feels worse than a single bigger card. 32gb is a meaningful step up for 27–31b, mainly for less aggressive quant and fewer headaches. if u care about smooth usage over max capacity, single larger gpu tends to be the better experience.
_Viral19@reddit
For 27B/31B, 32 GB is a meaningful step because it lets you run higher quants and longer context without aggressive offload, but two midrange GPUs usually help capacity more than speed since tensor splitting and PCIe sync can eat into the win. If llama.cpp is your main stack, I'd only choose dual 7800 XTs if your goal is fitting bigger models cheaper; if you want the cleanest single-user latency, the 32 GB card is usually the nicer buy even at worse $/GB. Have you benchmarked the exact models you care about with partial offload first, so you know whether you're solving a hard VRAM limit or just chasing a better quant/context combo?
boutell@reddit
I don't have speed numbers to share with you, but I wonder if a Macbook Pro with an M5 Pro and 48GB of RAM might be the overall win? $2,599 is a little steep, but you're getting a great machine in general and I suspect the best overall throughput per watt. The unified RAM is a solid win. My question would be whether it can keep up with the cards under discussion in this thread performance-wise.
wwabbbitt@reddit
Unified RAM is significantly slow than VRAM, so it's more suited for MoE models.
Qwen3.5/3.6 27b and Gemma 31b are dense models more suited for VRAM
Rangizingo@reddit
Slower yes but perfectly usable. Source : I have a 4090 gaming rig and a 48GB MacBook Pro. The 4090 is obviously faster but that doesn’t make the Mac slow per se. the gpu is just wildly fast.
Elusive_Spoon@reddit
If going Mac, I’d get at least a Max-level chip for the all-important memory bandwidth. Could even get an m3 or something. M5 speeds up prompt processing, but oMLX can help with that even with an older chip (I’m on M1 Max).
sleepy_roger@reddit
If you like AMD get 2 9700s. If price isn't a factor skip it all and just get a 6000 pro and call it a day, or one of the smaller variants like the pro 5000 72gb or 48gb.