Hardware Choice for 27b to 31b models.

[-]

E1Extrano@reddit

What motherboard do you have for the 3090s?

[-]

Radiant_Condition861@reddit

Dell Precision 7865. I love this thing. Rack mountable too.

https://www.microcenter.com/product/663925/dell-precision-7865-tower-workstation-desktop-computer

[-]

kapitanfind-us@reddit

Two questions for you:

1) why do you have a custom chat template? 2) what is compressed-tensors quantization?

Apologies if I'm lazy, I will start googling next 🙄

[-]

Radiant_Condition861@reddit

I was trouble shooting the tooling call problem. thought it was the template, but it turned out that I has to increase the output tokens (maxTokens) in pi to 64k
it's a technique to compress the models weights. if I understand it correctly, this setting stages the compressed weights in memory for faster inference.

[-]

drallcom3@reddit

dual 3090 with nvlink. I get 30-150tok/s

30+ with those? Damn. With my single 5060 16gb I get like 1.

[-]

2Norn@reddit

i mean if ur loading the same model as him yeah that makes sense

[-]

Kahvana@reddit

Personally I wouldn't use the 7800 XT, it's very power hungry compared to other options.

I'm running 2x 5060 Ti, 20t/s generation (degrades to 14 t/s on 100k context) isn't fast but it gets the job done. For that price and energy output, it's well worth.

If I had to buy a new GPU today, I would've gotten the R9700 Pro instead. Still very good energy use, 32GB unified VRAM (helps with fitting model layers) vs 32 GB paralel (some models might not fit as layers can't neatly be offloaded). Only downside I've heard is the very loud blower-style fan and the lack of CUDA.

With the single R9700 Pro, you leave room to expand later in a costumer case / motherboard also. Well worth it.

[-]

foxpro79@reddit

Just bought the 9700 for a test and will return it for the noise. Cannot believe how loud it is. On par with having a hair dryer right beside you.

[-]

horeaper@reddit

Buy a custom water cooler for it, works great.

[-]

ProfessionalSpend589@reddit

I tolerate it (I have two behind my PC table), but most people will probably not like it.

The blower fan is definitely not as silent as a Noctua fan.

[-]

Spare-Ad-4810@reddit

Dual 3090s is always the budget gpu offer.

[-]

starkruzr@reddit

that's over like $2K at this point at least on eBay in the US, idk that we can call it budget anymore

[-]

2Norn@reddit

conpared alternatives yes its budget

[-]

starkruzr@reddit

"pair of 5060Tis" makes a lot more sense budget-wise in terms of both available precision levels and total VRAM per dollar

[-]

2Norn@reddit

5060 ti has less than half the speed of 3090 tho... and its same price as used 3090...

u get less vram less speed just to have brand new and blackwell which wont do anything for you

[-]

Tenuous_Fawn@reddit

A new 5060ti 16gb is $550, a used 3090 is $1000

[-]

2Norn@reddit

a 16gb 5060 ti and used 3090 are the same price here about 600usd

[-]

Tenuous_Fawn@reddit

Here it's 1000 usd https://imgur.com/a/WEBajuY

[-]

2Norn@reddit

idk whats that supposed to prove

are you claiming entire used 3090 supply worldwide goes through that website?

[-]

starkruzr@reddit

they may not be global prices but they are national

[-]

2Norn@reddit

so? flash news things don't cost the same in 2 different countries more at 5...

[-]

Tenuous_Fawn@reddit

It's supposed to prove that the market price for a used 3090 here is about $1000 on ebay

[-]

2Norn@reddit

okay so its 1000 there and 600 here

what now?

[-]

Tenuous_Fawn@reddit

What is the value of knowledge? What is the purpose of living? What is the market price of a used RTX 3090? These are the mysteries pursed by the great philosophers of our day, u/Tenuous_Fawn and u/2Norn

[-]

munkiemagik@reddit

People can 'demand' $1000 but they can f*(k right off if they think most of us are going to give them $1000 for a battered 3090, greedy begging knob jocks!

Just keep an eye on eBay especially the non-FE cards. If you do a filter for completed + auctioned you will see they generally tend to end up selling for a lot less. Good luck hunting mate!

[-]

Tenuous_Fawn@reddit

This is with sold + auctioned:

https://imgur.com/a/WEBajuY

[-]

Hodler-mane@reddit

even a single 3090 at int4 on 27b dense will be a good choice for using it at home. 2x opens that up to 8 parallel sessions

[-]

Caffdy@reddit

opens that up to 8 parallel sessions

what do you mean by that? interacting with just one instanc of 27B@Q4 on my 3090 already make it use full power

[-]

Hodler-mane@reddit

you can batch requests and get a much higher aggregate tps.

instead of one session at 90 tps. you can do 2 at say, 70. 4 at 55, 8 at 40 (these numbers aren't exact).

8x sessions at 4 would give you a total of 320 tps.

but to do this you need 8x the KV cache stored in vram, a single 3090 can only store one. so if you get another, you get the benefit of an additional cards processing power to increase overall tps + an additional 24gb vram just for kv caches. = 8 sessions

[-]

Hedede@reddit

Full power doesn't mean the card is fully itilised. Since it's still burning power while waiting for data to arrive from the memory.

[-]

samandiriel@reddit

Confirmed; it's what we have set up, and the hype is true - it's the sweet spot in terms of power/price/performance trade off for non-millionaires amongst us

[-]

Spare-Ad-4810@reddit

Same, its great. Pro WS-x570-ace is a great budget board to pair with it, with a ryzen9 5950x and 128gb ram. Regret not going bigger seeing what prices are a couple years later now. I paid 260$ for the ram, feb 2024, same ram now 1100. What the fuck.

[-]

RomanticDepressive@reddit

Have you OCed the memory for more bandwidth? So far my 5950x+dual 3090s feel more limited by system ram/pcie bandwidth. I got up to 3600mhz @ cl 16 but feel I could maybe do slightly better

[-]

Spare-Ad-4810@reddit

Actually the opposite, I got better performance and stability clocking down to 3200 and fclk to 1600 change alot of motherboard settings. Like making sure pcie gen is not on auto, and some others.

[-]

RomanticDepressive@reddit

Interesting… I’m currently 2+ year stable at my current settings. My biggest hurdle seems to be my flck at 1800 I’d love to break beyond but that takes time to test

[-]

Spare-Ad-4810@reddit

Ill do some benchmarks this week and we can compare

[-]

TheOnlyBen2@reddit

I agree, but I am surprised by how little ressources I find on optimized configurations with this setup. Or I am not looking at the good places ?

[-]

Spare-Ad-4810@reddit

I mean this sub alone has alot

[-]

TheOnlyBen2@reddit

Available information is quickly outdated as new models go out

[-]

Spare-Ad-4810@reddit

The benefit of using a 6 year old card. Hardware wise, the information stays pretty the same, same with Linux OS setup. Then once you've decided that, chatgpt or whatever your preferred flavor is, can help you decide on ollama vs llama.cpp.
In fact, youd be hardpressed to find a build more thoroughly explored on this sub than the dual 3090 setup.

[-]

TheOnlyBen2@reddit

Well, I have 2 RTX 3090 with NVLINK and kinda struggle to see what my best setup would be.

You have to pick a Quant format (AWQ, gguf), a Quant provider, decide if you should parallelize tensor, layers, or maybe on a single card ?

Then if you should use vLLM or llama.cpp, mostly depending on your quant. But wait do you realy need a quand or you can go FP8 ? What about your quant for the KV cache ? Should you quant the model for more context ? What's the sweet spot between all those parameters ?

Wait it also depends on your use case, so one setting for one is not good for another.

So you ask gemini or chat gpt and their advices is crap. I tired and got worth perf than bare config.

So I suppose I do it wrong somewhere, but I don't feel like that, dispite being a common config, it is easy to find optimized recommandations.

[-]

Spare-Ad-4810@reddit

Thats part of the fun of a local rig, testing and benchmarking all of those. And with agent systems crazy easy to setup, you could have Hermes or oh my agent run benchmarks for a few days and tell you definitively what your best setup is. Though it sounds like youre not even sure what exactly you need, so good luck buddy

[-]

TheOnlyBen2@reddit

Well, either there are plenty of ressources, or figuring it out by yourself is part of the fun. Pick one, buddy

[-]

Spare-Ad-4810@reddit

Its both. Definitely both. You may find 50% of 1 guide helpful and 50% of another. If you want your hand held, just say so.

[-]

sleepingsysadmin@reddit

You'll find that the amd 9700 has poor memory bandwidth and running dense on it will get 20-25TPS. Closer to 10-15 when you have any reasonable amount of context used.

The people telling you to buy 3090. They arent telling you that you're only getting \~120k context.

Long story short.

You're going 5090 or RTX pro 5000 for 27b.

[-]

Kagemand@reddit

Prompt processing also matter a lot, and you don’t need more than 32gb vram for 27b. So I’d say 2x 9070 is a good option still?

[-]

sleepingsysadmin@reddit

2x 9700 means you're going server cpu and mobo and still being about 50% slower than 5090. You dont just double your bandwidth in multigpu setups. Though yes, tensor and row splits can be better but doubtful.

While basically being the same price as the 5090.

[-]

Kagemand@reddit

I am not sure a server CPU and mobo is really necessary for 2x GPUs. Sure, one of the cards will be on 4x PCIe, but my impression is that it isn’t a huge problem, but sure correct me if I am wrong.

[-]

TiK4D@reddit

I have 2x R9700's on my Asrock X870 taichi creator board x8/x8, its not great but very useable

[-]

Kagemand@reddit

Out of interest, what does not great mean?

[-]

TiK4D@reddit

My goal was to run smaller dense models with a large context super quick but I just didn't do enough research before buying, with qwen-3.6-27b I get about 17tok/sec which I just don't think justifies how much I spent on everything

[-]

akira3weet@reddit

Are you sure you are running tensor parallel? That sounds like single r9700 speed.

[-]

Kagemand@reddit

17 tok/sec sound low for a dual gpu setup.

[-]

TiK4D@reddit

Any AI I ask seem to say it's about expected with my hardware

[-]

sleepy_roger@reddit

inference tokens have very little to do with PCIE bandwidth. It affects prefilling.

[-]

sleepingsysadmin@reddit

>Sure, one of the cards will be on 4x PCIe on a consumer board, but my impression is that it isn’t a huge problem, but sure correct me if I am wrong.

We are trying to compare apples to apples here. If you're allowing very significant limiters to further reduce performance, then now you need 3x r9700 to get similar performance, maybe.

[-]

rebelSun25@reddit (OP)

Thanks for being realistic about it. 🤝 That's my I'm asking hoping others know much better than I can guess

[-]

sleepingsysadmin@reddit

Lets be further realistic.

If AI is your hobby.

Spending $5000 on a 5090 sounds like alot.

But $5000 in golf clubs?

$5000 in tires, rims, and supercharger? Cobb stage 1?

It's really not unreasonable and the resell value of 5090 will remain for 5+ years.

Oh boy, im really convincing myself lol.

[-]

munkiemagik@reddit

So too will the resale value of an RTX 6000 Pro 😂

[-]

sleepy_roger@reddit

5000 on a 5090 is insane. They're 3200-3500 at micro center. Everyday I realize how much I was to get 2 fe's from Best buy for 2000.

For 5k I'd go with the rtx pro 5000, 48gb of vram or 2x5090s

[-]

Radiant_Condition861@reddit

did you upgrade your house electric circuit or cut back on power consumption via nvidia-smi -pl ?

[-]

sleepy_roger@reddit

For the 3090 box I run them all at 250, I've got a 20 amp circuit in the room that runs that box. Most it pulls is like 1600-ish, the 2x5090 box runs on a different circuit most it pulls is around 1100 or so, I have those limited to 460 each.

[-]

Caffdy@reddit

what about your 5090s? did you power limit them as well? if so, to which power target?

[-]

sleepy_roger@reddit

Yeah 460w for both

[-]

rebelSun25@reddit (OP)

It's not a hobby. I'm an old dev who is doing this every day, and I have created processes which use LLMs, but running off-site. Our office uses proper hardware, but my personal machine wasn't created with LLMs in mind.

With the advent of good 30b dense models, I want to move my testing in house and that's the situation I find myself in. $5k is perfectly reasonable to be honest as I expect this card to last me for a few years

[-]

ProfessionalSpend589@reddit

If money is not an issue - get a card with more than 32GB of VRAM.

You’ll run any model in the range of 32GB GPUs, but in good quality (quant 8 or better) and context won’t spill over to other hardware and you’ll avoid communication over the slow PCIe.

[-]

sleepingsysadmin@reddit

>It's not a hobby. I'm an old dev who is doing this every day, and I have created processes which use LLMs, but running off-site.

IF this is a business write off. There's little to no justification to be trying to pinch pennies on a 9700.

The big difference between the 5090 and the rtx 5000 is wattage. You're going to want to go the rtx pro 5000 unless you have the power supply to back 600+ watts for just the gpu.

[-]

2Norn@reddit

idk man i looked at the prices 5090 costs about 6x 3090

sure i want the goodies but i dont want it that much

slap 2x 3090 call it a day get 40tks

[-]

RoterElephant@reddit

RTX PRO 4500 also exists as a 32GB alternative.

[-]

sleepingsysadmin@reddit

That's true but that memory bandwidth. It's slower than a 3090. It's about the same speed as the r9700. While being twice the price? No thanks. that card doesnt exist.

[-]

Caffdy@reddit

It's slower than a 3090

896 vs 936GB/s, the difference is not that big

[-]

sleepy_roger@reddit

Bump to a 5000 then, speed is 1.34 TB/s, under a 5090 but a bit faster than a 3090.

[-]

This_Maintenance_834@reddit

RTX PRO 4500 runs faster than any non nvidia consumer cards on the market.

It gets you 36 tps on Qwen3.6-27b-Q4 without any tuning.

[-]

Anbeeld@reddit

You absolutely can fit 200k+ into 3090, even if with trade-offs. But I bought mine so cheap I can't complain.

[-]

sleepingsysadmin@reddit

>You absolutely can fit 200k+ into 3090, even if with trade-offs. But I bought mine so cheap I can't complain.

Are you saying that you kv cache quantize? Or like run Q2? Yikes.

[-]

Anbeeld@reddit

I'm running Q4 with TurboQuant (turbo3). It's not like 27B model is gonna solve the world hunger if you just don't quantize the cache, while 200k context is something I need just right now for the task it's doing at this exact moment.

And I'm on Windows which steals ton of VRAM by just existing, on Linux you can squeeze it more and use virtually lossless turbo4 for K or both K and V, add vision, etc.

[-]

sleepingsysadmin@reddit

Ok, ill give you the turboquant thing.

How's that working out for you? Stable?

[-]

LirGames@reddit

For coding, I personally wouldn't go the TurboQuant direction until the perfect implementation comes around on main llama.cpp.

The introduction of rotations a few weeks ago has made Q8 viable to FP16, before than all my tests at Q8 broke down at around 40-50K context (normal on real codebases). Now I feel comfortable with Q8, tested up to 96K context with very little issues (easy to spot of you don't vibe code but review whatever the model is generating).

[-]

Savantskie1@reddit

I’ve upgraded my setup to two MI50 32GB’s and plan to buy a third one soon. And I’m glad I did. Yeah it’s a bit slow for pp, but way better than the 7900xt and 6800 I traded the MI50’s for.

[-]

ea_man@reddit

for running 27B: sell 7800xt buy 7900 xtx

That should cost little vs the other alternatives

[-]

orinoco_w@reddit

I'd go with a 48GB card. I strongly recommend sizing for your next upgrade even if you can't afford it now.

I'm running a 7900xtx and an mi100 (bought the mi100 over a year ago), and 32GB isn't quite enough to run long context on the one card, and mixing and matching two architectures, especially AMD ones, is painful as all heck.

I find really inconsistent results between different quants, single cards, two cards, flash attention on/off, split-model tendor/layer and then different model architectures - and its never clear whether its just buggy llama.cpp features, quant impacts, differences between gfx1100 and gfx908 etc.

My 52GB VRAM (useable - I need to leave \~2GB for OS as the 7900xtx is my display) is enough to run 256k context with 27B at Q8 and maintain 20tps above 120k context but the amount of time I lose to rebuilding llama.cpp, and testing is just painful to try and get new models stable.

My advice: go with one 48GB card, or 72GB card (or even one 96GB card).

Two 32GB identical cards has limited upgrade potential - my next capacity goal is "be able to run 120B range models" so I'm after 96GB VRAM, as going from 54 -> 64 isn't

[-]

boulderingfanatix@reddit

How much are you guys finding 3090s for? Can't seem to find one for less that $1k. Is that standard basically?

[-]

2Norn@reddit

i think it heavily depends on where you live

i can find some for 700-750

[-]

starkruzr@reddit

not finding any for much less than $1100 on eBay

[-]

2Norn@reddit

like i said it depends on where you live

[-]

Mikolai007@reddit

$1900 will give you several years of usage of even stronger models through APIs. People afraid to send personal data through the API are sus.

[-]

buttplugs4life4me@reddit

As much as I was always a defender of AMD and always tried to support projects like ZLUDA, in the end its just not worth the time anymore for me and you gotta know what you sign up for.

The little bit that kinda made it overflow was recently running a model on my 6950XT and experiencing driver timeouts right at the end of the generation.

In addition, when trying to debug basically anything with the Radeon Developer Suite, it always said that the profile cannot be opened. Thanks. That's after 2 hours spent trying to even get profiling to work, since it likes to cause driver timeouts as well.

Now to be clear, this was Windows. Linux is likely better, but since you didn't say what you used, it's important to know. That and more exotic options like ik_llama.cpp and others usually dont support Vulkan or ROCm. IMHO Vulkan should always be the first target, but unfortunately its usually CUDA instead.

[-]

rpkarma@reddit

Maybe I’m biased but to your last point: that’s because writing CUDA kernels is easier than Vulkan, at least for me

[-]

FinancialBandicoot75@reddit

27b is safe due to MOE unless you ha e 64g VRAM or Mac with 64

[-]

This_Maintenance_834@reddit

if the purpose is solely to use localllm, you should get a nvidia card. it you want to play with other brands, then 9700 32GB should be better.

[-]

Toastti@reddit

Sell everything and buy two 3090s

Or if you wanted to sell the gpus + one kidney buy a rtx pro 6000

[-]

triynizzles1@reddit

Buy yourself a single rtx 8000 48gb. They cost about the same as all of the other proposed solutions. But its a single card. I have 1 in my setup and it can run gemma 4 and qwen 3.6 at max context length no problem.

[-]

Fortunato_NC@reddit

What you paid for your 7800XT is a sunk cost and is irrelevant to your decision. You should be evaluating your options based on improved results versus new costs incurred.

[-]

gladfelter@reddit

"Sunk cost" implies that the cost cannot be recouped. OP is willing to sell the card. This is an optimization problem.

[-]

GoodTip7897@reddit

It is partially right to call it a sunk cost. The $150 is sunk because the gpu is being treated as $550 resale but it is being valued at $700 in other parts of the problem...

[-]

Enough_Big4191@reddit

dual gpu looks nice on paper but the pcie bottleneck hits fast, especially for inference. you’ll get the vram, but tokens/sec usually drops enough that it feels worse than a single bigger card. 32gb is a meaningful step up for 27–31b, mainly for less aggressive quant and fewer headaches. if u care about smooth usage over max capacity, single larger gpu tends to be the better experience.

[-]

_Viral19@reddit

For 27B/31B, 32 GB is a meaningful step because it lets you run higher quants and longer context without aggressive offload, but two midrange GPUs usually help capacity more than speed since tensor splitting and PCIe sync can eat into the win. If llama.cpp is your main stack, I'd only choose dual 7800 XTs if your goal is fitting bigger models cheaper; if you want the cleanest single-user latency, the 32 GB card is usually the nicer buy even at worse $/GB. Have you benchmarked the exact models you care about with partial offload first, so you know whether you're solving a hard VRAM limit or just chasing a better quant/context combo?

[-]

boutell@reddit

I don't have speed numbers to share with you, but I wonder if a Macbook Pro with an M5 Pro and 48GB of RAM might be the overall win? $2,599 is a little steep, but you're getting a great machine in general and I suspect the best overall throughput per watt. The unified RAM is a solid win. My question would be whether it can keep up with the cards under discussion in this thread performance-wise.

[-]

wwabbbitt@reddit

Unified RAM is significantly slow than VRAM, so it's more suited for MoE models.

Qwen3.5/3.6 27b and Gemma 31b are dense models more suited for VRAM

[-]