5090 worth it?

[-]

Steus_au@reddit

honestly I gave up on running big model locally, not worth it. GLM API so cheap, I spend les than a dollar a day. it would never pay off even a single 5090.

[-]

lumos675@reddit

Which glm model you use? For me it easily can eat 1 dollor in less than an hour. Speceficaly glm 4.6

[-]

Steus_au@reddit

glm 4.6 is about $1 per million tokens, are you sure you can read 1,000 A4 pages in one hour? you are a superman then

[-]

lumos675@reddit

Maybe bcz i am using cline or roo code and they waste too much of token. Idk

[-]

Steus_au@reddit

fair, best performance you can have on Mac are 30b models but they won't match any cloud big ones

[-]

prusswan@reddit

The smallest quant for GLM 4.6 is nearly 100GB. If you are serious about this you will need something better than 5090.

[-]

power97992@reddit

With a good cpu plus 128 gb of ram , you can run it with a 5090 at q2…

[-]

If you want to run 4.6 q8 with a reasonably large context and an okay speed , you will need two 5090s and a good cpu and plus 360 gb of ddr 5 ram .. Otherwise buy 4 rtx 6000 pros plus an rtx 5090 ( if you want semifast speed)

[-]

Ummite69@reddit

If you want to run LLM > \~50 GB, you could nearly don't bother about a GPU. I've run Deepseek Qwen3 thinking, some model around 200GB, with a PC with 192 GB ram AND a 5090, and before with a 3090. In such setup, the GGUF inference is so mostly in PC RAM that the GPU VRAM doesn't improve a lot the resulting performance (T/s)

So if you are ok waiting for 10-30 minutes a full answer (depending of the provided context and the size of the result), but want the best quality a LLM can give you, go for a 256GB DDR5 PC, with then the biggest GPU (VRAM) you can afford, which is at least ddr6.

If you want speed and some quality, some of the best scenario may be to have LOT of 'cheap' GPU and use GGUF inference, which can combine the VRAM of every GPU as if it was a single one. So if you take 4 GPU with 16 GB (like 5060ti or 5070ti) ram you will be just slightly little slower than two 5090 combined, but for maybe around 1/4th to 1/8th of the price depending of the current pricing. But depending of the scenario you may need to have good pci connexion else the pci transfer will slow down the inference and gpu will simply wait for data.

I'll soon try 5090 + 3090 + 5090 on a Thunderbolt 5 configuration, with 256gb ddr5 pc ram. Not sure what will be the resulting speed and exactly what will be the biggest LLM I'll be able to run, but I'll make myself that little surprise since it is my hobby for now 2-3 years...

[-]

mr_zerolith@reddit

GLM 4.6 is going to run like sheet because you have to fit >75% of the model on CPU.
GPT OSS? the 20b model will work, but it's not very smart.

Smartest model that runs on this GPU is SEED OSS 36B right now, which provides intelligence on the level of two notches down from Deepseek R1. With Q8 on the context, and a small Q4 quant, you can get \~80k context which is good enough for most purposes.

But do keep in mind that 5090's create a ton of heat, even if you downclock and downvolt it. So don't expect to be able to sit next to this beast of a card while gaming or using it.

Heat is the reason i'm waiting for the next generation of hardware.

[-]

GaryDUnicorn@reddit

5090s on GLM 4.6? No. But 7 of them will :D in all honesty get the rtx 6000 pro if you want big models, your performance tanks passing calculations between a half dozen cards.

Also, GLM 4.6 is great, even the EXL3 quant at 3.0bpw h6 is super good.

[-]

SimilarWarthog8393@reddit

GLM 4.6 & GPT OSS 120B are two different monsters. OSS 120B is more than doable. I can run that model on my laptop with a 4070 @ 15 t/s. GLM on the other hand is 3x the size and has 6x the active parameters, so your CPU & RAM will make a significant difference in whether it's feasible.

[-]

Particular-Panda5215@reddit

Only worth it when the model fits completely in Vram, otherwise you can't saturate the gpu.
The gpt-oss-20b will fit the 120b does not, and glm will also not fit on one card.

[-]

silenceimpaired@reddit

Yes… but… MoE’s not fitting is less important compared to dense models of past. Depending on your application.

[-]

ForsookComparison@reddit

Also yes.. but.. no matter how fast your VRAM is it'll be waiting on your (likely dual channel) DDR4/DDR5 for the same amount of time. Why blow $3K+ to have 32GB of 2TB/s VRAM when a $150 Alibaba Mi50 will feel fill virtually identical for large MoE's that have over half the model in system memory?

[-]

Steus_au@reddit

lumos675@reddit

Steus_au@reddit

lumos675@reddit

Steus_au@reddit

prusswan@reddit

power97992@reddit

power97992@reddit

Ummite69@reddit

mr_zerolith@reddit

GaryDUnicorn@reddit

SimilarWarthog8393@reddit

Particular-Panda5215@reddit

silenceimpaired@reddit

ForsookComparison@reddit

silenceimpaired@reddit

panchovix@reddit

Consistent-Donut-534@reddit

a_beautiful_rhind@reddit