5090 worth it?
Posted by UteForLife@reddit | LocalLLaMA | View on Reddit | 19 comments
I really want to run like GLM 4.6 or GPT OSS locally. Is this really something a 5090 could do?
Posted by UteForLife@reddit | LocalLLaMA | View on Reddit | 19 comments
I really want to run like GLM 4.6 or GPT OSS locally. Is this really something a 5090 could do?
Steus_au@reddit
honestly I gave up on running big model locally, not worth it. GLM API so cheap, I spend les than a dollar a day. it would never pay off even a single 5090.
lumos675@reddit
Which glm model you use? For me it easily can eat 1 dollor in less than an hour. Speceficaly glm 4.6
Steus_au@reddit
glm 4.6 is about $1 per million tokens, are you sure you can read 1,000 A4 pages in one hour? you are a superman then
lumos675@reddit
Maybe bcz i am using cline or roo code and they waste too much of token. Idk
Steus_au@reddit
fair, best performance you can have on Mac are 30b models but they won't match any cloud big ones
prusswan@reddit
The smallest quant for GLM 4.6 is nearly 100GB. If you are serious about this you will need something better than 5090.
power97992@reddit
With a good cpu plus 128 gb of ram , you can run it with a 5090 at q2…
power97992@reddit
If you want to run 4.6 q8 with a reasonably large context and an okay speed , you will need two 5090s and a good cpu and plus 360 gb of ddr 5 ram .. Otherwise buy 4 rtx 6000 pros plus an rtx 5090 ( if you want semifast speed)
Ummite69@reddit
If you want to run LLM > \~50 GB, you could nearly don't bother about a GPU. I've run Deepseek Qwen3 thinking, some model around 200GB, with a PC with 192 GB ram AND a 5090, and before with a 3090. In such setup, the GGUF inference is so mostly in PC RAM that the GPU VRAM doesn't improve a lot the resulting performance (T/s)
So if you are ok waiting for 10-30 minutes a full answer (depending of the provided context and the size of the result), but want the best quality a LLM can give you, go for a 256GB DDR5 PC, with then the biggest GPU (VRAM) you can afford, which is at least ddr6.
If you want speed and some quality, some of the best scenario may be to have LOT of 'cheap' GPU and use GGUF inference, which can combine the VRAM of every GPU as if it was a single one. So if you take 4 GPU with 16 GB (like 5060ti or 5070ti) ram you will be just slightly little slower than two 5090 combined, but for maybe around 1/4th to 1/8th of the price depending of the current pricing. But depending of the scenario you may need to have good pci connexion else the pci transfer will slow down the inference and gpu will simply wait for data.
I'll soon try 5090 + 3090 + 5090 on a Thunderbolt 5 configuration, with 256gb ddr5 pc ram. Not sure what will be the resulting speed and exactly what will be the biggest LLM I'll be able to run, but I'll make myself that little surprise since it is my hobby for now 2-3 years...
mr_zerolith@reddit
GLM 4.6 is going to run like sheet because you have to fit >75% of the model on CPU.
GPT OSS? the 20b model will work, but it's not very smart.
Smartest model that runs on this GPU is SEED OSS 36B right now, which provides intelligence on the level of two notches down from Deepseek R1. With Q8 on the context, and a small Q4 quant, you can get \~80k context which is good enough for most purposes.
But do keep in mind that 5090's create a ton of heat, even if you downclock and downvolt it. So don't expect to be able to sit next to this beast of a card while gaming or using it.
Heat is the reason i'm waiting for the next generation of hardware.
GaryDUnicorn@reddit
5090s on GLM 4.6? No. But 7 of them will :D in all honesty get the rtx 6000 pro if you want big models, your performance tanks passing calculations between a half dozen cards.
Also, GLM 4.6 is great, even the EXL3 quant at 3.0bpw h6 is super good.
SimilarWarthog8393@reddit
GLM 4.6 & GPT OSS 120B are two different monsters. OSS 120B is more than doable. I can run that model on my laptop with a 4070 @ 15 t/s. GLM on the other hand is 3x the size and has 6x the active parameters, so your CPU & RAM will make a significant difference in whether it's feasible.
Particular-Panda5215@reddit
Only worth it when the model fits completely in Vram, otherwise you can't saturate the gpu.
The gpt-oss-20b will fit the 120b does not, and glm will also not fit on one card.
silenceimpaired@reddit
Yes… but… MoE’s not fitting is less important compared to dense models of past. Depending on your application.
ForsookComparison@reddit
Also yes.. but.. no matter how fast your VRAM is it'll be waiting on your (likely dual channel) DDR4/DDR5 for the same amount of time. Why blow $3K+ to have 32GB of 2TB/s VRAM when a $150 Alibaba Mi50 will feel fill virtually identical for large MoE's that have over half the model in system memory?
silenceimpaired@reddit
Ooo… an interesting take…. I think the one why that has me second guessing my two 3090’s is image and video AI.
panchovix@reddit
For LLMs I would rather get multiple 3090s, or wait for the 5000 supers on Q1 2026 (5070TiS and 5080S 24GB).
Consistent-Donut-534@reddit
Rent one on a cloud provider and see for yourself
a_beautiful_rhind@reddit
It's worth it for image/video gen. LLM you might want 2-4 of them paired with many channels of fast ram.