One Bottleneck After Another - First GPU & now RAM

[-]

Massive-Question-550@reddit

Let's be clear, buying a rtx 6000 pro was never cheap or a good value.

[-]

I’m on the fence on the good value part. Being able to get 170t/s on gpt-oss-120b with 2.5-4K prompt processing speed has changed when and how I think about using ai. Basically everywhere all the time on everything just to see how it does.

80t/s on glm-4.5-air on q5. These are all imminently usable models for real local work and the combination of latency and gen speed really does make a difference.

So it depends on what value you’re looking for I guess.

[-]

Long_comment_san@reddit

Idk is it worth like 5-10 grand? If that earns you money sure, but I wanted 24-32 gb vram at ~800-1200$ for cool roleplays and now I'm kinda baffled sitting on my ass with my 4070 with 12gb VRAM. God saved us with MOE models there. Job is one thing, casual guys like me are kind of screwed for a couple of years, it's mining all over again.

[-]

Kitchen-Year-8434@reddit

Idk is it worth like 5-10 grand?

As disappointing as it is, my answer is: it depends.

Do you have that kind of money to dedicate to AI? Keep in mind, we have all the way from 10 year olds to 80+ year olds (probably lower and higher) posting here on reddit; major differences in economic situations and time to compound between the 2 groups.
If you do have that kind of money, do you want 96GB VRAM that's super freaking fast with a ton of compute? Or half a terabyte of memory to run a massive model with way worse prompt processing and inference speeds (mac studios)? Or some middle ground with worse memory bandwidth but better compute and more memory (epyc w/12-channel DDR-5)? Or do you want to run 3 5090's and do tensor parallelism but eat the power bill from it?

So - it depends. At that price point (say 5-10k), there's a lot of directions to go, and it seems like each offering kind of has its own niche based on some hardware limitations and a balance of tradeoffs between speed, size, power consumption, and compute.

MOE is a godsend for sure, but tbh I remain more impressed with gemma-3-27b as a model for pretty much anything other than code gen, and the qat version of that is, while not quite where you'd need with 12gb VRAM, still quite modest at 16.8GB at Q4 (link) or just over 14GB at 3.0bpw with exllamav3.

12GB of VRAM on a 4070 makes a ton of sense for gaming; that's a great amount of footprint there. Just when it comes to LLMs and needing VRAM for these super sparse, redundant models, turns out using a GPU isn't exactly what they were designed for. So it's pretty amazing we can get as far as we have with the current general-purpose GPU architecture, but just take a peek at what groq is doing for inference or google with their TPUs and realize we're all kind of hammering square pegs into round holes when it comes to our approach to inference right now.

[-]

Disastrous_Meal_4982@reddit

For real, we need slow deflation that tappers off to a steady market asap. If the bubble bursts it’s going to be more pain, not less.

[-]

Aggressive-Bother470@reddit

There's no waiting for a better deal in this game.

It's buy now or get fucked later.