What is the best inference model you have tried at 64gb VRAM and 128gb VRAM?

Posted by seoulsrvr@reddit | LocalLLaMA | View on Reddit | 5 comments

I'm using the model to ingest and understand large amounts of technical data. I want it to make well reasoned decisions quickly.
I've been testing with 32gb VRAM up to this point, but I'm migrating to new servers and want to upgrade the model.
Eager to hear impressions from the community.

[-]

Glittering-Koala-750@reddit

What gpu are you running at those levels of vram?

[-]

RobotRobotWhatDoUSee@reddit

gpt-oss for 128GB. I use it for statistical programming and it is very very good for that tasking

[-]

-dysangel-@reddit

At 64, probably still Qwen 3 32B for me. At 128, GLM 4.5 Air and gpt-oss-120b

[-]

seoulsrvr@reddit (OP)

Thanks!
GLM 4.5 Air and gpt-oss-120b compare to one another in your opinion?

[-]

-dysangel-@reddit

They seem similar in speed and capability, but GLM generates more aesthetic and colourful outputs and feels more human like to talk to. In my brief testing I'd say gpt-oss feels more task based and defaults to making colour schemes grey etc (same as GPT 5 does actually).

I haven't spent that much time with gpt-oss because I found the Harmony format messy and confusing for compatibility with local agents, so I've been waiting for Cline/Roo/LM Studio etc to catch up. I managed to do a successful test with it using codex cli.

LM Studio did add a Harmony runtime a couple of weeks ago, and Cline etc have had some time to iterate, so I should probably try gpt-oss more seriously again.