96GB Vram. What to run in 2026?
Posted by inthesearchof@reddit | LocalLLaMA | View on Reddit | 88 comments
I was all set on doing the 4x 3090 route but with the current releases of qwen 3.5 and gemma 4. I am having second doubts. 96gb of vram seems to be in a weird spot where it not enough to run larger models and more than needed for the mid models. What are you running as your main model?
ValeKokPendek@reddit
4 instance of your favourite model
inthesearchof@reddit (OP)
Maybe minimax 2.7 q2 when released or qwen3.6 122b?
VoidAlchemy@reddit
I was asking myself the same question, started a post here with some preliminary comparisons: https://www.reddit.com/r/LocalLLaMA/comments/1sjsokz/minimaxm27_vs_qwen35122ba10b_for_96gb_vram_full/
VoidAlchemy@reddit
96GB is great, and if you use ik_llama.cpp's
-sm graphor try the mainline llama.cpp experimental feature-sm tensoryou can use all 4x of your GPUs for "tensor parallel" kind of operation similar to vLLM etc.My "daily driver" is opencode plus ubergarm/Qwen3.5-122B-A10B-GGUF IQ5_KS 77.341 GiB (5.441 BPW) with 256k uncompressed kv-cache which I designed to fit snuggly onto 2x older A6000 GPUs (basically 48gb vram 3090s).
I personally find it better than Qwen3.5-27B dense and definitely better than gemma-4-31b-it dense, both of which are slower too given more active weights.
Your rig is great, no need for fomo, enjoy what you have! Cheers!
No_Algae1753@reddit
I use q4 k XL. Which of these is better ?
VoidAlchemy@reddit
The IQ5_KS is the best available for 96GB VRAM offload - requires ik_llama.cpp (ik did many of the mainline llama.cpp quantization implementations like the ones in your quant, i'm using his newer stuff).
Nepherpitu@reddit
Qwen 3.5 122B at AWQ or GPTQ or nvfp4 fit with 200+K context running on vllm at 110+ tps.
inthesearchof@reddit (OP)
Qwen 27b seems to match or come close to 122b moe but there is the world knowledge difference
Nepherpitu@reddit
Speed is a quality as well. 27B is slower.
Panthau@reddit
exactly, around 10-15t/s on my strix halo 128gb
Trashposter666@reddit
That's exactly why I returned mine. Just too slow.
Panthau@reddit
Well, i get around 28t/s on qwen 3.5 122b in q4... for me, thats fine. If you want something more productive, it will be way more expensive... thats how its always been.
inthesearchof@reddit (OP)
It's frustrating when those companies are crippling their consumer solutions to around 250GB/s bandwidth.
uti24@reddit
I mean, companies are not entirely crippling, AMD AI MAX memory configuration is 4 channel 8000MT DDR5 memory, it's as good as it gets, nothing crippled here. I also would want faster memory, but how? More channels? It would cost even more, even more niche product, as PC with similar characteristics already cost like 2 times less
Mil0Mammon@reddit
Medusa halo is rumored to have 6 channels and lpddr6, so about 2.7x the BW
uti24@reddit
How come 2.7x the BW?
Stix halo has 4 channels, so if Medusa really has 6, then it's 1.5x, maybe 10 000MT then 1.7
So significantly better but not fantastically better
Panthau@reddit
You gotta go, with whats available. I paid 2700 euros for that device, which is less then one 5090 but 4x the ram. Depending on your goals, you might have different needs... but for most, its a great pricetag for what the device delivers.
inthesearchof@reddit (OP)
I remember when they were selling for $1600. Best value for larger Moe models with low watts
nostriluu@reddit
Apparently it's the limits of mainstream PC technology more than "crippling" their technology. It takes denser circuits that are more difficult to design and produce, and billions of dollars to make a chip plant to current specs (JEDEC). The market would have to be badly warped for all the companies to agree not to produce fast RAM if it were feasible. Making a PC out of commodity parts is the strength and weakness of the PC market. Apple largely designs their own parts (aside from RAM modules) and puts everything in one package (UMA), so they have to spend more on research and production but take more profit because of their difference.
The PC market has been stuck with slow RAM for a while, most enthusiast PCs from the past decade were about gaming so focused on the GPU and CPU rather than fast RAM. A workstation PC has fast RAM, but because it's still based on commodity parts rather than putting everything on one package, they're huge and power hungry and due to the bifurcation, expensive.
Karyo_Ten@reddit
For AMD I understand, it's way cheaper, and it wasn't designed for AI inference at first. Nvidia though ... they wanted it as an AI platform from the get go. And the price ... and they put a 5070-class GPU that is crippled by the bandwidth
IZaYaCI@reddit
Via vLLM? Can't get it to run even on 8x 3090, can you share your command/params?
Nepherpitu@reddit
Key points: - All GPUs have 0Gb reserved by system, all 24Gb are free. No GUI, no DE, no WM - only terminal under ubuntu server. -
--max-num-seqs 2- default value will try to capture graphs for 16 users - too much for 96Gb VRAM. ---attention-backend flashinfer- free performance! ---max-num-batched-tokens 4096- default is 2048, but 4096 is faster PP. ---gpu-memory-utilization 0.955- ONLY achievable if 0Gb reserved by system - after weights loaded I haveAvailable KV cache memory: 2.64 GiBin logs, thenAuto-fit max_model_len: reduced from 262144 to 223872 to fit in available GPU memory (2.64 GiB available for KV cache). For example, CachyOS and KDE uses2.677Gi/31.843Gion my 5090. OS uses more VRAM than is available for KV cache. There are ZERO chances you will be able to run this model shared with OS desktop.IZaYaCI@reddit
Thanks so much! I will try that, I was trying to run 122B-8bit model, maybe that the reason, spent like 2 days trying various configs, with max-num-seqs 1 max-model-len 2048, nothing worked :D
Also I did the P2P patch for the 3090-s
Already lost hope with vllm, gave up on running Qwen3.5-35B on all 8 gpus also
My current setup is Qwen3.5-35B on 4 gpus with 262k context for main agent, and 4 gpus on same model with 50k context for sub-agents
eribob@reddit
Are you running them bare metal? I tried to do the P2P patch yesterday but failed. I am running in a VM in proxmox though so maybe P2P does not work there.
Nepherpitu@reddit
FP8 must fit 8 cards as well
anzzax@reddit
Any reasons you prefer awq to int4-autoraund? I'm using 'Intel/Qwen3.5-122B-A10B-int4-AutoRound' now so asking maybe I need to switch to awq.
Nepherpitu@reddit
Autoround doesn't support tp=4, has worse quality overall (on par with GPTQ and nvfp4).
anzzax@reddit
I see, I'm on dgx spark so it's tp=1 for me. I'll check AWQ, from what I read here and on nvidia forum it looked like int4-autoraund is better than awq and GPTQ, nvfp4 is a different story.
Radiant_Condition861@reddit
any attempts to get that in sleep mode? level 1 crashes the system a lot. level 2 seems to force a re-tuning...
https://docs.vllm.ai/en/latest/features/sleep_mode/
IZaYaCI@reddit
Also, can you help me understand, you have expert-parallel commented out, is it right that it's either tensor-parallel or expert-parallel?
Nepherpitu@reddit
Nope, its just expert parallel is slower than tensor only
robertpro01@reddit
What's your PP?
Nepherpitu@reddit
Who knows, vllm benchmarks are hard. Somewhere between 4000 and 8000 tps up to 60K context. Around 3.5K at 180K.
FriendlyTitan@reddit
You can try Q3 quant of qwen3.5 397b (IQ3_XSS).
I tried something similar but on a 2x scale with GLM5.1 on 192gb of vram. IQ3_XSS with full context (200k) on llama_cpp, -fit on, -b and -ub 4096, I got pp at ~550-600t/s and tg at ~20-22t/s. With concurrent requests (-np 3) tg maxes out at 30t/s, no improvement to pp. Would appreciate if anyone has any advice on what to improve. I haven't tried ik_llama_cpp which iirc many people recommend for this hybrid inference scenario (cuda + cpu + iquant).
This was painfully slow for my use so it was just an experiment. I ran qwen 397b Q3_K_XL most of the time with decent success and speed (fully in vram).
PaMRxR@reddit
Give it a try with ik_llama.cpp, I use it specifically for Qwen3.5 122B-A10B with 2x24GB in VRAM + 35GB in RAM, which in scale I think is kinda similar to what you are trying. I'm getting 1000 pp/s, a lot better than llama.cpp.
Dontdoitagain69@reddit
is this for you or a team. If this is for you, you are bottlnecking 4 expensive copute units through horrible pci bus. its not unified 96 gbs, its 24gb
NNN_Throwaway2@reddit
Qwen 3.5 27b at bf16 and 397b at q2 with expert offloading. If you really want speed, then 122b at q4 but I’m not personally a fan of that one.
dobkeratops@reddit
if you live somewhere with cheap electricity, there's no such thing as too many GPUs
Eyelbee@reddit
Don't do the 4x3090. Isn't worth it. If I had 96gb I'd still run the same models that fit in 24gb, but at bf16 to make use of extra vram.
Nobby_Binks@reddit
96gb opens up a whole other level. Now you can easily run 120B models with decent context.
Eyelbee@reddit
That's precisely why it's not worth it. Difference is so minuscule.
ormandj@reddit
What? Taking qwen 3.5 for example, the 122b vs. 27b, you get around the same 'coding' performance, far more world knowledge, and higher performance - IF you have the VRAM.
Eyelbee@reddit
122B is actually worse in 90% of the tasks
Veearrsix@reddit
Just got GLM-5.1 running on my 128GB Studio, slow as balls right now. But with a smaller quant it could fit in 96GB.
Whole-Scene-689@reddit
are there quants under 1 bit 🤣
Cupakov@reddit
How did you fit GLM-5.1 on 128gb? Are you offloading to SSD?
Veearrsix@reddit
Yeah, streaming experts into memory from SSD.
xspider2000@reddit
what numbers of pp and tg u get?
inthesearchof@reddit (OP)
The rumored mac studio M5 ultra 512gb appeared to be the dream machine for 10k before the ram crisis.
ironmatrox@reddit
Will 512 Mac studio m5 ultra even appear? 🤞
FoundNil@reddit
It’s actually a great spot for large context. You can do gemma4 31B 8bit quant with 256k context.
-Ellary-@reddit
I would go for big GLM 4.6-4.7 at IQ4XS with partial offload.
Status_Record_1839@reddit
Qwen3.5 235B at Q4 fits in 96GB and it's a completely different league than the 72B. If you're doing any serious reasoning or long context work, the jump is worth it.
lemondrops9@reddit
Do you mean Qwen3 235B ? and I believe only a Q3 would fit.
Cupakov@reddit
It’s an LLM you’re talking to
LikeSaw@reddit
I still don't understand of what is the point of LLM/bots to interact with random reddit posts and reply etc. like why??? what for???
Cupakov@reddit
I have no idea either man, I think the most credible option is that people are farming these accounts to sell them later for nefarious purposes
Plenty_Coconut_1717@reddit
Go with Qwen3 235B (quantized). Best performance you can squeeze out of 96GB VRAM right now.
Long_comment_san@reddit
What? This is ancient! Do you use Maverick too? Better take Qwen 400b quantized over 235b!
marsxyz@reddit
400b quantized to fit 96gb , or as I call it, the lobotomy special
Long_comment_san@reddit
Still gonna be far, far better than 235b.
Makers7886@reddit
agreed, as a 235b enjoyer myself the new 122b replaces it.
jacek2023@reddit
I have 3x3090 and I am trying to buy fourth one because it's useful for 120B models, but also small models like 20-40B could use longer context, not to mention TP which makes everything faster on multiple GPUs
ParaboloidalCrest@reddit
But needless to mention the 4th gpu requires a lot of rearrangements in the case, more than one risers, bifurcation, hooking another PSU, so it's a royal pain in the ass.
Been thinking about it but I think I'm happy with qwen35-122b iq4xs with 256k of context.
jacek2023@reddit
I failed to put two 3090 into my desktop so I switched into the open frame and now four or cooling are not an issue
ParaboloidalCrest@reddit
Yeah those 3090s are chuncky. I was luck to find 3x asrock 7900xtx creator editition, so it's a blower style, and all fit on the mobo easily. The 4th would be another story of course.
Long_comment_san@reddit
96 gb is completely pointless. 48 gigs with dual 3090 is all you need. It can fit any 30b class model with Q6-Q8 with plenty leftovers for context. Also you can run any MOE (pretty much on a single 3090 actually) and load quite a bit of layers onto the memory to speed this up. It can also fit GLM 4.7 flash and Qwen 35b a3b fully if you really need speed. I would definitely target dense 30-50b models though.
By doubling to 96 you're gonna require massive power source, it's going to be hot and loud. Thing is, going 48->96 the only thing you gain currently is boosting the speed of your larger MOE models. That's literally it.
Thistlemanizzle@reddit
You sound like you know what you're talking about. I have a 5070 with 12GB of VRAM, but I also have 96GB of just regular RAM. My experience with using a hybrid of GPU and RAM has been poor, but I'm just starting out.
In your opinion, is it basically VRAM or nothing? Or is hybrid okay? But it's like, only needs to be a little bit extra, and there's not a lot of benefit from having just so much of this RAM on hand.
Long_comment_san@reddit
you can only use MOE models. With 12 gigs you can fit all the smaller MOE models with loads of context. You should be golden with 120b native 4bit quants. Qwen, Nemotron, Mistral have \~120b models that have about 6b active each and you can probably fit Q6 of that in your RAM
TacGibs@reddit
🤡
90hex@reddit
Gemma 4 122B A10B, Kimi 1T, Gemma4 26B etc. If you have plenty of RAM on the side you can load even larger models (say Kimi or GLM). Strangely Gemma 4 31B seems to beat most larger models from last year on many benchmarks, so that’s my favorite so far. It even beats Opus in some silly tests.
_-_David@reddit
Gemma 4 122b-a10b doesn't exist... "Kimi 1T" instead of Kimi K2.5.. You're bot-spicious
90hex@reddit
Woops I was missing a couple details. Thanks for pointing it out ! I swear I’m not a bot. I’m just a forgetful human.
anomaly256@reddit
And em-dashes — don't forget em-dashes
90hex@reddit
Yeah alright some models avoid them now. Clever bastards.
inthesearchof@reddit (OP)
The mysterious gemma4 124b
inthesearchof@reddit (OP)
Do you work at google? Gemma4 31b after getting fixes is turning out to be very nice
90hex@reddit
Nope just a happy user. I like both Qwen3.5 and Gemma4. Gemma4 seems even better now that I use the 31B and 26B variants.
Bird476Shed@reddit
GLM as main model, when it fails try with Qwen instead.
ambient_temp_xeno@reddit
2x 3090 and gemma 4 31b seems like the move*
*this week.
NoahFect@reddit
If you like image generation models, HunyuanImage-3 runs pretty well on a 96GB rig (RTX6000 in my case.)
spicypicsforsharing@reddit
can it run on multiple GPUs?
lemondrops9@reddit
doubt it, I have the same issue. There are some work arounds on Comfyui. But Comfyui drives me nuts so I havent tried in a while.
spicypicsforsharing@reddit
me either. maybe a weekend project
lemondrops9@reddit
Let me know if you get it going.
spicypicsforsharing@reddit
will do!
AurumDaemonHD@reddit
You can always run parallel agentic workflows on multibatch with smaller models. Or have each gpu load separate vllm or pipelin paralell or tp with batching idk tbh.
This is what the claw folk been doing u spin up locao server point claw to be crazy in a sandbox and come to what monstrosity u c reated.
jikilan_@reddit
Should be up to 120B+ for 96GB with high KV cache