96GB Vram. What to run in 2026?

[-]

inthesearchof@reddit (OP)

Maybe minimax 2.7 q2 when released or qwen3.6 122b?

[-]

VoidAlchemy@reddit

I was asking myself the same question, started a post here with some preliminary comparisons: https://www.reddit.com/r/LocalLLaMA/comments/1sjsokz/minimaxm27_vs_qwen35122ba10b_for_96gb_vram_full/

[-]

96GB is great, and if you use ik_llama.cpp's -sm graph or try the mainline llama.cpp experimental feature -sm tensor you can use all 4x of your GPUs for "tensor parallel" kind of operation similar to vLLM etc.

My "daily driver" is opencode plus ubergarm/Qwen3.5-122B-A10B-GGUF IQ5_KS 77.341 GiB (5.441 BPW) with 256k uncompressed kv-cache which I designed to fit snuggly onto 2x older A6000 GPUs (basically 48gb vram 3090s).

I personally find it better than Qwen3.5-27B dense and definitely better than gemma-4-31b-it dense, both of which are slower too given more active weights.

Your rig is great, no need for fomo, enjoy what you have! Cheers!

[-]

No_Algae1753@reddit

I use q4 k XL. Which of these is better ?

[-]

VoidAlchemy@reddit

The IQ5_KS is the best available for 96GB VRAM offload - requires ik_llama.cpp (ik did many of the mainline llama.cpp quantization implementations like the ones in your quant, i'm using his newer stuff).

[-]

Nepherpitu@reddit

Qwen 3.5 122B at AWQ or GPTQ or nvfp4 fit with 200+K context running on vllm at 110+ tps.

[-]

inthesearchof@reddit (OP)

Qwen 27b seems to match or come close to 122b moe but there is the world knowledge difference

[-]

Nepherpitu@reddit

Speed is a quality as well. 27B is slower.

[-]

Panthau@reddit

exactly, around 10-15t/s on my strix halo 128gb

[-]

Trashposter666@reddit

That's exactly why I returned mine. Just too slow.

[-]

Panthau@reddit

Well, i get around 28t/s on qwen 3.5 122b in q4... for me, thats fine. If you want something more productive, it will be way more expensive... thats how its always been.

[-]

inthesearchof@reddit (OP)

It's frustrating when those companies are crippling their consumer solutions to around 250GB/s bandwidth.

[-]

uti24@reddit

It's frustrating when those companies are crippling their consumer solutions to around 250GB/s bandwidth.

I mean, companies are not entirely crippling, AMD AI MAX memory configuration is 4 channel 8000MT DDR5 memory, it's as good as it gets, nothing crippled here. I also would want faster memory, but how? More channels? It would cost even more, even more niche product, as PC with similar characteristics already cost like 2 times less

[-]

Mil0Mammon@reddit

Medusa halo is rumored to have 6 channels and lpddr6, so about 2.7x the BW

[-]

uti24@reddit

Medusa halo is rumored to have 6 channels and lpddr6, so about 2.7x the BW

How come 2.7x the BW?

Stix halo has 4 channels, so if Medusa really has 6, then it's 1.5x, maybe 10 000MT then 1.7

So significantly better but not fantastically better

[-]

Panthau@reddit

You gotta go, with whats available. I paid 2700 euros for that device, which is less then one 5090 but 4x the ram. Depending on your goals, you might have different needs... but for most, its a great pricetag for what the device delivers.

[-]

inthesearchof@reddit (OP)

I remember when they were selling for $1600. Best value for larger Moe models with low watts

[-]

nostriluu@reddit

Apparently it's the limits of mainstream PC technology more than "crippling" their technology. It takes denser circuits that are more difficult to design and produce, and billions of dollars to make a chip plant to current specs (JEDEC). The market would have to be badly warped for all the companies to agree not to produce fast RAM if it were feasible. Making a PC out of commodity parts is the strength and weakness of the PC market. Apple largely designs their own parts (aside from RAM modules) and puts everything in one package (UMA), so they have to spend more on research and production but take more profit because of their difference.

The PC market has been stuck with slow RAM for a while, most enthusiast PCs from the past decade were about gaming so focused on the GPU and CPU rather than fast RAM. A workstation PC has fast RAM, but because it's still based on commodity parts rather than putting everything on one package, they're huge and power hungry and due to the bifurcation, expensive.

[-]

Karyo_Ten@reddit

For AMD I understand, it's way cheaper, and it wasn't designed for AI inference at first. Nvidia though ... they wanted it as an AI platform from the get go. And the price ... and they put a 5070-class GPU that is crippled by the bandwidth

[-]

IZaYaCI@reddit

Via vLLM? Can't get it to run even on 8x 3090, can you share your command/params?

[-]

Nepherpitu@reddit

  qwen3.5-122b-a10b-fp4:
    env:
      - VLLM_LOG_STATS_INTERVAL=5
      - VLLM_SKIP_P2P_CHECK=1
      - VLLM_ENABLE_PCIE_ALLREDUCE=1
      - NCCL_P2P_LEVEL=SYS
      - CUDA_DEVICE_ORDER=PCI_BUS_ID
      - CUDA_VISIBLE_DEVICES=5,4,3,2,1,0
      - OMP_NUM_THREADS=12
      - VIRTUAL_ENV=/home/gleb/llm/env_qwen_35
    cmd: |
      /home/gleb/.local/bin/uv run
        -m vllm.entrypoints.openai.api_server
        #--model /mnt/samsung_990_evo/llm-data/models/Qwen/Qwen3.5-122B-A10B-GPTQ-Int4
        #--model /mnt/samsung_990_evo/llm-data/models/Sehyo/Qwen3.5-122B-A10B-NVFP4
        #--model /mnt/samsung_990_evo/llm-data/models/Intel/Qwen3.5-122B-A10B-int4-AutoRound
        --model /mnt/samsung_990_evo/llm-data/models/cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit
        #--model /mnt/samsung_990_evo/llm-data/models/happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound
        --served-model-name "qwen3.5-122b-a10b-fp4"
        --port ${PORT}
        --tensor-parallel-size 4
        --pipeline-parallel-size 1
        --max-num-batched-tokens 4096
        --enable-prefix-caching
        --max-model-len auto
        --gpu-memory-utilization 0.955
        --max-num-seqs 2
        --attention-backend flashinfer
        --reasoning-parser qwen3
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
        --load-format instanttensor
        #--enable-expert-parallel
        #--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}'

Key points: - All GPUs have 0Gb reserved by system, all 24Gb are free. No GUI, no DE, no WM - only terminal under ubuntu server. - --max-num-seqs 2 - default value will try to capture graphs for 16 users - too much for 96Gb VRAM. - --attention-backend flashinfer - free performance! - --max-num-batched-tokens 4096 - default is 2048, but 4096 is faster PP. - --gpu-memory-utilization 0.955 - ONLY achievable if 0Gb reserved by system - after weights loaded I have Available KV cache memory: 2.64 GiB in logs, then Auto-fit max_model_len: reduced from 262144 to 223872 to fit in available GPU memory (2.64 GiB available for KV cache). For example, CachyOS and KDE uses 2.677Gi/31.843Gi on my 5090. OS uses more VRAM than is available for KV cache. There are ZERO chances you will be able to run this model shared with OS desktop.

[-]

IZaYaCI@reddit

Thanks so much! I will try that, I was trying to run 122B-8bit model, maybe that the reason, spent like 2 days trying various configs, with max-num-seqs 1 max-model-len 2048, nothing worked :D

Also I did the P2P patch for the 3090-s

Already lost hope with vllm, gave up on running Qwen3.5-35B on all 8 gpus also

My current setup is Qwen3.5-35B on 4 gpus with 262k context for main agent, and 4 gpus on same model with 50k context for sub-agents

[-]

eribob@reddit

Are you running them bare metal? I tried to do the P2P patch yesterday but failed. I am running in a VM in proxmox though so maybe P2P does not work there.

[-]

Nepherpitu@reddit

FP8 must fit 8 cards as well

[-]

anzzax@reddit

Any reasons you prefer awq to int4-autoraund? I'm using 'Intel/Qwen3.5-122B-A10B-int4-AutoRound' now so asking maybe I need to switch to awq.

[-]

Nepherpitu@reddit

Autoround doesn't support tp=4, has worse quality overall (on par with GPTQ and nvfp4).

[-]

anzzax@reddit

I see, I'm on dgx spark so it's tp=1 for me. I'll check AWQ, from what I read here and on nvidia forum it looked like int4-autoraund is better than awq and GPTQ, nvfp4 is a different story.

[-]

Radiant_Condition861@reddit

any attempts to get that in sleep mode? level 1 crashes the system a lot. level 2 seems to force a re-tuning...

      # - VLLM_SERVER_DEV_MODE=1 #       --enable-sleep-mode

https://docs.vllm.ai/en/latest/features/sleep_mode/

[-]

IZaYaCI@reddit

Also, can you help me understand, you have expert-parallel commented out, is it right that it's either tensor-parallel or expert-parallel?

[-]

Nepherpitu@reddit

Nope, its just expert parallel is slower than tensor only

[-]

robertpro01@reddit

What's your PP?

[-]

Nepherpitu@reddit

Who knows, vllm benchmarks are hard. Somewhere between 4000 and 8000 tps up to 60K context. Around 3.5K at 180K.

[-]

FriendlyTitan@reddit

You can try Q3 quant of qwen3.5 397b (IQ3_XSS).

I tried something similar but on a 2x scale with GLM5.1 on 192gb of vram. IQ3_XSS with full context (200k) on llama_cpp, -fit on, -b and -ub 4096, I got pp at ~550-600t/s and tg at ~20-22t/s. With concurrent requests (-np 3) tg maxes out at 30t/s, no improvement to pp. Would appreciate if anyone has any advice on what to improve. I haven't tried ik_llama_cpp which iirc many people recommend for this hybrid inference scenario (cuda + cpu + iquant).

This was painfully slow for my use so it was just an experiment. I ran qwen 397b Q3_K_XL most of the time with decent success and speed (fully in vram).

[-]

PaMRxR@reddit

Give it a try with ik_llama.cpp, I use it specifically for Qwen3.5 122B-A10B with 2x24GB in VRAM + 35GB in RAM, which in scale I think is kinda similar to what you are trying. I'm getting 1000 pp/s, a lot better than llama.cpp.

[-]

Dontdoitagain69@reddit

is this for you or a team. If this is for you, you are bottlnecking 4 expensive copute units through horrible pci bus. its not unified 96 gbs, its 24gb

[-]

NNN_Throwaway2@reddit

Qwen 3.5 27b at bf16 and 397b at q2 with expert offloading. If you really want speed, then 122b at q4 but I’m not personally a fan of that one.

[-]

dobkeratops@reddit

if you live somewhere with cheap electricity, there's no such thing as too many GPUs

[-]

Eyelbee@reddit

Don't do the 4x3090. Isn't worth it. If I had 96gb I'd still run the same models that fit in 24gb, but at bf16 to make use of extra vram.

[-]

Nobby_Binks@reddit

96gb opens up a whole other level. Now you can easily run 120B models with decent context.

[-]

Eyelbee@reddit

That's precisely why it's not worth it. Difference is so minuscule.

[-]

ormandj@reddit

What? Taking qwen 3.5 for example, the 122b vs. 27b, you get around the same 'coding' performance, far more world knowledge, and higher performance - IF you have the VRAM.

[-]

Eyelbee@reddit

122B is actually worse in 90% of the tasks

[-]

Veearrsix@reddit

Just got GLM-5.1 running on my 128GB Studio, slow as balls right now. But with a smaller quant it could fit in 96GB.

[-]

Whole-Scene-689@reddit

are there quants under 1 bit 🤣

[-]

Cupakov@reddit

How did you fit GLM-5.1 on 128gb? Are you offloading to SSD?

[-]

Veearrsix@reddit

Yeah, streaming experts into memory from SSD.

[-]

xspider2000@reddit

what numbers of pp and tg u get?

[-]

inthesearchof@reddit (OP)

The rumored mac studio M5 ultra 512gb appeared to be the dream machine for 10k before the ram crisis.

[-]

ironmatrox@reddit

Will 512 Mac studio m5 ultra even appear? 🤞

[-]

FoundNil@reddit

It’s actually a great spot for large context. You can do gemma4 31B 8bit quant with 256k context.

[-]

-Ellary-@reddit

I would go for big GLM 4.6-4.7 at IQ4XS with partial offload.

[-]

Status_Record_1839@reddit

Qwen3.5 235B at Q4 fits in 96GB and it's a completely different league than the 72B. If you're doing any serious reasoning or long context work, the jump is worth it.

[-]

lemondrops9@reddit

Do you mean Qwen3 235B ? and I believe only a Q3 would fit.

[-]

Cupakov@reddit

It’s an LLM you’re talking to

[-]

LikeSaw@reddit

I still don't understand of what is the point of LLM/bots to interact with random reddit posts and reply etc. like why??? what for???

[-]

Cupakov@reddit

I have no idea either man, I think the most credible option is that people are farming these accounts to sell them later for nefarious purposes

[-]

Plenty_Coconut_1717@reddit

Go with Qwen3 235B (quantized). Best performance you can squeeze out of 96GB VRAM right now.

[-]

Long_comment_san@reddit

What? This is ancient! Do you use Maverick too? Better take Qwen 400b quantized over 235b!

[-]

marsxyz@reddit

400b quantized to fit 96gb , or as I call it, the lobotomy special

[-]

Long_comment_san@reddit

Still gonna be far, far better than 235b.

[-]

Makers7886@reddit

agreed, as a 235b enjoyer myself the new 122b replaces it.

[-]

jacek2023@reddit

I have 3x3090 and I am trying to buy fourth one because it's useful for 120B models, but also small models like 20-40B could use longer context, not to mention TP which makes everything faster on multiple GPUs

[-]

ParaboloidalCrest@reddit

But needless to mention the 4th gpu requires a lot of rearrangements in the case, more than one risers, bifurcation, hooking another PSU, so it's a royal pain in the ass.

Been thinking about it but I think I'm happy with qwen35-122b iq4xs with 256k of context.

[-]

jacek2023@reddit

I failed to put two 3090 into my desktop so I switched into the open frame and now four or cooling are not an issue

[-]

ParaboloidalCrest@reddit

Yeah those 3090s are chuncky. I was luck to find 3x asrock 7900xtx creator editition, so it's a blower style, and all fit on the mobo easily. The 4th would be another story of course.

[-]

Long_comment_san@reddit

96 gb is completely pointless. 48 gigs with dual 3090 is all you need. It can fit any 30b class model with Q6-Q8 with plenty leftovers for context. Also you can run any MOE (pretty much on a single 3090 actually) and load quite a bit of layers onto the memory to speed this up. It can also fit GLM 4.7 flash and Qwen 35b a3b fully if you really need speed. I would definitely target dense 30-50b models though.

By doubling to 96 you're gonna require massive power source, it's going to be hot and loud. Thing is, going 48->96 the only thing you gain currently is boosting the speed of your larger MOE models. That's literally it.

[-]

Thistlemanizzle@reddit

You sound like you know what you're talking about. I have a 5070 with 12GB of VRAM, but I also have 96GB of just regular RAM. My experience with using a hybrid of GPU and RAM has been poor, but I'm just starting out.

In your opinion, is it basically VRAM or nothing? Or is hybrid okay? But it's like, only needs to be a little bit extra, and there's not a lot of benefit from having just so much of this RAM on hand.

[-]

Long_comment_san@reddit

you can only use MOE models. With 12 gigs you can fit all the smaller MOE models with loads of context. You should be golden with 120b native 4bit quants. Qwen, Nemotron, Mistral have \~120b models that have about 6b active each and you can probably fit Q6 of that in your RAM

[-]

TacGibs@reddit

🤡

[-]

90hex@reddit

Gemma 4 122B A10B, Kimi 1T, Gemma4 26B etc. If you have plenty of RAM on the side you can load even larger models (say Kimi or GLM). Strangely Gemma 4 31B seems to beat most larger models from last year on many benchmarks, so that’s my favorite so far. It even beats Opus in some silly tests.

[-]

_-_David@reddit

Gemma 4 122b-a10b doesn't exist... "Kimi 1T" instead of Kimi K2.5.. You're bot-spicious

[-]

90hex@reddit

Woops I was missing a couple details. Thanks for pointing it out ! I swear I’m not a bot. I’m just a forgetful human.

[-]

anomaly256@reddit

And em-dashes — don't forget em-dashes

[-]

90hex@reddit

Yeah alright some models avoid them now. Clever bastards.

[-]

inthesearchof@reddit (OP)

The mysterious gemma4 124b

[-]

inthesearchof@reddit (OP)

Do you work at google? Gemma4 31b after getting fixes is turning out to be very nice

[-]

90hex@reddit

Nope just a happy user. I like both Qwen3.5 and Gemma4. Gemma4 seems even better now that I use the 31B and 26B variants.

[-]

Bird476Shed@reddit

96gb of vram seems to be in a weird spot where it not enough to run larger models and more than needed for the mid models.

78G GLM-4.5-Air-UD-Q5_K_XL.gguf
81G Qwen3-Coder-Next-UD-Q8_K_XL.gguf

GLM as main model, when it fails try with Qwen instead.

[-]

ambient_temp_xeno@reddit

2x 3090 and gemma 4 31b seems like the move*

*this week.

[-]

jikilan_@reddit

Should be up to 120B+ for 96GB with high KV cache