48GB VRAM users, what are your daily drivers? Do you wish you had more VRAM? What would you run if you did?

[-]

pArbo@reddit

I have 96GB in a strix-halo setup and I'm achieving what feels like sonnet level results at about 50-60 tokens/s and Q_8 132k context. its pretty dope. I know it can be a lot faster with a discrete GPU but these results are very cool.

[-]

Borkato@reddit (OP)

What model?

[-]

pArbo@reddit

qwen3.6-35b

[-]

Borkato@reddit (OP)

Dude qwen 3.6 35B is my favorite model. It can do almost everything lol

[-]

LORD_CMDR_INTERNET@reddit

Qwen 3.6 27b Q8 with 150k is a perfect fit for 48GB

[-]

HaggardSummaries@reddit

Unsloth Q8 fits full 262k context, been the daily driver on my dual 3090 since it came out.

[-]

rwa2@reddit

Been playing around with the Qwen 35b A3B FP8 and got these llmperf benchmarks on a RTX 6000 Ada

Will try to repeat on the 27b dense model soon!

[-]

Borkato@reddit (OP)

Why no llama cpp

[-]

stoppableDissolution@reddit

Former 48gb user. Used to mainly run either q4 of various llama3 70b tunes or full-precision mistral small, then q6 gemma4 31b.

Got 96gb now and still running almost exclusively gemma but now with few hundred thousands of context and much faster! Plus occasionally qwen27 and q4 mistral medium. One hell of financially irresponsible decision but no regrets.

[-]

ROS_SDN@reddit

Yeah you'd say the move to 96GB wasn't that big a jump for the models you run, but more the QoL of running multiple models/contexts etc?

[-]

stoppableDissolution@reddit

Yea, exactly. They are not really making models for that range and running huge moes is still virtually impossible, but it is such a relief to be able to just send it without having to close the browser before launching lcpp, lol. And it is significantly faster than 2x3090, and having the ram in one big chunk also really helps, and I can load more than one mid-sized model indeed. And, well, now I can finally finetune without renting runpod!

So not next level of the models, but very very big qol.

[-]

Borkato@reddit (OP)

Is it bad that I lowkey hope they DON’T ever make models for 50-100GB vram anymore lmao. I hope I don’t get my 48GB and instantly wish I had more…

[-]

ROS_SDN@reddit

You'll always want more

[-]

Borkato@reddit (OP)

How good is mistral medium? Would it be worth trying it at Q2 😂

[-]

Icy_Butterscotch6661@reddit

Why Gemma instead of Qwen?

[-]

Royal-Elderberry6050@reddit

There’s no such thing as “enough ram”

[-]

SillyLLM@reddit

Literally 640K

[-]

raika11182@reddit

Gemma 4 31B Q8 GGUF is the daily driver. You can get a nice context size and split the workload across two GPUs. Using GGUF's because these are old P40 cards.

Already in the leftover space, if I'm good with 32K context, I can run an image model on one card, and also get TTS, STT, etc. loaded.

I'm not sure what I'd do with more VRAM, but at the moment Gemma 4 has been the best daily driver local model experience I've had. I've played with larger models (albeit at slow speed once I start eating into RAM as well), and at the moment most of the models just feel kind of "last gen" compared to Gemma 4.

I guess I'd like to play with that new Mistral model if had some more space and horsepower. I've recently started to become very suspicious of quantization and avoid going below Q8 on anything.

[-]

pdawes@reddit

I've recently started to become very suspicious of quantization and avoid going below Q8 on anything.

Would you be willing to share your overall sense of what led you to feel this way?

[-]

nicholas_the_furious@reddit

I have 2x 3090s and I feel the same thing. The Q8s are much more polished and less likely to miss something. If Q4 is 95% of a Q8, that's still a 1 in 20 chance it fucks up something and you don't realize it.

[-]

Borkato@reddit (OP)

What Q8 do you run? I’m getting a second 3090 and I’m so excited haha

[-]

nicholas_the_furious@reddit

Qwen 3.6 27B primary and if I need speed then 35BA3B. They both fit with full context. I also run full KV, no quant.

Even then, you'll notice a soft wall around 150k tokens. That should be your compact checkpoint or like 120k.

[-]

Borkato@reddit (OP)

Wait, so qwen 27B Q8 fits completely on 2 3090s with full context and no kv quanting with full context like 250k or whatever? 😮

[-]

nicholas_the_furious@reddit

Yes 100%. I don't have any monitors plugged in so no VRAM overhead. Otherwise you're looking at like 200k context. Which is still plenty.

[-]

Borkato@reddit (OP)

I’m so surprised!! What’s your tps?

[-]

raika11182@reddit

It depends what you use it for tbh. It's hard to put your finger on it but I can feel it in the way it writes and communicates.

Think of it this way: it's literally HALF the data of Q8, and a quarter of f16. You're losing a whole lot in there, especially in terms of depth and all of that... Well... Training. I don't do anything important, but Q4 just doesn't "feel" right to me in things like RPGs or whatever. It just lacks that spark thay makes AI cool and feels closer and closer to that "fancy autocomplete" vibe.

[-]

wgaca2@reddit

i run q6 at 15-20 t/s on 2x3090, what speeds do you get?

[-]

munkiemagik@reddit

I get 20tk/s on Q8 with my 2x 3090. I cant remember if at all/how much difference it makes but I use:

--batch-size 4096
--ubatch-size 1024
--spec-type ngram-mod
--spec-ngram-mod-n-match 16
--spec-draft-n-max 4

I'm probably doing something wrong as usual but it works well enough for me to not bother digging into it any further.

[-]

cu-pa@reddit

what use case do you handle for 20 tk/s?

[-]

Vaping_Cobra@reddit

I am still a bit sad the industry gave up on the 70B model family for the most part. It is mostly sub 8B models, 24-32B models and then it seems to jump to 100B+. Even if I were to add a third P40 to my server, it would only be used for adding more context or secondary services for TTS/STT or image recognition.

If we still had high quality SOTA 70B models I might consider a third P40 just to run one at Q4-Q6, but as it stands the next step would be to add two more p40's to bring the pool up to 92GB and allow running 120B size models at Q4 and that requires hardware with a lot of PCIe lanes so you start hitting even more walls with 3+ cards.

[-]

himefei@reddit

We all know why, if they don’t fuk up, a 70b Qwen3.6 or Gemma4 we’ll be very close to frontier models performance

[-]

MarcusAurelius68@reddit

I can squeeze this into my 40GB of VRAM in Q8. Not a ton of extra context space though.

[-]

Far-Low-4705@reddit

Wouldn’t u rather run Q4 and get the extra speed boost?

I find I almost always prefer that, I can’t really say there have ever been any problems that Q8 could solve while Q4 couldn’t

[-]

IgnisIason@reddit

2nd on Gemma 4. I like it even over larger models

[-]

munkiemagik@reddit

Congrats, from 32 > 48 is a nice bump, gives you that little bit more room for more useful context. with 32GB you are anxiously watching your context fill up rapidly right from the get go and being right on the limit of your 32GB VRAM

I used to (technically still do) have a 'convertible' LLM server. ie it used to flip-flop between 48GB <> 80GB VRAM

2x 3090 +/- 1x 5090

This was back when i used GPT-OSS-120B and GLM4.5-Air a lot and all the other models of that era.

Since qwen3.6 I just cant be bothered to go through the hassle of taking my SFF case apart (its genuinely a ballache as its a deshrouded MSI Ventus 5090 shoehorned into a FormD T1 case which needs to be dismanteld to get such large GPU in or out) to transfer the 5090 into the LLM rig.

LIke everyone else here -

Would I like more VRAM - hell yes

Do I think its worth it for me to pay another £XXXX or whatever a 5090/6000/5000 pro costs - NOPE

In fact I probably run qwen3.6 27b and 35b (both 6bit quants) more on the single 5090 more than I do on the dual 3090 for now and if i need more capable I prefer to just consume paid cloud tokens.

I still from time to time ponder over what to do next, the other day I was tempted to sell the 3090's to grab an AMD PRO 48GB, with intention to get a 2nd AMD 48gb at some indeterminate time later.

Sometimes I get a bit daft/reckless and almost decide to just order an RTX 6000 Pro becasue I have no self control but what stops me is that I dont really have any real use-case for 96GB VRAM, I feel like if I was to invest my time and money 128GB+ is the next step up that I need to aim for. If I did dump the 3090s in favour of AMD 48GB, then eventually x3 of those would put me in a nice spot at 144GB for considerably less than Nvidia offerings.

[-]

Ornery_Hall@reddit

My currently is 5090+ 24G pro4000,which give me 56G of VRAM, most of work was handled by 5090 while pro 4000 was used as a memory tank. what I like about pro4000 is its a single slot unit with only 145W, which can fit in the same case with 5090 monster but still have good airflow to keep the system quiet and cool. The C/P ratio for pro 4000 is very good, 24G for $1499 is way below 1G VRAM per $100 for graphic card. I would invest in another pro4000 if I can find a raise cable that can sit under 5090.

[-]

aeroumbria@reddit

MTP pushed the comfort zone for the current wave of 30B class models slightly above 48GB at Q8. Otherwise it is working really well so far. I did feel a lot more comfortable with a 64GB setup though, on which you can pretty much dump the unsloth default parameters and forget about tensor split shuffling for best performance.

[-]

Kal-LZ@reddit

Gemma4 26B with 2xR9700 for most tasks. Tried Qwen3.6 27B but it's a bit slow (24-29 tokens) even with MTP.

Maybe add a 3rd card to try GPT OSS 120B

[-]

Weird_Llama_317@reddit

Same as xza_nomad33. I get consistently 50-60tps with dual R9700 (linux).

You should check your setup. My start command is below:

llama-server \
-m ~/aimodels/Qwen3.6-27B-Q8_0.gguf \
--host 0.0.0.0   --port 8080 \
--spec-type draft-mtp   --spec-draft-n-max 5 \
-sm tensor \
-fa on \
-ngl 999 \
-c 262144 \
-np 2

Model is https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF

[-]

xza_nomad33@reddit

My 2x R9700 does near 50t/s with mtp. Running Qwen 3.6 27B Q8.

[*]
flash-attn   = on      
stop-timeout = 1200     
n-gpu-layers = 999      
jinja        = true
split-mode = tensor
#tensor-split= 0.5,0.5
threads = 16
; fit-target = 512
; fit = on
; no-mmap=true
; direct-io=true
; swa-full=true
cache-ram=20000
kv-unified = true
parallel=1


[Qwen3.6-27B-Q8_0]
model = /models/MTP/Qwen3.6-27B-Q8_0.gguf
fit = on
fit-target = 2048,1024
mmproj = /models/MTP/Qwen3.6-27B-mmproj-F16.gguf
temp             = 0.6
top-p            = 0.95
top-k            = 20
min-p            = 0.0
presence-penalty = 0.0
repeat-penalty   = 1.0
#cache-type-k = bf16 #q8_0
#cache-type-v = bf16 #q8_0
#chat-template-kwargs = {"preserve_thinking":true}
spec-type = draft-mtp #,ngram-mod,ngram-map-k4v
spec-draft-n-max = 3
; spec-ngram-mod-n-match = 24
; spec-ngram-mod-n-min = 48
; spec-ngram-mod-n-max = 64
; spec-ngram-map-k4v-size-n = 16
; spec-ngram-map-k4v-size-m = 24
; spec-ngram-map-k4v-min-hits = 1


batch-size = 2048
ubatch-size = 1024
alias = default[*]
flash-attn   = on      
stop-timeout = 1200     
n-gpu-layers = 999      
jinja        = true
split-mode = tensor
; tensor-split= 0.5,0.5
threads = 16
; fit-target = 512
; fit = on
; no-mmap=true
; direct-io=true
; swa-full=true
cache-ram=20000
kv-unified = true
parallel=1


[Qwen3.6-27B-Q8_0]
model = /models/MTP/Qwen3.6-27B-Q8_0.gguf
fit = on
fit-target = 2048,1024
mmproj = /models/MTP/Qwen3.6-27B-mmproj-F16.gguf
temp             = 0.6
top-p            = 0.95
top-k            = 20
min-p            = 0.0
presence-penalty = 0.0
repeat-penalty   = 1.0
; cache-type-k = bf16 #q8_0
; cache-type-v = bf16 #q8_0
; chat-template-kwargs = {"preserve_thinking":true}
spec-type = draft-mtp #,ngram-mod,ngram-map-k4v
spec-draft-n-max = 3
; spec-ngram-mod-n-match = 24
; spec-ngram-mod-n-min = 48
; spec-ngram-mod-n-max = 64
; spec-ngram-map-k4v-size-n = 16
; spec-ngram-map-k4v-size-m = 24
; spec-ngram-map-k4v-min-hits = 1


batch-size = 2048
ubatch-size = 1024
alias = default

```

[-]

Xylildra@reddit

46Gb VRAM here. But just recently upgraded to 58, very soon 70! My daily was Skyfall 31B by “TheDrummer” it’s wonderful, BUT… People are swearing by Gemma 31B with a finetune. I’ve never used it yet, but it should be great from what I’ve read. Hope this helps.

[-]

illcuontheotherside@reddit

Unsloths google gemma4 32b q4 xl with googles latest jinja chat template.

Seriously underrated model.

[-]

-dysangel-@reddit

Do you wish you had more VRAM?

Who is ever going to say "no" to this?

[-]

National_Meeting_749@reddit

This. I'd have about 1TB of VRAM if that didn't cost as much as some houses.

[-]

yuicebox@reddit

1 tb of vram has gotta be way more than 99.99% of houses, right?

[-]

National_Meeting_749@reddit

If you do it with these then that's a bargain of only 130K in GPU's after taxes. Estimate 70K more for everything that surrounds those GPUs and you're at 200K.

Nothing to scoff at, more than I have, but I wouldn't say more than 99.9% of houses.

[-]

CubicleHermit@reddit

How much would the electricity cost to run it?

[-]

yuicebox@reddit

honestly, not as bad as I imagined. I was just thinking of the ungodly costs of Nvidia's server nodes.

[-]

National_Meeting_749@reddit

Well, we're trying to get 1tb vram here, gddr will work. We won't be greedy and ask for hbm 😉

[-]

HugoCortell@reddit

Not if it's all K80s

[-]

jamesbuniak@reddit

i have a lot of 16 on ebay right now, want them?

[-]

slalomz@reddit

Depends on what you mean by VRAM (1TB is just 2 maxed out Mac Studios!) or how many GPUs you're willing to put up with to get there.

[-]

yuicebox@reddit

RIP 512gb Mac Studio, gone but not forgotten :'(

[-]

Ok_Technology_5962@reddit

The legend, the budget vram king

[-]

Creative-Type9411@reddit

its a million dollars for 4 of the 200s with the module to slot them into, like 998k

[-]

tenkawa7@reddit

More

[-]

Mediaright@reddit

Lord, show me how to say no to this.

[-]

SanTrades@reddit

keep adding bois, VRAM to the moon!

[-]

Sofakingwetoddead@reddit

RTX 6000 Pro and yes. If I had more vram I would run GLM 5.1

[-]

jikilan_@reddit

After 48, u would want to have 72 or 96.

72 for context, 96 for TP=4. Then 96 is not big enough for 120b at q8.

I use qwen and Gemma.

[-]

kevin_1994@reddit

I'm 4090 + 3090. Running qwen 3.6 27b q8 with 156k q8_0 kv at about 1200 pp/s and 50 tg/s (with speculative decoding n=3)

Works great with opencode and for anything really

[-]

triynizzles1@reddit

With 48gb vram i can run most 30b and below at max context length. My opinion, 32 to 48 isnt huge because the models you have access to are the same but you can run higher quant or longer context. If you have another 64 gb of system memory, you can run 120b models on both cards at decent speeds.

[-]

fallingdowndizzyvr@reddit

I have 2x7900xtxi. That gives me 48GB. I don't use that because I also have Strix Halo which gives me 128GB. 48GB is not enough.

[-]

sheetis@reddit

I've been pretty happy with Qwen3.6-27B:Q8_0 myself with the recent MTP improvements. After regularly seeing 70 tok/sec+, I don't think I could go back to a slower speed.

This is especially true since I can't think of a model that would fit on that 128GB that wouldn't on the 48 (other than maybe BF16 of the dense Qwen 3.6).

For me after this model, I'm looking at wanting 512 GB to fit things that I would be confident were upgrades.

If I had both, I'd probably still have the primary driving model running on the faster hardware with a subagent configured for the more memory for specific tasks.

[-]

appakaradi@reddit

Qwen 3.7 27B AWQ on vLLM.

[-]

jonahbenton@reddit

Qwen 3.6 27b 8bit quant under opencode works very well for technical tasks/programming. Opencode itself does well up to 160k context or so, then falls over. Post compaction work is not as high quality.

[-]

ThePixelHunter@reddit

Does compaction in OpenCode just chop out the middle of context? Or is it like Kilo Code, prompting the model to summarize the session and then starting fresh from there?

[-]

jonahbenton@reddit

I'm not sure. It produces a summary that is decent but concise, and context only shrinks by half or so, so I think it might do both- chop out a whole bunch and then append the summary. It does not work that well for me. Just seeding a new session with the summary is ok but not really useful enough. Building up a new session from scratch seems to be the way (in my work).

[-]

Accomplished_Ad_4604@reddit

my 48gb vram dual rtx 3090's is turned off :/ i need the smartest models and fastest models and beside having enough vram to run kimi k2.6 i don't think we even close

[-]

MindRuin@reddit

GPUs: 2× RTX 3090 (24 GB each) — open-air rig, Ryzen 9 5900X, 128 GB DDR4, 1600 W PSU

VRAM extension: GreenBoost — ~48 GB GDDR6X + ~96 GB DDR4 tier + NVMe spill → ~144 GB effective for weights/KV (MoE-friendly: cold experts sit in T2)

Fleet: same Tailscale mesh — two always-on NUCs (16 GB each): one for warm memory (FastAPI + pgvector/Surreal), one for voice (Kokoro CPU TTS + embeddings). primary rig does the heavy lifting.

What I actually run (flagship)

Qwen3.5-122B-A10B — MoE (~10B active / 122B total), Q5_K_M, MTP speculative decoding llama.cpp fork (llama-server), tensor split 0.5 / 0.5 across both cards, ngl 20, KV q8_0, 16k ctx, MTP draft depth 3 Live numbers (desktop still on): TTFT ~1.66 s, sustained ~8 tok/s, MTP acceptance ~47–50% Side lane when GPUs are free: Qwen 3.6 27B AWQ on vLLM, tensor parallel both 3090s → ~92 tok/s

[-]

Badger-Purple@reddit

I just tried 122B in the strix halo with MTP, and it was going at 20+ tps beyond 16K

[-]

MindRuin@reddit

Nice! That's exciting. I've only really benchmarked so far. Have you been using it for actual text-inference? I'm wondering how big of a leap it is in terms of performance in comparison to the more common family of params lately which tends to stop at around 31b. I know things have gotten more efficient and thus smaller in size, but I'd still hope that the larger models hold a major advantage due to sheer density. And at this level of parameters, it's not even considered consumer anymore.

[-]

6OMPH@reddit

gemma4 and qwen (i forget which one, i tend to use gemma4 for regular stuff and qwen for coding)

[-]

mr_kandy@reddit

You can have 128Gb on DGX Spark and find that vram is not everything 😄

[-]

Maleficent-Ad5999@reddit

What gpu are you using now and what are you getting?

[-]

BuilderUnhappy7785@reddit

Rtx 8000 is probably the best value for 48gb imo

[-]

eleqtriq@reddit

27b as the planner with Qwen3 Coder Next as the muscle.

[-]

PassengerPigeon343@reddit

Currently Qwen 3.6 35B with a whisper STT model also running in memory. Plenty of space for both and performance is great.

I still need to play with Qwen 3.6 27B with MTP and give the Gemma 4 models another try. I was still having issues with Gemma on the latest batch and llama.cpp updates and Qwen worked flawlessly so I stuck with it.

[-]

Thrumpwart@reddit

Qwen 3.6 27B RYS-XL. Anything else would be uncivilized.

[-]

abnormal_human@reddit

Qwen 3.6 27B in Q8 is my workhorse for 48GB situations currently. I’ve pushed billions of tokens through it doing batch processing.

If I need more speed and task is less demanding, 35B A3B is good too.

Nothing wrong with the Gemma’s they are great too but I generally have been developing agent flows against the Qwen family and little differences need to be evaled / fixed before it will be as productive.

[-]

IgnisIason@reddit

Personally Gemma 4

[-]

soferet@reddit

Gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking, q8, with KV cache at f16. If I could run 96gb I totally would, but my server won't support the RTX 6000 Blackwell (I have the Ada). When it's time to upgrade the server, I'll upgrade VRAM too. But I adore Gemma-4, so I'll stick with that. Also looking to add Qwen2-audio to process audio tokens.

[-]

ansmo@reddit

Still Qwen 27b, just with larger context and/or higher quant.

[-]

Ell2509@reddit

I drive a VW for my daily driver. And at the weekends, the same VW.

Happy Hepatitis Testing Day! (Really, Google it!)

[-]

CrookedCasts@reddit

How is 48gb for non coding? Particularly voice and document processing workflows?

[-]

silenceimpaired@reddit

4bit 70b models, 8bit 30b models… everyone always wants more VRAM… especially with MoEs. I think 48gb is a good stopping point.

[-]

nizus1@reddit

gemma-4-31B-it-uncensored-heretic-GGUF q4 is the best now but I run into limits on context length so I also use Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive q4.
Previous favorite was the 102B parameter ggml-c4ai-command-r-plus-iq2_m
It's a couple years old now but still so smart. Just doesn't do the new agentic work flow stuff well

[-]

SSSHash@reddit

interesting

[-]

eddietheengineer@reddit

Club-3090 dual! https://github.com/noonghunna/club-3090/blob/master/docs/DUAL_CARD.md I hadn’t gotten results that were usable until I switched to that. Game changer!