TheaterFire

Joined the 48GB Vram Dual Hairdryer club. Frankly a bit of disappointment, deepseek-r1:70b works fine, qwen2.5:72b seems to be too big still. The 32b models apparently provide almost the same code quality and for general questions the online big LLMs are better. Meh.

Posted by ChopSticksPlease@reddit | LocalLLaMA | View on Reddit | 117 comments

Reply to Post

117 Comments

custodiam99@reddit

Well you can run Qwen 2 72b q\_5 on RTX 3060 12Gb and 48Gb DDR5 RAM! lol If you have time...but it IS cheap.
View on Reddit #50217113

ChopSticksPlease@reddit (OP)

Yep i tested bigger models on single RTX3090 and was getting aroun \~1tps. So while it can be run, i cant wait an hour or two to get a full response :)
View on Reddit #50267661

custodiam99@reddit

It depends on the task. Rewriting, grammar, translation, basic text analysis works with 9b models. Summary and deeper analysis works with Qwen 32b and Mistral 24b models. Reasoning works with Fuse01 32b q\_8, but it is better with R1 Llama 3 70b q\_5. For nuanced questions I use Dracarys2 72b q\_5, but it is quite rare. So there can be a fully functioning LLM ecosystem on a very "weak" home PC.
View on Reddit #50273994

MachineZer0@reddit

How much time are we talking here? I’m getting 17 tok/s on dual RTX 3090 on exl2 4.65bpw. Fits with 10k context.
View on Reddit #50218909

custodiam99@reddit

16k context and 1.1 t/s. It IS useable.
View on Reddit #50219290

MachineZer0@reddit

Damn, that’s like DeepSeek R1 671B Q4 running on my quad E7-8893v4 with 576gb RAM and 6x Titan V. 40mins of inference at 1.2 tok/s pulling 700-800w.
View on Reddit #50232073

custodiam99@reddit

Nice!
View on Reddit #50239030

frivolousfidget@reddit

Online inference will always (with current tech and prices) be better, cheaper and faster. That is not exactly the point here.
View on Reddit #49571633

davew111@reddit

Depends on the GPU and model size. I transcribe and summarize phone calls on my PC at work with an nvidia A2. It's the cheapest even with our energy prices: local: $0.02 / hr remote API (Groq): $0.11 / hr remote hardware (Runpod): $0.43 / hr
View on Reddit #49577195

slavik-f@reddit

\> local: $0.02 / hr Who are you paying to for local?
View on Reddit #49596526

perelmanych@reddit

As I understand this is electricity cost.
View on Reddit #50033734

davew111@reddit

Nobody it's just a python script
View on Reddit #49603067

frivolousfidget@reddit

Same model?
View on Reddit #49578300

davew111@reddit

Yes, Whisper-large-v3
View on Reddit #49603291

FullOf_Bad_Ideas@reddit

If you can do a workload in batches, local can come out cheaper. What model do you use for transcription? Can it do batches of smaller requests? Do you do summaries with batching?
View on Reddit #49600927

davew111@reddit

Whisper for transcription, Mistral Nemo to summarize. I transcribe at the end of the day. I could run it every hour since it's just a python script and typically takes less than an hour to process a day's worth.
View on Reddit #49602908

Pedalnomica@reddit

\*almost always. There's weird edge cases... If you're batching a ton on 3090s, you'll come out ahead.
View on Reddit #49574047

frivolousfidget@reddit

Considering cost, electricity and a model of similar size and performance online?
View on Reddit #49574230

ReadyAndSalted@reddit

Depreciation? Of a GPU? Considering how few consumer GPUs are manufactured nowadays, if anything they're an appreciating asset/investment lmao.
View on Reddit #49912659

psilent@reddit

Yeah especially since alot of models are available for free online. Openrouter has a whole list of models hosted in various places freely available as apis. Llama 3.3 70b, Deepseek r1, googles experimental ones including Gemini pro 2. Some are rate limited but not the smaller models, which are still bigger and faster than you can jam into 48gb vram
View on Reddit #49581533

FullOf_Bad_Ideas@reddit

Are there any small llm's that are hosted for free without rate limits? Think 200 concurrent requests with total generation throughput with at least 3000 t/s that I can use 24/7?
View on Reddit #49601418

psilent@reddit

I’m not totally sure what the rate limits are but check out router they might be able to provide detail details on the specifics. My guess is unless it’s really small there’s going to be some kind of cost for that heavy workload.
View on Reddit #49604306

FullOf_Bad_Ideas@reddit

I had a workload like this when I was making finetunes as personal experimentation. Local GPU was very cost effective, I think I processed 8B+ input tokens with 500M output tokens in like 40-80 GPU hours, my memory is pretty hazy on it but that's the scale. 7B model.
View on Reddit #49606140

FullOf_Bad_Ideas@reddit

Yes. You can get faster&cheaper because apis don't have prefix caching. In extreme cases, 3090 can do prefill at 80000 t/s on 7B INT8 model.
View on Reddit #49601147

L3Niflheim@reddit

Most likely a proper mechanic can do a better job with that old motorbike in your garage. Not as fun though!
View on Reddit #49597617

AppearanceHeavy6724@reddit

> and for general questions the online big LLMs are better. Meh What did you expect (shrug)? Qwen2.5:72b should work absolutely fine on 48gb, as it is only 1 GiB bigger at Q4 that 70b. Qwen coder 32b is going to be at some tasks better than 72b, as 72b is general purpose, and 32b is a coder.
View on Reddit #49570150

ChopSticksPlease@reddit (OP)

Wierd, ollama ps says: qwen2.5:72b 424bad2cc13f 54 GB 9%/91% CPU/GPU and i think context length etc is set to default :S
View on Reddit #49570309

kovnev@reddit

You've got 48gb of vram and you're downloading models in that way? Bro... Get your VRAM to tell you about quantization.
View on Reddit #49572597

mzinz@reddit

Can you explain more on this? What’s wrong with this approach?
View on Reddit #49582666

kovnev@reddit

You appear to be downloading non-quantized versions. Are you using the ollama site? Don't. Get on huggingface and look up quantized versions of the models you want. Use their Local LLM Leaderboard to scope out models here - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/ Stick to 'Only Official Providers' until you know what you're doing, as a lot of the custom tuned models are centered entirely around achieving the best benchmark scores, and make sacrifices in other areas. You can get Q4 versions that are like 1/8th the size of the full f32, and sacrifice little accuracy. Or a Q6 is I think 1-3% precision loss, or something like that, and they're **5x smaller**. The upshot is you can fit much better models entirely in vram. With your 48gb, i'd be trying out quants of 70-72B models. You can even quantize your context, at basically no precision loss, but that's a different topic. To download quantized models on huggingface, find the model you want, and go to its page. Click 'Quantizations' (on the right). Sort those by most downloaded. Click the top downloaded GGUF version. You can click on the stuff on the right like Q4_K_M, or Q6_K, and you'll get a tab pop out with the size. When you find the size you like,click 'Use This Model' at the top of the popout, then click the Ollama button. Copy the link, open a command prompt and paste it in. It'll download it for you, and then run it in the command prompt window. Here's a link for a Qwen2.5-32B, which is one of the top performing models. This one is at at Q6_K. Paste it into command prompt if wanna give it a go: ollama run hf.co/bartowski/Qwen2.5-32B-Instruct-GGUF:Q6_K Hope this helps, and wish this was the first thing I read 😆.
View on Reddit #49787655

mzinz@reddit

Good info here! Can’t you also download Quantized models through ollama by clicking on Tags to view all model versions?
View on Reddit #49795340

YouDontSeemRight@reddit

Yes you can. He gives good info but ignore the comment about Ollama. Ollama is a great tool to manage LLM's. Sometimes we don't need all the control in the world or want anything more than what's needed. Sure it would be nice if they had speculative decoding but I'm patient. My vLLM docker image has its place as well.
View on Reddit #49805178

AttentionIsAllYouGet@reddit

Ollama is just a lobotomized wrapper around llamacpp. It saves you the hassle of setting up a llamacpp server (which isn't even hard to begin with), but comes with basically no customizations that are possible in llamacpp. If you got the 2x3090 setup - might as well spend an extra hour learning to roll your own llamacpp server. Moreover, llamacpp is not even a competitive option for inference on GPU, better off using vLLM.
View on Reddit #49600650

mzinz@reddit

So the TLDR is that llamacpp and vLLM are more performant and customizable than ollama?
View on Reddit #49641284

YouDontSeemRight@reddit

Does llama.cpp facilitate the transfer of one LLM to another? Where Ollama shines is ease of use, control, and rapid swapping of models. If I make a vLLM server it's running one model for the duration of the task because it took ten minutes to start up. If I'm running Ollama, I can switch between multiple models depending on the task at hand. It's definitely not as flexible. Just a heads up though the Ollama speculative decoding MR was stashed because they were writing their on inference engine to decouple from llama.cpp.
View on Reddit #49612536

AppearanceHeavy6724@reddit

Because you loaded Q6 quant of model. You need to run IQ4 quant.
View on Reddit #49572196

Dry-Judgment4242@reddit

With exl2 at 4.25bpw you should be able to fit up to 65k context length for Qwen2.5 72b if you run Q4.
View on Reddit #49570552

mgr2019x@reddit

Kv cache quantisation with llama.cpp to q8_0 or q4_0 or to Q4,Q6,Q8 with exllama v2.
View on Reddit #49574910

MoffKalast@reddit

Qwen dies with cache quants though, it may fit but the performance will suffer.
View on Reddit #49585160

fizzy1242@reddit

8 bit kv cache is still solid from my experience. It really takes a hit at 4 bits.
View on Reddit #49752494

mgr2019x@reddit

In my experience there are no such issues...
View on Reddit #49598216

cm8t@reddit

It’s the context window that pushes it over the limity
View on Reddit #49590311

AppearanceHeavy6724@reddit

yes, this is also true.
View on Reddit #49590638

a_beautiful_rhind@reddit

One does not run local models to ask the capital of france.
View on Reddit #49572457

InterstellarReddit@reddit

Then what are we supposed to do if not ask the capital of France ?
View on Reddit #49574950

a_beautiful_rhind@reddit

Ask for the capital of your pants.
View on Reddit #49578675

Vivarevo@reddit

Recommendations for best model for this? Asking for a friend.
View on Reddit #49744065

a_beautiful_rhind@reddit

Currently I went back to evathene. I also like monstral v2. The llama tunes keep breaking after 6-8k context for some reason. They're good at the start and then get to looping and alliteration. Kind of kills the oomdmay. Mistral never breaks and qwen somewhere in the 18-20k range. When I get my 4th 3090 back I'm trying wizard again at higher BPW, assuming it works. Don't think I gave that one enough of a chance.
View on Reddit #49771762

ModPiracy_Fantoski@reddit

Effeil Tower's location.
View on Reddit #49580522

martinerous@reddit

Ask it for financial capital (because you must be out of it after buying expensive GPUs).
View on Reddit #49589980

AnhedoniaJack@reddit

Ask it to show you a cool math proof.
View on Reddit #49583090

eidrag@reddit

lower bracket france
View on Reddit #49575091

Radiant_Dog1937@reddit

Wait is that everyone's default test question?
View on Reddit #49589503

L3Niflheim@reddit

write me a poem about cheese
View on Reddit #49597457

SanFranPanManStand@reddit

uh... is it just porn, or are folks doing something more interesting?
View on Reddit #49587342

SteveRD1@reddit

Dear GPT, please tell me the name of the descendant of Conrad Hilton famed for her s\*x tape!
View on Reddit #49587019

Evening_Ad6637@reddit

XD
View on Reddit #49574104

Hex30_03@reddit

you need just to implement a basic rag/online search on top of running the model and it will be very good for general questions, i dont know if jt will be as good as big consumer llms like chat gpt but for sure it will be better than it is now
View on Reddit #49572112

YouDontSeemRight@reddit

What rag and online search do you have setup?
View on Reddit #49593350

someonesmall@reddit

you could use openwebui
View on Reddit #49739479

SteveRD1@reddit

Dumb question perhaps...why can't we just download a model that includes this? Is it something we absolutely have to do ourselves after downloading?
View on Reddit #49587103

OmarBessa@reddit

Just focus on the illegal questions. It's your new safe space.
View on Reddit #49724619

ThenExtension9196@reddit

Why do they report as 3090? My modded 4090 doesn’t do that.
View on Reddit #49644373

ChopSticksPlease@reddit (OP)

these are regular 3090
View on Reddit #49653400

ThenExtension9196@reddit

Ah okay
View on Reddit #49696193

13henday@reddit

I was in a similar spot when I first got my second 3090. However, since then I’ve gotten a real handle on rag and fine tuning and that’s where the dividends were. 32b q6 coder with 64k context is enough to reliably rag dependencies and have the entirety of my working file in context. I work with proprietary Fortran based languages that tend to break most base AI coders. 72b vl with large context can reliably cross reference and collect info from multiple images. 72b fine tuned on engineering reports writes and summaries really well.
View on Reddit #49641567

waiting_for_zban@reddit

> I was in a similar spot when I first got my second 3090. However, since then I’ve gotten a real handle on rag and fine tuning and that’s where the dividends were. 32b q6 coder with 64k context is enough to reliably rag dependencies and have the entirety of my working file in context. I work with proprietary Fortran based languages that tend to break most base AI coders. 72b vl with large context can reliably cross reference and collect info from multiple images. 72b fine tuned on engineering reports writes and summaries really well. I think the community would find this really valuable, if you do a write up of your approach!
View on Reddit #49660241

manzked@reddit

Did you try vllm with quants? You can disable there the cuda graph calculation and reduce the model Len (context window). With this you can squeeze it on your cards. If 70b was running and 72b not.
View on Reddit #49655448

MachineZer0@reddit

For a second I thought these were the 48gb 4090s. Inquiring minds want to know about the performance of those.
View on Reddit #49574871

slavik-f@reddit

Yep. I ordered such last week. Can't wait to try it... [https://www.c2-computer.com/products/new-parallel-nvidia-rtx-4090d-48gb-gddr6-256-bit-gpu-blower-edition](https://www.c2-computer.com/products/new-parallel-nvidia-rtx-4090d-48gb-gddr6-256-bit-gpu-blower-edition)
View on Reddit #49596831

manzked@reddit

Crazy … modified rtx cards… https://www.igorslab.de/en/rtx-4090d-with-48-gb-and-rtx-4080-with-32-gb-chinese-ki-companies-rely-on-more-vram/
View on Reddit #49655366

waiting_for_zban@reddit

Is this a trusted website? Never heard of it. And they don't seem to accept paypal. Also what's the catch, good-luck-with-the-warranty I guess?
View on Reddit #49606671

eidrag@reddit

or where to buy them
View on Reddit #49575116

countjj@reddit

Sorry what cards are those?
View on Reddit #49647323

ChopSticksPlease@reddit (OP)

Afox RTX 3090 turbo fan AF3090-24GD6H7
View on Reddit #49653374

xanduonc@reddit

48gb is optimal for 32b models at high quants with decent context sizes though.
View on Reddit #49629822

ChopSticksPlease@reddit (OP)

Well, I'm not saying local LLMs are meh, its just that to me after trying it, the best value seems to be a single 24GB GPU like RTX 3090 / 4090 / A5000 to run coding 32b llms locally and integrate the online monsters with Open Web UI. The second GPU simply didn't add much value, so in that regard it is a little disappointment.
View on Reddit #49575629

SteveRD1@reddit

I think the 5090 is better value for a cheapish card...if you can wait six months. That extra bandwidth plus RAM is nice.
View on Reddit #49587266

I_AM_BUDE@reddit

If it doesn't light itself on fire or comes with 8 disabled ROPs lul.
View on Reddit #49599538

SteveRD1@reddit

I can live without the ROPs, but would prefer my house to remain intact it's true!!:)
View on Reddit #49626657

caetydid@reddit

This is interesting to hear since I was already suspecting the same. A model with twice the param count simply does not double the utility. However, this might shift with new models which push the entry barrier. How about deepseek-r1? I believe you could at least run the q4 1776 from perplexity at decent speeds - for me it is 1 tok/sec with one rtx3090.
View on Reddit #49605690

ChopSticksPlease@reddit (OP)

Regular deepseek-r1:70b ran at sth around 16tps, which imho is really usable. The GPU utilization was around 50%, i guess PCIE v3 is the bottleneck in my system.
View on Reddit #49611920

ParaboloidalCrest@reddit

Exactly! The returns diminish quickly beyond 32B.
View on Reddit #49594084

silenceimpaired@reddit

I don’t know I can agree. 4bit 70b/72b models on 48gb is a valuable option not easily available to 24gb. Anything below 4bit has serious performance impacts… also it feels you can run 5bit quite a bit faster on 48gb than on 24gb… maybe I’m misremembering that… and 5 bit performance wise is pretty close to 8bit
View on Reddit #49598887

ParaboloidalCrest@reddit

70b models require double the investment for how much increased intelligence exactly? Can you quantify it?
View on Reddit #49601729

skrshawk@reddit

For writing use-cases 70b vs 32b is the difference between a model that can consistently keep multiple characters straight, as well as keep thoughts to themselves, and knowing who saw what happen between scenes. At least in my experience. I don't consider anything under 70b these days.
View on Reddit #49622743

silenceimpaired@reddit

Nah, it’s subjective all the way down. How do you hold a moonbeam in your hand? I can just say I spend $700 to move to 48gb and I haven’t regretted it. I’ve felt the pull to move higher… with Digits but I’m content enough. It feels like smaller models cannot create large paragraphs that hold as well together as larger models do, and larger models have a larger space to statistically guess what should happen. As an example: a person is pushed out of a window on the seventh floor… (Small model) person climbs to their feet and yells up I’m going to call the police. (Large model) person has a pool of blood around them. The small model gets pushing someone out the window is not legal or good but misses the fact the person fell seven stories.
View on Reddit #49603915

FullOf_Bad_Ideas@reddit

Umbrella runs llama 3 70b INT4 at 6 t/s on single rtx 4060 TI/3090. At low context. It's something.
View on Reddit #49602201

silenceimpaired@reddit

What?! What is this umbrella? … I want to run 8bit at 6 t/s.
View on Reddit #49604134

FullOf_Bad_Ideas@reddit

https://github.com/Infini-AI-Lab/UMbreLLa I don't think it supports multi-GPU, or 8bit
View on Reddit #49605960

silenceimpaired@reddit

Tragic. But thanks.
View on Reddit #49615359

synn89@reddit

You shouldn't have any problem running a 72b at 32k context. Use a 4.5bpw EXL2 quant with q4 cache. I'd recommend against ollama and using something like Text Gen Web UI for more control over the quant size and cache setup.
View on Reddit #49584314

tmvr@reddit

Theoretically it should fit, but the Qwen2.5 72B GGUFs on HF are larger somehow. The Q4\_K\_M from Bartowski (and LM Studio as they repost his ones) is 47GB+ and the one from Qwen is 44GB+ which is a problem with 48GB VRAM.
View on Reddit #49590116

skrshawk@reddit

Difference between EXL2 and GGUF. Also, running TabbyAPI with tensor parallel is a massive performance gain over anything LCPP.
View on Reddit #49622621

Papabear3339@reddit

Local really shines when fine tuned on company data. Company documents, contracts, metadata, database samples, etc. This isn't a world engine, it is a small but powerful agent that can be tuned to do a specific set of tasks very very well. Think of it in that context and you will understand why local can be so amazing when done right.
View on Reddit #49617595

GodComplecs@reddit

I plan on running multiple LLMs, VLMs etc on my 2x3090, already do it with my single. That is really optimal if you do some real work with them. Oh and you can serve them to small businesses if you need. I get 60tks+ on my single already so...
View on Reddit #49602577

_wOvAN_@reddit

https://preview.redd.it/mbn9kermhale1.png?width=1282&format=png&auto=webp&s=2700934d67f1f8e60bc3ea08aa3663a2282ef996
View on Reddit #49579876

ChopSticksPlease@reddit (OP)

Nice, but clearly the GPU pool is underutilized. So whats the point? Heating? ;)
View on Reddit #49580284

_wOvAN_@reddit

small model now runnig, can run 70b fp16, but still not enough to run big models or big contexts
View on Reddit #49581242

danielv123@reddit

Are those on 1x mining risers?
View on Reddit #49581993

_wOvAN_@reddit

yep, miner board, pcie 2, 1x
View on Reddit #49582262

DefNattyBoii@reddit

Does the pice bandwith influences speed or not?
View on Reddit #49598144

Awwtifishal@reddit

In layer split mode it should not, since only a small amount of data is being transferred.
View on Reddit #49600978

No_Afternoon_4260@reddit

The question is very did you got these?
View on Reddit #49600882

JustinPooDough@reddit

This is why I won't bother buying more than my 3090. I use it for prototyping locally, running basic stuff, then outsource complex stuff to cheap cloud services. Even Claude Sonnet with token caching and effective context management is really affordable. If I feel like being even cheaper, Gemini Flash 2.0 and Thinking 2.0 are both basically free, and - IMO - excellent models.
View on Reddit #49576436

L3Niflheim@reddit

But you could have ***two*** 3090s. Double the fun right?
View on Reddit #49597719

koalfied-coder@reddit

I mean yes one really needs 96gb for 70b 3.3 8bit which is nice.
View on Reddit #49587566

Such_Advantage_6949@reddit

Your setup is meh, i have 4x4090/3090 and i still find it meh. What do u expect from model that is one tenth of the size of model from the big guy? Have hardware that can run the full size deep seek 671b then you can tell me if it is still meh
View on Reddit #49574621

ChopSticksPlease@reddit (OP)

Nope, I just was curious is the second GPU going to add much value and it clearly didn't. Happy with a single RTX 3090 to run coding LLMs.
View on Reddit #49575694

thesuperbob@reddit

I dunno, I'm in a similar situation, the second 3090 allows me to run Deepseek R1 distills smoothly, and do everything I did before but better, faster, and with a larger context. I'd say it's a decent step up. Obviously it's nothing compared to online models, and once again there are superior models that seem to be just out of reach at 48GB VRAM...
View on Reddit #49583781

UsualResult@reddit

> and for general questions the online big LLMs are better. Meh Might I recommend looking a benchmarks next time before you plunk down some money? There are ways to theorize about the experience without buying a bunch of GPUs.
View on Reddit #49582845

You_Wen_AzzHu@reddit

The only reason we use local LLM for work is because the data simply can't be sent to any vendor due to compliance.
View on Reddit #49581069

siegevjorn@reddit

Try vLLM with tensor parallelism, should be much faster than ollama.
View on Reddit #49577368

ElephantWithBlueEyes@reddit

Even those big cloud models are meh in some cases. Depends on what you want them to do. I've got rtx4080 with 16gb before i knew i'll be into LLMs and 16 gb is just not enough (24 gb is the way). Currently testing Open WebUI + any backend. I can squeeze just qwen 2.5 coder 14b + gemma 2 9b for general questions to access them from my laptop. 48 gb VRAM is pretty much fine for that kind of scenario when you want to use multiple models locally. Why locally? Because i'm digging our team codebase (i'm QA) so i don't want code to go outside. And also because many cloud LLMs aren't available without VPN, except Deepseek.
View on Reddit #49576591

LienniTa@reddit

i actually used 3090 as a hairdryer back in 2022-23. Launched stable diffusion inference, apex legends, and in apex legends shooting range was throwing fire grenades and STARING at them. Gets hair dry pretty fast.
View on Reddit #49575750

_hypochonder_@reddit

Qwen2.5-72B-Instruct-Q4\_K\_M.gguf is 47.4 GB. deepseek-r1:70b is 43GB. You can try qwen2-72b-instruct-imat-IQ4\_XS.gguf (40GB) or qwen2-72b-instruct-imat-IQ4\_NL.gguf (41,3GB) instead.
View on Reddit #49571566

AppearanceHeavy6724@reddit

qwen2.5. not qwen2
View on Reddit #49572268