Joined the 48GB Vram Dual Hairdryer club. Frankly a bit of disappointment, deepseek-r1:70b works fine, qwen2.5:72b seems to be too big still. The 32b models apparently provide almost the same code quality and for general questions the online big LLMs are better. Meh.

[-]

custodiam99@reddit

Well you can run Qwen 2 72b q\_5 on RTX 3060 12Gb and 48Gb DDR5 RAM! lol If you have time...but it IS cheap.

Reply

[-]

ChopSticksPlease@reddit (OP)

Yep i tested bigger models on single RTX3090 and was getting aroun \~1tps. So while it can be run, i cant wait an hour or two to get a full response :)

Reply

[-]

It depends on the task. Rewriting, grammar, translation, basic text analysis works with 9b models. Summary and deeper analysis works with Qwen 32b and Mistral 24b models. Reasoning works with Fuse01 32b q\_8, but it is better with R1 Llama 3 70b q\_5. For nuanced questions I use Dracarys2 72b q\_5, but it is quite rare. So there can be a fully functioning LLM ecosystem on a very "weak" home PC.

Reply

[-]

MachineZer0@reddit

How much time are we talking here? I’m getting 17 tok/s on dual RTX 3090 on exl2 4.65bpw. Fits with 10k context.

Reply

[-]

custodiam99@reddit

16k context and 1.1 t/s. It IS useable.

Reply

[-]

MachineZer0@reddit

Damn, that’s like DeepSeek R1 671B Q4 running on my quad E7-8893v4 with 576gb RAM and 6x Titan V. 40mins of inference at 1.2 tok/s pulling 700-800w.

Reply

[-]

custodiam99@reddit

Nice!

Reply

[-]

frivolousfidget@reddit

Online inference will always (with current tech and prices) be better, cheaper and faster. That is not exactly the point here.

Reply

[-]

davew111@reddit

Depends on the GPU and model size. I transcribe and summarize phone calls on my PC at work with an nvidia A2. It's the cheapest even with our energy prices: local: $0.02 / hr remote API (Groq): $0.11 / hr remote hardware (Runpod): $0.43 / hr

Reply

[-]

slavik-f@reddit

\> local: $0.02 / hr Who are you paying to for local?

Reply

[-]

perelmanych@reddit

As I understand this is electricity cost.

Reply

[-]

davew111@reddit

Nobody it's just a python script

Reply

[-]

frivolousfidget@reddit

Same model?

Reply

[-]

davew111@reddit

Yes, Whisper-large-v3

Reply

[-]

FullOf_Bad_Ideas@reddit

If you can do a workload in batches, local can come out cheaper. What model do you use for transcription? Can it do batches of smaller requests? Do you do summaries with batching?

Reply

[-]

davew111@reddit

Whisper for transcription, Mistral Nemo to summarize. I transcribe at the end of the day. I could run it every hour since it's just a python script and typically takes less than an hour to process a day's worth.

Reply

[-]

Pedalnomica@reddit

\*almost always. There's weird edge cases... If you're batching a ton on 3090s, you'll come out ahead.

Reply

[-]

frivolousfidget@reddit

Considering cost, electricity and a model of similar size and performance online?

Reply

[-]

ReadyAndSalted@reddit

Depreciation? Of a GPU? Considering how few consumer GPUs are manufactured nowadays, if anything they're an appreciating asset/investment lmao.

Reply

[-]

psilent@reddit

Yeah especially since alot of models are available for free online. Openrouter has a whole list of models hosted in various places freely available as apis. Llama 3.3 70b, Deepseek r1, googles experimental ones including Gemini pro 2. Some are rate limited but not the smaller models, which are still bigger and faster than you can jam into 48gb vram

Reply

[-]

FullOf_Bad_Ideas@reddit

Are there any small llm's that are hosted for free without rate limits? Think 200 concurrent requests with total generation throughput with at least 3000 t/s that I can use 24/7?

Reply

[-]

psilent@reddit

I’m not totally sure what the rate limits are but check out router they might be able to provide detail details on the specifics. My guess is unless it’s really small there’s going to be some kind of cost for that heavy workload.

Reply

[-]

FullOf_Bad_Ideas@reddit

I had a workload like this when I was making finetunes as personal experimentation. Local GPU was very cost effective, I think I processed 8B+ input tokens with 500M output tokens in like 40-80 GPU hours, my memory is pretty hazy on it but that's the scale. 7B model.

Reply

[-]

FullOf_Bad_Ideas@reddit

Yes. You can get faster&cheaper because apis don't have prefix caching. In extreme cases, 3090 can do prefill at 80000 t/s on 7B INT8 model.

Reply

[-]

L3Niflheim@reddit

Most likely a proper mechanic can do a better job with that old motorbike in your garage. Not as fun though!

Reply

[-]

AppearanceHeavy6724@reddit

> and for general questions the online big LLMs are better. Meh What did you expect (shrug)? Qwen2.5:72b should work absolutely fine on 48gb, as it is only 1 GiB bigger at Q4 that 70b. Qwen coder 32b is going to be at some tasks better than 72b, as 72b is general purpose, and 32b is a coder.

Reply

[-]

ChopSticksPlease@reddit (OP)

Wierd, ollama ps says: qwen2.5:72b 424bad2cc13f 54 GB 9%/91% CPU/GPU and i think context length etc is set to default :S

Reply

[-]

kovnev@reddit

You've got 48gb of vram and you're downloading models in that way? Bro... Get your VRAM to tell you about quantization.

Reply

[-]

mzinz@reddit

Can you explain more on this? What’s wrong with this approach?

Reply

[-]

kovnev@reddit

You appear to be downloading non-quantized versions. Are you using the ollama site? Don't. Get on huggingface and look up quantized versions of the models you want. Use their Local LLM Leaderboard to scope out models here - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/ Stick to 'Only Official Providers' until you know what you're doing, as a lot of the custom tuned models are centered entirely around achieving the best benchmark scores, and make sacrifices in other areas. You can get Q4 versions that are like 1/8th the size of the full f32, and sacrifice little accuracy. Or a Q6 is I think 1-3% precision loss, or something like that, and they're **5x smaller**. The upshot is you can fit much better models entirely in vram. With your 48gb, i'd be trying out quants of 70-72B models. You can even quantize your context, at basically no precision loss, but that's a different topic. To download quantized models on huggingface, find the model you want, and go to its page. Click 'Quantizations' (on the right). Sort those by most downloaded. Click the top downloaded GGUF version. You can click on the stuff on the right like Q4_K_M, or Q6_K, and you'll get a tab pop out with the size. When you find the size you like,click 'Use This Model' at the top of the popout, then click the Ollama button. Copy the link, open a command prompt and paste it in. It'll download it for you, and then run it in the command prompt window. Here's a link for a Qwen2.5-32B, which is one of the top performing models. This one is at at Q6_K. Paste it into command prompt if wanna give it a go: ollama run hf.co/bartowski/Qwen2.5-32B-Instruct-GGUF:Q6_K Hope this helps, and wish this was the first thing I read 😆.

Reply

[-]

mzinz@reddit

Good info here! Can’t you also download Quantized models through ollama by clicking on Tags to view all model versions?

Reply

[-]

YouDontSeemRight@reddit

Yes you can. He gives good info but ignore the comment about Ollama. Ollama is a great tool to manage LLM's. Sometimes we don't need all the control in the world or want anything more than what's needed. Sure it would be nice if they had speculative decoding but I'm patient. My vLLM docker image has its place as well.

Reply

[-]

AttentionIsAllYouGet@reddit

Ollama is just a lobotomized wrapper around llamacpp. It saves you the hassle of setting up a llamacpp server (which isn't even hard to begin with), but comes with basically no customizations that are possible in llamacpp. If you got the 2x3090 setup - might as well spend an extra hour learning to roll your own llamacpp server. Moreover, llamacpp is not even a competitive option for inference on GPU, better off using vLLM.

Reply

[-]

mzinz@reddit

So the TLDR is that llamacpp and vLLM are more performant and customizable than ollama?

Reply

[-]

YouDontSeemRight@reddit

Does llama.cpp facilitate the transfer of one LLM to another? Where Ollama shines is ease of use, control, and rapid swapping of models. If I make a vLLM server it's running one model for the duration of the task because it took ten minutes to start up. If I'm running Ollama, I can switch between multiple models depending on the task at hand. It's definitely not as flexible. Just a heads up though the Ollama speculative decoding MR was stashed because they were writing their on inference engine to decouple from llama.cpp.

Reply

[-]

AppearanceHeavy6724@reddit

Because you loaded Q6 quant of model. You need to run IQ4 quant.

Reply

[-]

Dry-Judgment4242@reddit

With exl2 at 4.25bpw you should be able to fit up to 65k context length for Qwen2.5 72b if you run Q4.

Reply

[-]

mgr2019x@reddit

Kv cache quantisation with llama.cpp to q8_0 or q4_0 or to Q4,Q6,Q8 with exllama v2.

Reply

[-]

MoffKalast@reddit

Qwen dies with cache quants though, it may fit but the performance will suffer.

Reply

[-]

fizzy1242@reddit

8 bit kv cache is still solid from my experience. It really takes a hit at 4 bits.

Reply

[-]

mgr2019x@reddit

In my experience there are no such issues...

Reply

[-]

cm8t@reddit

It’s the context window that pushes it over the limity

Reply

[-]

AppearanceHeavy6724@reddit

yes, this is also true.

Reply

[-]

a_beautiful_rhind@reddit

One does not run local models to ask the capital of france.

Reply

[-]

InterstellarReddit@reddit

Then what are we supposed to do if not ask the capital of France ?

Reply

[-]

a_beautiful_rhind@reddit

Ask for the capital of your pants.

Reply

[-]

Vivarevo@reddit

Recommendations for best model for this? Asking for a friend.

Reply

[-]

a_beautiful_rhind@reddit

Currently I went back to evathene. I also like monstral v2. The llama tunes keep breaking after 6-8k context for some reason. They're good at the start and then get to looping and alliteration. Kind of kills the oomdmay. Mistral never breaks and qwen somewhere in the 18-20k range. When I get my 4th 3090 back I'm trying wizard again at higher BPW, assuming it works. Don't think I gave that one enough of a chance.

Reply

[-]

ModPiracy_Fantoski@reddit

Effeil Tower's location.

Reply

[-]

martinerous@reddit

Ask it for financial capital (because you must be out of it after buying expensive GPUs).

Reply

[-]

AnhedoniaJack@reddit

Ask it to show you a cool math proof.

Reply

[-]

eidrag@reddit

lower bracket france

Reply

[-]

Radiant_Dog1937@reddit

Wait is that everyone's default test question?

Reply

[-]

L3Niflheim@reddit

write me a poem about cheese

Reply

[-]

SanFranPanManStand@reddit

uh... is it just porn, or are folks doing something more interesting?

Reply

[-]

SteveRD1@reddit

Dear GPT, please tell me the name of the descendant of Conrad Hilton famed for her s\*x tape!

Reply

[-]

Evening_Ad6637@reddit

XD

Reply

[-]

Hex30_03@reddit

you need just to implement a basic rag/online search on top of running the model and it will be very good for general questions, i dont know if jt will be as good as big consumer llms like chat gpt but for sure it will be better than it is now

Reply

[-]

YouDontSeemRight@reddit

What rag and online search do you have setup?

Reply

[-]

someonesmall@reddit

you could use openwebui

Reply

[-]

SteveRD1@reddit

Dumb question perhaps...why can't we just download a model that includes this? Is it something we absolutely have to do ourselves after downloading?

Reply

[-]

OmarBessa@reddit

Just focus on the illegal questions. It's your new safe space.

Reply

[-]

ThenExtension9196@reddit

Why do they report as 3090? My modded 4090 doesn’t do that.

Reply

[-]

ChopSticksPlease@reddit (OP)

these are regular 3090

Reply

[-]

ThenExtension9196@reddit

Ah okay

Reply

[-]

13henday@reddit

I was in a similar spot when I first got my second 3090. However, since then I’ve gotten a real handle on rag and fine tuning and that’s where the dividends were. 32b q6 coder with 64k context is enough to reliably rag dependencies and have the entirety of my working file in context. I work with proprietary Fortran based languages that tend to break most base AI coders. 72b vl with large context can reliably cross reference and collect info from multiple images. 72b fine tuned on engineering reports writes and summaries really well.

Reply

[-]

waiting_for_zban@reddit

> I was in a similar spot when I first got my second 3090. However, since then I’ve gotten a real handle on rag and fine tuning and that’s where the dividends were. 32b q6 coder with 64k context is enough to reliably rag dependencies and have the entirety of my working file in context. I work with proprietary Fortran based languages that tend to break most base AI coders. 72b vl with large context can reliably cross reference and collect info from multiple images. 72b fine tuned on engineering reports writes and summaries really well. I think the community would find this really valuable, if you do a write up of your approach!

Reply

[-]

manzked@reddit

Did you try vllm with quants? You can disable there the cuda graph calculation and reduce the model Len (context window). With this you can squeeze it on your cards. If 70b was running and 72b not.

Reply

[-]

MachineZer0@reddit

For a second I thought these were the 48gb 4090s. Inquiring minds want to know about the performance of those.

Reply

[-]

slavik-f@reddit

Yep. I ordered such last week. Can't wait to try it... [https://www.c2-computer.com/products/new-parallel-nvidia-rtx-4090d-48gb-gddr6-256-bit-gpu-blower-edition](https://www.c2-computer.com/products/new-parallel-nvidia-rtx-4090d-48gb-gddr6-256-bit-gpu-blower-edition)

Reply

[-]

manzked@reddit

Crazy … modified rtx cards… https://www.igorslab.de/en/rtx-4090d-with-48-gb-and-rtx-4080-with-32-gb-chinese-ki-companies-rely-on-more-vram/

Reply

[-]

waiting_for_zban@reddit

Is this a trusted website? Never heard of it. And they don't seem to accept paypal. Also what's the catch, good-luck-with-the-warranty I guess?

Reply

[-]

eidrag@reddit

or where to buy them

Reply

[-]

countjj@reddit

Sorry what cards are those?

Reply

[-]

ChopSticksPlease@reddit (OP)

Afox RTX 3090 turbo fan AF3090-24GD6H7

Reply

[-]

xanduonc@reddit

48gb is optimal for 32b models at high quants with decent context sizes though.

Reply

[-]

ChopSticksPlease@reddit (OP)

Well, I'm not saying local LLMs are meh, its just that to me after trying it, the best value seems to be a single 24GB GPU like RTX 3090 / 4090 / A5000 to run coding 32b llms locally and integrate the online monsters with Open Web UI. The second GPU simply didn't add much value, so in that regard it is a little disappointment.

Reply

[-]

SteveRD1@reddit

I think the 5090 is better value for a cheapish card...if you can wait six months. That extra bandwidth plus RAM is nice.

Reply

[-]

I_AM_BUDE@reddit

If it doesn't light itself on fire or comes with 8 disabled ROPs lul.

Reply

[-]

SteveRD1@reddit

I can live without the ROPs, but would prefer my house to remain intact it's true!!:)

Reply

[-]

caetydid@reddit

This is interesting to hear since I was already suspecting the same. A model with twice the param count simply does not double the utility. However, this might shift with new models which push the entry barrier. How about deepseek-r1? I believe you could at least run the q4 1776 from perplexity at decent speeds - for me it is 1 tok/sec with one rtx3090.

Reply

[-]

ChopSticksPlease@reddit (OP)

Regular deepseek-r1:70b ran at sth around 16tps, which imho is really usable. The GPU utilization was around 50%, i guess PCIE v3 is the bottleneck in my system.

Reply

[-]

ParaboloidalCrest@reddit

Exactly! The returns diminish quickly beyond 32B.

Reply

[-]

silenceimpaired@reddit

I don’t know I can agree. 4bit 70b/72b models on 48gb is a valuable option not easily available to 24gb. Anything below 4bit has serious performance impacts… also it feels you can run 5bit quite a bit faster on 48gb than on 24gb… maybe I’m misremembering that… and 5 bit performance wise is pretty close to 8bit

Reply

[-]

ParaboloidalCrest@reddit

70b models require double the investment for how much increased intelligence exactly? Can you quantify it?

Reply

[-]

skrshawk@reddit

For writing use-cases 70b vs 32b is the difference between a model that can consistently keep multiple characters straight, as well as keep thoughts to themselves, and knowing who saw what happen between scenes. At least in my experience. I don't consider anything under 70b these days.

Reply

[-]

silenceimpaired@reddit

Nah, it’s subjective all the way down. How do you hold a moonbeam in your hand? I can just say I spend $700 to move to 48gb and I haven’t regretted it. I’ve felt the pull to move higher… with Digits but I’m content enough. It feels like smaller models cannot create large paragraphs that hold as well together as larger models do, and larger models have a larger space to statistically guess what should happen. As an example: a person is pushed out of a window on the seventh floor… (Small model) person climbs to their feet and yells up I’m going to call the police. (Large model) person has a pool of blood around them. The small model gets pushing someone out the window is not legal or good but misses the fact the person fell seven stories.

Reply

[-]

FullOf_Bad_Ideas@reddit

Umbrella runs llama 3 70b INT4 at 6 t/s on single rtx 4060 TI/3090. At low context. It's something.

Reply

[-]

silenceimpaired@reddit

What?! What is this umbrella? … I want to run 8bit at 6 t/s.

Reply

[-]

FullOf_Bad_Ideas@reddit

https://github.com/Infini-AI-Lab/UMbreLLa I don't think it supports multi-GPU, or 8bit

Reply

[-]

silenceimpaired@reddit

Tragic. But thanks.

Reply

[-]

synn89@reddit

You shouldn't have any problem running a 72b at 32k context. Use a 4.5bpw EXL2 quant with q4 cache. I'd recommend against ollama and using something like Text Gen Web UI for more control over the quant size and cache setup.

Reply

[-]

tmvr@reddit

Theoretically it should fit, but the Qwen2.5 72B GGUFs on HF are larger somehow. The Q4\_K\_M from Bartowski (and LM Studio as they repost his ones) is 47GB+ and the one from Qwen is 44GB+ which is a problem with 48GB VRAM.

Reply

[-]

skrshawk@reddit

Difference between EXL2 and GGUF. Also, running TabbyAPI with tensor parallel is a massive performance gain over anything LCPP.

Reply

[-]

Papabear3339@reddit

Local really shines when fine tuned on company data. Company documents, contracts, metadata, database samples, etc. This isn't a world engine, it is a small but powerful agent that can be tuned to do a specific set of tasks very very well. Think of it in that context and you will understand why local can be so amazing when done right.

Reply

[-]

GodComplecs@reddit

I plan on running multiple LLMs, VLMs etc on my 2x3090, already do it with my single. That is really optimal if you do some real work with them. Oh and you can serve them to small businesses if you need. I get 60tks+ on my single already so...

Reply

[-]

_wOvAN_@reddit

https://preview.redd.it/mbn9kermhale1.png?width=1282&format=png&auto=webp&s=2700934d67f1f8e60bc3ea08aa3663a2282ef996

Reply

[-]

ChopSticksPlease@reddit (OP)

Nice, but clearly the GPU pool is underutilized. So whats the point? Heating? ;)

Reply

[-]

_wOvAN_@reddit

small model now runnig, can run 70b fp16, but still not enough to run big models or big contexts

Reply

[-]

danielv123@reddit

Are those on 1x mining risers?

Reply

[-]

_wOvAN_@reddit

yep, miner board, pcie 2, 1x

Reply

[-]

DefNattyBoii@reddit

Does the pice bandwith influences speed or not?

Reply

[-]

Awwtifishal@reddit

In layer split mode it should not, since only a small amount of data is being transferred.

Reply

[-]

No_Afternoon_4260@reddit

The question is very did you got these?

Reply

[-]

JustinPooDough@reddit

This is why I won't bother buying more than my 3090. I use it for prototyping locally, running basic stuff, then outsource complex stuff to cheap cloud services. Even Claude Sonnet with token caching and effective context management is really affordable. If I feel like being even cheaper, Gemini Flash 2.0 and Thinking 2.0 are both basically free, and - IMO - excellent models.

Reply

[-]

L3Niflheim@reddit

But you could have ***two*** 3090s. Double the fun right?

Reply

[-]

koalfied-coder@reddit

I mean yes one really needs 96gb for 70b 3.3 8bit which is nice.

Reply

[-]

Such_Advantage_6949@reddit

Your setup is meh, i have 4x4090/3090 and i still find it meh. What do u expect from model that is one tenth of the size of model from the big guy? Have hardware that can run the full size deep seek 671b then you can tell me if it is still meh

Reply

[-]

ChopSticksPlease@reddit (OP)

Nope, I just was curious is the second GPU going to add much value and it clearly didn't. Happy with a single RTX 3090 to run coding LLMs.

Reply

[-]

thesuperbob@reddit

I dunno, I'm in a similar situation, the second 3090 allows me to run Deepseek R1 distills smoothly, and do everything I did before but better, faster, and with a larger context. I'd say it's a decent step up. Obviously it's nothing compared to online models, and once again there are superior models that seem to be just out of reach at 48GB VRAM...

Reply

[-]

UsualResult@reddit

> and for general questions the online big LLMs are better. Meh Might I recommend looking a benchmarks next time before you plunk down some money? There are ways to theorize about the experience without buying a bunch of GPUs.

Reply

[-]

You_Wen_AzzHu@reddit

The only reason we use local LLM for work is because the data simply can't be sent to any vendor due to compliance.

Reply

[-]

siegevjorn@reddit

Try vLLM with tensor parallelism, should be much faster than ollama.

Reply

[-]

ElephantWithBlueEyes@reddit

Even those big cloud models are meh in some cases. Depends on what you want them to do. I've got rtx4080 with 16gb before i knew i'll be into LLMs and 16 gb is just not enough (24 gb is the way). Currently testing Open WebUI + any backend. I can squeeze just qwen 2.5 coder 14b + gemma 2 9b for general questions to access them from my laptop. 48 gb VRAM is pretty much fine for that kind of scenario when you want to use multiple models locally. Why locally? Because i'm digging our team codebase (i'm QA) so i don't want code to go outside. And also because many cloud LLMs aren't available without VPN, except Deepseek.

Reply

[-]

LienniTa@reddit

i actually used 3090 as a hairdryer back in 2022-23. Launched stable diffusion inference, apex legends, and in apex legends shooting range was throwing fire grenades and STARING at them. Gets hair dry pretty fast.

Reply

[-]

_hypochonder_@reddit

Qwen2.5-72B-Instruct-Q4\_K\_M.gguf is 47.4 GB. deepseek-r1:70b is 43GB. You can try qwen2-72b-instruct-imat-IQ4\_XS.gguf (40GB) or qwen2-72b-instruct-imat-IQ4\_NL.gguf (41,3GB) instead.

Reply to Post

117 Comments