Is local AI the actual endgame? (M5 Mac Studio vs. Dual 3090s)

[-]

it's easy to see it, if it can run for millions of users simultaneously on servers then it can run locally no problem at all. If RAM prices stay high, that is an incentive for businesses to get into the RAM production business which in turn will bring us into the next phase where 256GB of RAM is the new norm for your laptop.

[-]

Hydroskeletal@reddit

boromir.gif - One does not simply walk into the RAM production business

There's also some very real materials constraints shaped by geopolitics

[-]

Radium@reddit

Businesses and people tend to flock to where the money is. That's RAM production right now.

[-]

Hydroskeletal@reddit

Sure, but that might not meaningfully budge until well into the 2030s. GPUs have fluctuated price wise but they've never tanked as the demand is always going up.

[-]

Radium@reddit

Terabytes will become the new Gigabytes for RAM soon.

[-]

ContextLengthMatters@reddit

Yes. Anything you do now is simply for the love of the game. If you think you are positioning yourself for the final showdown right now you are going to be left behind in less than a year. Hardware obviously moves a lot slower than the models and their harnesses, but it will move.

Any money you spend in this space should be for continued learning and experimentation. You don't want to think about this like you are investing in a future that no one can predict.

[-]

starkruzr@reddit

this is a pretty exciting time to be experimenting, tho.

[-]

MuDotGen@reddit

Honestly, even if I haven't made any actual local solutions that work for me at work, I've been following this sub for only a couple months now and experimenting locally and learning so much. It's helped me improve my understanding of LLMs and AI in general, so I work on a production ready AI Chatbot with AWS and Bedrock and understand better what models to use, prompt Caching and effects on cost, reasoning, etc. I recently upgraded it with a gatekeeper that acts as a semantic router and helps classify queries better since it can utilize Bedrock's grammar constrained decoding, among other concepts for tuning hyperparameters, etc. It's all converging toward the things I dreamed of as a kid (Tron, Mega Man Battle Network net navis, etc.), so it's like a dream to experiment and learn about the latest tech.

[-]

ContextLengthMatters@reddit

Sure. But no one in this sub should be advertising it as anything but. I have a bunch of money in equipment and I would not be recommending others to do the same if they aren't just trying to experiment.

[-]

imp_12189@reddit

What if I love to experiment but have no money? Hire me :C

[-]

starkruzr@reddit

I think there are some limited applications that can be reliable today. handwriting recognition with VLMs is one example. "full" foundation model flexibility, probably not absent a LARGE gear purchase.

[-]

ContextLengthMatters@reddit

These types of buyers are very specific and you can tell when they come into the sub. Most already have an idea of what they are trying to get into.

[-]

NoShoulder69@reddit

I think yes, they will have much bigger role that it is now.

[-]

Equivalent-Repair488@reddit

Wait for the bubble to pop for cheap hardware, model development will never stop, like how the internet didn't after the dotcom crash, but if you are like me and already enjoy using local models, why wait?

[-]

-Crash_Override-@reddit

Wait for the bubble to pop for cheap hardware,

The disconnect from reality is truly astounding

[-]

Equivalent-Repair488@reddit

Explain then

[-]

-Crash_Override-@reddit

You thinking there is going to be a bubble. Tjat that bubble will pop. That that will result in cheap hardware flooding the market.

Not only are you taking copium. You're mainlining it lmao

[-]

Equivalent-Repair488@reddit

You haven't given any reason as to the contrary. If you just keep insulting me without explaining any proper point, I'll just ignore your opinions as the ramblings of a clueless redditor lol.

Cyclically Adjusted Price-to-Earnings, a statistic that had indicated major market corrections in the past like black tuesday, black monday and the dotcom crash is at another all time high since dotcom. Meaning investor perceptions and expectations are higher than the entities in said ETFs or markets collective earnings, it is unsustainable, and will result in a correction eventually unless earning catch up, but I find that unlikely because the share prices are currently exorbitant.

When a correction happens, demand for data centers will drop, demand for the input items like enterprise GPUs will drop, perhaps even existing current ongoing projects will get cancelled and GPUs liquidated. If the demand dropped one day all of a sudden, do you think companies are willing to continue building datacenters, pay onsite workers to maintain, operational costs like power, when there is practically no utilisation?

Will be a crash as big as the dotcom? Unsure, but likely not, ai bubble sentiment is already tampering investor expectations. But the share prices cannot sustainably be so many times more than earnings.

[-]

-Crash_Override-@reddit

Starting by saying I actually work in finance (specifically leading AI for a large financial firm- you may even own some of their funds).

Cyclically Adjusted Price-to-Earnings,

This is an absolutely rediculous statistic. Firstly, go compare it to the P/Es of the dotcom buble and you'll see an astronomical difference. While we do have some 'meme stocks' that defy valuation, the overwhelming majority of tech stocks right now are within an acceptable range. Weve seen the definition of acceptable P/E creep up over the past 20+ years precisely because of concentration of capital in the USs premier industry...tech.

Furthermore, the key companies at play here...anthropic and oai...are not impacted by P/E becausr they are not publicly traded, they have no problem securing PE money hand over fist, and the model builders/players who are...google, meta, msft, nvda, etc could have plenty of capital, amazing revenue, and are doing just fine.

When a correction happens, demand for data centers will drop, demand for the input items like enterprise GPUs will drop, perhaps even existing current ongoing projects will get cancelled and GPUs liquidated.

According to who? Why do you think a correction will impact the strategies of microsoft, meta, google, etc...any indication of, say openai getting cold feet on hynix memory orders, hynix will turn right around and sell it to anthropic.

We just had a correction thanks to Iran. We had on in 2022 from rare hikes. Not a single company changed their course of action. Hell some of the most bullish news came out during that time.

ai bubble sentiment is already tampering investor expectations

So far in 2026 we have seen company share price get rewarded for clear and aggressive ai investment (google, meta, etc..), punished when they keep steady or pull back a bit (msft, tsla). PE has made the largest investments ever in AI companies. The market as a whole has been surging to ATH on tech investment.

But still. Lets say a market crash does happen. And all these companies, the ones who are the cornerstone to american success get cold feet....you think someone at nvda will just start posting a bunch of H200s on ebay for dirt cheap? No. China will buy them. Europe will buy them. Other companies who want their own data centers will buy them. Competitions will buy them. Sure as shit not consumers.

In short, if there is a point when companies like google stop building data centers. And the demand for this hardware evaporates, we have much much bigger things to worry about than scoring cheap ram.

Also worth noting that AI in corporate america is no longer just a fun thought experiment like it was in 2025. Its making adding vast amounts of value. Its become integral and cant be removed....so much so msft just said for our org (F500 so 14k-ish people), they will be switching to API based billing for github copilot. This represents a 4x increase (almost $6M/yr)...and were just like, well, guess we have no choice. AI companies have corporate america by the balls.

OAI and Anthropic are expected to IPO later this year, and when they do, their books will be open. We already know revenue for these companies is strong. They will be aiming to show a significant leap towards profitability. Which I expect will be true.

[-]

Important_Quote_1180@reddit

Consumer hardware isn’t what brings in the $$$ anymore. Companies want that enterprise expense account. Hobbyists are vacuuming up hardware. 3090’s going for $1100+ seems insane but it’s actually just an amazing card for ai and you can see the disconnect from 5080s betting very undesirable in comparison. H100s or Blackwell 6000s are going to be traded around after the next generation comes out, but if that enterprise customer needs to spend 60k for a card they are probably going to have a bad time. The market is struggling unless you are nVidia

[-]

Equivalent-Repair488@reddit

I agree fully.

But it's honestly just super insane that even the A100s can command 5 digit prices. What I hope is that the market correction will bring down prices for the enterprise cards back down to like under 5k USD hopefully if datacenters get cancelled and existing GPUs are liquidated to recoup costs. But the economy will be rough, way way worse than even now if that were the case.

[-]

Party-Log-1084@reddit (OP)

So far i dont have experience with local ones. Whats the best way / tools to start with? Got a Gaming PC with Linux Mint, Ryzen 9 5950X, 64GB RAM, AMD RX 7800 XT (16GB VRAM)

[-]

ea_man@reddit

get https://huggingface.co/mradermacher/Qwen3.6-27B-GGUF

llama-server \
 -m /home/eaman/lm/models/mradermacher/Qwen3.6-27B/Qwen3.6-27B.i1-IQ4_XS.gguf \
        --host 0.0.0.0 \
        -np 1 \
        --fit-target 20 \
        -ctk q4_0 \
        -ctv q4_0 \
        -fa on \
        --temp 0.45 \
        --top-p 0.9 \
        --top-k 35 \
        --min-p 0.05 \
        --repeat-penalty 1.05 \
        --presence_penalty 1.5 \
        -b 512 \
        --jinja  \
        --no-mmap \
        --reasoning-budget 1 \
        --chat-template-kwargs '{"enable_thinking":false}' \
        --no-mmap \

Qwen3.6-35B-A3B.i1-IQ3_M.gguf

llama-server \
 -m /home/eaman/lm/models/mradermacher/Qwen3.6-35B-A3B.i1-IQ3_M.gguf \
        --host 0.0.0.0 \
        -np 1 \
        -ctk q4_0 \
        -ctv q4_0 \
        -fa on \
        --temp 0.35 \
        --top-p 0.9 \
        --top-k 40 \
        --min-p 0.05 \
        --repeat-penalty 1.05 \
        -b 512 \
        --fit-target 60 \
        --jinja  \
        --chat-template-kwargs '{"enable_thinking":false}' \
        --reasoning-budget 1 \
        --no-mmap \
        --swa-full \

And run those on a Lubuntu, not windows, not LM studio. Use llama.cp

[-]

boutell@reddit

I don't think 27B is going to run well in 16GB RAM, except maybe at tiny context sizes. His 7800 XT ought to kick ass with 35B-A3B though, because it's an MoE model and doesn't require pulling 100% of the weights into memory 100% of the time and also uses less VRAM for k-v cache. I like OP's chances with the quant you recommended there or even 4 bit.

[-]

ea_man@reddit

What do you mean?

On a 9700xt is gonna do \~40tok/sec. Context size is equal to other 16gb cards.

Model File	Context Size (Tokens)
`Qwen3.6-27B.i1-IQ4_XS.gguf` (KV 4_0)	93,952
`Qwen3.6-27B.i1-Q4_K_S.gguf`	70,656
`Qwen3.6-27B.i1-IQ4_XS.gguf` (KV 8_0)	51,712
`Qwen3.6-27B.i1-Q4_K_M.gguf`	22,784

[-]

boutell@reddit

How much VRAM do these scenarios require? I thought it would be more than 16GB.

[-]

ea_man@reddit

I just wrote there, that is 16GB cards.

[-]

boutell@reddit

I see why it just barely fits with XS. Still makes me nervous LOL. My money to waste I guess if I want to have more RAM.

[-]

boutell@reddit

Thanks, I may buy one. Qwen says you shouldn't go below 128k but it doesn't seem to hurt as badly with 27b as it does with the MoE.

[-]

boutell@reddit

Didn't he say he had something with considerably less RAM than a 9700?

[-]

ea_man@reddit

he said AMD RX 7800 XT (16GB VRAM), ok so it's maybe \~32tok/sec.

[-]

mechkbfan@reddit

Is there a tool that calculates a baseline for you?

How do you determine variables?

I'm new here and wanting to dive into it but damn there's so many variables

[-]

ea_man@reddit

You mean like temperature?

Usually the provider should give defaults, es. https://unsloth.ai/docs/models/qwen3.5#thinking-mode

[-]

mechkbfan@reddit

Thanks, appreciate it

[-]

NoahFect@reddit

Those are some really low temperatures. Is that mradermacher's recommendation, or something you've converged on yourself?

[-]

ea_man@reddit

My preference, for agent use, low quants, low KV quantization, small tasks that I check

[-]

Sabin_Stargem@reddit

I recommend KoboldCPP if you want a relatively simple backend that can leverage RAM+GPU. It is compatible with Windows and Linux, maybe Mac for Arm CPUs?

https://github.com/LostRuins/koboldcpp/releases

0000

My two pence on hardware: wait until DDR6 boards start coming out. That will cause many enthusiasts to start dumping their older hardware, so you can adopt much of their kit at a lower price. That would be at least a year off.

[-]

samandiriel@reddit

We started out with a similar box. It will be meh without at least 48GB of VRAM, in our experience (we went the dual 3090 route, and 128GB RAM).

CachyOS is definitely way better for your host OS than Mint for this application, IMO. I have mint for my workstation, and we were using mint originally for the gaming rig too, but the improvements we got from Cachy were worth the pain of switching.

[-]

Equivalent-Repair488@reddit

You can always use Openrouter to try specific base models that are open source, and are able to run on your potential hardware options. You can use VRAM calculators to estimate, there are plenty online

16gb of vram is also plenty to try out small models quantized.

Or your can straight up rent 3090s from like Runpod, make sure all layers are offloaded to them. Although none of these options are 100% to the experience you will get from just testing with the actual hardware.

But if you are satisfied with the model's quality itself, imagine the hardware likely being slightly slower unless you are willing to spend time to learn to optimise your final purchased and running local setup.

[-]

New_Zone5490@reddit

how are you so confident the current state of ai is a "bubble" though?

[-]

Equivalent-Repair488@reddit

Not confident. The whole situation has alot of unprecedented factors.

But one thing is for sure, Cyclically Adjusted Price-to-Earnings cannot be this sustainably high, as well as may startups getting funding slapping the word ai on their product. A major correction is on the way, as big dotcom? 🤷‍♂️ Anti AI sentiment has been tampering investor expectations already.

But the PE ratio is not sustainable in the long run, a major correction WILL happen, or the earnings catches up to the price which is astronomical and I can't see how.

Then again, by a few years time, the big GPU players likely will have new releases, better processes, saturate the market more with better hardware, or competitors wanting a piece of the cake will provide substitutes enough for competitors like the Taalas LLM ASICs, or the Bolt Graphics Zeus GPU (which I am skeptical of).

[-]

looselyhuman@reddit

You don't think all the commercial models will go closed weights/hosting only at some point? We'll always have what's already out there, but we might get cut off from advanced models at some point.

[-]

t4a8945@reddit

I bought 2x Asus Ascent GX10 with that in mind: "if price go up, then I'm holding gold ; if price go down, I can make my cluster bigger"

Very happy with the 5K€ purchase, can't complain

[-]

FastHotEmu@reddit

what's the memory bandwidth on it? isn't it like 270GB/sec? seems very low for inference

[-]

t4a8945@reddit

Yes, current high-performing models on my setup generate tokens at 20 to 40 tps.

That's not lightning fast, but that's not very slow either. Totally usable.

[-]

FastHotEmu@reddit

ive got an epyc server that was a fraction of the cost of one of those and hits 400Gb/sec with 256GB of ram. Although, admittedly, I bought the ram before the crisis

[-]

t4a8945@reddit

That's great! Have you tried these models? What kind of inference perf do you get?

[-]

FastHotEmu@reddit

what should i try?

[-]

t4a8945@reddit

MiniMax M2.7 at Q4 and Qwen 3.6 27B-FP8 are the one I run right now ; and they are both really great.

[-]

FastHotEmu@reddit

what numbers do you get? i've been running qwen 3.6 27B on my 3090 but a quantized version. I should try it on the epyc!

[-]

Important_Quote_1180@reddit

OP, I don’t think Mac is the future for AI this year. Slow and not a good experience imo. Even the new Mac M5 chips are IMO extremely underwhelming and the high memory configs are laughable cost wise. 3090s still best bang for your buck but hers a way to let you try it before you buy it. Rent time on a card and see if you get what you want from it. If you are like me though, experimenting and working will eventually require an investment and the market is not kind to fresh entrants right now. Or just still with cloud services like GLM5.1 DeepSeek or OpenAI(this one is so subsidized right now but the party will end soon).

[-]

cmndr_spanky@reddit

I just mess around with models that fit on my gaming PC GPUs and learn. Just accept that the frontier models from openAI and Anthropic are still far ahead, and open source models have barely caught up.. but they still have a lot of use cases that would be overkill for Claude. For example simpler constrained agents and RAG systems can work great on smaller local models.

Local AI coding is much more challenging though and IMO you’re better off just using Claude and not overthinking it. Even with a $7k Mac Studio you aren’t going to come that close to Claude at least for complex coding / engineering tasks.

But as I said, I still get a ton of use out qwen 35b for a lot of other things

[-]

qfox337@reddit

If you're buying new hardware and willing to spend that kind of cash, I'd say neither and get a RTX 6000 Blackwell or a RTX 5000 Blackwell. You'll have an upgrade path to add a second (cheaper if years down the road). But it depends what you're doing. The Mac will be okay if you're using it as a chatbot and probably insufferable if you're "vibe coding". I honestly feel pretty impatient with Qwen3.6 35B-A3B at 100 tok/s because the "thought" tokens take a while, and it's not always clear when to turn it off (it does sometimes help, especially with strict output instructions). 2x RTX 5090 is the best deal but it's basically impossible to find them at retail price.

Deepseek Flash is the sweet spot for me, low cost, fast, significantly better practical quality (much as Qwen 3.6 was a huge advance for smaller models) and supports JSON restrictions on output (without writing complicated grammars). But the license is clear they can log your requests, like basically any other primary model provider. I think a Chinese company is less likely to surveil me in ways that are harmful than an American one, but obviously local models are much better than any other option in that regard. Imo, best to do a little due diligence checking a few requests before using any cloud model, especially if you use one of the extremely featureful vibe coding tools that can all too easily upload files you didn't mean to.

[-]

txgsync@reddit

I will just copy/paste the same comment I made in response to a person wondering why I reference spreadsheets whenever this topic comes up.

We aren’t quite to the endgame for the tech but it’s quite close.

VisiCalc didn't make mainframes smaller. It made the use case portable to a machine that was already cheap. I'm old enough to watch the industry pull off the same trick, four times now, over fifty years.

1979 — VisiCalc. Dan Bricklin watches a professor at HBS update a paper financial model on a chalkboard, has the idea, and ships the first electronic spreadsheet for the Apple II. The Apple II cost $2,000. The mainframe time-sharing services doing financial modeling — IDC, ITS, the rest — cost thousands a month plus per-CPU-second fees. Within twelve months businessmen are buying Apple IIs as "VisiCalc accessories" — they don't want a personal computer, they want the spreadsheet, and the cheap hardware comes along for the ride. The mainframe didn't get smaller. The demand for "I need a five-year cash flow model by Tuesday" got routed to a $2K box. Bricklin's own retrospective on bricklin.com is worth reading from the source.

1980 — 8087 math coprocessor. Floating-point used to be a mainframe / minicomputer thing. Engineering workstations from DEC, Sun, and Apollo sold a moat made out of fast FP and structural FORTRAN. Intel ships the 8087 — FP coprocessor for the 8086. I remember installing math coprocessors onto motherboards in the early 1990s. Eventually, the 486DX integrates floating point on-die.

Fun aside: one of my earliest gigs was soldering pins onto 486 "SX" CPUs en masse for customers when I worked out of a computer shop in Las Vegas. If you simply soldered some pins back on? You could enable the coprocessor. It made certain operations -- notably, video encoding for the dial-up-porn shop upstairs that hired us go dramatically faster.

By 1996 there's MMX, by 1999 SSE, but they didn't gain much traction in the consumer marketplace for a few years. FP goes from "expensive allocated resource you submit a job to" to "system call." Sun peaks in 2000 and is sold for parts to Oracle in 2010. Racks of minicomputers in datacenters didn't get smaller. The thing they were for — fast numerical work — moved to the cheap commodity Intel box.

1996–2009 — GPU eats SGI. Silicon Graphics sold $50K–$100K+ workstations to Hollywood, NASA, and the DoD, in addition to the small-time computer game studio I worked at at the time. SGI's moat was dedicated 3D silicon. Three SGI alumni — Sellers, Smith, Tarolli — leave in 1994 to found 3Dfx and put 3D acceleration on a $300 add-in card for PC gamers.

I owned every generation of 3dfx card through the Voodoo II, and used two in SLI mode religiously when "nvidia sucked".

It made me laugh when NVIDIA shipped GeForce 256 in 1999 calling it "the first GPU." Tedium's history of that moment is great. SGI's own people had pitched a PCI graphics card internally; it got killed because the margins were too low for SGI's existing customers. Textbook Christensen. By 2005 a $400 commodity GPU renders what an SGI Onyx farm rendered for Toy Story in 1995. SGI files Chapter 11 in 2009. The massive rendering farms didn't necessarily get smaller. The job moved to farms of the cheap commodity boxes, and an individual user with enough spare time could do similar things at home.

2024–now — AI coprocessor. Same pattern, fourth time. Every Copilot+ PC ships with 40+ TOPS of NPU. Apple Silicon has the Neural Engine on-die since the A11. Qualcomm Hexagon, Intel NPU on Lunar Lake, AMD XDNA on Ryzen AI. And then there's Taalas — hardwired Llama 3.1 8B at 17,000 tokens/sec on a single ASIC, claimed 1000× perf-per-watt versus an H100. Their pitch: two months from model weights to silicon.

With the recent drop of Gemma 4 31B from Google, you could cast the model into ASIC by end of summer. Near-frontier-class tool-using model capability in your laptop's coprocessor for the cost of metal mask layers, running thousands of times faster than on GPU.

The part that should make every API-tier AI subscription investor reach for the antacids: Apple just signed a multi-year deal with Google for Gemini at ~$1B/year. Same Google that published Gemma open weights, including small variants explicitly designed for on-device deployment. IMHO, Apple is less interested in paying for frontier API access than in licensing the substrate: a model family they can distill, quantize, and bake into silicon they already control, the way they baked H.265 into the Media Engine and FP into the FPU. ANE running a distilled-from-Gemini model on every iPhone in 2027 is the likely play.

The Christensen frame, The Innovator's Dilemma, talks about this in great detail (and is worth the read!): low-end disruption starts worse-than-incumbent on metrics incumbents care about (raw capability), better on metrics incumbents don't care about (cost, latency, locality, privacy, offline operation), and eventually crosses the "good enough" threshold on the headline metric. At which point the incumbents' margin structure becomes a trap. OpenAI can't run a $20/month consumer subscription business when the laptop and phone you already own do 95% of the work for free. They can sell frontier capability to enterprises that need GPT-6-class reasoning — that will probably remain a real business for some time — but it's a much smaller business than the consumer subscription thesis their valuation is built on.

Now, I know this endgame sounds like sci-fi. But I'm an old fart. I've traced the same trajectory repeatedly for half a century of lived experience in the IT business. I think within a few years, the internet stops being a content delivery network and becomes a prompt and context exchange. Tutorials, blog posts, recipes, how-tos, lookups — your local model generates them on demand from indexed/cached source material. What flows over the wire is the prompt, the citations, occasional fresh facts, the recipes for getting good output. The browser becomes a thin client for a local model that drafts most of what you read against material the model already has. Certainly not this year. Probably not next year. But the trajectory seems like it's following exactly the four cases above, and we're at the "wait, is the cheap box actually good enough?" moment that always precedes the demand curve flattening.

[-]

Zyj@reddit

I think local AI will play a bigger role once many people

want their own AI assistants and
realize the privacy implications

We‘ll need more hardware like Strix Halo or Medusa Halo.

But then again, this singularity makes it difficult to predict what will happen. Perhaps confidential compute is the answer for most.

[-]

slippery@reddit

Local hardware will never match a multi gigawatt data center running 10+ trillion parameter models. It might be good for certain things, maybe a lot things in the future, but it will be orders of magnitude below the frontier.

[-]

Curious-Function7490@reddit

It's where I want to go.

I think the AI boom is crazy and the bubble will burst. Token cost is an expression of this.

LLMs (predictive text) are very useful technically and should be adopted under controlled circumstances.

This comment was written under the cloud of a 5 day head cold but you get the general idea.

[-]

Pleasant-Shallot-707@reddit

There is no M5 Studio right now

[-]

Ok_Warning2146@reddit

If u need image gen video gen, then 3090

[-]

_derpiii_@reddit

I'm a digital nomad, so opted for a 'dual-use' middleground of M5 Max 128GB. IMHO, that's the sweet spot.

But if I had a desktop, kind of a nobrainer to lean towards linux + mult-slot GPU.

[-]

ObjectiveOctopus2@reddit

The end game is most AI running local. Super advanced ai running in the cloud.

[-]

Individual-Cup-7458@reddit

Firstly, yes, local AI is the actual end game. The cost is a one-off hardware payment that is free to use for life [of the hardware] and all your data stays with you. What's not to like?

If you want to keep hardware costs down there are viable alternatives to Nvidia that work extremely well, particularly for LLMs.

A 24GB AMD Radeon RX 7900 XTX is a fraction of the price of its Nvidia equivalent.

I am a programmer and my 24GB 7900 XTX lets me easily run 30B models (and higher) at a rate and quality that, day to day, is not noticeably different to commercial offerings.

[-]

boutell@reddit

Yes, the business models of the big AI players are completely unrealistic and yes, there is a huge reckoning coming in cloud AI pricing and yes, we're all going to feel it in our wallets.

But the exact timing of when a local graphics card will be able to both deliver enough smarts *and* work out to a lower total cost including the electricity is tricky and I wouldn't treat it as a practical investment.

That being said... I'm tempted myself.

Claude Code subscriptions are said to cost Anthropic between 8 and 13 times as much as they charge for them.

Assuming 10x, that's $200/mo for Pro or $1,000/mo for 5x Max.

Let's say they start charging that for real and you want to replace it.

Electricity costs vary, but let's say you're paying 20 cents per kwh (much, much more in California, a little less in Philadelphia).

If you were using Pro, it's unlikely you'd break even, even on the hardware costs, in less than 2 years.

If you were using Max though, you'd break even pretty damn quick on hardware.

There is also electricity to consider. With 6 hours of heavy use a day, an unusually efficient rig at 250W would cost you 1.25 kwh per day, $7.50/mo, $90/year. That would get you Qwen 3.5 35B A3B at pretty good speeds. If that model does it for you, great. Some say it's comparable to Sonnet. I think it's pretty great but requires way more correction and guidance than Sonnet.

Now let's think about a more powerful rig capable of running Qwen 3.5 27B at decent speeds. That's going to set you back more like $3,500. Again, if you're replacing Max... sure! Hardware cost is no problem.

Electricity cost closes in on 1 kwh per hour, geez. Almost a microwave oven. But still, you're talking just about $36/mo if you're paying 20 cents per kwh and using it 6 hours a day.

So in this scenario... you totally come out ahead... IF:

(1) These models actually are good enough for you, or good enough to greatly reduce your need for cloud models, AND

(2) The AI market really does transition soon to charging the true cost to the consumer.

If #1 doesn't happen, you're gonna feel pretty dumb. So make sure you evaluate that hardware thoroughly by renting a card in the cloud.

If #2 doesn't happen, and you did this for financial reasons, you lost money. But maybe you see it as insurance.

These aren't the only possibilities. Just thinking out loud!

[-]

mechkbfan@reddit

As a newbie but software engineer of 15+ years, I think so

Once we get good enough models that can do code assistant of Opus 4.6 on GPU & RAM that's ~$1500 USD, that's good enough for me.

[-]

lightskinloki@reddit

Local will win for majority usecases with cloud inference being mostly for enterprise use. I use nemotron super and qwen 3.6 for almost everything now and only go to claude for opus when I really really need it which is becoming far less often. Am a solo game dev and environmental educator. Once 1m context is solved on mid grade consumer hardware there will be almost no reason for the average person to ever use cloud models again other than convenience

[-]

Liringlass@reddit

The brutal truth is they are not the same and I fear the gap will grow.

Local ai is getting better at equivalent size giving us better options that are viable for some tasks, and the hardware you’re thinking about gives you good options.

But frontier models are in the trillion parameters now and still climbing. Even the small models like haiku, gemini flash are leagues ahead.

The best thing to do before investing is to open an open router account and test the models you will be able to run (30b class i would say) for a few weeks. Then you will know with just a few dollars invested whether you can rely on those, at least for the privacy involving use cases

[-]

datbackup@reddit

The best thing to do before investing is to open an open router account and test the models you will be able to run (30b class i would say) for a few weeks. Then you will know with just a few dollars invested whether you can rely on those, at least for the privacy involving use cases

u/Party-Log-1084 listen to this man, OP, he speaks the truth here

A big part of what you learn from running locally is a sense of what size models can do what, using what harness

If money isn’t tight I would go ahead and put together your 2x3090 build, I doubt we see anything that touches 3090 price-performance ratio for at least a year and even if/when something gets released that competes on this ratio, it’s not like the 3090s all suddenly become worthless bricks or like demand for tokens won’t be sufficient to continue driving 3090 sales.

M5 studio should be roughly equal in performance to 3090 but price is another matter

[-]

hackcasual@reddit

I think waiting is the play. Hardware prices are at all time highs, mostly due to over buying. A few data center cancellations, a little catch up in production and hardware drops in price.

[-]

rorowhat@reddit

Nothing is going to change anytime soon, micron is sold out of memory till 2027 at least.

[-]

Kahvana@reddit

Why go for the best if it works with "good enough"? Try out Gemma4-31B and Qwen3.6-27B on openrouter or nvidia nim or their own website, If it's good enough for what you need, spend the cash on the hardware.

GPUs are almost always better than CPU inference unless it's about RAM size. Two used RTX 3090s will do the job just fine, even better with NVLINK bridge.

Personally I'm really happy with my dual ASUS PRIME RTX 5060 Ti 16GB setup. It's very slow by most standards on this sub (Bartowski's Qwen 3.6-27B Q4_K_L with 256k Q4_0 context. Gen: 20 t/s at 32k, 10 t/s at 250k).

Even if it's slow, it conforms to all my soft factors (no 12vhpwr, very silent (barely audible even under load), very little heat (60c tops), very little electricity use (300W including my 55w monitor during heavy inference). It also costed little over 1400EU (2x gpu, 1x asus proart x870e) for me at the time, that price is still doable today (2x gpu, 1x asus proart b850 neo).

[-]

OddDesigner9784@reddit

Definitely not endgame. I think local ai is great because you can essentially run it forever and would use it over any long running task with a frontier model. That being said the frontier is expanding and will have easier tooling to work with it. There are things that will only be accomplishable with frontier ai

[-]

An0nynn0u5@reddit

It's worth investing in 2 3090s. A small local model for privacy and automation, and cloud models for heavy work. Small models should be able to rival current large models in a year, if the bubble doesn't burst.
The Mac Studio, if it still ships with 512gb memory, is expected to run large models slowly. I don't think it would be worth it for heavy work like coding, but well worth it for light or occasional tasks where you don't want your data sold.

[-]

boogityxracing@reddit

I'd echo the others here saying that waiting is the most economical option. Any long-term investment in hardware now is going to be a bad one, so if you really want the most bang for your buck, there's nothing worth buying right now.

That said, if you can afford to burn some money, I don't think buying hardware now is the worst idea, either. You just have to go into knowing it's more of an investment in your own knowledge about running and working with local LLMs than an actual long-term economic investment in hardware.

I have a couple personal AI servers, a mini PC with a 4090 eGPU and a Framework Desktop, neither of which is as powerful as I'd like or likely to be a great long-term investment, but I still appreciate having them just to be able to play with models locally. They can both run decent-sized models at decent speeds for the price, and being able to run models on my own hardware has inspired me to try out use cases I wouldn't have ever thought to try with a hosted offering. Not even for privacy reasons, necessarily, but just because it forces me to think more about the models' capabilities and ways I might want to experiment with them.

[-]

maschayana@reddit

4 to 7k 😂

[-]

synn89@reddit

I have a couple dual 3090 servers, a M1 Ultra with 128GB RAM and prefer the M1 Ultra by a wide margin. When 70B models were the norm, the dual 3090 was more attractive since it could run those at 4bit pretty well. It was a trade off between being a little faster than the Mac vs the Mac having low power consumption and a low footprint. But with MOE models today it's no contest, the 128GB of unified RAM is so much better.

A used M1 with 128GB or M2 with 192GB would've been the good "get it now" cheapish option, but I think the used prices may have gone up a lot. I'm waiting to see what Apple does with the M5 Ultras and am may try to snag a 256GB one of those if the price isn't insane.

The above changes though if you need something to run imaging models. For that, you basically need Nvidia.

[-]

alexpolo3@reddit

What model are you running on the dual 3090?

[-]

Party-Log-1084@reddit (OP)

I dont have dual 3090 yet. I am just asking if it does make sense to buy too. In germany, they are 1K each (used ones) which is really high price

[-]

Craftkorb@reddit

Huh, just had a look at Kleinanzeigen. Just a few months ago they dipped into the 650 range, now a "fair" price seems to be 850€. Still, damn.

Anyhow. Cloud will always be more powerful, they just have other hardware options. Butt even if a good local models is only 80% as good as a cloud one, that's still a really good model. I do loads with Qwen3.6 27B fp8 on my 2x3090 machine.

But before you buy hardware, just try these models on openrouter.ai for cheap. You could rent some GPUs in vast.ai for more tinkering. And if you like what you see, you can still invest the money. If you're contend with less speed, there are options like the aging Nvidia P40 or the AMD MI30, look on eBay for Chinese sellers.

[-]

samandiriel@reddit

This is the best answer, OP.

That being said, we have a dual 3090 set up and it is pretty amazing all things considered. It will never touch cloud hosted high end models, tho, just like my hobby car will never match Indy 500 racers.

[-]

Pretend_Engineer5951@reddit

It depends on what's your use case. I dived into vibe coding with local LLMs on Strix Halo. After 3-4 months of probing different projects, tasks, agents, models I began to feel disappointed on some things.

Coding can't be gradual. Either code works or not. On projects with high complexity any local model failed me when I needed to implement a feature. Of course having enough time even "stupid" LLM will handle the work through many failures. But hidden bugs are coming out later. The main thought I want to expose: I cannot rely on this tool. Maybe one day I'll workout optimal workflow with step-by-step manual verification. But "Claude just works".

On the other hand, local LLMs did me a favour when I

- needed to understand how complex things in a code work

- do code review

- need to gather information on tickets and do analysis report (KPI things)

- do simple bootstrapping or small/medium module refactoring

Plus model updates are becoming more frequent what keeps a hope that we'll see much more sophisticated free LLMs in the nearest future.

[-]

Material_Policy6327@reddit

The end game is making it so no one can use their own hardware easily

[-]

RiotNrrd2001@reddit

The hardware we are using now is going to look like a slide rule in five to ten years. GPUs weren't designed for AI, we've just pressed them into service because that's all that's available.

ChatGPT, the product that brought AI to the general publics attention, was released four years ago.

The typical time it takes to bring a chip from design to reality is about five years. If a company started working on an AI focused chip the day ChatGPT was released, it would still be in the pipeline today.

Additionally, local AI is getting better and cheaper. My own setup is basically a potato. I have a "gaming machine" from seven or eight years ago with literally one of the worst GPUs Nvidia ever made, one that tops out at 6GB and does not support half-duplex, something every other GPU in the world supports. Nevertheless, the (quantized) local AIs that I can run in LM Studio on this potato are significantly better than ChatGPT 3.5. New optimizations are occurring fairly regularly, as well. I expect that the models I'm running now will be inferior just a year or two from now. We're still in the DOS days of AI. Maybe we're up to DOS 5 now instead of DOS 3.3, but it's still DOS. But someone somewhere is working on the "Windows" version.

Next year, or maybe the following, I expect to start seeing relatively inexpensive chips optimized specifically for AI start appearing. There are some prototypes out there now, but they're, of course, stupidly expensive because the economies of scale haven't kicked in, those chips are still "high end" the way Pentiums were "high end" once.

I remember reading a letter to the editor in PC Magazine back around 1994, where the author was musing about whether they should upgrade from a 10 MB hard drive to a 40 MB hard drive. The response was they should wait. Lo and behold, within a few months the standard size you'd get in the stores was 200 MB. A year later we were getting gigabyte drives. Had the person splurged on the 40 MB drive (which was several hundred dollars at the time), they'd have been very sad not too long after.

I think we're in the same situation as back then. Yes, you could pony up for a 286 DOS box, but 486 machines running Windows are on the horizon. I'd make do with what you have for as long as you can, MUCH better things are on the way, no reason to sink tons of money into something that will go obsolete faster than you can say "obsolete".

[-]

Marshall_Lawson@reddit

one that tops out at 6GB and does not support half-duplex, something every other GPU in the world supports. Nevertheless, the (quantized) local AIs that I can run in LM Studio on this potato are significantly better than ChatGPT 3.5.

what models are you running?

[-]

RiotNrrd2001@reddit

The ones I tend to use the most are glm-4.7-flash and the Gemma4 models. Qwen3.6 tends to get into endless loops, so I don't use that one as much, although I'll still break it out now and then.

The models are generally 4-bit quantized. None of them run particularly fast, usually less than 10 tps, but that's fine for what I use them for (non-production screwing around, basically).

[-]

redmctrashface@reddit

But models will likely become more and more effective, no? So, having something like an nvidia dgx spark makes you confortable for few years I guess

[-]

JurassicSharkNado@reddit

It's worth it enough for me to set up the infrastructure and start learning about it, and planning upgrades. I'm working on setting up a system where I can drop in a mini PC and easily spin it up and get it integrated into a distributed LLM setup.

But I'm under no illusions that my cobbled together setup that started as a former gaming rig will ever match a giant server rack filled with hundreds of thousands of dollars of dedicated AI hardware.

[-]

Party-Log-1084@reddit (OP)

Sounds quite interessting. Which specific Hardware do you use?

[-]

JurassicSharkNado@reddit

Right now I have my AM5 based desktop, former gaming rig. I swapped out the 6950xt GPU for a workstation grade r9700 with 32 GB VRAM. Wanted to do dual GPU and keep the 6950xt in there, but can achieve that in my mobo without a x16 to x8x8 bifurcation card. So I bought an oculink egpu dock for that and an AMD HX370 based mini PC (gmktec evo x1). I'd like one of the amd 395 based mini PCs with 128GB unified RAM but the prices are 50%+ higher than they were a year ago and new APUs are on the horizon (AMD Gorgon and Medusa) so I'll wait and see what those bring. Pooling resources using llama.cpp RPC, so a combined 48GB VRAM (+10GB if I add the iGPU on the mini PC) and pooled 96 GB RAM. Since it's just two devices for now, plan is to connect them with USB4, comes on the mini PC and just bought an add in pcie card for the workstation. Laptop is a GPD win max 2 that also has an AMD HX370, can pool that in if I want, actually tested that with the egpu dock first then bought the gmktec.

Quick AI generated summary table of my brain dump below, let's see if it formatted properly...

Current setup:

Device	CPU/APU	GPU	VRAM	Notes
Desktop (workstation)	AM5	Radeon AI PRO R9700	32 GB	Primary workstation
Mini PC (GMKtec EVO X1)	Ryzen AI HX 370	Radeon 890M iGPU + RX 6950 XT via OCuLink dock	~10 GB shared + 16 GB	6950 XT lives here permanently
Laptop (GPD Win Max 2)	Ryzen AI HX 370	Radeon 890M iGPU	~10 GB shared	Optional pool member

Pooled totals (desktop + mini PC):

Resource	Amount
Discrete VRAM	48 GB (32 + 16)
With mini PC iGPU	~58 GB
System RAM (pooled)	96 GB
Interconnect	USB4 (native on mini PC, PCIe AIC on desktop)
Orchestration	llama.cpp RPC

Future plans: Holding off on a Strix Halo (Ryzen AI Max+ 395) 128 GB mini PC — prices up 50%+ YoY, and waiting to see what AMD Gorgon and Medusa APUs deliver.

[-]

Miriel_z@reddit

AnythingLLM is probably a good start. There are other good alternatives too. Will involve some learning curve. You can actually fir quantized 27B-30B model in 16GB VRAM. the context will be offloaded to CPU though, so will be kinda slow-ish for response.

[-]

ea_man@reddit

You just have to use the right quants and you won't offload, even better use linux.

es:

| Model File | Context Size (Tokens) |
| :--- | :--- |
| `Qwen3.6-27B.i1-IQ4_XS.gguf` (KV 4_0) | 93,952 |
| `Qwen3.6-27B.i1-Q4_K_S.gguf` | 70,656 |
| `Qwen3.6-27B.i1-IQ4_XS.gguf` (KV 8_0) | 51,712 |
| `Qwen3.6-27B.i1-Q4_K_M.gguf` | 22,784 |

[-]

Miriel_z@reddit

It is up to OP for quality/speed balance. The key point is OP does not need to invest before jumping into LLM. The rig is already very decent.

[-]

ea_man@reddit

Agreed, nowadays a 16GB is a nice starting point to learn the ropes, yet I'd rather use a sw env that is easier to debug if that's the point.

I mean 16GB keeps you honest, you get a big hit whenever you screw up your config 😛

[-]

Miriel_z@reddit

I started from 6GB. And it was still decent, not tried coding though.

[-]

ea_man@reddit

He coding is a bitch because you often start with a \~12k context prompt and then you have to parse your existing code, also I would not say that 4B models are much useful for anything other than quick tests.

I mean 27B it's where it's at nowadays, you need 12-16gb to use it.

[-]

ea_man@reddit

You should wait to buy expensive hw, coz now it's stupid expensive.

Yet right now with your 16gb gpu you can already run the "good" models at Q4, so go start to build the sw enviroment (linux, llama.cp). In a year or two prices should go down, small models get smaller every month...

[-]

Miriel_z@reddit

Start small, get comfortable, research models. Then you can clearly identify what you want and what you need for it. You can easily start from 8B-12B models. With quantization, you can run these on 6-8GB VRAM, which will not burn a hole in your pocket.

[-]

Party-Log-1084@reddit (OP)

So far i dont have experience with local ones at all. Whats the best way / tools to start with? Got a Gaming PC with Linux Mint, Ryzen 9 5950X, 64GB RAM, AMD RX 7800 XT (16GB VRAM) and the Dell one mentioned above.

[-]

Toastti@reddit

Qwen 3.6 35B at a 3 bit quant would be perfect here. Basically state of the art and the best you can get on that hardware. And fast

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF?show_file_info=Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf