Need a brutally honest answer: what can realistically be achieved on consumer hardware?
Posted by wewerecreaturres@reddit | LocalLLaMA | View on Reddit | 69 comments
I have a PC with a 4090. I’m also in need of a new MacBook generally.
From a code quality and speed perspective as compared to things like Sonnet/Opus/Codex/etc…
What can realistically be achieved with a 4090?
M5 Pro 64GB?
M5 Max 128GB?
Or do I just keep paying for the big boy subscriptions and call it a day?
This isn’t a money thing, I can afford the M5 Max, but am not going to waste money for no real value.
mlhher@reddit
It depends on the model you use, the quant you use and the harness you use. With 24GB VRAM you can do quiet a lot (even if some people may feel personally attacked telling them that 5GB VRAM is more than enough and they don't need 3x 3090s).
I am happily running Qwen3.5-35B-A3B at \~30t/s (will be 3.6 now; it works very well) for nearly all of my dev work using Late and it works flawlessly often needing no guidance whatsoever.
A big issue is that tools like OpenCode, OpenClaw and Claude Code throw useless context and bloated prompts at your LLM actively destroying their reasoning capabilities before you even told them what you want. That is also why they push for bigger models. They cannot handle context and always assume you are using some big beefy cloud model.
binary@reddit
Why Late over Pi? https://pi.dev
mlhher@reddit
From a quick glance I cannot see the context separation Late provides.
I have not seen any agent do that so far; even if subagents are starting to pop off no harness to my knowledge enforces them (yet). This is the major point why local agents fail so often all these tools assume you are simply throwing some big cloud model at it.
Further I am not sure what it uses as default system prompt so I can't judge it in this context but comparing to OpenCodes and Claude Codes obnoxiously large system prompts, Lates system prompt is \~1k tokens.
mouseofcatofschrodi@reddit
I hope to check soon your tool. Looks very promising. How is the promt processing when you use Late compared to others that use huge contexts for tasks? I guess it should help a lot? If yes, that's big.
Qwen3.5 (and now 3.6) 35B is already super capable, tested it with Claude Code and worked very well. But the promt processing takes ages rendering it useless.
mlhher@reddit
You've hit the exact points why I built Late!
The prompt processing delay in other tools is usually caused by two things: massive context bloat (10k token system prompts), and dynamic variable injection (like system time) that constantly breaks the LLM's KV cache, forcing re-computes.
Late fixes this by using a static, \~1k token prompt for the orchestrator, which caches perfectly in llama.cpp. For the actual coding, it spawns ephemeral subagents which get a fresh, empty context containing only the specific pieces they need to edit (and nothing more). Once they finish, the context is destroyed (ensuring your context window is not being bloated by editing operations). The orchestrator never edits code and only acts as the Lead Architect potentially also increasing its performance (it only has a single task now: Architecturing)
mouseofcatofschrodi@reddit
I'm testing it. Pretty impressed, you have solved local coding to me :)
I had Late chaging the whole styling of a PWA app for around 40min with qwen3.6 35b.
Some thoughts:
a) create plan
b) create the subagents
c) check the results
Could the architect get broken in 3? so that each ones has 1/3 of the context? Or some other solution, like after every step it writes a short summary of what happened, dies, and a new architect is born which gets the new torch of information? Or in general any other idea for compressing or solving the long context of the architect.
3) Coding agents get a huge jump on effectiveness if they get feedback data (and access to internet to search for documentation if they need it). Right now the agents did their tasks without getting any feedback. With a loop "task <--> check result" they could work until the task is really done. This last part not even the architect did correctly, it did run some typescript test, but couldn't run npm run dev and check the errors on the terminal, neither open the browser (playwright) and check visually the frontend, or the errors (or console.logs if it need to debug) in the console.
If the subagents (or the architect) could have feedback from the terminal, browser console and (if the model is multimodal) screenshots, it would make them "smarter" and much more effective.
After a plan is done, the implementation_plan.md should be deleted automatically (or any other noisy files that aren't anymore necessary)
From the user perspective (macbook, terminal on VS Studio), I found that the "ctrl+g" and "tab" don't do anything. In order to start a new conversation with a fresh architect I had to close the terminal and open another one.
Thank you for your work and sharing it, it is amazing. That's why I'm taking also my time to write you this report. I hope it helps or brings some ideas, or at least the positive feedback gives you some motivation :)
mlhher@reddit
Thanks a lot for the kind words! Seeing someone push a model for 40 minutes validates the whole thesis. You actually get it.
It’s funny as Localllama just deleted my post calling the repo 'AI slop' and had inflammatory comments because the mainstream is still completely obsessed with dumping everything into their shiny 10k token wrappers. Building for the few people who actually understand VRAM/context bottlenecks is an uphill battle, so getting a field report like this is huge and certainly does with the motivation thank you!
Regarding your suggestions:
1) Lifecycle
You are correct with how it currently works. I've explicitly decided to avoid killing the architect to avoid games of telephone. With the current architecture, if requirements shift midflight or new information is gained, the orchestrator knows exactly how to pivot. If we respawn it, I'm worried critical context will be lost (summarization is fine for the subagent's tasks, but likely not for the orchestrator). This would require benchmarking to prove out, though I'm open to exploring it (and any kind of architectural suggestions!) if it squeezes out more performance.
2) Feedback for agents
Both the orchestrator and the subagents actually do read terminal stdout and errors invisibly (I hide it to prevent UI bloat). If your npm build has errors (and they are printed to stdout) Late will definitely see them. For something not covered you can hook up an MCP server for e.g. browser logs. I have not implemented image support or similar since I do not personally use it. If people ask for it I might change it (always welcome for feature requests, suggestions etc.).
3) Artifact cleanup
Funny you mentions this because I had it originally cleanup the implementation_plan.md in earlier testing. I found out that sometimes (not often but it does happen) after finishing an implementation one only realizes far later that some minuscule deep detail is wrong. In that case it is significantly easier to just start another instance tell Late "hey check this part of the implementation plan and why it does not do x". I just put the file into .gitignore now. There should not be any other files generated (Qwen models seem to reliably clean up "waste" e.g. some foo_test.go they did to test some assumption). If you find any kind of lingering artifacts please do report them!
4) UX shortcuts
ctrl+g (yes that is Emacs muscle memory) only aborts the active subagent session but keeps the orchestrator going. I might change this again it depends entirely on how people perceive it. This is just the version I use daily so it ships with my "personal" settings (an example being the "y,n" keys during tool validation being propagated; I did not mind that but someone pointed it out and I fixed it).
And a minor note on the "fresh start". That is how Late is supposed to work. You start Late, prompt for one thing and then start another instance and prompt for something else. That is also why the worktree support is in there.
Thanks for the deep dive, highly appreciated and happy about all the reports and positive traction!
mouseofcatofschrodi@reddit
Hey, there are so many people in this sub that jump just to comment that some post is AI Slop... Funny that the community for AI hates AI being used... I think there is not a single post I wrote where nobody said my text was AI written (they never were, I even leave the texts uncorrected). Also don't get why would they delete your post, your tool is exactly the kind of project everyone interested in local ai should root for.
They also deleted a post of mine that didn't infringe any rules (nothing extraordinary, just a post asking about how the local world could look like in the next 5 years).
mlhher@reddit
You're welcome! Trying to see how far I can get it now before others start stealing the concept lol.
For the architect: I was thinking about the exact same thing quiet a lot. I did not find any better way without sacrificing performance (going back to the telephone game) so I left it like this. A major point for me was not just making it faster but also proving I can make it smarter (again as you said before, likely the agent is smarter with the same model simply due to the orchestration architecture).
I am very sensitive to bullshit and seeing all these bloated Python/JavaScript wrappers wasting code for shiny UIs is tingling all my senses :)
mouseofcatofschrodi@reddit
Hi! Here again. I was doing some more testing and found out that, though the sub-agents start super fast (almost no promt processing), it seems that whatever the task is, the architect always create a lot of steps (15-20 is the avarage from my tests so far). That means, it works for easy tasks (that could be done in 1-3 steps) as much as it would for much more complex tasks.
As an example I asked to created a 3D human in a single html. Left what qwen 3.6 35b did with Late, right what it did with opencode:
Late created around 20 subagents that worked super fast... but 20 times. It worked for around 1/2h on my laptop. I was checking the plan.md and realized that sometimes some sub-agents did things already in the first steps, that were planned for later. That means some sub-tasks were done at least twice.
Opencode also had an agent for planning and an agent for implementation. It worked for around 5min. It achieved a result more detailed in less steps and much faster.
Both are fully 3D and you can rotate them with the mouse. I tried the same promt on a one shot test (no extra agent harness) but the html had errors (was not able to do it in one shot).
mlhher@reddit
These both look quite funny lol though impressive for where we are right now with local llms.
As for the plan: Phases should be lean (e.g. 5 or so). Steps can easily be bigger that is just natural as the agent explicitly plans out what has to be done atomically though if steps explode (e.g. 20 steps but 5 phases) i would definetely read the implementation plan and challenge it before proceeding. The amount of steps depends completely on the given tasks complexity; I will come back to this as your task specifically is a special case where I can easily see Late tripping more easily with dumber (relatively speaking) models. The agent is explicitly instructed to decompose steps to avoid ambiguity. If a task is complex (like 3d rigging) it will naturally have more steps than less, if a prompt is ambiguous it likely will give more tasks even if less are needed. I think this is simply a case where the models intelligence (specifically its lack, again see below) shows very clearly.
The harness generally expects the developer to read the implementation plan and verify it before accepting it. This is a trade-off that has to be done (if llms could do it from start to finish without any supervision AGI would be here already lol).
As for the task itself (the 3d limbs). I think this is simply a case where forcing to deconstruct might (or not) hurt smaller models that do not have the "intelligence". I have never done 3d rigging or anything remotely similar so I can only use my "internal knowledge" here. Also I would not rule out that you simply got hit by some combination of probabilistic chance combined with Qwens typical overthinking and not reading the implementation plan. This is also why context pollution is so deadly; if the implementation plan was already ambiguous (or overly complicated) for whatever reason this compounds very easily.
For any task you do I would highly suggest to actually read the implementation plan and potentially challenge it if necessary before proceeding.
Skeptic-AI-This-User@reddit
What would you use instead of OpenCode/OpenClaw/Claude?
mlhher@reddit
https://github.com/mlhher/late
It solves the issue by having ephermal subagents do the "tedious" implementation tasks while keeping the main planner agent separate. When the planner wants to do some edit it has to instruct a subagent to do the edit. That alone in most cases saves multiple thousands of tokens of pollution from the main planner agent in cases with ambiguous/complex goals this can easily go to tens of thousands of tokens saved.
When the subagent is done with its edit it just returns a minimal neat summary of what it did to the planner agent ensuring the context is kept lean and clean.
teraflop@reddit
The README says:
Sorry to be the bearer of bad news, the sandboxing is extremely poor. It just checks the first word of the command line, and checks against a few fixed regexes for malicious patterns. This isn't even remotely sufficient to stop dangerous commands
For example, a command like
wget http://www.google.com/presents a permission prompt, butecho foo; wget http://www.google.com/is executed with no prompt at all.mlhher@reddit
This was actually a great test for Qwen3.6 running under Late and it worked pretty well. I pushed the new commit if you are interested gladly check it out!
teraflop@reddit
Thank you for not getting defensive. Unfortunately, the sandboxing is still badly broken and the README is still very misleading. If Qwen3.6 wrote the new wording for you then it's clearly misunderstanding the problem.
For example, this command still gets through the filter:
echo >(wget https://www.google.com/)I'm sure you could pretty easily paper over this particular example too, but there are many other possible ways to get malicious code execution. In particular, there are many ways for the model to silently write files to the current working directory, and then prompt the user to run a command that looks innocuous but actually does something malicious to escape the sandbox. That one isn't so easily fixed.
Again, the core of the issue is that you're relying on brittle shell parsing even though your README claims that's not what you're doing. If you want to use a brittle security mechanism, fine, but you should be honest about it.
Your code is relying on shell parsing to provide a sandbox. It has no other security mechanism and that's why it's so broken. Calling it a "transparent, trust-but-verify guardrail system" serves no purpose except to obscure what it's doing. (That is, you told Qwen to make your system secure, and all it did was try to persuade you of its security without actually fixing it.)
Calling this "late parsing" is meaningless. The parsing happens before bash sees the command line. Commands like "ls" and "grep" are not "known-safe".
There is nothing restricting the shell to the current working directory except for shell parsing, which is easily broken.
mlhher@reddit
No worries, getting defensive is usually a sign of a lost cause regardless of which party.
You're right, and rather than dealing with individual cases I've addressed the root cause:
- The auto-approve whitelist has been trimmed to actual read-only commands only (removing e.g. echo and find)
- Any command containing shell metacharacters now requires confirmation, regardless of the base command.
As for the readme, I removed all "jailing" and "locking" language. The section now explicitly states that it is a convience heuristic and not a security sandbox.
Appreciate you pushing on this, and again no worries.
And a funny side note (don't worry, I am not trying to use this as a security argument lol just thought it was quite funny): I have asked Qwen five times to run your command and it actually refused every single time telling me it looks very suspicious.
mlhher@reddit
No need to be sorry that is a real and important point to make!
I did grapple with this initially using e.g. bwrap, seccomp or similar features but then explicitly decided against them. This was done for multiple reasons including ensuring that it does not rely on Linux only features.
Further since this is not a "OpenClaw" like harness I personally deemed the attack surface to be rather small (for me personally); the cwd is still locked and while you can break out of it with tricks but we are handling common things and models will not just maliciously inject weird shell tricks without external instructions (going back to OpenClaw).
My idea was if other people start using Late and think that is a genuine problem we might move to a genuine shell parser (I have checked some packages already). That is for the future for now though.
Thanks for the feedback though highly appreciated and actually happy someone reads the code lol
teraflop@reddit
If that's the tradeoff you're happy with for your own use, then fair enough. But I think it's really irresponsible to advertise "Sandboxed Execution" in your README when the sandbox is so trivially broken.
You should remove that section or at least dramatically soften the wording, to make it clear that the LLM can run arbitrary commands without trying very hard.
mlhher@reddit
I completely agree with you!
I just pushed an update changing the section clarifying that it’s not a true sandbox, but simply a heuristic fast-path to speed up safe reads, while relying entirely on the user as the final security gate. (I also pushed a commit parsing compound operators like && and ; based on your example).
Thanks for holding my feet to the fire on the wording here and thanks for the non-inflammatory wording lol. I am used to worse :)
Skeptic-AI-This-User@reddit
Haha, hadn’t even realized! Does RAG factor at all in your setup?
mlhher@reddit
There is no RAG pipeline whatsoever. The main agent acquires all context it requires before writing an implementation plan. If the subagent needs something that was not provided to it directly by the orchestrator agent it will go ahead and fetch it itself. The subagents are exclusively to do the "implementation" parts ensuring the orchestrator can focus on architecting (and adjusting along the way if needed). Since the context windows are isolated this ensures the orchestrators context never gets polluted.
The tool does have MCP support if you need it though.
And no worries lol always happy to help!
NexusMT@reddit
this sum’s pretty much my experience. it runs very well with 64k context the problem is when using opencode it iterates constantly thr prompt history and sucks the 64k like it’s nothing. I’m yet to try some plugins on opencode to reduce bloat but I might need to look for an alternativ. Late might be what I’m looking for.
ScipyDipyDoo@reddit
What can you do on a single 5090?
Budget-Juggernaut-68@reddit
Sonnet?
None.
Next.
PressTilde@reddit
If you want the best of what AI can provide, you have to pay for it.
My brutally honest answer is to do both. Pay for the big AI, and use your 4090 with smaller models. I wouldn’t bother with a mac. The money spent on that would cover years and years of the highest value coding plan/agent and you’ll get substantially more done by using faster/better models in the meantime.
The 4090 runs things like qwen 35b a3b or the Gemma 31b and 26b models at speed and remarkably well, and it’s likely that same card will run models on par with opus 4.6 in the future. The level of advancement has been insane. Since I first bought my 4090 we’ve went from barely running 8b llama models at 4k context, to a 35b sitting here at 256k context churning away at hundreds of tokens per second.
Also, don’t sleep on the ability to run multiple small agents with that card.
wewerecreaturres@reddit (OP)
Need a new Mac either way, fuckin Xcode, so it’s a matter of which model on that front.
I think what I’ve learned in this thread is that if I want to tinker, I most definitely can, but that’s about it for now - at least for what I need from it.
Thanks for sharing!
PressTilde@reddit
I have my “old” iMac retina for Xcode shenanigans. I getcha though. Makes sense.
Local LLM is all about the tinkering right now. :)
Your 4090 will likely be a sweet spot for messing around for years.
mouseofcatofschrodi@reddit
to me the function of recording meetings from anythingllm is the first local use that is really useful. Every week I have around 1-3 meetings. With the tool I can:
- record the meetings and get the transcript (thats already great in itself)
- get a summary with all the important stuff, action points and so on
- chat with the meeting. Specially useful for meetings that contained a lot of info (for example technical data, then the meeting answer and I get a tutorial out of the meeting)
As soon as promt processing gets faster, I could imagine doing a lot of agentic stuff to spare tokens from Codex for the difficult stuff
Long_comment_san@reddit
4090 with 48gb seems to absurdly good but it's a lottery.
Jeidoz@reddit
4090 with 48gb is probably chinese modification. For most "western" markets common RTX4090 has only 24GB of VRAM, and I suppose OP's GPU has 24GB too.
Long_comment_san@reddit
Yeah it's a Chinese item. But I've seen somebody here say that it works just fine. yeah it's a questionable thing but you do get double the VRAM per $ relative to something like RTX 5000
Annual_Award1260@reddit
Llm models it is essentially impossible to compete with the latest models like claude opus. But you can do all sorts of neural network training of your own
Suspicious_Body50@reddit
The 4090 makes local AI genuinely useful, but it doesn’t make Sonnet/Opus obsolete. After a day of heavy benchmarking and a local overhaul, here is my blunt assessment:
If the model fits in 24GB VRAM, the RTX 4090 is almost always the superior raw-speed option. This is about capacity. You can fit much larger models locally, but that doesn’t magically grant them frontier-model reasoning. It’s more "I can run bigger stuff" rather than "this beats the cloud."
Local is king for privacy, offline use, and heavy lifting (summarizing massive logs/files). However, for high-end coding quality and reliability, I still wouldn't trust a local weights model over the top-tier cloud providers.
RTX 4090: Best value for pure speed-per-dollar in local AI.
M5 Max 128GB: Best if you need massive local capacity on a mobile workstation.
I personally win a different route and have a 48 GB vram nvlink with two 3090FE and custome cooling, if you want to run the biggest models with the most contacts on consumer Hardware this is the only answer right now.
The 4090 and the 5090 are faster than the 3090, and are better for models that fit on one card but anything that spans past 24 GB of vram the dual 3090s are king.
Still the best answer if your goal is the highest quality output, not just local ownership. If money isn’t the bottleneck, keep the subscriptions regardless. Buy the Mac if you want the ecosystem and capacity; stick with the 4090 if you want the fastest local inference possible.
wewerecreaturres@reddit (OP)
Out of curiosity, why 2x 3090 instead of 2x 4090? Or 2x 5090?
Not accounting for cost or availability, logic would say 2x 4090 is better than 2x 3090
Suspicious_Body50@reddit
Because 3090 is the last consumer card that supports NVLink which allows you to connect the cards or pool their VRAM together, bypassing the motherboard for communication. This speeds up data transfer between GPUs significantly.
Since newer cards like the 4090 or 5090 lack physical NVLink bridges, they have to rely on the PCIe bus. For LLM inference and complex AI workloads, that 112 GB/s interconnect on the 3090s prevents the massive performance hit you'd otherwise get when your model is split across two cards."
samandiriel@reddit
Are you using yours as a code assistant. planning assistant or document coauthor? We are pondering whether the nvlink would be worth it for our dual 3090 set up. The high price point makes us pause, given that our reading says that the use cases tend to be things like training or massive continuous loads on 70b models.
Suspicious_Body50@reddit
Before purchasing the NV link I came across the same information about how it was only good for training or massive loads but I can tell you first hand that the NV link is useful for pulling together your vram. I don't know why the information is so heavily skewed towards not using it but it makes a world of a difference when using large models. Without it, the bottleneck of moving data across the PCIe lanes often leads to noticeable 'stutter' during inference. With NVLink, the communication between the cards is so much faster that the 48GB of VRAM feels like one fluid pool, making 70B models feel snappy rather than sluggish.
samandiriel@reddit
Do you have any metrics you could share? I'd love to see some hard stats if you happen to have any or could easy-peasy run some! It would def help us make our own decision.
wewerecreaturres@reddit (OP)
Thank you for the explanation!
Suspicious_Body50@reddit
I'm actually configuring my computer to take on tasks as a sub agent of Claude code and codex right now and will post that information in this forum when I have successfully reduce my token usage by Outsourcing a lot of the work to my own computer... I just wanted to give you an idea of what is possible and what I am personally exploring
Special_Animal2049@reddit
Brutally honest, huh.. are you prompting Reddit?
wewerecreaturres@reddit (OP)
Sometimes a little prompt engineering is required to get the truth
Due-Function-4877@reddit
I'll bite. Without an expensive server rig for huge open models, the smaller local options (that you can run on a single consumer card) can't do much of anything that's novel unless you know what you're doing and you can already do it yourself, because you will be mostly doing it yourself... I still find myself feeling sorry I tried using an agent at all... I think it's a bad intern with no common sense. It still helps out, though.
I like the option for autocomplete. It's also nice to let the agent handle boilerplate tasks and scripts. Beyond that, things get dicey fast.
wewerecreaturres@reddit (OP)
Ah yes, the hard truth has arrived. Thank you!
substandard-tech@reddit
If you make money with it, pay money for the good stuff.
Local coding is a hobby at this point
Turbulent_Pin7635@reddit
4090 - Porn
4x - 4090 LLM local as good as top tier @ 8 months ago
Mashic@reddit
Just try it, get llama.cpp/lm studio and try qwen3.6 35B, qwen3.5 27B, Gemma 4 31B and 26B and see if they can do the tasks you require or not.
breqa@reddit
Getting Qwen3.5-35B-A3B at ~30t/s, for some reason I’m getting 5-10t/s if I use it with vscode+copilot chat
createthiscom@reddit
If you wait for ram prices to come down you can do 1 tb of ddr5 and a blackwell 6000 pro. It’s still technically consumer hardware. Just high end.
SweetHomeAbalama0@reddit
The limit of what can realistically be achieved on consumer hardware is determined by two primary variables: 1) the inquirer's comfort working with technology, 2) the inquirer's budget.
What can be achieved on a 4090 alone is... not quite Sonnet-level capability, but can still do some very cool things. Can have a lot of fun with some \~32b LLMs and image/video generation, definitely some potential there.
That said, Kimi K2.5/GLM 5.1/Deepseek tier models can in many ways be comparable to Big closed models, coding quality included. Not quite 1:1, but I think for most peoples uses, "close enough" is an apt description. To get them running on consumer hardware is achievable with the right approach (we're talking up to 1T parameters), albeit a technical challenge to overcome.
I usually rotate between Kimi K2.5 and Deepseek V3.2 and use them pretty much daily on a 256Gb VRAM Ai server (8x3090 + 2x5090). I find myself using Gemini less and less every day, never need to use ChatGPT. Output quality is rarely if ever an issue, speed at least for me isn't an issue; most "issues" we run into come down to user error with using the appropriate chat template and providing the proper prompt/context to get the desired output.
AutomataManifold@reddit
Go to one of the online providers that lets you call any model or rent a cloud server, pick the model you're interested in, try it out. For $10 you can get direct experience with the approximate speed and quality you can expect to get. (Harder to approximate the macs, but you can at least try the different models)
-dysangel-@reddit
M5 Max will run this model similar in speed to my M3 Ultra. I think you'd get get even faster prompt processing, slightly slower inference.
https://www.reddit.com/r/MacStudio/comments/1sjktdh/minimax_27_running_subagents_locally/
https://www.reddit.com/r/LocalLLaMA/comments/1sk70ph/local_minimax_m27_gta_benchmark/
defective@reddit
What can realistically be achieved on consumer hardware pales in comparison to what you get if you subscribe. You might find some things that a local AI can do almost as well as the big boys, but if those things aren't the only things you do, then you'll need a sub for the rest, so you almost might as well use your sub for everything.
Since you are even entertaining the idea that maybe you should just pay for cloud AIs to do everything, I assume that you aren't interested at all in privacy, and are fine with handing every question and discussion you ever have with a model to the companies which are now MOST equipped to analyze the absolute shit out of every thought you have and infer shockingly accurate things about you (or imagine crazily wrong things about you) and your life and store them forever and provide them to any government who asks or any hacker/scammer who can get a human to click on a phishing email.
Therefore the only thing that I can think of that one in your situation would use local AI for is for running a derestricted/hereticized/refusal-suppressed model so you can get some prompts by it that the fat cats won't allow their models to answer.
wewerecreaturres@reddit (OP)
It’s a bit of both! Is privacy great? Absolutely. Is it worth it to me for substandard results? Less so.
I’d also really love to learn more about setting up and running locally
defective@reddit
Well there's always that tradeoff then. So I suppose that your ultimate solution is close to mine -- I have local consumer stuff that I always try to use first, when possible, and I spill over into subscriptions stuff when it just can't handle it. This minimizes somewhat my privacy risk and entertains me.
I think it's worth it to keep abreast of the consumer-runnable models, and methods of hosting them/using them. They are always getting better, and you never know when all available cloud AI companies will make some policy decision that excludes you from using them -- maybe they will make them too expensive or do something so immoral that you can't stand them. Always good to have an idea of what open-source is capable of in case you're faced with a tough choice.
Local models on consumer hardware are definitely useful. You can check out https://swe-rebench.com/ and see that Gemma4 and some Qwens that can fit in your 4090 are actually somewhat competitive with online models. (Those can both run acceptably on CPU/RAM too, as they are MoEs.)
One of the biggest differentiators usually between local and online stuff can be online research. If you give your local stuff access to Searxng so that it can search the web, it can ground itself and perform even better, and look up specific things. You'd have to do some experimentation, but I'm happy with what I have gotten local models to do, and it's still getting better (for now).
MurmurRunner@reddit
What do you have lying around? Computer and phone wise?
wewerecreaturres@reddit (OP)
Lying around in what sense? Daily drivers? Extra hardware?
iPhone 17 Pro Max PC 1: 7800x3d + 4090 PC 2: 5800x3d + 4070 MBP M1 being replaced with M5, model TBD
MurmurRunner@reddit
Have you tried EXO?
wewerecreaturres@reddit (OP)
Haven’t tried anything yet! Wanting to feel out the community’s thoughts so I don’t waste my time
MurmurRunner@reddit
Well if it doesn't work let me know. I'm working on a side project that allows distributed training and inference.
FinnE145@reddit
Look up some of the websites that offer cloud inference for open source models and try it out. They will usually have some sort of free trial. Look up which models are best at each vram amount and give it a try with your own use case to see if they are capable enough.
Adorable_Weakness_39@reddit
Macbooks are great for MoE models afaik. Qwen3.6 MoE just came out and looks impressive from the benchmarks and is \~20GB
brickout@reddit
Tinkering. And memory compression tricks will make it more capable in the future.
If buying unified RAM, I'd hold out for 256 or 512GB. Just play with your 4090 for now and maybe add a used 30/4090 for funsies.
AdamDhahabi@reddit
I really like Qwen 3.5 122b MoE and would tell anyone to invest $5K to run it fast, locally. But you already have your 4090 which can run Qwen 3.5 27b dense quite fine. So not worth it for you to splurge on a M5 Max 128GB.
You could go low budget or big budget from here. A Frankenstein build by adding 2x 3090 or go for a single RTX Pro 6000 + your 4090. You could run Minimax M2.7 at fast speed, which you can't do with your current setup.
Luke2642@reddit
I've been thinking about this too, looking at:
https://omlx.ai/benchmarks?chip=M5&chip_full=&model=&quantization=4bit&context=65536&pp_min=&tg_min=
Gemma 4 26B runs quite acceptably, ~40tok/s with 64k context, but obviously it's still gonna be lower IQ than claude sonnet is today by a significant margin. It's not only brains but speed you need to consider.
I'd speculate that the big models, GLM, Kimi, Deepseek, and the smaller Minimax, are all going to release either smaller or smaller and smarter models as time goes on.
Gemma 4 has shown what is possible with only 3B active 26B total. The advancements like will compound. Dozens of papers published per day, and you only need a couple per month compounding.
It doesn't seem unlikely that we'll have a Opus 4.6 strength model by Christmas running locally in ~100B MoE with ~10B active parameters. If that's true, then the M5 128GB is worth it?
I don't have such deep pockets, but I'm thinking an M1 Max 64gb at around £1000 is good value.
cride20@reddit
To be fair a lot. I have a notebook with 48gb of ram and I'm running the 122b-a10b qwen model for making prototypes of my ideas. I built a minimalistic agentic framework for it, and it does the whole project setup, installs everything to make sure it works, tests if it builds and runs how it's supposed to. Yes it takes a lot of time since I'm using CPU inference at this point, but damn it's so easy. I have an idea and just feed it to the AI, after an hour I have a working POC (Qwen3.5-122b-a10b-UD_Q2_K_XL)
Velocita84@reddit
Miriel_z@reddit
4090 gives you 24gb vram. Should be sufficient for 30B models finetuned for coding. Do models research, try, switch if unhappy.