This is where we are right now, LocalLLaMA

[-]

Low-Opening25@reddit

this is false. running LLM on a laptop, battery would be done in 30 mins.

[-]

Pleasant-Shallot-707@reddit

And there’s no reason to think he doesn’t have a plug available to him

[-]

Pleasant-Shallot-707@reddit

lol just accept your comment provided zero value and learn from this

[-]

Plenty of laptops have power these days, and decent internet; I've run Qwen 3.6 33B A3B on a MacBook Pro for over 90 mins on battery. I used to do work flights overnight and try to sleep, now I tend to do daytime flights as I've got decent internet and power so can work pretty much as normal.

[-]

Dry_Yam_4597@reddit

Cool cool.

But this type of dramatic writing.

Is super annoying.

It's as if the writers wants to share something dramatic.

They can just calm their tits down.

[-]

yaosio@reddit

This morning I fired up my laptop and started my LLM.

All I did was tell it to make a world changing app.

I didn't make a harness.

I didn't tell it how to do it.

I didn't tell it how to change the world.

It made an app of a purple monkey that sits on your desktop.

Friendly, fast, purple.

If you're not already doing this you're left behind.

[-]

nomickti@reddit

Purple. Monkey. Dishwasher.

[-]

Tommy3Tanks@reddit

Sounds like a strong password.

[-]

iamapizza@reddit

Interior. Crocodile Alligator?

No, I drive a Chevrolet.

Movie theater.

[-]

It’s not just a dishwasher. It’s a dishwisher.

[-]

chris415@reddit

no, dishwursher

[-]

rasmadrak@reddit

iWash... It just wash.

[-]

ovrlrd1377@reddit

uWish

[-]

vabello@reddit

Man

Woman

Person

Camera

TV

Covfefe

[-]

octoo01@reddit

Banzi Buddy?

[-]

KindaSortaGood@reddit

Larimus89@reddit

What model you find best?

[-]

Poromenos@reddit

This isn't humans. This isn't LLMs. This is Claude writing.

[-]

Ok_Scientist_8803@reddit

Reminds me of:

This is not just food.

This is M&S food

[-]

gnnr25@reddit

But what if this is

is the next

e = mc^(2) + AI

[-]

seanmacproductions@reddit

Best take I’ve seen all year

[-]

SkyFeistyLlama8@reddit

LinkedIn being Corporate 4chan is .

[-]

TheSEOVicc@reddit

Yeah it’s better to connect sentences more and not go full gooroo spam mode on LI

[-]

Downtown-Key9504@reddit

Real. Got me wanting to say “ With a soul-stirring sigh of finality, I surrendered a titan of earthly burden to the porcelain abyss, severing the heavy chains of internal discord. A crystalline clarity now permeates my being, as if the very stars have realigned to honor this profound and hallowed evacuation of the spirit. “

Every morning

[-]

KarmaBitesDogma@reddit

This is the most amusing post I’ve read on any sub in the last year, and, of equal importance, I’m hoping to hell that an actual Homo sapiens has crafted it.

[-]

It's infecting newspaper writing too, absolutely insufferable trying to read about some event that happened locally when it's being

spread out.

On five lines.

With half of a detail or location.

On each line.

[-]

mutexsprinkles@reddit

Tabloids have been a sentence per paragraph forever, that's not a new thing. It's because their audience is not very good at reading or thinking.

LinkedIn dramaspam is different because their audience is...oh.

[-]

Ok_Scientist_8803@reddit

This feels like:

A linkedin/instagram post.

I opened a LLM on my MacBook.

It coded me a whole fizzbuzz program in python.

It would have taken my employees £500.

This cost me £0.

Embrace local LLMs or be left behind.

This is the era of AI

[-]

here_n_dere@reddit

True fact his profile says open to work 😅

[-]

McSendo@reddit

I remember one of their "free" courses would asked you to connect to HF's api (it has a free quota and will ask you to put credit in), when you could've just loaded the 100mb model in your LOCAL machine.

They need to make their money i guess.

[-]

Torodaddy@reddit

100mb model? Is that the intelligence of a labradoodle?

[-]

last_llm_standing@reddit

I know the dude personally, he's a fine lad and actually engaging

[-]

Dry_Yam_4597@reddit

[-]

kbderrr@reddit

Yeah, the whole post feels like it was vibed and I think bro thought he was posting on LinkedIn.

[-]

LocoLanguageModel@reddit

People often ask me if it's possible to run something powerful locally.

I always tell them the same thing.

How dare you speak to me.

[-]

Andrew_hl2@reddit

Is super annoying.

It's as if the writers wants to share something dramatic.

They can just calm their tits down.

Turtlesaur@reddit

On the plus side, because of this post I learned Pi coding agent doesn't have anything to do with running something in a raspberry Pi

[-]

WhoTookPlasticJesus@reddit

That's what I assumed until I read your post.

a few Qwen models

There's no "a few Qwen models", it's just 27B that's great, MAYBE 122B/397B (I didn't try those).

If you are using 35B A3B, I don't care what the benchmarks say, it's nowhere close to 27B. People like it because it's fast, but that speed doesn't take into account the fact that it relies a lot on trial and error due to not having the intelligence to approach the problem properly. 27B may be slower but it's gonna take smarter decisions and the session length might end up the same.

Even with 27B, you gotta maximize your odds of success by meeting it half-way, it's a small local model after all.

In your CLAUDE.md, tell it how to approach an issue before starting any work. Make it ask you questions to clarify. Make it stop and ask if any assumption turns out to be false.
You have tests, right?
At runtime after the research phase, have it write the result to TODO.md, and maybe start over in a new session with fresh context
If you have large mature repos, then you need to make sure there's a good codebase map document, ideally a code exploration tool/skill. Don't expect the LLM to correctly understand your codebase every session, that's relying on luck. This might take a few solo sessions just for codebase exploration, where you ask it to write it to code-map.md or something, and correct it manually. (for small projects or when the entire source code map needs to be known in context for any task, I just put it in CLAUDE.md)
1 issue at a time, start over new session every time
You have tests, right?

dtdisapointingresult@reddit

I mostly use Qwen Code, but I try to use Claude Code about 20% of the time to have something to compare to (same model on both). I'm reading up on Pi right now, it might end up being my primary.

Currently most of my local tasks are some form of LLM tooling. For example doing tests, getting a model to run with certain parameters, trying to get cool apps that don't work on ARM (my DGX Spark) to build.

I think any language task like maintaining a wiki should be considered easy for any model. Managing a Minecraft server, it would depend on the sort of work involved Starting/stopping services, following easy doc, there should be no issues there.

But if you say it's doing good at coding in Python and TS, then that surprises me. Maybe Ubuntu + bash + docker is a harder task than I give it credit for.

[-]

my_name_isnt_clever@reddit

This is what I'll say, compared to Qwen 3.5 122b it's just as capable at agentic tasks, but it's not as intuitive with the unexpected. It usually does a great job but sometimes needs a nudge in the right direction more than larger models. It's worth it for the speed IMO, but we will see how I feel about 3.6 122b.

[-]

CatConfuser2022@reddit

Please share your instructions if possible :)

[-]

FullstackSensei@reddit

I treat the LLM like a junior dev fresh out of uni, who's on their first day on the job. So, I point it to the documentation directory, which (thanks to LLMs) has a "directory" markdown file summerizing the contents of each documentation file. I instruct it to read the requirements, specifications and architecture documents. I instruct it to follow existing conventions it sees in the source. I instruct it to create a markdown file in a temporary directory inside the project detailing what it did and how.I instruct it to break out the task into byte sized sub tasks and to create those sub tasks, and instruct it to pass all the above instructions as part of the prompt to the agents of those sub tasks.

The rest is specific to what I want to do. I give very detailed instructions of what needs to be done, where it needs to be done and how I want it done. Before submitting the prompt to the agent, I paste in a chat and ask the LLM to point any ambiguities or contradictions in the language of the prompt and ask me about them. If there are any, the LLM will point them out. If there aren't, the LLM will come buck with a bunch of silly questions, and if so, it's ready to go cooking.

[-]

Imaginary-Unit-3267@reddit

Would it be accurate to say that getting it to understand exactly what you want it to do is a large portion of the entire problem?

[-]

FullstackSensei@reddit

Isn't that always the problem, not only with LLMs, but also in real life?

If you can express your thinking clearly, you can communicate with anyone and anything effectively. It's not that easy, but not that hard either if you try to be conscious about what implicit assumptions you're making that the other might not be aware of or know. That's why I use the junior dev on their first day on the job analogy, and then rubber duck it in a chat with the LLM if I'm still not sure.

It goes very much against the trend of vibe coding things fast, but my objective is to delegate work and still have it done the way I want it, so I can maintain it. It's 10x slower than one shotting a few lines, but still 10x faster than writing the code by hand.

[-]

philmarcracken@reddit

If you can make a flow chart, the LLM can build it for you.

It doesn't have to be about what lawyers do, which is straighten english out and avoid loopholes. Thats why legalese is so verbose, its taken how a computer might try and understand it, completely literally. And also why they get paid the big bucks, because english is so full of holes

[-]

FullstackSensei@reddit

I agree that prose is quite.... verbose. It's something I have been thinking about for a while. But what's the alternative?

Flow charts can also take time to create, and I don't know how to pass them to the LLM in an effective manner. Mermaid is nice, but LLMs seem to frequently make mistakes spitting it out that I don't trust it as an input format. Thought about UML, but same problem, plus it takes more time.

I've stuck with prose also because it's what LLMs have been trained on, issue -> code.

The thing about "legalese" with LLMs for coding tasks is that it significantly reduces how big and how good the model has to be to complete a certain task.

[-]

Imaginary-Unit-3267@reddit

I agree. For me, the reason I don't just vibe code things is precisely because I'm not a dev, I'm not a genius programmer, and I know that if I don't make sure I understand everything every step of the way, whatever the AI produces will be unmaintainable for me. I am finding myself very ironically being forced to learn software engineering just to make a helper for my (independent, non-academic) philosophy research, which is what I'm actually interested in!

[-]

VertigoOne1@reddit

I always tell the devs to think like this. You know things, the llm knows things, among these things it knows is how to translate languages really really well, and not just english to german, but english to Java and typescript, and typescript to C#. it needs to know what you are saying really well to translate really well, so the more effort you put in, the better it does it. This is not vibecoding, this is systems design and engineering. Every time i’ve been let down by an llm it was ultimately my own fault. Be honest, are you cruxing opus to make up for your laziness? It will let you down too, just like a genius senior dev will too if you give him crap.

[-]

miversen33@reddit

Humans have this problem with humans too :)

[-]

Pyros-SD-Models@reddit

I also think people are speaking from belief rather than actual experience, because they haven’t really tried Qwen3.6-27B. For coding agent tasks, Qwen3.6-27B inside Pi mops the floor with Sonnet inside Claude Code.

Or they’re judging adjacent tasks, but yeah, obviously Qwen3.6-27B will not meticulously search half the internet and write the most perfect plan ever. It can do it, but it doesn’t extract the learnings as well as something like Opus or GPT-Pro would. But nobody is talking about that, since OP is clearly referring to coding tasks, not planning tasks.

[-]

Double_Cause4609@reddit

Is it possible that both he and you are correct?

Is it possible that he has a strong prior in software engineering, and in his field of expertise he's able to manage the agent in ways that are limited in scope such as to also limit the difference between different models?

For his use case, the models may actually genuinely be quite close.

But to an average vibe coder who is not directing the model to do the right thing, who is unclear about their requirements, or who expects too much out of a single step of the pipeline, it's possible that there may be a much larger difference in a less constrained environment.

[-]

Poromenos@reddit

Is it possible that he has a strong prior in software engineering, and in his field of expertise he's able to manage the agent in ways that are limited in scope such as to also limit the difference between different models?

No, I suspect it's the other way around: To an experienced, professional developer, these models are very far apart. To an average vibe coder who YOLOs a bunch of tickets to the model, maybe they can't tell them apart, sure.

[-]

mrjackspade@reddit

As an experienced programmer, even Claude Opus is infuriatingly stupid. a lot

Just as a non-programming example (for reference) I'm having an issue where I'm having issues connecting to a server. \~70% of the connections fail.

Claude runs a test, one IPV4 and one IPV6 connection. IPV4 fails and IPV6 succeeds.

Claude then confidently states that my issue is caused by IPV4 connections failing.

Claude does things like this and I wonder how the fuck anyone even succeeds to vibe code anything without existing software developer experience.

[-]

aw2xcd@reddit

One more example: I had an Opus 4.6 generated Mac app failing to start because the splash screen image was missing and its solution was to do all kinds of tests to check if the bundled logo exists and fail silently without suggesting that the image is missing.luckily I picked this up in the PR but imagine all the things that get through because I don’t have the mental capacity to read thousands of lines this thing spits out every minute.

[-]

mrjackspade@reddit

fail silently

I have wasted so many fucking hours debugging because Claude defaults to failing silently for everything, even mission critical functions.

[-]

ttkciar@reddit

I can see how you might think that, if you didn't know I was a senior software engineer with 47 years of programming experience.

Or maybe I'm just too old to use these new-fangled tools correctly? /s

More seriously, my perception is that it's the other way around -- to inexperienced programmers, it seems like the less-capable models are better at codegen than they really are, because their standards for code quality are lower.

Either way, it is possible that both he and I are correct (like you said), because there are subjective and skill-relative factors impacting the perception of codegen competence.

[-]

xienze@reddit

to inexperienced programmers, it seems like the less-capable models are better at codegen than they really are, because their standards for code quality are lower.

I think there's a similar dynamic happening even with experienced developers. There's definitely a certain kind of developer that produces heaps and heaps of absolutely dogshit spaghetti code that does in fact work and meets the requirements quickly. Solve the immediate problem and move on to the next thing as quickly as possible is their MO. I can totally understand the love these kind of developers have for AI. It's probably producing code very similar to what they already crank out, perhaps even better. And when requirements shift or bugs come up, what does the AI do? Tack more shit onto the function that's already 3000 lines long and call it a day. Just like these guys do. And, it'll generate loads of unit tests to boot!

I just setup Qwen 3.6-27B on my 5090 last night in vLLM. I'm using cyankiwi/Qwen3.6-27B-AWQ-INT4 with an fp8 kv cache with 200k context. (Everything just barely fits) I haven't had it write any code yet but it runs pretty fast and seems intelligent.

Hows the math work out that you can fit the full context in VRAM? I'd expect you to be slightly over in your setup.

[-]

Plabbi@reddit

Hows the math work out that you can fit the full context in VRAM? I'd expect you to be slightly over in your setup.

I am running Unsloth q5_k_m (21.35GB file size) using llama-server in Win 11 and after filling the context with data I still have 1GB VRAM left according to HWiNFO64. The context is around 10GB.

Performance varies from 50 t/s when empty down to 30 t/s when full context.

[-]

Zc5Gwu@reddit

Is he though? I don’t think most people realize how strong a 30b model actually is. It’s rare that the dense model would hallucinate common facts for example. It’s like Wikipedia in your pocket.

[-]

ResidentPositive4122@reddit

Is he though?

Yes, 1000%. The creators of dsv4, a 1.6T model have openly said that there is still a gap to Opus.

Thing is, the small models are really cool, have become truly useful and we're lucky to have them. But exaggerating about their capabilities doesn't do any good. I'll take a local gpt5-mini / haiku level model any day of the week, and be happy about it. I think the small qwens, gemmas, even gpt-oss-20b can be used for real work, in the right setup and with a lot of elbow grease. But having used the SotA models as well, I agree with OOP 100%. Let's keep it real.

[-]

Far-Low-4705@reddit

honestly, i dont use closed models anymore, just because local models are free and i dont get rate limited after 5 messages like u do on free tiers, and local models score better than low end free closed models, so i wouldnt know

But, imo, i really do thing we have local models better than haiku... haiku kinda gets destroyed on benchmarks.

And ik benchmarks arent everything, but they do mean something. and i mean ofc closed will always be better, but the real question is if local models are better than the last usable closed models.

do we have a local model better than haiku 3.5 - yes.

imho, once a local model becomes capable enough, like qwen 3.6, it doesnt really matter for the majority of use cases.

[-]

novelide@reddit

local models are free

Given it's pretty easy to rack up $20/month in electricity, I think a fairer comparison is with the $20 tier on cloud models. But when you hit usage limits with the equivalent of 1 prompt/hour (approximately what I get with Opus 4.7), local models still win in many cases even though the capabilities are definitely much lower.

[-]

Far-Low-4705@reddit

not really, my pc is already gonna be on, im only running a 35b a3b MOE model, and power draw cant be anymore than 200w tops. also 90% of the time the LLM is idle waiting for a request.

I recently graduated college, and i mostly just used them to check my work/math for engineering problems, and with the free teir, you couldnt use thinking models, or if u could u got like 5 messages/day, and it sucked.

[-]

FullstackSensei@reddit

I'm running half a dozen instances of it in parallel and I'm quite happy with it. If that's offending to you, that's actually your problem, not mine.

[-]

spawncampinitiated@reddit

your happiness is not a benchmark

[-]

2Norn@reddit

i dont care what you use. use gemma or bonsai or whatever and think its like opus. thats not what im interested in.

but fact of the matter is the claim is not true. unnecessary hype that fools people.

that's why we have tool calling. the data doesn't have to all reside in the model weights and in fact very often is better to craft a response treating the Internet like a RAG database

[-]

sellyme@reddit

It’s like Wikipedia in your pocket.

So's the $20 USB I loaded a Kiwix install and db dump on to in 2009.

Wikipedia in your pocket is cool but it's not exactly revolutionary any more.

[-]

coding9@reddit

Enjoying 4 minutes of coding when your laptop burns you and uses the entire battery haha. Plugged in, a little better but sooo slow

[-]

30b is incredible, even something like 14b is good, if you know how to feed it the right data.

[-]

diddlysquidler@reddit

Also Mac battery will last about 40 minutes.

[-]

AngelOfLastResort@reddit

What model would you say is pretty close to Sonnet in performance that could run locally? Is it important to have a good RAG setup?

[-]

ttkciar@reddit

What model would you say is pretty close to Sonnet in performance that could run locally?

According to benchmarks, GLM-5.1 ranks slightly better than Claude Sonnet and slightly worse than Claude Opus. I cannot say from experience, though, since my hardware is insufficient to host it.

Is it important to have a good RAG setup?

It really depends on your use-case. For general Q&A I have found that Wikipedia-backed RAG is really great for improving the quality of inference and cutting down on hallucinations, but for creative writing RAG does nothing whatsoever.

RAG is a really complicated and nuanced technology. There's an entire subreddit dedicated to it: r/RAG

[-]

AngelOfLastResort@reddit

Sorry if this is a stupid question but how much vram would you need to run GLM 5.1 locally? I see it has a total of 754 billion parameters with only 40 billion active at a time. With no mention of quantisation.

Grok said that good RAG was essential to a code local LLM coding experience lol!

[-]

ttkciar@reddit

My go-to quantization is Q4_K_M; I have yet to regret using it.

At that level of quantization, GLM-5.1 weights would take about 468GB, the inference overhead (mostly context K/V caches) would be another 56GB, and if it's a multi-GPU rig there would be about 14GB of overhead per GPU beyond the first.

For a four-GPU setup, that would come to 566GB of VRAM.

You're not going to get that on a laptop, but could cluster about ten MI210 on old Xeons for about $60K and gang them together with llama.cpp's rpc-server, or wait for the first-generation eight-GPU hyperscaler servers to age out and appear on the second-hand market (probably some time 2030'ish), or if you're rich you could buy one now for about a quarter-million dollars.

[-]

AngelOfLastResort@reddit

Okay, so it's not a local model then? Offline sure but not local.

[-]

ttkciar@reddit

No, you can download the weights from Huggingface, and I have done exactly that. If you're willing to wait for pure-CPU inference you can host it on a fairly inexpensive $2K Xeon with a buttload of DDR4 memory, but that would be far too slow for interactive use.

[-]

AngelOfLastResort@reddit

You can't run it on a desktop or a laptop. So it's not local. It's built for multi GPU server environments. It's not a local model.

Worse, those disappointed first-time users aren't going to blame Chaumond; they are going to blame all of us, because in their minds there is no difference between Chaumond and the wider LLM tech community.

This is interesting to me because my immediate response was, "so what?"

It is 1995 and getting online isn't easy, fast or very useful. People who just reject the internet now can feel smug & superior for another 10 or 15 years but everyone is online by 2010 or 2015.

Except for AI it is to going to be 15 or 20 years before it is absolutely everywhere, it's going to be 3 or 5 at most.

[-]

Time-Heron-2361@reddit

Context -> that something no one in the local llm community mentions. What good is I can run a model on my 48gb laptop if the context cannot exceed 32k? Its practically useless.

[-]

ttkciar@reddit

Yeah, that's a whole can of worms, but it's a highly relevant can of worms.

Agentic codegen really needs a lot of context, which means not only do you need a high context limit (and memory to match), but also a model whose competence does not drop off too rapidly at long context.

Also, the impact of K and V cache quantization on inference competence is more pronounced for codegen than it is for other kinds of tasks, which means your options to stretch memory are even more limited -- q8_0 is the most you want for codegen, and turboquant doesn't save you.

These issues are frequently masked from the user's perspective, at least at first, because non-agentic tasks frequently do not require high context (fewer than 2K tokens, in the common case, barring RAG) so K and V caches fit in VRAM, making inference very fast. It is not until they try to use that model for "serious" work that the cache spills to system memory and performance tanks.

These measurements are relevant:

https://old.reddit.com/r/LocalLLaMA/comments/1suh3sz/gemma_4_and_qwen_36_with_q8_0_and_q4_0_kv_cache/

[-]

olibui@reddit

Hey. Whatever gets you likes right? Facts are irrelevant

[-]

JacketHistorical2321@reddit

So...?

[-]

cmdr-William-Riker@reddit

I think the expectations for sonnet and opus are too high. It falls short constantly

[-]

Time_Cat_5212@reddit

Okay, so dum dums dum dum, and it gives everyone else more time to get ahead.

phreaqsi@reddit

Legit question.

If you had $1000 right now, what would you buy with it to run Qwen 3.6 27B?

[-]

Imaginary-Unit-3267@reddit

It's a feature if you're one of the elites who loves (nonconsensually) pissing on everyone else.

[-]

Time_Cat_5212@reddit

Yeah everything's just a big conspiracy by the elites to fuck everyone else over

Sheesh. I don't miss being 20 years old

[-]

SufficientPie@reddit

Oh I see

[-]

TFABAnon09@reddit

Aka the Sam Vimes Boot Theory of Economics

[-]

Time_Cat_5212@reddit

It's neither a feature nor a bug; it's just the way resources work.

I recommend reading https://www.gnu.org/philosophy/free-sw.html and https://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar

(In Polish these words also sound different)

[-]

DominusIniquitatis@reddit

Already read the first one years ago. :)

[-]

Plabbi@reddit

Yeah those people.. I am totally not using it just because it's free.

[-]

AvidCyclist250@reddit

to not be shackled

InterstellarReddit@reddit

⚰️⚰️⚰️

[-]

AllNamesAreTaken92@reddit

LetterheadFresh5728@reddit

Ya it's great for making a python script to write a poem if you have 15 minutes

These chatgpt LinkedIn bots are driving me insane

[-]

NitinJadhav@reddit

One day this will happen. Man just wants to be first to comment.

[-]

DrDisintegrator@reddit

But if that was Claude Mythos, it would be asking you where you'd like to have the plane land at.

smallDeltaBigEffect@reddit

Cloud inference is extremely unprofitable at the moment and prices will just go up from now on. A 70b model from last year is now worse than a 9b one. The trend will continue, for both I guess

[-]

SnuffleBag@reddit

Sure, but right now that's not my problem.

For this future that's supposedly now, the price of that laptop pays for about 6 years of a pro/max cloud model that currently gives significantly better results.

Yes, prices will go up for sure. Quality and/or latency will become worse. But that future is not here yet. And at the end of the day, I'll still be on the hook for some $8000 laptop to participate in that future when it does arrive - in no small part thanks to cloud inference vacuuming up all the components.

I have no doubt local inference will become huge and hugely important. But the practical future is not here yet when it comes to comparing to frontier cloud models.

[-]

smallDeltaBigEffect@reddit

Curious to see. Let’s look back in a year

!remindme 1 year

[-]

RemindMeBot@reddit

I will be messaging you in 1 year on 2027-04-25 12:32:28 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

Healthy_Bedroom5837@reddit

sure is , check out box on android its got alot ! https://github.com/jegly/Box

[-]

akira3weet@reddit

qwen3.6-27b is very cool, best part is it's a good size to barely fit in my setup, 16GB+6GB VRAM.

[-]

nsshing@reddit

While i do believe we are gonna run a Opus4.6 on Macbook Pro, a Qwen 27B cannot do much useful things.

[-]

OverjoyedBanana@reddit

Sovereignty: dude running a chineese model that has been produced god knows how with illegally donwloaded media.

[-]

spencer_kw@reddit

every time someone claims a 27b model matches opus i ask them to run it on a codebase they actually know well. not a benchmark, not a toy project, their actual production code with all the weird conventions and edge cases

the models are genuinely impressive for their size but the overclaiming does more harm than good. sets people up to be disappointed and makes the whole local community look like it can't self-assess

[-]

GFrings@reddit

I mean you can also just point them at the actual benchmarks

[-]

arguingwithabot@reddit

Ya that should be the benchmark: can you use it for a production code base. Gemma 4 on a maxed out macbook m4/m5 actually comes pretty close for my team but we still can’t justify moving off Anthropic right now.

Maybe someday! But for finance dept it’s a capex vs opex question and capex isn’t favored these days across the board.

I fucking hate twitter. This guy is maybe ok, but it's full of this exact type of weird hyperbole and lying.

Like 3.6 is good but no, it's not opus, and no, you're not getting actual work done with it on a MBP. the TPS crushes utility.

[-]

tmvr@reddit

It's not even Sonnet 4.5 imho. Well, the "old" Sonnet 4.5 from a few weeks ago, before the recent shenanigans. Whatever it was doing end of this week, it felt like a different model compared to the one I was using for months before. I stuck to 4.5 even after the release of 4.6 so I've noticed when it changed how it behaves.

[-]

tat_tvam_asshole@reddit

honestly, truly honestly, I work in the field and if the sophistication of open source agentic orchestration could approach what flagship has, you'd be surprised how much closer in real capability they are/could be. so much of the intelligence isn't even in the model itself per se

[-]

s-Kiwi@reddit

Claude Code source was leaked, we can literally 1:1 copy what flagship has

[-]

tat_tvam_asshole@reddit

lol, you think client side is all there is?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Local_Phenomenon@reddit

I just want to feel the same as he does. It is cool not gonna lie but I know that flagship is where it's at.

[-]

Kinky_No_Bit@reddit

Hardware is now catching up to what AI is demanding of it. As we improve the hardware, and the software is improved, each generation will get better. We are barely at gen 2 right now for hardware. Wait till we can have local hardware designed to run AI, built in, and running for a few generations of improvement. Combined with next gen RAM. its gonna be really interesting.

[-]

dealingwitholddata@reddit

Is there a guide for setting this up? Havent done local models since OG llama, right in llama.cpp cli

[-]

Due_Duck_8472@reddit

But it's all lies ... LIES ... you can't run a model like that with any meaningful productivity, on a small cheap laptop .. IT'S.JUST.NOT.POSSIBLE.YOU. STUPI.....

What is really up with all these false witnesses on this board, spewing out "facts" and pure fantasies .. claiming that "Yes it's possible to outsmart a 1.5T model with a tiny quant of a 27B model".

LIES!

And for what?! The "algorithm"? For likes? For kicks and giggles?

I tried .. it works, horribly slow, and it's stupider than the village idiot.

[-]

Plabbi@reddit

He didn't say that it outsmarted Opus, he in just said it was getting close to being as good.

/s

[-]

otterquestions@reddit

They think qwen is going to be opensource forever, like it’s a charitable exercise or something.

[-]

dwittherford69@reddit

Has bro ever used Opus? This would be closer to Sonnet 3.5

[-]

kiwibonga@reddit

Fuck Apple and Fuck Macs.

I’ve just been doing some coding with the same setup. I found that for more simple work, it’s quite powerful. It did get stuck in some tasks and I had to help it find its way, and in one particular task, I had to do the implementation itself.

With that said, it’s running locally on my laptop and still produces some good stuff is quite incredible

[-]

iMrParker@reddit

Lol the 16" MacBook pro fans are loud as hell when doing inference. I can't imagine sitting next to this guy on the plane. I guess the plane would drown out the sound

[-]

AshuraBaron@reddit

New feature idea for Airpods Pro 5. ANC specifically tuned to macbook pro fan sound during local LLM use.

[-]

You could lock it inside docker container and only expose it to dir with code

[-]

Fit-Produce420@reddit

This is why we didn't get an open weight 130B dense Gemma 4 that was leaked - it's too good, there's no need to pay per token and it fits on reasonable hardware.

[-]

toothpastespiders@reddit

I've been stubborn as hell about not upgrading my system since costs skyrocketed. I'd do it for a 130b dense gemma 4, no question. I'd probably do it for a 130b MoE. I'm loving the 31b. But man, I just keep thinking what the same model bumped up that much would be like.

[-]

jacek2023@reddit (OP)

based on my experience with gemma 26B this may be true, 124B was a threat to Gemini

[-]

Fit-Produce420@reddit

It was a threat to everyone.

No, oMLX is best app/engine you can use on Mac.

If you are wondering anout this post - that guy from post is co-founder of Hugging Face. Hugging Face acquired GGML (so llama.cpp) 2 months back https://reddit.com/r/LocalLLaMA/comments/1r9vywq/ggmlai_has_got_acquired_by_huggingface/

[-]

DarkArtsMastery@reddit

I confirm that. I have been personally transitioning to local first in the last few weeks and I'd say for 95% of cases local is definitelly there with the quality of big proprietary models.

[-]

I use Claude Code for work so I see CC crying. I also pay for ChatGPT Plus so I am familar with Codex crying :)

[-]