Doing real coding work locally for the first time | TheaterFire

Doing real coding work locally for the first time

Posted by mouseofcatofschrodi@reddit | LocalLLaMA | View on Reddit | 43 comments

I thought it would take way longer (and a macbook of the future) to do real coding locally. But it is happening in front of my eyes right now!

Im using qwen3.5 35b (mlx 4bit, running on omlx). It is not comparable to the big models, but it is the first that is starting to cross the line of being productive agentically. It has a level of intelligence enough not only to answer in a chat, but to solve problems, to code and to use tools. And it is FAST.

The other part of the equation is how to give it powers to do agentic tasks. Most tools I've tried (claude code, opencode, codex cli, etc) abuse so much of gigantic promt injections. They are so heavy the promt processing takes ages, the RAM explodes. So I thought I won't be able to use any local model agentically until a I get a new laptop. Maybe with an M7 or M8 lol.

But then I started testing pi (pi.dev), and with it I've been able to do already 3 real tickets on a real project!

It seems to be very efficient to understand the project and read only the necessary code. For one ticket it did it at one shot consuming around 7K tokens!!

For the other 2 I had to promt back some errors from the browser console (I guess this could get better adding the rule of checking on playwright to finish the tasks).

The only annoying problem so far is when qwen3.6 it starts looping on its thinking. I have the official sampling for coding with reasoning:

Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Also I have 126K context configured in omlx. Maybe the problem is the 4-bit mlx quant?

[-]

nmqanh@reddit

I used 8 bit mlx quant and still have loop problems at temp=0.6. After testing for a while, temp=1 seems less to almost none loop for me now.

[-]

mouseofcatofschrodi@reddit (OP)

Will test that

[-]

fathergoat_adventure@reddit

How'd you make out with the temp=1?

[-]

mouseofcatofschrodi@reddit (OP)

I tried it, so far it loops still :(

The funny thing is, many times qwen3.6 35B actually SOLVES the ticket, the tasks is fully completed, but it starts thinking ending in loops after completing the coding task.

[-]

fathergoat_adventure@reddit

I've had the same experience with Qwen3.6 35B.

[-]

mouseofcatofschrodi@reddit (OP)

you got any solution??

[-]

fathergoat_adventure@reddit

Lots of cursing; it doesn't fix the loop issue, but it makes me feel a little better.

[-]

Clean_Initial_9618@reddit

What's the difference mlx models and gguf ? Like is it better ?

[-]

mouseofcatofschrodi@reddit (OP)

https://famstack.dev/guides/mlx-vs-gguf-apple-silicon/#community-update-what-was-actually-going-on-with-qwen35-a3b

[-]

isugimpy@reddit

People are going to give you all kinds of suggestions here, but the one I'll give is to switch to Qwen3.6. It's a distinct improvement over 3.5, particularly for coding.

[-]

Maleficent-Ad5999@reddit

How better it is compared to 3.5 27B model? for coding?

[-]

mouseofcatofschrodi@reddit (OP)

didn't test the 3.5 27B. I read is they should have a similar performance. Anyway now we have the 3.6 27B. So looking forward for the 3.7 35B!

[-]

mouseofcatofschrodi@reddit (OP)

ups, I was already meaning that I was using that model. It was a typo by writting to fast (it is also 35B...)

[-]

Icy_Host_1975@reddit

the context explosion from tools like claude code/opencode is mostly their scaffolding — system prompts, tool schemas, file trees all injected before your first token. for the playwright browser-console check specifically, playwright mcp dumps the full a11y tree each step which wrecks local context fast. vibe browser runs as an mcp server inside your actual logged-in browser and only sends ranked interactive elements, so the per-step token cost is a fraction of full playwright. vibebrowser.app/mcp

[-]

mouseofcatofschrodi@reddit (OP)

For playwright I would say you don't need an MCP. If the model has terminal access, it can use it. Is vibe browser MPC more efficient than the model using playwright just from the terminal?

[-]

Chupa-Skrull@reddit

no it's not, this is a shitty bot

[-]

mouseofcatofschrodi@reddit (OP)

thanks :)

[-]

themoregames@reddit

Would love to try it for a week with a DGX Spark.

[-]

benevbright@reddit

feel free to try my tiny tool. Pi is great for sure but Pi inserts thinking block to the context so context bloats super quickly. https://www.npmjs.com/package/ai-agent-test . this one is just focused to be staying real simple/small.

[-]

themoregames@reddit

Pi inserts thinking block to the context so context bloats super quickly

Sounds like a bug?

[-]

mouseofcatofschrodi@reddit (OP)

wow there is a lot of tools being developed right now. Just this week I heard of kon and late too. What does your tool different from pi? (besides not using the thinking block in the context).

[-]

benevbright@reddit

And I have a feeling that Pi is not for small/mid model like qwen3.6. (also author's article, he never mention about local model or anything) You'll notice that the context doesn't have space very quickly.

[-]

mouseofcatofschrodi@reddit (OP)

thanks for answering. Somehow in my mind I'm starting to think we do not only need benchmarks on models, but on:

- quants (different ggufs, different mlx). Speed, context and ram usage, tool use, etc.
- BUT ALSO the tools around (harnesses, agentic tools and so on): steps to solve problems, how they rank, how they use tokens and context, etc... In the last days I've being trying to test so many things it is a mess (omlx vs lm studio; quants against quants; tools against tools).

[-]

benevbright@reddit

definitely. and also Mac (slow but large ram - can't run dense) vs Nvidia (fast but small ram - can run dense)

[-]

benevbright@reddit

Pi is already became a giant, and becoming a backbone of many agentic software not only coding apps. For ecosystem, extensions, and so on. Code base is also pretty got big already. Mine is just like a simple toy, you can come and read the code and can see how it works instantly. :) But It also works for professional coding works lol.

[-]

benevbright@reddit

I'm also very happy with qwen3.6-35b btw. (8bit)

[-]

RMK137@reddit

Not my project, but I've been keeping an eye on this coding agent written in Go.

https://github.com/mlhher/late

[-]

puncia@reddit

This is really good, I've been using it the last few days alongside qwen3.6 and got very decent results

[-]

mouseofcatofschrodi@reddit (OP)

how does it compare to you with pi? I tried Late and liked the concept of ephemeral agents doing the steps of a plan (to reduce context usage), but so far pi seems more mature and efficient.

[-]

puncia@reddit

Haven't tried it yet but I was actually planning to let it generate the same kind of project I did with Late and compare it like that, but it requires some time first

[-]

mouseofcatofschrodi@reddit (OP)

would be glad to know when/if you do!

[-]

einmaulwurf@reddit

What device with how much RAM are you using?

[-]

mouseofcatofschrodi@reddit (OP)

macbook pro (m3 pro) with 36 GB unified. Using it with omlx:

Any ideas on why the loops?

[-]

_hephaestus@reddit

Are you setting preserve_thinking to true? New qwen3.6 flag that needs to be set for it to recognize it thought of something before

[-]

mouseofcatofschrodi@reddit (OP)

yes, in omlx. Tried with and without "force" and got the loops anyways. I wonder if it is indeed related with this, since the loops occur in the thinking.

[-]

fail_violently@reddit

Whats the specs of your machine?

[-]

mouseofcatofschrodi@reddit (OP)

macbook pro, m3 pro, 36GB

[-]

No-Mountain3817@reddit

Switch to Qwen 3.6, and you will definitely see the improvement.

[-]

sinevilson@reddit

The llama.cpp code changes pushed on the 18th, yeah those'll put an end to operational workflows that just worked. Too many variables renamed/removed dont work any more. Have to update all that shit now, in every workflow. Oh wait! What are we talking about again?

[-]

Dany0@reddit

Try running one of the REAPs. If there isn't an mlx available, there are ggufs out

Perf on english coding and general knowledge should be unaffected. Just multilingual, creative writing, maybe EQ

[-]

BidWestern1056@reddit

try out npcsh and incognide as well :)

https://github.com/npc-worldwide/npcsh

https://github.com/npc-worldwide/incognide

[-]

the-xero@reddit

Try unsloth UD 4bit mlx quant... its better!

[-]

mouseofcatofschrodi@reddit (OP)

downloading it :)