Newbie vibe coding experience: Shifting from Claude Sonnet 4.6 to Qwen3.6-35B-A3B-UD-Q6_K

Posted by sooki10@reddit | LocalLLaMA | View on Reddit | 32 comments

This is really just a post for those with shallow understanding of all this stuff, those not yet ready or capable of diving into the deeper end of vibe coding/llms. It might not be a helpful post for anyone more advanced than that.

I have been working on a Python Pygame project for about two months. It is now sitting at roughly 30k lines of code across 55 modules. I have been using Visual Studio Code, Copilot Pro+, and around three times the cost of pro+ in additional premium requests per month.

I initially started with Claude Opus, which was brilliant, but it became too expensive. I then moved to Claude Sonnet 4.6, which worked reasonably well at first. But over time I started seeing more and more messages like, “Sorry, the response hit the length limit. Please rephrase your prompt.” It also began struggling to resolve some bugs, even after many prompt attempts.

Generally, the thinking and reasoning periods seemed to get longer without producing useful outcomes, which meant tokens were being spent for very little return. I tried several ways to minimise this, but the same issues kept coming up.

I decided to install Ollama and Cline and use Qwen3.6... which has been going really well. It has already solved a few bugs that Sonnet seemed unable to resolve. I do need to be more mindful with prompts and context window management, but that feels like less of an obstacle than the issues I was having with Sonnet.

When my Copilot Pro+ allowance refreshes, I plan to get Claude Opus to review the code and give me a sense of how well Qwen3.6 has handled things. If the review is positive, I think that may be the end of my Copilot subscription for now.

I also want to acknowledge that before leaving Opus, I used it to modularise the program from one large monolithic Python file into smaller files and modules, with each file responsible for a specific part of the game. I think that made a big difference and helped both Sonnet and Qwen3.6 work much more effectively. For any newbie coders, I do think there is good merit in getting Claude Opus to setup and structure your program initially.

For context, my hardware is probably above average, with a 5090 and a 4000 Pro (56 GB of VRAM) , running a 250k context on Qwen3.6 within Cline.

[-]

uti24@reddit

So copilot has this free tier, like 20 low-end requests for a months, with Claude Haiku (even worse then Sonnet), but it is there.

And comparing it to Qwen3.6-35B/27B, even Claude Haiku looks better. I had a problem refactoring 1000 line js file into separate files, I spend couple of hours with 27B to do that step by step, and copilot free tier Claude Haiku done that in a single request in like, 5 minutes. So experience could be different.

Qwen3.6 27/35 is first local models feels good enough to somewhat substitute paid cloud services, but still, not as good as even simpler models.

[-]

scslmd@reddit

1000 lines takes a few hours? What's your pp and tps looking like?

[-]

uti24@reddit

tg about 20t/s, pp I don't know, whatever 3090 doing. It's not only about raw speed, it was not able to refactor whole file by itself, I had to steer it

[-]

scslmd@reddit

Perfectly understandable, I have to give it really good guard rails for any refactoring or features. Are you on MTP? Should almost double that tg to 35-40.

[-]

uti24@reddit

No, I’m using LMStudio, and it didn’t support it a week ago. Well, it still doesn’t.

Also, speed mostly wasn’t a problem. Even after creating a plan with detailed steps (like 5 steps, mostly moving and reusing things) and executing those simpler steps in separate sessions, it still broke somewhat often or didn’t do what was expected (for example, it started failing to execute commands, looping and even eracing whole step when 80% was done and looks ok), so I had to rerun steps too. It wasn’t too bad though, I didn’t even notice it until about 2 hours into the process.

[-]

scslmd@reddit

Yes, hallucinations can be a problem. Honestly I was hesitant to change to llama.cpp at first but it is so much faster compared to LM studio. Here is a snippet of the instructions I use:

Follow these strict constraints and execution steps. Do not compress, truncate, or omit any existing logic, edge cases, or comments.

1. Refactoring Rules

Split the monolithic file into smaller, logically distinct files based on single-responsibility principles (e.g., separating core logic, types/interfaces, helpers, and data handling).
Maximum Function Length: No single function or method may exceed 70 lines of code. Split complex functions into smaller, descriptive helper functions.

2. Output Format (Atomic Task Plan)

Instead of dumping all the code at once, break the refactoring down into a sequence of atomic, deterministic tasks. For each proposed new file, output an independent task block using this exact format:

TASK [X]: Create `[Filename]`

Purpose: [Brief 1-sentence description of this file's responsibility]
Expected Endpoints / Exports: [List specific function names, classes, or interfaces exported by this file]
Dependencies: [List which other newly created files or external libraries this file imports]

[-]

uti24@reddit

Is it working for you? I have notice, after 30k tokens context it starting to drift somewhat and after 60k it feels almost a wall for me. And for refactoring like that, in a single step it need at least 100k.

1000 line file already about 20k tokens, i'd say? and agent needs to read it multiple times during refactoring.

Expected Endpoints / Exports: [List specific function names, classes, or interfaces exported by this file]

Ahh, but that can be done during the second stage, when executing smaller steps?

[-]

scslmd@reddit

Clears context after every task or use subagents. Because your breaking it down, you're using small ctx size only so there is less chance of drift. Your not going after 1000 lines at one time.

YourNightmar31@reddit

With that amount of vram you should probably run Qwen3.6 27B instead.

JaredsBored@reddit

And sell the RTX PRO 4000, and get a second RTX 5090. You're stuck with an uneven split llama.cpp only setup, but if you go matching pair 5090s, you upgrade to vLLM. That's a stupid fast setup with 27B at Fp8

TheWaffleKingg@reddit

How does that setup look with 2x 3090s? Got mine in last week and ive been using llama.cpp. its gret but the speed leaves a little to be desired

dinerburgeryum@reddit

Sell it to me if you’re in the US. I’m serious DM me lol

I know you mean OP but dude I have an Mi50 posted for sale on another subreddit and you had me so hyped thinking I had a buyer when I saw this notification

philmarcracken@reddit

Yep, slower is fast if the model is more bigly smart(like me).

trialbuterror@reddit

Get low tokens in 9060xt 16gb and 48gb ddr4 ubuntu 24.04

JLeonsarmiento@reddit

I use Zai GLM-5.1 to plan everything, let my local Qwen3.6-35b Q4 implement everything, then call GLM again if something is not working as expected. Balance cheap powerful API with unlimited local is my fórmula today.

Otherwise-Director17@reddit

Definitely run 27B instead, just as fast with mtp and uses less VRAM for significantly better quality.

Miserable-Dare5090@reddit

I dont know if I would say a 27B dense model can ever be “just as fast” as a model with 1/10 the active parameters. It can be 1/10 as fast, for sure, and maybe even 1/2 the speed, but the same? Logically impossible.

On a RTX 5090 I get around 150-180 tokens per second for both at int4. MTP spec decoding on 27b and no spec decoding on 35b

Calm-Republic9370@reddit

It sounds to me like you are running too large for a scope of a question. But i have really enjoyed 27B with 128K context and i work on large projects. Just not the entire project at the same time.

danihend@reddit

Codex offers free tier use - you can use that to review. Gemini also has free tier on API.Then there's Openrouter, Opencode, Kilocode, probably more. All have free tier to some degree. Worth sticking them in the mix here and there for review/implementation and comparison. Definitely wouldn't rely on 3 Qwen 35 fkr all work as it's just not reliable. It will completely invent things if given half a chance. Fantastic Model still, and I hear 27b is even better.

Creative-Type9411@reddit

you can run q8 with f16 cache, use and MTP version it will make your speeds fly

ai-christianson@reddit

That is a solid frame. The gap is rarely just raw parameter count. It is about how you structure the interaction. When you treat the model as an executable, you stop relying on ad-hoc prompting and start passing structured context. Things like project structure files, explicit tool definitions, and persistent memory/state files change the output quality more than the model size does. The hybrid workflow works because you use the cloud model for the heavy architectural lift and then hand off to the local model with that structure already in place. It makes the local run reliable instead of fragile.

IgnisIason@reddit

Are you doing this professionally? Honestly I think most people just prefer using frontier models for serious work. Local models are fun to play with but I'm having trouble finding a practical use.

haragon@reddit

What does the UD mean?

blaz3d7@reddit

Unsloth Dynamic?

Ty!

SnooPeripherals5499@reddit

Yep

smicky@reddit

This is super helpful. I’m in a similar boat…setting up my local stack with a RTX3090 (24gb) with a primary purpose of vibe coding some basic web apps but also a more in-depth trading algorithm and automation.

The way you are approaching this is the way I was planning on but hadn’t really heard from someone on results. On the trading app, I keep running into the credit limit…but prepping it with Claude and then turning it over to my local stack for the actual coding is where I think the sweet spot will be.

messydata_nerd@reddit

The modularisation step you did with Opus before switching is honestly underrated advice. I mostly heard people try to squeeze everything into one giant context and then wonder why the model loses track :)) so breaking it into smaller focused files really changes everything

Some-Cauliflower4902@reddit

Newbie-ish but have been vibe coding even before the term was coined. With the last generation of cloud models, large code files never end well. This will apply to Qwen3.6, which could compare to sonnet 3.x for my use case. For me, if I don’t have time then Opus is the go to. Smaller things like mini games, web crawler, data extraction scripts, any job related & privacy sensitive things, all stay local.

RefactorEverything@reddit

I think you need to evaluate how you're using each model. Its not just throw random question in and get result, and some need a bit more guidance. Context matters, tooling (MCP, memory) matters. Once you start thinking of a model as an executable, and provide it the right parameters, everything starts to change.

1. Refactoring Rules

2. Output Format (Atomic Task Plan)

TASK [X]: Create [Filename]

TASK [X]: Create `[Filename]`