Qwen 3.6 27B overdoing it
Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 51 comments
Although I'm very impressed with Qwen3.6 and is my most used model, I feel that sometimes it being too proactive and start doing things I didn't ask, from creating tests for the last modification to reverting changes I made - eg removing an hardcoded value - that it thinks are instead useful to keep, and still others.
Are you also getting the same behaviour? If so, how do you counter it? Change the prompt? Use different temperature or other parameters?
ea_man@reddit
For me the problematic model is rather 35B A3B, I started using that coz it's 3x fast yet it spends 3x tokens thinking, wait, I should check again, let's count any banana, omfg l'm a banaaanaaa!
goldcakes@reddit
So every time a model processes a token (input or output, including thinking), the activations change and it's been shown there's 'internal thinking' that emerges.
So it can still be helpful.
ea_man@reddit
Or you can apologetically --reasoning off the chatty bot, that will teach it manners!
Borkato@reddit
I almost never use reasoning on, literally like less than 1% of my prompts have it on. Does it really make that much of a difference? It works great with it off lol
ea_man@reddit
It's a reasoning model, ofc it's been trained to reason through.
Try to make a one page app of some 4k ctx with and without reasoning and see your self.
UniForceMusic@reddit
Qwen is a HELPFUL assistent by default. You can tune him down a little with the system prompt
Weekly_Comfort240@reddit
Naw, just run Qwen Code in YOLO mode on a sandboxed environment and slap it into /plan mode if it gets too frisky.
Durian881@reddit
Make it a reluctant assistant, only do as told and nothing more ;p
UniForceMusic@reddit
"You are a 35 year old developer with a mortgage. You suspect layoffs are coming, but at the same time you don't want to slave away your precious, so you're also quiet quitting. Adjust your motivation and proactivity levels accordingly"
cinnapear@reddit
It's like I'm there.
Xyklone@reddit
Lmfao, this is genuinely hilarious
tuura032@reddit
Lmao
sagiroth@reddit
Dude...
rpkarma@reddit
First of all I didn’t say you could tell everyone about me
Blues520@reddit
Real
Ok-Measurement-1575@reddit
Suddenly, it becomes obvious what system prompt they're using for opus.
goldcakes@reddit
You can Google for system prompts used in harnesses, e.g.
"You are an agent for Claude Code, Anthropic's official CLI for Claude. Given the user's message, you should use the tools available to complete the task. Complete the task fully—don't gold-plate, but don't leave it half-done" ...
The differences just come to different RL and post-training.
ambassadortim@reddit
I actually like it's default behavior.
Complete_Mango7069@reddit
funny
Real-Discussion-7712@reddit
I’ve seen this more with coding agents than pure chat. What helps is making the “permission boundary” explicit: ask it to first list the exact files/actions it intends to touch, and add something like “do not refactor, add tests, or revert unrelated changes unless I explicitly ask.”
Lowering temperature can help a bit, but the bigger difference for me is forcing a plan/checkpoint before edits.
peanutbuttergoodness@reddit
You need an agents.md file.
It should have things like: - When asked a direct question, simply answer the question rather than taking action.
- Do not assume, always validate information/data and ask clarifying questions.
Hot_Turnip_3309@reddit
This is why I use 3.5
datbackup@reddit
This sounds like a harness problem, not a model problem
randomjapaneselearn@reddit
i'm using cline and kinda new to this, i had the same problem of OP, i'm open to tips.
alfirusahmad@reddit
I agree. Some problem are not from the model, it night be from persona, workflow, prompt system
soyalemujica@reddit
Adjust the system prompt, be specific with directions in what can it and it cant do, just don’t prompt it away and expect it to understand what do you need without
randomjapaneselearn@reddit
i'm using cline so i guess that it set a system prompt by itself...
i asked it to plan to refactor code, write an implementation plan and follow it.
in the end it changed some functions logic, added more questions to a list of existing questions.
i was using 35BA3B, i tried again with the same model, same implementation plan and the same prompt that cline generated to "follow the plan" and it didn't do mistakes, then i tried with 27B and it changed things again, i tried Q4, Q5 and few more tests and it sometimes happen, sometimes not...
i'm open to tips also ideas to replace cline
AndThenFlashlights@reddit
Qwen 3.5 and 3.6 desperately want specificity and context, which I kinda like. They're great with clear direction and objectives, but the model will start arguing with itself in the thinking process if it's not sure what the user wants - it's not as proactive about asking the user for clarification the way Claude tends to be. Watching the thinking process actually helped me a lot to better understand what I'm missing in a prompt.
thoquz@reddit
Mind sharing your system prompt?
randomjapaneselearn@reddit
same here but with 35B-A3B, i asked it to refactor code and it changed functions, there was a list of questions and it added a few more questions to the list...
right now i'm using cline, i'm open to tips
Equal_Jellyfish_4771@reddit
yeah same issue here... lowering temp to 0.3-0.5 helps but system prompt is key. what framework are you using it with?
fasti-au@reddit
Run 35b with Speckit, and your problem disappears. 27b is dense and thinks always a3b is MOE, which means it's like instruct, but i i bad ad 'prose' btw PROSE is the token for almost all human language. You use non prose, you don't get 2 rounds of is this words or rules type in your layers......its both behavioural and design.
moe is i have things to do -codex
dense what things do i have to do. -gpt
Truth-Does-Not-Exist@reddit
it seems to self doubt alot and stray a bunch of diffferent directions in it's thinking, it needs a reasoning fine tune
audioen@reddit
I think mostly it is not messing up. Sometimes it does unwanted change. I have to review the agent's work before I can commit it -- if for nothing else than that it doesn't touch any unexpected files. It is rare but it happens often enough that I need to skim through a git diff to be sure that the changes are related to what I actually want done, and that sometimes I find that the agent didn't realize which component I wanted to change and it can have implemented the entire change into wrong file.
I find it rarely going off the rails, unless the code looks on superficial analysis to be completely incorrect, in which case it can helpfully attempt to fix it for you. To stop this, I typically request agent to write documentation explaining why something is done the way it is, so that it will stop trying to change it in the future. If you provide useful documentation, you will help yourself and the context-free agent that later stumbles on the same code and likely concludes again that it's something that it must change.
Project documentation helps. AGENTS.md file can cover exceptions and special cases. It can define coding style, and I find that the agent tries very hard to observe your instructions. At the same time, I advice not making the file extremely long or trying to cover lots of use cases by writing tons of examples, because long system is also counterproductive in terms of polluting the context and inference, and any mistakes in examples or discussion will just confuse the model and degrade performance you get out of the model one way or other. Watching the first reasoning traces after any changes is critical, especially if it suddenly spends dozens of seconds and writes 1000 tokens of reasoning, as this indicates that the model is arguing with itself about what you want or how it should interpret one clause or another.
Potential-Leg-639@reddit
Looks like your plans aren‘t detailled enough or you are starting right away with coding. Plan as long as you can and do it as detailled as possible.
Hefty-Elk-7435@reddit
Sampling parameters.
Turn your temperature down until your agent is only as proactive / imaginative as suits the task in hand. Depending on your inferencing endpoint you can usually pass temperature: as a parameter when you send the request.
--
If it's still a problem, then there's probably an issue with your system prompt.
For a while I had "**Don't ask, just do it**" as part of the session start instructions. The structure of the file headings got slightly messed up and my agent started interpreting it as a general instruction.
Next time you run into excessively proactive behaviour ask your agent to trace what was in their immediate context when they were doing it. Ask them how your L1 files look from their perspective and whether some of the instructions they are seeing are potentially confusing.
xienze@reddit
I don't think this is great advice. Qwen publishes the recommended sampling parameters. Which implies that's what they tested against and felt most comfortable recommending to users.
For something like this, you really just need to update your
AGENTS.mdor similar to be very clear about what your expectations are. And don't just add something like "don't ask, just do it" and think it will be clear and unambiguous to an LLM. Take that simplified set of instructions and ask an LLM how it could be phrased for maximum effectiveness THEN feed that to your model.Also keep in mind that these models are non-deterministic and sometimes just don't follow all your instructions to the letter. It happens, and it probably happens more often when people are running smaller quants and quantized KV cache. Gotta roll with it.
goldcakes@reddit
Asking the LLM to self-trace is an underrated tip. Something like Qwen 3.6 27B has enough intelligence to often (of course, not always) point you in the right direction.
Don't trust what it says fully though, of course. It can be a hallucination.
yes2matt@reddit
I have wondered about temperature, top_p, top_k in an agent-using-tools situation.
I never thought to prompt self-examination.
However, Hermes on Qwen3.5 decided to edit its own config.yaml and I got to spend some time fixing that :/
MT_Carnage@reddit
maybe i need qwen. claude is the opposite. ill give it 5 tasks and it'll decide it needs stop and give me a life story after 2 fixes
goldcakes@reddit
exactly, I'm not sure if it's a bug (I've emptied all memory/configs) or some funky a/b test, but opus will randomly sometimes be ridiculously lazy, I hate how it often tells me to go to sleep or get some rest... even at 10am...
MT_Carnage@reddit
you need to get rest so dario can end white collar jobs smh
Prudent-Ad4509@reddit
I’ve started to inform the harness that btw I’ve modified this and that. This usually prevents it from reverting my changes.
Also, I’ve observed this behavior mostly after reverting certain steps in the session. I’ve switched to forking the session form a particular step instead, but I have not done much testing.
Endurance_Beast@reddit
What are you using with it?
Fair-Television5497@reddit
Every model does that.
This is what helped me:
https://www.reddit.com/r/ClaudeCode/comments/1ta7zbk/karpathys_claudemd_cuts_claude_mistakes_to_11/
Sofakingwetoddead@reddit
No, I don't get the same behavior because I have an instruction packet that is required to be read at the beginning of each new session. I CAN get that behavior if I want, and sometimes I do want it, but my coders have their temperature and top_p tuned to be less exploratory.
Generally, the Qwen team recommendeds 0.6 temperature and 0.95 top_p, and it's perfectly fine for coders. It gives them enough freedom to explore while not hallucinating into worse solutions. But that's ONLY IF you've established a protocol of correct behavior.
Rhetorically - How do you take your complaints and convert them into behavior restraints? How do you tell the coder to read the onboarding packet you created, at the start of every session? Think about it - how do you actually want him to behave? Write-out the behaviors you want into instructions that are required reading at the start of each session. IE - "always clean up" "always test your work" "never assume" "ask questions if....."
That should be enough, alternatively you can adjust the temperature down. At 0.1 the coder will work with blinders on, but if you have a bad implementation, 0.1 coder will have a hard time recognizing the implementation needs to be gutted and replaced. However, he'll try to make the existing code work instead of trying to rewrite.
MaxKruse96@reddit
Using pi.dev and a few skills, as well as a strict workflow depending on task in system prompt helps. Those workflows include:
Idea finding: explore project, give suggestions, report back to user
Bug finding: identify the part of the code that might be bugged, write a failing test that *should* work but currently doesnt, also write edgecase tests in the same swoop + write a "failure with wrong input is expected" type test
Feature: draft out interfaces first, then write tests that satisfy the tests, then write tests
With that rough framework, i manage to shoehorn 3.6 27b Q4 (and Q6, currently testing) to work on 1-3 Tasks at once in the same prompt ("Work on X, Y and Z"), depending on complexity and depth of the tasks of course. Using just normal inference params.
jacek2023@reddit
Agentic coding is the art of creating good rules (AGENTS.md, etc)
askoma@reddit
In my harness it’s totally depends on the prompt and sampling parameters, but it’s true, qwen is like a proactive junior, doing a lot of stuff I didn’t expect.
FullstackSensei@reddit
Models in general amplify gaps in communication. Anything you don't say, the model has to make a probabilistic guess of what that could be.
Don't assume the model thinks like you or knows what's going in your head just because you think that's the rational or logical thing.
Main_Problem_2696@reddit
Lower temperature to 0.2-0.3. Add "only do exactly what I ask, nothing extra" to system prompt. Explicitly tell it which lines not to touch.Used Runable to track prompt tuning experiments. Clean reference sheet in 20 minutes. Made repeating good configs way easier.