Qwen3.6-35B - Terrible instruction following when using context files (with vanilla pi-agent). Model issue or am I doing something wrong?
Posted by FusionX@reddit | LocalLLaMA | View on Reddit | 7 comments
First of all, I am really impressed with Qwen 35B's first class agentic behaviour and tool calling support. I've been exploring it for general tasks where I prompt the model to research and analyze using tool calls and scripts. And it has exceeded my expectations. Until now..
During some of the runs, I noticed few common mistakes that kept cropping up, due to the nature of the task itself. Nothing that an AGENTS.md couldn't fix. So, I added a couple of (3-4) simple instructions to address them. Here is when things go wrong. The model completely IGNORES these prior instructions, unless I explicitly remind it during the actual chat. (Yes, the context file is pre-filled, I confirmed that)
Example: - Agents.md instruction: Never read a file directly into context window without knowing its size. A large file might overload the context window. - User prompt: explore list.txt and analyze. - Result: It tries to directly read list.txt without bothering to check the size..
Am I doing something wrong? I'm really betting on this being a configuration issue because the model had been otherwise exceeding my exepectations. I tried a lot of things, from changing quants to removing llama.cpp params to find the culprit but nothing helped so far.
Setup:
bartowski's Qwen3.6-35B-Q5_K_L with officially recommended sampling parameters for general tasks (tried coding params too, same result), and latest llama.cpp build on linux with CUDA 13.2
llama-server --model models/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF/Qwen_Qwen3.6-35B-A3B-Q5_K_L.gguf -fitt 128 -fa on --jinja --no-mmap --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking": true}' -ctk q8_0 -ctv q8_0 -c 128000
Using it with (latest) vanilla pi coding agent.
PhilippeEiffel@reddit
May be add:
--chat-template-kwargs '{"enable_thinking": true}' \
--reasoning on \
FusionX@reddit (OP)
nah, didn't work still. Even with positive system prompt. Cloud models work without any issue in same setup.
It looks like the reasoning is much shorter in pi (compared to directly using it through llama-server), but I don't yet know why.
NigaTroubles@reddit
Maybe yours instructions is the problem also try modify model settings temperature etc
Serprotease@reddit
An old, yet still valid rule for small-ish model. -> Do not use “negative” instructions.
“Don’t do that…” needs to be avoided.
Prefer “Files smaller than x should be read directly in the context. Files bigger than x should … Use fileInfo.py to get information about file size.”
This also works for bigger models.
Long_comment_san@reddit
I forget this rule too often. Just went to edit my system prompt and it fixed a lot of my issues.
SimilarWarthog8393@reddit
This model seems to fall into loops more frequently than 3.5, even with the same parameters, I went back to 3.5
Purpose-Effective@reddit
I’m using the same model. But I’m using my own quant.
It sounds like it could be two things, you have it on nothink. There is a setting to switch the model between think and nothink, I hate it set up to use think and nothink when I specifically say so, so it switches automatically.
It could also be your context window. If you’re giving too little context it won’t work properly. I use llama server to extend the context to 1M tokens and use a memory system that’s basically an improved version of OpenViking which was originally built for openclaw. That way I keep coherence near the limit of the context window.
Also qwen 3.6 plus from the free chats on their website is the best you can get at debugging anything related to their models.