llama-server + qwen (code) : acknowledges tasks but silently stops working , requiring constant nudging.
Posted by Althar93@reddit | LocalLLaMA | View on Reddit | 9 comments
Hey all,
I am new to the world of LLMs, and specifically local LLMs.
I am currently trying to get a stable setup with & qwen code using my local llama-server as the provider. The model I am using is 'gemma-4-e2b-it-Q8_0', because it is small & seems to work really well overall.
---
My issue is that when using qwen, I will prompt the model to perform a task. It will usually do the initial legwork & confirm the request, but then more often than not it tells me it is working on the task, when in fact it just stops & goes idle.
I am able to get it unstuck by continuously nudging it to 'continue' or 'resume work' but it keeps going idle again and again.
---
Any ideas or hints as to what might be causing this? Should I be looking at the model I use, some server setup, or could this simply be because my hardware is too weak for this kind of work (I have an RX 6700XT)
Althar93@reddit (OP)
I ended up switching to a different model instead (Qwen3.5 9B), which fits on my GPU and is fast enough.
Speed wise, it is nothing like running Claude Code Pro, but it seems stable enough & I can have it run in the background whilst I work on something else. Moreover, unlike Gemma 4 it does not give up, at least not until I blow through the context.
Regarding blowing through the context, I know I can clear it manually & segment work, but I'll have to do some research as I am interested to know if there are ways to have the SLM/LLM do this itself, perhaps synthesing/condensing context at regular intervals, and having it force-clear its own context, before resuming.
CommonPurpose1969@reddit
Gemma 4 E2B and E4B both keep doing exactly that. They give up fast. Too fast. They describe the task, but won't execute it. Or try, fail, and then ask for help.
Even with the updated models, the latest fixes in llama-cpp, and the latest chat scripts, it still happens.
You need to let the SLM plan the tasks, update the plan with each executed task, and have it (using ReACT?) evaluate the plan to decide whether it should stop because it has the final answer, or be automatically nudged to continue until it is done.
Althar93@reddit (OP)
Ok interesting, a lot of acronyms and concepts I know nothing about so I'll need to get reading but that does make sense, thanks!
Althar93@reddit (OP)
Or I could try a larger model (which I can afford to run on my CPU, albeit it would be a lot slower)
CommonPurpose1969@reddit
You'd be surprised at what you can achieve with a 4B model, ReACT, and proper planning.
Althar93@reddit (OP)
That's fair enough, at this point I mainly wanted to understand what was happening / explain why the task fails silently.
I'll definitely play around
eugene20@reddit
Gemma support has had numerous bugs and many updates to llama.cpp, make sure you have the latest builds of everything. Even the models themselves were re-uploaded recently to fix a chat template (though I read you can get a python script to merge the change too instead).
Konamicoder@reddit
I suggest to take the raw output of the model in your llama-server logs, paste into your llm chat window, explain the issue you are experiencing, and ask your llm to analyze root cause of the issue. It should give you some direction of how to fix. Most likely you will need to tweak settings in your instruction template to optimize for your particular model.
Althar93@reddit (OP)
Thanks, I'll try that.
I did enable --verbose mode on my llama-server, and I can see that it has completed all tasks, as far as the server is aware.