Qwen3.5-9B is actually quite good for agentic coding

Posted by Lualcala@reddit | LocalLLaMA | View on Reddit | 142 comments

I have to admit I am quite impressed. My hardware is an Nvidia Geforce RTX 3060 with 12 GB VRAM so it's quite limited. I have been "model-hopping" to see what works best for me.
I mainly did my tests with Kilo Code but sometimes I tried Roo Code as well
Originally I used a customized Qwen 2.5 Coder for tools calls, It was relatively fast but usually would fail doing tool calls.

Then I tested multiple Unsloth quantizations on Qwen 3 Coder. 1-bit quants would work also relatively fast but usually failed doing tool calls as well. However I've been using UD-TQ1_0 for code completion with Continue and has been quite good, better than what I experienced compared to smaller Qwen2.5 Coder models. 2-bit quants worked a little bit better (it would still fail sometimes), however it started feeling really slow and kinda unstable.

Then, similarly to my original tests with Qwen 2.5, tried this version of Qwen3, also optimized for tools (14b), my experience was significantly better but still a bit slow, I should probably have gone with 8b instead. I noticed that, these general Qwen versions that are not optimized for coding worked better for me, probably because they were smaller and would fit better, so instead of trying Qwen3-8b, I went with Qwen3.5-9b, and this is where I got really surprised.

Finally had the agent working for more than an hour, doing kind of significant work and capable of going on by itself without getting stuck.

I know every setup is different, but if you are running on consumer hardware with limited VRAM, I think this represents amazing progress.

TL;DR: Qwen 3.5 (9B) with 12 VRAM actually works very well for agentic calls. Unsloth-Qwen3 Coder 30B UD-TQ1_0 is good for code completion

[-]

cogitech2@reddit

I'm running the same video card and I agree 100%. Qwen3.5-9b is the clear winner for agentic work on this card.

My setup is Qwen3.5-9B-UD-Q6_K_XL running with 128k context (KV = Q8), flash attention on. 1GB of VRAM free for comfort. Hermes Agent runs great on this setup. I have auto context compaction set at 75%. When it hits that limit, there is a significant delay, but that's better than just crashing or looping or whatever, and I usually don't get to that point anyway.

I tried a lot of models. A LOT. Nothing comes close in terms of VRAM efficiency. Any other model in this size range uses VRAM (for context) way, way less efficiently and forces you down to 16-32k context which is virtually useless for agentic AI, IMO.

The combination of Q6 for the model and Q8 for KV is solid AF. Robotic, but solid. I don't know about you, but for agent work I want Spock, not Timothy Leary or Aldous Huxley.

[-]

ea_man@reddit

I use the standard Qwen3.5-35B-A3B with my 12GB 6700xt, it gives me 30tok/sec (no thinking) while the 9B gives me 40, I guess that with 12GB of ram MoE is the best thing, I can run it with some 40k context and usually manages to edit / apply code.

Also as a generalistic LM it works better for learning / explaining.

[-]

zuubumafuma@reddit

--fit-target 256

What will this do?

[-]

juandann@reddit

how much system RAM do you have?

[-]

MikeSouto@reddit

do you mind sharing the command? I'm getting 22 with a 6800XT (vulkan backend) using the MXFP4_MOE

[-]

ea_man@reddit

Qwen_Qwen3.5-35B-A3B-GGUF

# optimized for coding
# 1. Set Environment Variables
export LD_LIBRARY_PATH="/home/eaman/llama/bin_vulkan"  
export LLAMA_CACHE="/home/eaman/lm/models/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF/"

# 2. Run the Server
/home/eaman/llama/bin_vulkan/llama-server \
-m /home/eaman/lm/models/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF/Qwen_Qwen3.5-35B-A3B-Q4_K_S.gguf \
       --reasoning-budget 0 \
       -ctk q8_0 \
       -ctv q8_0 \
       -fa on \
       --temp 0.2 \
       --top-p 0.95 \
       --top-k 20 \
       --min-p 0.05 \
       --repeat-penalty 1.05 \
       --fit-target 256 \
       --ctx-size 44768

For reference, serving 9B omnicode:

export LD_LIBRARY_PATH="/home/eaman/llama/bin_vulkan" ;
/home/eaman/llama/bin_vulkan/llama-server \
   -m /home/eaman/.lmstudio/models/Tesslate/OmniCoder-9B-GGUF/omnicoder-9b-q4_k_m.gguf \
   -ngl 99 \
   --ctx-size 132768 \
   --temp 0.7 \
   --top-p 0.8 \
   --top-k 20 \
   --min-p 0.05 \
   --cache-type-k q8_0 \
   --cache-type-v q8_0 \
   --reasoning-budget 0 \
   -fa on

eval time = 235.03 ms / 11 tokens ( 21.37 ms per token, 46.80 tokens per second)

You see that you can use a much bigger --ctx-size 13276 , yet if you wanna go fast a 4k ctx will give you speed.

[-]

MikeSouto@reddit

Thanks! yesterday I got almost 40 with 65k context using the llama.cpp ui

command:

llama-server \

-hf bartowski/Qwen_Qwen3.5-35B-A3B-GGUF:Q4_K_M \

--mlock \

--cache-ram 0 \

--ctx-size 65536 \

--temp 1.0 \

--top-p 0.95 \

--top-k 20 \

--min-p 0.00 \

--fit on \

--flash-attn on \

--parallel 1 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--device Vulkan0 \

--host 0.0.0.0 \

--port 80 \

--threads 8

[-]

ea_man@reddit

Aye, most comes from --fit.

If you use an unsloth you may get one or two tokens more, yet barto is more code straight.

[-]

kastaldi@reddit

which quant do you use with Qwen3.5-35B-A3B ? I have a RTX 3060 12GB...

[-]

ea_man@reddit

I use Qwen3.5-35B-A3B-Q4_K_M

[-]

kastaldi@reddit

thanks, I prefer starting from a specific quantization and find the sweet spot instead of downloading a lot of GBs blindfolded

[-]

ea_man@reddit

No prob, if you use softwares like lm studio in the listing there's usually a recommanded / favourite version.

Q4 is usually good for low memory, if it's a small model you can go Q6, then it's up to how much context size you wanna have available.

[-]

nullmove@reddit

Saw this earlier: https://huggingface.co/Tesslate/OmniCoder-9B-GGUF

Might be of interest to you.

[-]

Lualcala@reddit (OP)

Thanks! Will definitely take a look

[-]

yay-iviss@reddit

When you test, say your result.

I also have been positive in my tests on qwen3.5 even the 0.8b have made good tool calls, using some mcps to make web search or control a browser with playwright.

And the 9b have made a webbsite, not as good and beautiful as frontier models or even glm 4.7, but finished the job

[-]

fulgencio_batista@reddit

How are you getting 0.8B to make tool calls? Mine loops infinitely unless I disable thinking

[-]

yay-iviss@reddit

the loop happens in every size for me, but it stops looping if i set the temperate higher, I'm using LM studio, then default temperature is 0.1 and it makes the looping behaviour, setting to 0.5 resolves.
But I have hear that lama.cpp with right parameters should work better.

[-]

StartupTim@reddit

Hey there if you don't mind, could you explain how a higher temps helps in this?

Thanks

[-]

MathematicianWise999@reddit

Hi, my friend posted this photo from his very short testing of OmniCoder-9B-GGUF

He has said it was installed by zed for him, and he wasn't aware of the prompt zed put in it, but as you can see it provided quite a prolonged, complicated, quite weird looking hallucination with very strange recommendations at the end ...

So my friend decided to immediately uninstall the bugger.

Cheers.

[-]

SidneyFong@reddit

Not an expert, but higher temperatures make the inference engine more likely to choose lower probability tokens. So if the model tend to loop, the higher temperature gives some chance for the inference process to choose some other token instead of the highest probability one to break the loop.

[-]

yay-iviss@reddit

I don't know exactly why, but there are some default parameters that should be used on these models, the page on huggingface has these parameters and using these with a updated system(lama.cpp) that supports this arch, make these loop erros happen less.

Is like the default one is miss configured.

There are more parameters than just the temperature to adjust, there are a penalty to configure also, but lm studio doesn't support changing everything and for my uses just this was enough

[-]

StartupTim@reddit

Aight thanks for the info, I appreciate it!

[-]

General-Economics-85@reddit

They literally said it stops looping?

[-]

LoafyLemon@reddit

Qwen recommends temp to be set to 0.6 for coding tasks with thinking enabled. Check Usloth page for quick task recommendation settings for Qwen 3.5.

[-]

Otherwise-Variety674@reddit

For me, I deleted my current downloaded qwen 3.5 model and redownload the most recently released unsloth version, they had fixed the looping problem recently.

[-]

Lualcala@reddit (OP)

Well, it looked promising but for some reason my results weren't that good (maybe I did something wrong). Roo Code just said that OmniCoder didn't support tool calls and Kilo Code allowed me to use it but it indeed failed doing tool calls and it just printed the XML of the tool call instead of actually doint it :/

[-]

Lualcala@reddit (OP)

I need to correct myself here, I switched from Ollama to LM Studio and it's working properly now (not sure why), I'm eager now to see what it can do!

[-]

RMK137@reddit

Any idea how to disable thinking in llama.cpp? Tried both --chat-template-kwargs '{"enable_thinking":false}' and --reasoning-budget 0and neither worked.

[-]

Weary_Long3409@reddit

This arg works via litellm. I put that arg on litellm config, then every connection through litellm shuts thinking down.

[-]

Awwtifishal@reddit

This has worked for me in the API call json: "chat_template_kwargs": {"enable_thinking":false}

Make sure you have jinja enabled (which is the default nowadays) and that you're not using a custom jinja template. The GGUF files I've tried have the correct one but a possibility is that you have a bad one.

[-]

FUS3N@reddit

This model is insanely good btw, about a years ago i made an local assistant with something like 40 tools that it has access to and around that time every 8b to 9b failed to generate any meaningful text with all that tools and context given but this thing just does every complex task i give it by using multiple tools it completes them and even does better than 2.5 flash back then. Crazy how far small models has come

[-]

ducksoup_18@reddit

Have people done A/B tests on this vs unsloths 9b model? Im using the 9b model currently and the 4b model (specifically for agentic stuff for my home assistant) spanned across 2 3060 12gb gpus and have been impressed with both. Have an enormous context for the 9b model as I have additional unused space on one of the cards. Just wondering if i could use this model in place of the 9b unsloth and get more context at similar performance.

[-]

No-Statistician-374@reddit

Well it IS a coding finetune of that model, so for coding tasks it should slot in the same spot quite nicely... not sure why it would give you more context though? Unless you're going to run it at a lower quant, but that would make the comparison invalid...

[-]

ducksoup_18@reddit

Oh for some reason i thought the model size was smaller than unsloths 9b model. My misunderstanding.

[-]

No-Statistician-374@reddit

Oh damn, we're already getting coding finetunes of qwen3.5... here's hoping we'll get one for the 35B soon! \^\^ This is definitely going in my stack though, much thanks!

[-]

ScoreUnique@reddit

Looks pretty sick

[-]

blacklandothegambler@reddit

I have the same GPU but I notice that Q 3.5 9b ceeps stopping on agentic coding tasks in opencode. What's your setup specifically? Are you using an unsloth model? ollama?

[-]

darkregan11@reddit

Hey, same issue here, I'm running the Q3.5:9b in local and opencode is not able to interact again the responses from ollama, I ran it on a GPU RX 9060 XT 16 GB vram and a macbook air with 16 GB unified memory, and I'm getting the same result in both of them, anyone knows which model is compatible to run with opencode?.

[-]

d4prenuer@reddit

hey mate, i'm having the same problem, do you find any fix ?

[-]

darkregan11@reddit

I have been able to make a small progress, in the open code config.json file, I added a "type":"openai" parameter to the provider.ollama.options node, after that at least the plan agent starts to 'think' and log it, but when I change to build agent it is not able to perform the actions as create or edit files, seems as additional tags should be added for every tool call in the model response, pretty similar to /blablabla/myfile hello

So I think that a lot of time should be invest on it and doesn't make sense to me.

Hopefully a new custom model helps to it, and looks like the heavier models has better tooling compatibility, not the medium ones as 9B or 12B. :(

Eventually I'm going to try it again in a few months when another model makes noise.

[-]

d4prenuer@reddit

i find the fix, on my side the problem was due to an ollama environment variable that control ollama server context windows, i made a bunch of test to find the right value and at the end i was able to make it works very well.
The env var is OLLAMA_CONTEXT_LENGTH

[-]

Lualcala@reddit (OP)

I'm using Ollama in Linux (specifically CachyOS). I'm actually using the the GPU as an eGPU, I have 24GB of RAM. I'm using the model available from the Ollama library (although I'm considering switching to be able to test Unsloth and Barlowski models)
Have tried many context sizes, 65K worked quite well for me, I tried extending it to 75K today and it continued working fine
Temperature was set at 0.25

[-]

linuxid10t@reddit

Qwen3.5-9B managed to completely mess up my build system then delete the project today. I'm not terribly convinced lol. Seriously though, it works well sometimes, but others it falls flat on its face. Using LM Studio and Claude Code on the RTX 4060.

[-]

prototype__@reddit

Could you please explain your setup? Are you using Claude to direct a local LM which does the coding?

[-]

linuxid10t@reddit

Local LLM is run on LM Studio which provides and Anthropic endpoint. As of January, you can point Claude Code at ANY Anthropic endpoint. This means you can run a server locally on the same computer and point Claude Code at it. I posted a project to GitHub that sets up the environment variables in a nice, easy to use GUI. https://github.com/linuxid10t/HaiClaude-linux-qt6

[-]

juandann@reddit

how is the token consumption with Claude Code? I heard it embbeds quite the system prompt on each run?

[-]

prototype__@reddit

TY very much, kind internet stranger!

[-]

yay-iviss@reddit

It is not smart or good as frontier models, the thing is about that it can work and do things, and even make it right with some attempts.

In the other hand, older foss models would work just being bigger than 60B .

This is the comparison, apples to apples

[-]

linuxid10t@reddit

Long as it doesn't do anything too stupid, it can be okay. I can run GLM-4.7-Flash at 6 tok/sec or Qwen3.5-9b at 30 tok/s. Even if it makes mistakes, you can iterate fast enough that sometimes it's faster. The biggest issue is just how much thinking it uses. Starts to think so much that you run out of context or it gets lost. I was playing with a Qwen3.5-27B model with Opus 4.6 CoT distilled and the whole overthinking problem was much better. I need to see if there is a similar 9b model.

[-]

thewhitewulfy_@reddit

Yes, there is a similar 9b model and a 4b as well

[-]

thewhitewulfy_@reddit

Look for jackrong or crow 9b distill

[-]

linuxid10t@reddit

Jackrong has it. Downloading now

[-]

juandann@reddit

with Jackrong, do you enable/not restrict the engine reasoning option? Also, may i know what engine do you use for the inference?

[-]

Awwtifishal@reddit

Take a look at the new reasoning budget settings in llama.cpp. Qwen 3.5 works well with the official reasoning budget message that Qwen themselves used.:

--reasoning-budget-message "\n\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now."

Then you can use the JSON parameter thinking_budget_tokens or the CLI argument --reasoning-budget

[-]

sinevilson@reddit

Its a must to turn any external thinking off along with additional parameters to make it sane. It thinks more than a convict in court.

[-]

sinevilson@reddit

This is a good opportunity to test the response using the model in question, hope you dont mind, after my adjustments to attitude, sanity and performance for my architecture.

Trick question - thinking - All models "think" apps like Ollama will cause more "thinking" creating expenses and performance hits. Ill reference Ollama because its what Im currently building with. The short version is basically set it in your Modelfile and build the llm from there with the parameters and different name to recognize it. A toggle in any app, is disappointing for parameters.

Stream Chat Response: true Seed: Temperature: ... think: false ← HERE - DISABLES ADDITIONAL VISIBLE THINKING num_ctx:

"Thinking" apps like Ollama don't magically make models "think" more—Modelfile parameters do. Here's the truth:

You're being tricked if you think switching to a "thinking-enabled" app will unlock deeper reasoning. All models—Qwen3.5:9B, Llama 3.1, or anything else—respond to the same inference parameters regardless of the frontend. The "thinking" toggle is just UI theater when the real control lives in your config.

The actual levers (examples across frameworks):

Ollama Modelfile:**

FROM qwen3.5:9b
PARAMETER stream_chat_response true
PARAMETER seed 1337
PARAMETER temperature 0.7
PARAMETER think false          # ← DISABLES VISIBLE CHAIN-OF-THOUGHT
PARAMETER num_ctx 131072

LM Studio (Advanced Config):**

"stream": true,
"seed": 1337,
"temp": 0.7,
"show_thinking": false,       # ← Same effect, different name
"context_length": 131072

llama.cpp server (llamafile):**

--ctx-size 131072
--temp 0.7
--seed 1337
--no-display-prompt      # ← Skips verbose thinking display

vLLM/OpenWebUI:**

"stream_chat": true,
"temperature": 0.7,
"seed": 1337,
"enable_thinking": false,  # ← Backend parameter toggle
"max_context_length": 131072

Stop hunting for apps. Build your Modelfile/server config properly, set think: false (or equivalent), and chain with Granite3-Condensed or similar if you need actual reasoning guardrails. App toggles are for casual users—parameters are for engineers.

Im think the model responded accurately so Im posting the chat response. It took 5 seconds, which is acceptable to me. Im a patient person.

[-]

crazyclue@reddit

How do you turn the thinking off?

[-]

medialoungeguy@reddit

Do you think the 9b is better than the 35b?

[-]

grumd@reddit

35B is better, 9B does nonsensical things a lof of the time (at Q6)

[-]

kayteee1995@reddit

yes! true! I tried 9B Unsloth quant, and then Hauhaucs Aggressive one. but both gone loop and fail after 3-5 turns.

[-]

Bamny@reddit

Will try 35B tonight, I tried 9B on my 2x3060 12GBs with Openclaw and was disappointed. Shifted to gpt-oss:20b -> also semi disappointed. Might try 35B and shift to Hermes tbh

[-]

grumd@reddit

Try 35B at Q6 while offloading experts to RAM, I get 70+ t/s with 10+gb of the model in my RAM. Also try 27B, I have no experience with dual GPUs, but if they share their VRAM and you effectively have 24GB, then it's enough for 27B at good quants like Q4-Q5 with a lot of context, which will be your best option. 122B-A10B takes up sooooo much memory, but it's not even higher quality than 27B. I'm currently barely running 27B at IQ4-XS with almost 50k context, with -ngl 55 (so only 16 t/s), and it's way better quality than 9B Q6, 35B Q6 and 122B IQ3-XXS.

[-]

Lualcala@reddit (OP)

I haven't tested 35b yet, but for my hardware, I'm quite sure it will run super slow. So yeah.

[-]

medialoungeguy@reddit

35b is MOE and it runs at the speed of a 3B. Not only will it run on your setup, but it will run much faster.

[-]

General-Economics-85@reddit

how do you make it run faster if you can't fit 35b in your gpu?

[-]

ea_man@reddit

Use --fit-target 256 with llama.cp, on 12GB 6700xt it gives me 10tok/sec less than 9B yet it's a better model.

[-]

medialoungeguy@reddit

Only the active 3b params need to be in GPU. The other 17gb gets offloaded to ram.

[-]

General-Economics-85@reddit

how do you get that to work with llama.cpp then? when i tried it, it would just run as slow as it would on cpu.

[-]

zilled@reddit

Wich agentic tool are you using?
You mentionned Continue for the past experiments.

Are you still using this one?
Did you try others?
Do you use some specific settings?

(my current situation is that I can fit a Qwen3.5-27B on my system with decent t/s, but the results different A LOT depending on the agentic tool I'm using.

[-]

Lualcala@reddit (OP)

I've been using Kilo Code as my main agentic tool.
I'm just using Continue for code completion and some simple prompts now.
I'm still trying different context sizes but around 65K-75K is working fine for me. Also reduced the amount of files from my open tabs it can read (because I tend to leave them open and stack like crazy), also set my temperature to 0.25.
However I haven't done like some kind of standard testing, I'm always asking for different stuff and see how well it does, I just have been liking a bit more Kilo Code over Roo Code

[-]

zilled@reddit

Nice!
It start to work quite well on my side. Thanks a lot!

What would you use to code indexing?
Can you actually use any indexing model for that?
Or there is a specific model you need to work properly with Qwen3.5-9B ?

[-]

junior600@reddit

Which quant do you use with the rtx 3060?

[-]

Lualcala@reddit (OP)

I'm running qwen3.5:9b-q4_K_M using Ollama

[-]

Rude_Marzipan6107@reddit

Bartowski or Unsloth?

[-]

Lualcala@reddit (OP)

For Qwen 3.5, I used the one from the Ollama library. Not Bartowski or Unsloth
I used Unsloth for Qwen 3 Coder

[-]

thewhitewulfy_@reddit

Use bartowski for q4_k_m. It has the best response at similar quant and works great with cc and opencode

[-]

Lualcala@reddit (OP)

I may consider switching away from ollama. Right now there is a bug (https://github.com/ollama/ollama/issues/14575) where it fails to load those variants from hugging face

[-]

thewhitewulfy_@reddit

Yes, I had already found that out in preliminary research.

[-]

adriabama06@reddit

Ollama self made

[-]

Danmoreng@reddit

Enjoy some extra performance by building llama.cpp from source instead of ollama https://github.com/Danmoreng/local-qwen3-coder-env

[-]

StartupTim@reddit

I mainly did my tests with Kilo Code but sometimes I tried Roo Code as well

Hey there, which would you say you like more, Roo or Kilo? How good are they for local hosting? Do you do openai to your local models?

Thanks!!

[-]

Lualcala@reddit (OP)

I felt both kind of similar but liked Kilo a bit more because small UI details that feel nice, and also has a bit more features like being able to generate commit messages which also worked quite well for me.
I'm running my models locally through Ollama which supports OpenAI API compatibility, both tools work fine, I still need to test them a bit more

[-]

hesperaux@reddit

How do you use qwen3x for fill in middle completion? I can't get that to work at all. I am still using 2.5 coder for code completion.

[-]

Lualcala@reddit (OP)

I'm using Continue with VS Code, still is not as fast or as smart as using something like Copilot but it's better than having nothing. It just worked without too much setup.

[-]

yay-iviss@reddit

With tool you use for code completion?

[-]

hesperaux@reddit

I use ProxyAI in jetbrains, Zed, and neovim with minuet plugin. I just the model local with llama.cpp in docker. I had poor experience with continue and roo but ProxyAI is nice. Do you have any special config to tell the plugin how to do FIM template or anything like that?

[-]

yay-iviss@reddit

I don't have nothing, I stoped using local tools for code completion because continue that was good, is not more, doesn't work, and I cannot download the old version that it just worked very well with qwen2.5 code.

So I was curious if you have some setup, I will try zed with local AI again

[-]

hesperaux@reddit

Qwen2.5 coder works out of the box with llama.cpp and zed once you configure it to talk to llama. It's just the code completion I get aren't as relevant as what I got from copilot with gpt4.1. I know it's a lot to ask but I was hoping. I use q8 quants from unsloth with 64K context. Probably overkill. And it's dedicated to FIM. Response is just as fast of not faster than copilot. I still haven't taken the time to tune the parameters which might help. Honestly I'm thinking that a reranker might make a big difference here. It gets the first completion eagerly, then if you sit there it will add more options to cycle through. I rarely want the first option. But I've got the output token count set pretty high and temperature is a bit creative (0.6). I want to test it with like 64 output tokens and 0.1 temp but I've been busy. I just wish I could use a smarter model like qwen3 or 3.5. I wonder how results would improve. I also wonder if it could load more of the file into context for better results but I'm not sure what controls that.

[-]

Smigol2019@reddit

Also i tried using zed and configured an "openai api compatible" server... using llamacpp router mode. But i get some errors about response body and such. Can u share your config?

[-]

hesperaux@reddit

// FIM Edit Predictions

// Connects to llm-fim (Qwen2.5-Coder-7B-Instruct Q8_0) on GPU 4

// Option A: Direct connection (no TLS, use if Zed can reach port 8083)

edit_predictions : {

mode : eager ,

provider : open_ai_compatible_api ,

open_ai_compatible_api : {

api_url : https://ai/v1/completions ,

model : llm-fim ,

prompt_format : qwen ,

max_output_tokens : 128,

disabled_globs : [

**/.env* ,

**/*.pem ,

**/*.key ,

**/*.cert ,

**/*.crt ,

**/secrets.yml ,

Replace api_url with whatever the endpoint URL is for your inference server. The model name is whatever you called it (my llama.cpp server is configured to serve "llm-fim" which points to qwen2.5-coder in this case.

This cannot be configured via the UI. You have to put this into the Zed settings.json.

[-]

hesperaux@reddit

Sure. I'll get back to you tonight after work. How do I do the remind me thing?... !remindme 4 hours

[-]

RemindMeBot@reddit

I will be messaging you in 4 hours on 2026-03-13 01:33:31 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

bqlou@reddit

If i understand correctly, you have a dedicated agent in kilo that is responsible of tool calling ? Do you have an example on how to do that ? Couldnt find docs about this…

[-]

Lualcala@reddit (OP)

Yeah, here is a straightforward guide
https://kilo.ai/docs/ai-providers/ollama

And one for Roo as well:
https://docs.roocode.com/providers/ollama

[-]

JorG941@reddit

What quant did you use and how are yours tk/s

[-]

Lualcala@reddit (OP)

I'm getting around 45 tk/s with qwen3.5:9b-q4_K_M

[-]

JorG941@reddit

processing or output?

[-]

Lualcala@reddit (OP)

output
The processing was around 400 tk/s

[-]

Background-Bass6760@reddit

The fact that a 9B model can handle agentic coding workflows on a 3060 is a significant signal about where this space is heading. A year ago you needed 70B+ parameters and serious hardware to get usable agent behavior. The capability floor is rising fast at the small end of the spectrum.

What makes this interesting from an architecture perspective is the implication for local-first development workflows. If your coding agent runs entirely on consumer hardware with acceptable quality, the dependency on API providers becomes optional rather than mandatory. That changes the economics and the privacy model simultaneously.

Curious how it handles longer context windows and multi-file edits. The benchmarks usually test single-turn generation, but the real test for agentic coding is whether the model can maintain coherent intent across a sequence of file reads, edits, and tool calls without losing the thread...

[-]

MotokoAGI@reddit

A year ago the models were not really being trained for mutistep agentic workflow... The limit was not the hardware, but the training...

[-]

Background-Bass6760@reddit

Yeah, that's actually accurate upon further reflection.. It's hard to remember what things were even like 6 months ago in the space that changes every week and is only accelerating....

[-]

ea_man@reddit

Yup, the cool thing is that you can use the big cloud guys for planning and delegate agent EDIT / APPLY to these smaller QWENs and save a lot of credits.

[-]

Background-Bass6760@reddit

I love that it's heading this direction. The big guys will no longer be able to successfully gate-keep as the local models get more intelligence density,

I saw Altman in a clip talking about intelligence as a utility, like water or electricity. I'm just thinking, yeah, if at that point the open source market doesn't swallow openAI and spit it back out.

[-]

ea_man@reddit

Those people are madmen, they are so greedy they will eat their own legs and guts.
Those agentic workflow, Antropic PR reviews will whip slash those fools that now are buying tokens like candies.

[-]

qubridInc@reddit

Nice finding. Qwen3.5-9B running stable agent loops on a 12GB 3060 is actually pretty impressive for consumer hardware.

Feels like the sweet spot right now is \~8–10B models that fully fit in VRAM, rather than pushing bigger quants that slow everything down.

[-]

ea_man@reddit

I would say the sweet spot on 12GB is a MoE like Qwen3.5-35B-A3B (30t/s compared to 40t/s of 9B) but on my system they run good on llama.cp and 1/3 of performance with LM Studio.

[-]

MotorAlternative8045@reddit

I actually tested it with my openClaw setup and I can see it can properly call tools, search the web and handle everything that I threw at it so far. Maybe its finally the time to cut off my subscriptions

[-]

-dysangel-@reddit

enter stage left: the people who keep trying to tell you that lower quants can't do anything useful even though you're showing they can do something useful

[-]

sine120@reddit

3.5 Quantizes amazingly well. I find the sweet spot is the Q4 k-quants, but the UD-IQ3_XXS have been great for the 27/35B models in my 16GB VRAM. Haven't had any issues.

[-]

xeeff@reddit

i would really benefit from extra RAM when using UD-Q2_K_XL, do you mind testing how it performs for you?

[-]

sine120@reddit

What works for you may not work for me. IQ3 is enough to maintain KL divergence needed for coding and tool calls. If you don't care about those as much, you can get away with less.

[-]

vman81@reddit

similar setup to mine - I did notice that tool calling was basically broken for my openclaw+qwen3.5 when upgrading past 0.17.5, so anyone who can save half a day of debugging there, this is for you.

[-]

Lucky-Necessary-8382@reddit

Openclaw workflows break after every update . My ass

[-]

pitzips@reddit

I ran into the same bug this morning. Here's the GitHub issue I started following to know when it's fixed: https://github.com/ollama/ollama/issues/14745

[-]

zaidkhan00690@reddit

What kind of agentic coding work is it able to do ? How is the quality and how much is speed

[-]

Lualcala@reddit (OP)

I've been using it to generate test cases for a flutter project. I gave it one test file example from my project.
When it creates the new file, it usually has compilation errors, but so far it has been able to fix them, then run the tests and attempt to fix failing ones. Depending on the issues it may take more time than others.
I haven't tried implementing a whole new feature yet but I may try it soon.

[-]

Tanzious02@reddit

Im new to using LLM's properly, how do you use it?
Do you just prompt it up in LM studio or use something like claude code with the local model?

[-]

tiga_94@reddit

Claude code or roo or something like that

And then set up searching and scraping MCP servers (server.dev gives 5000 free smo they tokens, you can use it just for scraping and unlimited free duckduckgo.com api for searches) and you're good to go

[-]

RestaurantHefty322@reddit

We run background agents on smaller models for cost reasons and the biggest lesson is that benchmark scores lie about agentic performance. A model that scores well on HumanEval can still fall apart in a 20-step agentic loop because error recovery matters way more than first-shot accuracy.

The pattern that made 9B models actually usable for us was constraining the action space hard. Instead of giving the model a dozen tools and hoping it picks the right sequence, we use structured output schemas with explicit state fields - the model fills in a typed action and we validate before execution. Catches most of the "delete the project" type failures before they happen.

The other thing nobody mentions is context window pressure in long agentic sessions. A 9B model with 32k context filling up with tool call history degrades way faster than a 70B in the same situation. We ended up doing aggressive context pruning between steps - keep the last action result and the original goal, drop everything in between. Counterintuitive but the model makes better decisions with less history than with a bloated context full of stale intermediate states.

[-]

SlaveZelda@reddit

Is it better or worse than the 35b one

[-]

grumd@reddit

27B > 35B-A3B > 9B

[-]

LoafyLemon@reddit

In my tests, 9B pulled ahead of the MoE at times... I'm not too surprised given the active parameters.

[-]

yay-iviss@reddit

Worse

[-]

AvidCyclist250@reddit

Yeah feels to me like it's about 15-20% dumber. Pretty good for its size

[-]

sleepingsysadmin@reddit

it benches around gpt120b high. It's shocking how good it is with that size.

[-]

greeneyestyle@reddit

Woah 😮 I wasn’t aware of this. That’s crazy for a 9B model.

[-]

SemiconductingFish@reddit

I'm still pretty new to this stuff and still trying to get Qwen3.5 9B to work on my 12gb vram (more like <10gb vram if I account for baseline usage). What KV cache size did you use? Since I got an OOM type error when I tried running a AWQ version on vllm with just 4k cache size.

[-]

My_Unbiased_Opinion@reddit

Try 27B UD-IQ2_XXS. You might like it. The model is super smart and is a big step up from 9B even if quanted down to hell. Run KVcache at Q8.

[-]

Important-Farmer-846@reddit

I would appreciate if you try this and compare with the unsloth base version: Qwen3-5 9b Crow In my personal experience, its way better.

[-]

sammybeta@reddit

I will try it tonight with my 4070 ti super, so a bit more VRAM. I was frustrated previously with a similar sized model (can't remember which).

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

sniffton@reddit

I've been thinking of doing a draft and edit setup using qwen359b and something bigger/slower.

[-]

Kiansjet@reddit

Oh how I regret getting a 3060 ti instead of the 3060 12g

[-]

altomek@reddit

For completition you can use small models like:

granite-4.0-h-1b or granite-4.0-h-350m.

[-]

DeliciousGorilla@reddit

What does “completion” mean?

[-]

AvidCyclist250@reddit

It's also my golden spot for offline and online RAG. Quite the supermodel for 16gb VRAM.

[-]

gaspipe242@reddit

3.5-27b has me SERIOUSLY stanning. It is standing toe-to-toe with Devstrel 2 for agnetic loads and reasoning. (I don't use these models for coding... but more for agnetic loads , tasks, testing (managing playright testing fleets, etc) ... But holy smoke, Qwen3.5 27b is one of the most impressive models I've used in a while. (Mistral's 20b+ models, too, have shocked me).

[-]

switchbanned@reddit

I tried using kilo code with 3.5-9b, running in LM studio, and it failed at tool calling every time i tried using the model. I could have been doing something wrong.

[-]

sinevilson@reddit

I’m not entirely convinced by the hype surrounding Qwen3.5:9B. It shows strong potential, but it absolutely requires a full Modelfile rebuild or at least a deep rewrite with fine‑grained parameter tuning. The default configuration doesn’t bring out its best performance — you’ll need to explicitly define system prompts, context handling, and sampling parameters to align it properly. In my experience, it also benefits from being chained with something like Granite4 or Granite3‑Condensed to stabilize outputs and maintain logical coherence across longer sessions.

[-]