Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config
Posted by NoConcert8847@reddit | LocalLLaMA | View on Reddit | 44 comments
Hardware
| Component | Details |
|---|---|
| Machine | MacBook Pro (Mac14,6) |
| Chip | Apple M2 Max — 12-core CPU (8P + 4E) |
| Memory | 64 GB unified memory |
| Storage | 512 GB SSD |
| OS | macOS 15.7 (Sequoia) |
AI Agent Setup
I'm using the pi coding agent as my primary development assistant. It's a local-first AI coding agent that connects to local models via llama.cpp.
Model: Qwen3.6-35B-A3B (running via llama.cpp)
How pi Connects to llama-server
The pi agent communicates with llama-server via the OpenAI-compatible API. Configuration lives in ~/.pi/agent/models.json:
{
"providers": {
"llama-cpp": {
"baseUrl": "http://127.0.0.1:8080/v1",
"api": "openai-completions",
"apiKey": "ignored",
"models": [{ "id": "Qwen3.6-35B-A3B", "contextWindow": 131072, "maxTokens": 32768 }]
}
}
}
The Command
llama-server \
-hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q5_K_XL \
-c 131072 \
-n 32768 \
--no-context-shift \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--repeat-penalty 1.00 \
--presence-penalty 0.00 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--batch-size 4096 \
--ubatch-size 4096
Parameter Breakdown
| Flag | Value | Why |
|---|---|---|
-hf |
unsloth/...:UD-Q5_K_XL |
HuggingFace model repo with unsloth's custom UD quantization — good quality/size tradeoff (\~19 GB) |
-c 131072 |
128K context | This model supports a massive context window — set it high for long documents or extended conversations |
-n 32768 |
32K output tokens | Allows long single-turn generations without hitting the generation limit |
--no-context-shift |
Off | Prevents context shifting during generation — keeps long responses coherent |
--chat-template-kwargs |
preserve_thinking: true |
Keeps the model's reasoning/thinking blocks intact in the output |
--batch-size 4096 |
4096 | Logical batch size — higher = faster prompt processing, needs more memory |
--ubatch-size 4096 |
4096 | Physical batch size — kept equal to logical batch for consistency |
Sampling Parameters
The sampling parameters (--temp, --top-p, --top-k, --repeat-penalty, --presence-penalty) are taken directly from unsloth's recommended config for Qwen3.6. I use these as-is since they're the official recommendations from the model's creators and produce good results out of the box.
PermanentLiminality@reddit
I will be trying a very similar setup. Same model and quant, but on a PC with 3x P40 GPUs.
I've been using Opencode for a while and I find that my context can exceed 100k so I've run using the full 262144 context in case I need it. Uses about 32gb of VRAM. Is Pi lighter?
NoConcert8847@reddit (OP)
Pi is probably much lighter. It can do most things that I can throw at it so far. I've not had a super positive experience with opencode
promobest247@reddit
yeah it's faster than any agent because it has small system prompt
FusionX@reddit
Are we talking about the same quant? It's definitely nowhere near 19GB
NoConcert8847@reddit (OP)
It was a typo. I meant 29gb
BrewHog@reddit
Are you just showing your config? Or did you have any questions?
This looks like a great setup. What is your impression of this setup so far?
NoConcert8847@reddit (OP)
It's incredible. I was blown away by the fact that this is literally a model running on MY laptop that is performing so well - both in terms of intelligence and speed.
Ok_Blacksmith2405@reddit
Not better MLX version ? For KV cache TurboQuant to get big context window to not waste so many RAM?
NoConcert8847@reddit (OP)
Unsloth quants benchmark better. KV cache quantization made things much slower for me, which I think was because of having to enable flash attention. Not sure why that would be the case but I've not run into memory issues so far
OldPappy_@reddit
What sort of tokens/sec do you get with your setup?
NoConcert8847@reddit (OP)
~50 tok/s
kovrik@reddit
Want to know that too.
I have MacBook Pro M1 32GB and Qwen3.6 35B A3B is slow as hell for me. Gemma4 27B A4B is much faster. Not sure what I am doing wrong…
Fearless_Theory2323@reddit
I have the same setup, try that one: bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS
I'm getting 32t/s
2Norn@reddit
i so regret not buying 5090. i completely made my decision based on gaming and went with 5080 back then...
nicksterling@reddit
I’m happy to see pi getting more love. The extension system is incredible and being able to customize my harness is great. I added Claude Code plugin support via extensions so I’m not losing any compatibility. I’m surprised how well it works with models like Qwen 3.6 and Gemma 4
SnooPaintings8639@reddit
Is there an extension "marketplace" of some kind? Or do you yourself know how can I diable reasoning responses from chat history? They're ten times the size of the actual responses.
By default pi keeps all the tokens in history, making each task I give to qwen 3.6 nearly 100k tokens long, and another 50k for fixing bugs from first attempts. This means I have to restart the session after every single task, making it very non interactive.
liftheavyscheisse@reddit
you can ask your clanker to build an extension. but why not use /tree and summarize the branch instead?
Stutturdreki@reddit
The 'marketplace' : https://pi.dev/packages
rm-rf-rm@reddit
this is really good to hear. Im going to skip migrating to opencode (from claude code) and do pi instead.
Can you link the extensions you are referring to?
Several-Tax31@reddit
I'm also thinking of switching. I start to hearing very good things about pi
shovepiggyshove_@reddit
I've been using it for months now, it's my go-to tool for agentic coding. It feels super lightweight and customizable. It forces you to build/adapt everything yourself (skills, extensions, workflow).
arcanemachined@reddit
OpenCode is good too. Much nicer UI/UX than Claude Code IMO. Just feels more coherent.
Pi's also really cool. It's like having a robot that can build itself an extra arm if you just ask it to. Very cool platform.
Shoddy_Cook_864@reddit
Try this project out, its a free open source project that lets you use large models like Kimi K2 with claude code for completely free by utilizing NVIDIA Cloud.
Github link: https://github.com/Ujwal397/Arbiter/
sine120@reddit
Qwen + Pi has been working really well for me for coding. I just need to get a better search setup and I think I can start phasing out gemini day to day.
hailnobra@reddit
This has by far been my favorite part of Qwen 3.6. This thing is a data consuming machine when you hand it search tools. I have it setup with openwebui as a front end and I use SearXNG for metasearch along with Crawl4AI for scraping. I have a small scout model running Llama 3.2 3B instruct that extracts the right text per Qwens instructions so Qwen doesn't destroy its own context just searching.
After I gave Qwen these tools and a system prompt explaining them it was like a kid that just got their favorite toy for Christmas. Qwen will search the world for a perfect answer if you don't reign it in (I think I've seen it go as high as 21 searches and 14 scrapes before it came back with an answer it liked once...that ate about 90K tokens by the time it got done even with the scout model pairing down the scrape content)
schizzz8@reddit
Very cool. Can you share more details about your setup?
hailnobra@reddit
Sure thing.
Qwen 3.6 is running on a Strix Halo system with 96GB of RAM (75GB allocated to GTT). Host OS is running on CachyOS and llama server currently running on the amd-strix-halo-toolboxes:rocm-7.2.1 container from kyuz0 for full compatibility with the 8060s (I get about double PP with this over the standard ROCm container, though I may be switching this to vulkan to try out some of the turboquant builds soon). I also run stable diffusion on Forge Neo with Flux.2 on this same server. Here is my setup on the docker for Qwen 3.6:
command: >
llama-server
-m /models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf
--mmproj /models/mmproj-BF16.gguf
-c 524288
-ngl 999
--host 0.0.0.0
-fa on
--no-mmap
-ctk q8_0
-ctv q8_0
-np 4
--jinja
--chat-template-kwargs '{"preserve_thinking": true}'
--reasoning-budget 8192
--reasoning-budget-message " [Thinking budget reached. Finalizing the current research step and providing the answer now.]"
--batch-size 4096
--ubatch-size 4096
--metrics
I plan to try hooking up agentic tools in the future to this which is why I have it set to -np 4 and such a high context amount.
I have this loaded into openwebUI and I setup a workspace for it along with 2 functions. One for web_search that calls the searXNG endpoint and the other that is called scrape_and_scout that calls crawl4AI with a URL and then instructs the output to be sent straight to my scout model rather than being piped directly back to Qwen. Once the scout model completes, the function passes the scouted info back to Qwen to do with what it wants. OnewebUI, SearXNG and Crawl4AI are running in a separate docker alongside Gluetun with split tunneling to help with privacy for searches and to let me relocate my IP to countries that aren't blocked as much by scrapers (I actually have found Poland to work quite well).
I have the scout Llama 3.2 3B Instruct on a separate llama server docker container that is running to a 3070 on an eGPU connected to the same strix halo system. This works amazingly well and I was able to give the scout a context window of 49K without making llama.cpp yell at me to fit it entirely on the card. This model is insanely fast at scouting the pages that openwebUI hands it from crawl4AI, so it does not add much extra time to Qwen's workflow. Here is my docker command string for the scout model:
command: >
-m /models/Llama-3.2-3B-Instruct-Q6_K_L.gguf
-c 49152
-np 1
-ngl 999
--host 0.0.0.0
-ctk q8_0
-ctv q8_0
-fa on
--batch-size 4096
--ubatch-size 2048
--no-mmap
-t 4
My one complaint at the moment is that there are times that Qwen will get a bit too excited and forget the constraints I have placed in the system prompt, so I need to figure out a hard limiter on the tools so it doesn't lose its mind and go down 20+ search and scrape rabbit holes trying to find an answer and fill its context window. As each tool call is seen as a new command, Qwen has a hard time counting how many times it has run a tool in a session and just goes wild. Other than that occational issue, it is quite fun watching Qwen look for something, not be happy with web-snippits or scrapes, change it's prompt, try again, and continue refining info until it is happy enough to give an answer. It is certainly not as creative as Gemma 4, but it's tool calling is absolutely bonkers (I could not twist Gemma 4's arm hard enough to make it like tools).
Djagatahel@reddit
How did you end up configuring crawl4ai? I tried it a few months and I remember the UI being absolutely not intuitive
hailnobra@reddit
Honestly, I just deployed a docker container for this in my gluetun AI frontend stack and then tied a python script in that lets openwebUI send the call from Qwen to crawl4ai. Here is the crawl4AI section (ports are up with gluetun so I still have access to the webUI for myself, but I have never personally used the UI and just let OpenWebUI handle it
Here is the python script I put into OpenwebUI tools that handles the scrape and sending to the scout model for summarization (built with some help from Qwen and Gemini to get it working)
Momsbestboy@reddit
What system prompt do you use to explain/enforce it?
hailnobra@reddit
Here is my current prompt. It seems to have issues following the search count at the moment, so I don't think this is the right approach to get it to stop going forever. Need to figure out what else I can try. Everything else is working great. Would love suggestions if you have any.
Momsbestboy@reddit
Maybe just copy&paste the prompt to your llm and ask it for an opinion? And then copy&past the stuff to chatgpt for a second one?
Whenever I am stuck with the local llm, I push the question to chatgpt to see if it finds a different approach
hailnobra@reddit
Done this with both Qwen itself and with Gemini to try different refinement methods. This was the latest attempt. Need to spend more time with more ideas because Qwen is still escaping the 10 web_search limit if it isn't happy with what it finds.
No-Consequence-1779@reddit
Very interesting. I want to try this.
hailnobra@reddit
Posted a bit more info on the configuration to another person in this thread. Absolutely recommend giving Qwen tools.
promobest247@reddit
metoo , i use pi it's very good & fast locally with extenstions &skills i installed many extension : lsp web_access (websearch) plannator ( similar ultraplan claude code) teams
Thrynneld@reddit
I've been running a similar setup, but have gone a slightly different way when I discovered that at least for solving benchmarks, disabling thinking actually gave better results, so I run with:
--chat-template-kwargs {"enable_thinking":false}Give it a shot, I was surprised to see qwen 3.6 35b at q4 basically one-shot all 225 polyglot benchmark exercises
Worried-Squirrel2023@reddit
the pi extension system is what sold me too. opencode is great out of the box but the moment you want to add a custom tool or hook, pi is way less painful. for a 64GB M2 Max that setup is probably the best price/perf you can get without buying nvidia.
esp_py@reddit
How much token per second are you getting?
Clean_Initial_9618@reddit
Hi I have a rtx 3090 I can run IQ4_NL with 120T/s can i use this model for coding I have been trying either it loops too much or the results are not that great with coding
No-Consequence-1779@reddit
Very cool. My R9700 gets 120 too.
Dismal-Effect-1914@reddit
What is your pp/tg ?
Durian881@reddit
I was using Qwen3.6-35B-A3B with Qwen Code and it worked pretty well too with coding web ui, tool calls and using skills. It did have some problems repairing and restarting Hyperledger Besu nodes which stopped syncing.
uniVocity@reddit
Thanks for sharing! I’m too lazy to research configs and I’ve been stuck with LMStudio and whatever defaults comes with the models for a while.
Will try this out to see if makes too much of a difference.