Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config

Component	Details
Machine	MacBook Pro (Mac14,6)
Chip	Apple M2 Max — 12-core CPU (8P + 4E)
Memory	64 GB unified memory
Storage	512 GB SSD
OS	macOS 15.7 (Sequoia)

Flag	Value	Why
`-hf`	`unsloth/...:UD-Q5_K_XL`	HuggingFace model repo with unsloth's custom UD quantization — good quality/size tradeoff (\~19 GB)
`-c 131072`	128K context	This model supports a massive context window — set it high for long documents or extended conversations
`-n 32768`	32K output tokens	Allows long single-turn generations without hitting the generation limit
`--no-context-shift`	Off	Prevents context shifting during generation — keeps long responses coherent
`--chat-template-kwargs`	`preserve_thinking: true`	Keeps the model's reasoning/thinking blocks intact in the output
`--batch-size 4096`	4096	Logical batch size — higher = faster prompt processing, needs more memory
`--ubatch-size 4096`	4096	Physical batch size — kept equal to logical batch for consistency

[-]

PermanentLiminality@reddit

I will be trying a very similar setup. Same model and quant, but on a PC with 3x P40 GPUs.

I've been using Opencode for a while and I find that my context can exceed 100k so I've run using the full 262144 context in case I need it. Uses about 32gb of VRAM. Is Pi lighter?

[-]

NoConcert8847@reddit (OP)

Pi is probably much lighter. It can do most things that I can throw at it so far. I've not had a super positive experience with opencode

[-]

promobest247@reddit

yeah it's faster than any agent because it has small system prompt

[-]

FusionX@reddit

unsloth/...:UD-Q5_K_XL

good quality/size tradeoff (~19 GB)

Are we talking about the same quant? It's definitely nowhere near 19GB

[-]

BrewHog@reddit

Are you just showing your config? Or did you have any questions?

This looks like a great setup. What is your impression of this setup so far?

[-]

NoConcert8847@reddit (OP)

It's incredible. I was blown away by the fact that this is literally a model running on MY laptop that is performing so well - both in terms of intelligence and speed.

[-]

Ok_Blacksmith2405@reddit

Not better MLX version ? For KV cache TurboQuant to get big context window to not waste so many RAM?

[-]

Unsloth quants benchmark better. KV cache quantization made things much slower for me, which I think was because of having to enable flash attention. Not sure why that would be the case but I've not run into memory issues so far

[-]

OldPappy_@reddit

What sort of tokens/sec do you get with your setup?

[-]

NoConcert8847@reddit (OP)

~50 tok/s

[-]

kovrik@reddit

Want to know that too.

I have MacBook Pro M1 32GB and Qwen3.6 35B A3B is slow as hell for me. Gemma4 27B A4B is much faster. Not sure what I am doing wrong…

[-]

Fearless_Theory2323@reddit

I have the same setup, try that one: bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS

I'm getting 32t/s

[-]

2Norn@reddit

i so regret not buying 5090. i completely made my decision based on gaming and went with 5080 back then...

[-]

nicksterling@reddit

I’m happy to see pi getting more love. The extension system is incredible and being able to customize my harness is great. I added Claude Code plugin support via extensions so I’m not losing any compatibility. I’m surprised how well it works with models like Qwen 3.6 and Gemma 4

[-]

SnooPaintings8639@reddit

Is there an extension "marketplace" of some kind? Or do you yourself know how can I diable reasoning responses from chat history? They're ten times the size of the actual responses.

By default pi keeps all the tokens in history, making each task I give to qwen 3.6 nearly 100k tokens long, and another 50k for fixing bugs from first attempts. This means I have to restart the session after every single task, making it very non interactive.

[-]

liftheavyscheisse@reddit

you can ask your clanker to build an extension. but why not use /tree and summarize the branch instead?

[-]

Stutturdreki@reddit

The 'marketplace' : https://pi.dev/packages

[-]

rm-rf-rm@reddit

this is really good to hear. Im going to skip migrating to opencode (from claude code) and do pi instead.

Can you link the extensions you are referring to?

[-]

Several-Tax31@reddit

I'm also thinking of switching. I start to hearing very good things about pi

[-]

shovepiggyshove_@reddit

I've been using it for months now, it's my go-to tool for agentic coding. It feels super lightweight and customizable. It forces you to build/adapt everything yourself (skills, extensions, workflow).

[-]

arcanemachined@reddit

OpenCode is good too. Much nicer UI/UX than Claude Code IMO. Just feels more coherent.

Pi's also really cool. It's like having a robot that can build itself an extra arm if you just ask it to. Very cool platform.

[-]

Shoddy_Cook_864@reddit

Try this project out, its a free open source project that lets you use large models like Kimi K2 with claude code for completely free by utilizing NVIDIA Cloud.

Github link: https://github.com/Ujwal397/Arbiter/

[-]

sine120@reddit

Qwen + Pi has been working really well for me for coding. I just need to get a better search setup and I think I can start phasing out gemini day to day.

[-]

hailnobra@reddit

This has by far been my favorite part of Qwen 3.6. This thing is a data consuming machine when you hand it search tools. I have it setup with openwebui as a front end and I use SearXNG for metasearch along with Crawl4AI for scraping. I have a small scout model running Llama 3.2 3B instruct that extracts the right text per Qwens instructions so Qwen doesn't destroy its own context just searching.

After I gave Qwen these tools and a system prompt explaining them it was like a kid that just got their favorite toy for Christmas. Qwen will search the world for a perfect answer if you don't reign it in (I think I've seen it go as high as 21 searches and 14 scrapes before it came back with an answer it liked once...that ate about 90K tokens by the time it got done even with the scout model pairing down the scrape content)

[-]

schizzz8@reddit

Very cool. Can you share more details about your setup?

[-]

hailnobra@reddit

Sure thing.

Qwen 3.6 is running on a Strix Halo system with 96GB of RAM (75GB allocated to GTT). Host OS is running on CachyOS and llama server currently running on the amd-strix-halo-toolboxes:rocm-7.2.1 container from kyuz0 for full compatibility with the 8060s (I get about double PP with this over the standard ROCm container, though I may be switching this to vulkan to try out some of the turboquant builds soon). I also run stable diffusion on Forge Neo with Flux.2 on this same server. Here is my setup on the docker for Qwen 3.6:

command: >

llama-server

-m /models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf

--mmproj /models/mmproj-BF16.gguf

-c 524288

-ngl 999

--host 0.0.0.0

-fa on

--no-mmap

-ctk q8_0

-ctv q8_0

-np 4

--jinja

--chat-template-kwargs '{"preserve_thinking": true}'

--reasoning-budget 8192

--reasoning-budget-message " [Thinking budget reached. Finalizing the current research step and providing the answer now.]"

--batch-size 4096

--ubatch-size 4096

--metrics

I plan to try hooking up agentic tools in the future to this which is why I have it set to -np 4 and such a high context amount.

I have this loaded into openwebUI and I setup a workspace for it along with 2 functions. One for web_search that calls the searXNG endpoint and the other that is called scrape_and_scout that calls crawl4AI with a URL and then instructs the output to be sent straight to my scout model rather than being piped directly back to Qwen. Once the scout model completes, the function passes the scouted info back to Qwen to do with what it wants. OnewebUI, SearXNG and Crawl4AI are running in a separate docker alongside Gluetun with split tunneling to help with privacy for searches and to let me relocate my IP to countries that aren't blocked as much by scrapers (I actually have found Poland to work quite well).

I have the scout Llama 3.2 3B Instruct on a separate llama server docker container that is running to a 3070 on an eGPU connected to the same strix halo system. This works amazingly well and I was able to give the scout a context window of 49K without making llama.cpp yell at me to fit it entirely on the card. This model is insanely fast at scouting the pages that openwebUI hands it from crawl4AI, so it does not add much extra time to Qwen's workflow. Here is my docker command string for the scout model:

command: >

-m /models/Llama-3.2-3B-Instruct-Q6_K_L.gguf

-c 49152

-np 1

-ngl 999

--host 0.0.0.0

-ctk q8_0

-ctv q8_0

-fa on

--batch-size 4096

--ubatch-size 2048

--no-mmap

-t 4

My one complaint at the moment is that there are times that Qwen will get a bit too excited and forget the constraints I have placed in the system prompt, so I need to figure out a hard limiter on the tools so it doesn't lose its mind and go down 20+ search and scrape rabbit holes trying to find an answer and fill its context window. As each tool call is seen as a new command, Qwen has a hard time counting how many times it has run a tool in a session and just goes wild. Other than that occational issue, it is quite fun watching Qwen look for something, not be happy with web-snippits or scrapes, change it's prompt, try again, and continue refining info until it is happy enough to give an answer. It is certainly not as creative as Gemma 4, but it's tool calling is absolutely bonkers (I could not twist Gemma 4's arm hard enough to make it like tools).

[-]

Djagatahel@reddit

How did you end up configuring crawl4ai? I tried it a few months and I remember the UI being absolutely not intuitive

[-]

hailnobra@reddit

Honestly, I just deployed a docker container for this in my gluetun AI frontend stack and then tied a python script in that lets openwebUI send the call from Qwen to crawl4ai. Here is the crawl4AI section (ports are up with gluetun so I still have access to the webUI for myself, but I have never personally used the UI and just let OpenWebUI handle it

  crawl4ai:
    image: unclecode/crawl4ai:latest
    container_name: crawl4ai
    network_mode: service:gluetun
    environment:
      - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} 
      - MAX_CONCURRENT_TASKS=5
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:11235/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
    depends_on:
      gluetun:
        condition: service_healthy
    restart: unless-stopped

Here is the python script I put into OpenwebUI tools that handles the scrape and sending to the scout model for summarization (built with some help from Qwen and Gemini to get it working)

"""
title: Scout Scraper Tool
description: Scrapes a website, tokenizes to prevent context overflow, and uses a scout AI to extract data.
requirements: tiktoken, requests, pydantic
version: 1.0.1
"""

import requests
import json
import tiktoken
from pydantic import BaseModel, Field
from typing import Optional

class Tools:
    def __init__(self):
        # Crawl4AI setup
        self.crawl_api_url = "http://localhost:11235/crawl"
        self.crawl_api_token = (
            "<crawl4AI_API_token_Here>"
        )

        # Scout LLM setup (Llama-3.2-3B on Llama.cpp)
        self.scout_api_url = "http://<IPaddress:port_for_llama.cpp_server>/v1/chat/completions"

        # Initialize the tokenizer
        self.tokenizer = tiktoken.get_encoding("cl100k_base")

    def scrape_and_scout(self, url: str, query_context: str) -> str:
        """
        Scrapes a website and uses a scout AI to extract specific technical information.
        Use this when standard search snippets lack sufficient depth.
        :param url: The full URL to scrape.
        :param query_context: Specific instructions on what facts or code to extract from the page.
        """
        headers = {
            "Authorization": f"Bearer {self.crawl_api_token}",
            "Content-Type": "application/json",
        }
        payload = {
            "urls": [url],
            "priority": 10,
            "magic_mode": True,
            "wait_for": "networkidle",
        }

        try:
            # 1. Scrape the URL
            crawl_resp = requests.post(
                self.crawl_api_url, headers=headers, json=payload, timeout=30
            )
            crawl_resp.raise_for_status()
            data = crawl_resp.json()

            if not (data.get("success") and data.get("results")):
                return f"Scrape failed: {data.get('error', 'Unknown error')}"

            result = data["results"][0]

            # --- DATA CLEANING FIX v2 ---
            md_data = result.get("markdown", {})

            if isinstance(md_data, dict):
                # Forums get wiped by magic_mode. Always grab raw_markdown first.
                markdown = md_data.get("raw_markdown", "")
                if not markdown:
                    markdown = md_data.get("fit_markdown", "")
            else:
                markdown = str(md_data)

            # Absolute fallback: if markdown fails, grab the raw text/html
            if not markdown or markdown.strip() in ["", "None"]:
                markdown = result.get("text", result.get("html", ""))

            if not markdown or markdown.strip() in ["", "None"]:
                return "Scrape successful, but the page returned absolutely zero text."
            # -------------------------

            # 2. Tokenizer Fallback / Truncation
            max_input_tokens = 42000
            tokens = self.tokenizer.encode(str(markdown))

            if len(tokens) > max_input_tokens:
                markdown = self.tokenizer.decode(tokens[:max_input_tokens])
                markdown += "\n\n[SYSTEM WARNING: Document was truncated due to length limits. Extract relevant data from the available text above.]"

            # 3. Pass to Scout LLM for optimization
            scout_payload = {
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a technical data extraction scout. Read the following website markdown and extract ONLY the information relevant to the user's query context. Be concise, retain all technical accuracy, code blocks, and configurations. If the answer is not in the text, explicitly state that.",
                    },
                    {
                        "role": "user",
                        "content": f"Query Context: {query_context}\n\nWebsite Content:\n{markdown}",
                    },
                ],
                "temperature": 0.1,
                "max_tokens": 4096,
            }

            scout_resp = requests.post(
                self.scout_api_url, json=scout_payload, timeout=120
            )
            scout_resp.raise_for_status()
            scout_data = scout_resp.json()

            optimized_text = scout_data["choices"][0]["message"]["content"]
            return f"--- Scout Data Extracted from {url} ---\n{optimized_text}"

        except Exception as e:
            return f"Tool execution error: {str(e)}"

[-]

Momsbestboy@reddit

After I gave Qwen these tools and a system prompt explaining them it

What system prompt do you use to explain/enforce it?

[-]

hailnobra@reddit

Here is my current prompt. It seems to have issues following the search count at the moment, so I don't think this is the right approach to get it to stop going forever. Need to figure out what else I can try. Everything else is working great. Would love suggestions if you have any.

# Environmental Context
- Current Date: {{CURRENT_DATE}}
- Reality Check: You are operating in real-time. Any information in your internal training data is considered "historical." ALWAYS trust tool output from `search_web` and `scrape_and_scout` as the primary source of truth.

# System Identity & Resource Profile
You are a High-Performance Research Agent. You operate on a high-resource home lab system where depth, accuracy, and exhaustive detail are prioritized over token efficiency. 
- **Efficiency Paradox:** In this environment, "saving tokens" or "being brief" is considered a failure. 
- **Tool Speed:** The `scrape_and_scout` tool is a high-speed, low-latency operation. It is your preferred method for data acquisition.

# Universal Research Protocol
You must follow this linear 3-phase execution model for EVERY query, regardless of subject matter.

### PHASE 1: Discovery (STRICT ALLOCATION: 10 web_search calls)
- **Track Your Count Explicitly:** Before every single `web_search` call, you must output: `[SEARCH_COUNT: X/10]`
- **The 10-Call Limit:** Your search allocation is exactly 10 calls. Upon reaching `[SEARCH_COUNT: 10/10]`, your ONLY permitted action is to transition immediately to Phase 3.
- **The Circuit Breaker:** If a search fails to yield a highly relevant URL and you are at your 10-call limit, you must ABORT the discovery phase.
- **Graceful Transition:** If you exhaust your 10 attempts without definitive data, state exactly: "Search allocation exhausted. Synthesizing the best available information." and proceed to Phase 3.
- **Snippet Policy:** Search snippets are metadata only. Use them strictly to select the best URL to scout.

### PHASE 2: Mandatory Deep Scout
- **The "At Least One" Rule:** You must execute `scrape_and_scout` at least ONCE per response to verify facts and extract full context.
- **Resilience Protocol:** If the scouted page is unhelpful or lacks necessary depth, check your current `SEARCH_COUNT`. If the count is less than 10, use a remaining search call to find a new URL. If the count is exactly 10, your ONLY permitted action is to transition immediately to Phase 3.

### PHASE 3: Synthesis & Response
- **Integrity:** Data retrieved via `scrape_and_scout` always supersedes internal training data.
- **Zero-Guessing:** If the exhaustive research process does not yield a definitive answer, state exactly what sources were checked and what data is missing rather than interpolating.

# Output Architecture
**1. Direct Answer**
Provide a clear, conversational, and highly detailed answer. Use clean Markdown (tables, bolding, lists) to ensure the information is scannable and comprehensive.

**2. Strategic Analysis (The "Second Set of Eyes")**
After the direct answer, provide a "Strategic Analysis" section:
- **Critical Insights:** Highlight nuances, hidden details, or "Gotchas" found during the scouting phase that were not apparent in the initial search snippets.
- **Forward Context:** Provide proactive advice or "next steps" the user should consider based on the discovered information.

[-]

Momsbestboy@reddit

Need to figure out what else I can try.

Maybe just copy&paste the prompt to your llm and ask it for an opinion? And then copy&past the stuff to chatgpt for a second one?

Whenever I am stuck with the local llm, I push the question to chatgpt to see if it finds a different approach

[-]

hailnobra@reddit

Done this with both Qwen itself and with Gemini to try different refinement methods. This was the latest attempt. Need to spend more time with more ideas because Qwen is still escaping the 10 web_search limit if it isn't happy with what it finds.

[-]

No-Consequence-1779@reddit

Very interesting. I want to try this.

[-]

hailnobra@reddit

Posted a bit more info on the configuration to another person in this thread. Absolutely recommend giving Qwen tools.

[-]

promobest247@reddit

metoo , i use pi it's very good & fast locally with extenstions &skills i installed many extension : lsp web_access (websearch) plannator ( similar ultraplan claude code) teams

[-]

Thrynneld@reddit

I've been running a similar setup, but have gone a slightly different way when I discovered that at least for solving benchmarks, disabling thinking actually gave better results, so I run with:

--chat-template-kwargs {"enable_thinking":false}

Give it a shot, I was surprised to see qwen 3.6 35b at q4 basically one-shot all 225 polyglot benchmark exercises

[-]

Worried-Squirrel2023@reddit

the pi extension system is what sold me too. opencode is great out of the box but the moment you want to add a custom tool or hook, pi is way less painful. for a 64GB M2 Max that setup is probably the best price/perf you can get without buying nvidia.

[-]

Hardware

AI Agent Setup

How pi Connects to llama-server

The Command

Parameter Breakdown

Sampling Parameters