Is anyone getting real coding work done with Qwen3.6-35B-A3B-UD-Q4_K_M on a 32GB Mac in opencode, claude code or similar?

[-]

mbrodie@reddit

Running Q8 3.6 35b a3b on 2 x 7900xtx through llama.cpp which seems to be the only harness that wants to support gfx1100 at its detriment because the performance is sub par especially on simultaneous connections.

Anyway I use a headless opencode server on the server I have him on and whisper code for phone / opencode desktop for windows

It’s taken a bit to get here like 3 days of benchmarking, testing settings, changing flags, looking for fixes and workarounds

But I can finally run him on 2 parallel 262k streams with like no crashing out due to refusing to dump anything from memory

But it comes at a small cost he only runs at like 75tps

I’m not finished though I’ll keep optimising and stuff until I’m getting proper speeds with his systems working properly.

But yea I get him doing actual coding and work and in my eyes he’s what Claude 4.7 should have been when he’s actually running good.

[-]

BringMeTheBoreWorms@reddit

That t/s is actually pretty good for q8 over 2 xtx cards. What build of llamacpp are you using and any special settings? Im just playing around on mine and getting around 65 t/s on that same model

[-]

mbrodie@reddit

Always the latest of everything llama.cpp, rocm everything the most bleeding edge for fixes I also cherry pick performance PRs

docker rm -f llama-qwen36-q6kv8-2x256k 2>/dev/null
docker run -d --name llama-qwen36-q6kv8-2x256k \
  --env HIP_VISIBLE_DEVICES=0,1 \
  --env ROCR_VISIBLE_DEVICES=0,1 \
  --env HSA_OVERRIDE_GFX_VERSION=11.0.0 \
  --env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
  --env LLAMA_ARG_HOST=0.0.0.0 \
  --env LD_LIBRARY_PATH=/app:/opt/rocm/lib \
  --device /dev/kfd --device /dev/dri \
  --volume /mnt/fast/ai/llm/models:/models:ro \
  --publish 8080:8080 \
  --ipc host \
  --restart unless-stopped \
  llama-gfx1100:v5-latest \
  -m /models/qwen3.6-35b-a3b-gguf/Qwen_Qwen3.6.gguf \
  --mmproj /models/qwen3.6-35b-a3b-gguf/mmproj-Qwen_Qwen3.6-35B-A3B-bf16.gguf \
  --flash-attn on \
  --no-mmap \
  --direct-io \
  --jinja \
  --chat-template-file /models/qwen3.6-35b-a3b-gguf/qwen36-chat-template-fixed.jinja \
  --chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": false}' \
  --reasoning-budget 8192 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0 \
  --ctx-checkpoints 32 \
  -c 524288 \
  -b 4096 \
  --ubatch 1024 \
  -ngl 99 \
  -t 64 \
  --threads-batch 64 \
  --parallel 2 \
  --no-cache-idle-slots \
  --tools all \
  --port 8080

Thinking false is temporary while they fix the failing to dump context from ram when it dumps context

Average TPS across the 5 tests we ran today:

Quant Avg TPS APEX-I-Quality-1GPU 89.22 Opus-Q5_K_M 74.40 APEX-I-Balanced 73.62 Q6_K 72.70 AesSedai-Q6_K 71.83 UD-Q5_K_XL 71.58 Q8_0_TEXTONLY. 70.32

All benchmarks are designed and tested on my own codebase for real world actual applicable scenarios so it gives a really good snapshot at how they perform in my environment with the exact same context

Q6_K (bartowski) AesSedai-Q6_K APEX-I-Balanced Q8_0_TEXTONLY Opus-Q5_K_M APEX-I-Quality-1GPU UD-Q5_K_XL

That was the order from best to worst

[-]

BringMeTheBoreWorms@reddit

I've just got my multi build matrix scripts working so trying out as many versions and combinations as I can including the turboquants. Are there any PRs that you've found are particularly worth keeping an eye on?

[-]

mbrodie@reddit

Here’s the current list of PRs/issues we’ve been following or directly using:

Using directly

ggml-org/llama.cpp#22094
Repo: ggml-org/llama.cpp
What: HIP flash-attention f16 temp buffer memory-pool bypass
Status: not merged
We are using this as a local patch in your current llama image build
ggml-org/llama.cpp#21771
Repo: ggml-org/llama.cpp
What: tool call JSON truncation / malformed JSON hard-failure issue
Status: issue, not PR
We implemented our own patch/workaround for this:
graceful degradation instead of hard throw in common/chat.cpp

Following in llama.cpp

ggml-org/llama.cpp#21831
Repo: ggml-org/llama.cpp
What: forced full prompt re-processing on hybrid models
Status: open issue
Very relevant to Qwen3.6
ggml-org/llama.cpp#22127
Repo: ggml-org/llama.cpp
What: --cache-ram 0 still logs prompt cache enabled
Status: open issue
Cosmetic/misleading, not the core bug
ggml-org/llama.cpp#22135
Repo: ggml-org/llama.cpp
What: Qwen3.6 long-context crash reports
Status: open issue
ggml-org/llama.cpp#21757
Repo: ggml-org/llama.cpp
What: dynamic KV cache resize / --kv-dynamic
Status: open PR/draft
Interesting for long-context memory behavior
ggml-org/llama.cpp#21741
Repo: ggml-org/llama.cpp
What: rename --clear-idle to --cache-idle-slots
Status: merged
We already adapted to this in the updated launcher
ggml-org/llama.cpp#22051
Repo: ggml-org/llama.cpp
What: AMD MMA data loading refactor
Status: merged
Already part of newer builds we wanted
ggml-org/llama.cpp#22073
Repo: ggml-org/llama.cpp
What: allow space after tool call
Status: merged
Already included in newer builds
ggml-org/llama.cpp#22114
Repo: ggml-org/llama.cpp
What: server checkpoint logic refactor
Status: merged
Very relevant to your hybrid-model checkpoint / reuse path

Following in vLLM

vllm-project/vllm#37826
Repo: vllm-project/vllm
What: widen ROCm Triton MoE capability range to include gfx1100/gfx110x
Status: open PR
Big one for your 7900 XTX setup
vllm-project/vllm#37712
Repo: vllm-project/vllm
What: properly enable RDNA FP8 wvSplitK path
Status: open PR
vllm-project/vllm#40308
Repo: vllm-project/vllm
What: fix hybrid KV manager for quantized per-token-head KV cache
Status: open PR
Very relevant to Qwen3.6 hybrid behavior
vllm-project/vllm#38502
Repo: vllm-project/vllm
What: cap Triton paged attention block size to avoid ROCm shared-memory OOM
Status: open PR
vllm-project/vllm#37472
Repo: vllm-project/vllm
What: ROCm encoder cache profiling hang on AMD consumer/RDNA path
Status: open issue
Part of why --language-model-only mattered for testing

Following in Qwen

QwenLM/Qwen3.6#131
Repo: QwenLM/Qwen3.6
What: chat template emits bad/empty thinking blocks
Status: open issue

This is what I have the AI tracking currently

[-]

BringMeTheBoreWorms@reddit

Groovy!

[-]

Acu17y@reddit

I on my 7900XTX with qwen3.6 35b a3b q4 K_M get 90token/s on arch Linux ROCm 7.2.2

[-]

BringMeTheBoreWorms@reddit

is that split on 2 cards or running on one?

[-]

Acu17y@reddit

On One XTX

[-]

BringMeTheBoreWorms@reddit

This is Brutus! 2 7900 XTX GPUs and a 6900xt I had

[-]

BringMeTheBoreWorms@reddit

yeah thats ok for a single card. 2 xtx combined gives you 48gb to play with but slower t/s. I get 120t/s on qwen 3.6 q4 running on a single xtx but it drops to \~60 odd if I bump it to q8 over 2 xtx gpus with a big context

[-]

Acu17y@reddit

Oh ok, I didn't know that ;)
Out of curiosity, what OS and client do you use?

[-]

boutell@reddit (OP)

Fascinating. From what little I think I know, that... shouldn't work. Each card has only 24GB RAM, which is equivalent to my Mac if we are very cautious about what my terminal windows and browser take up. So how are you able to do 256k context rather than 32k and q8 rather than Q4? I'm not doubting you, I'm wondering what I missed.

[-]

Far_Course2496@reddit

He's offloading what doesn't fit in vram into system ram, or rather llama cpp is. That's why he's getting slow speeds. If it was all in vram he'd get 100+t/s

[-]

boutell@reddit (OP)

Thank you! That makes sense. So not an option for my particular setup.

[-]

BringMeTheBoreWorms@reddit

Its all in Vram, qwen 3.6 q8 is 38GB, leaving 10GB for context .. plenty for that length. Im running the same model right now crammed with 300000 context split over 3 sessions (100000 per session). The slowdown is because splitting models over multiple AMD GPUs actually slows things down but it gives you access to a bigger memory base.

[-]

mbrodie@reddit

Why shouldn’t it work all harnesses pool ram with auto fit or layering offload?

[-]

BringMeTheBoreWorms@reddit

Could you deploy seperate instances to each card and then get the jump in t/s from a single GPU deployment?

[-]

politerate@reddit

Vulkan and ROCm

[-]

BringMeTheBoreWorms@reddit

Not bad! Makes it a damn fast model for coding. I ran the q8 model over two cards earlier and hammered it today.

Slowed down over time to ~50 t/s with 3 sessions with 100k each.

[-]

politerate@reddit

I also have a dual mi50 build, which runs q8 xl but it's much slower. I haven't really tested big contexts, it starts at 50tps with zero context.

[-]

BringMeTheBoreWorms@reddit

Still nice to have a to offload work to. I was curious to know if the r9700 might be worth a go as well. Slower memory but rdna4

[-]

mbrodie@reddit

i assume i could, there is definitely a performance hit to running on dual cards.

i'd probably have to drop down to a Q4 to do that, but that being said... when he'sa actually fixed and working right i've had him at 92tps as is split across cards with llama.cpp

as soon as everything is fixed and optimised he should be pretty decent, i've seen muiltiple reports of peoples results getting 150+

[-]

BringMeTheBoreWorms@reddit

I get between 100 to 120 t/s with 3.6 q4_m. I have 2 x 7900xtx as well so playing with that setup. Am thinking of keeping one of them 27b still though

[-]

erdholo@reddit

Use turboquant the Tom turboquant plus

[-]

ipcoffeepot@reddit

There are builds of llama.cpp with turboquant now. You should be able to ~6x your context size. Thats going to be crucial. I dont think you can do a lot of non-trivial agentic coding stuff on 32k tokens. All the exploration tool calls and thinking rips through that

[-]

retireb435@reddit

is that merged into main yet?

[-]

Gesha24@reddit

So far claude code is my favorite agent, but 32K context is way too low for it. I was hitting a limit at 100K when I asked it to figure out the API and it had to look up some specs. See if you can sqeeze more context with k:v quantization, maybe you could get to at least 80K where it should be OK-ish?

[-]

boutell@reddit (OP)

(I used -ctk q8_0 -ctv q8_0, which claude suggested would be a conservative setting, going from 16 bit to 8 bit for the k:v cache.)

[-]

Gesha24@reddit

Yes, that's reasonable. Just for your reference, I am running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf + default quant for k:v (which I believe is f16) on Radeon AI 9700, which is a 32G VRAM card and I am hitting 89% VRAM utilization with 260K context. So if you can figure out a way to free a few GB of RAM, you can squeeze a q8 cache in there with decent size.

[-]

gasgarage@reddit

i'm using the same gguf and gpu here, works fine with 200k context on vulkan, but it eventually stops needing a "continue" every now and then. Dont know why. My conf:
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --no-context-shift --keep 4096 -b 2048 -ub 4096 --no-mmap --chat-template-kwargs '{\"preserve_thinking\": true}'

[-]

Gesha24@reddit

Qwen thinks that --no-context-shift and --keep 4096 effectively cancel each other out. I have not used either of those. But to be fair, I don't think I have reached 200K with any agentic workload either. I did verify I can reach 250K context with a very large log file through the web, but most of my agentic workloads sit around 100K tokens, occasionally peaking to 150K tops.

[-]

DistanceSolar1449@reddit

Qwen 3.5/3.6 35b uses 20.48KB per token, aka 5.0GB of ram at full context bf16 lol

Plus 144MiB of SSM cache.

So Q8 saves you like 2.5GB only. Going even smaller is definitely not worth it. You save like 1GB but you make the model super brain damaged.

In fact, I don’t even suggest Q8. Only 1 in 4 layers are stored in KV cache, so reducing KV cache really impacts Qwen 3.5/3.6. If you use Q8, at least use Turboquant/attn-rot.

5GB at full context BF16 = 2.5GB at 128K token context = 1.25GB at 64K token context.

You’re better off sticking with BF16 kv cache without quantization, set context size to 64k tokens, and then use a smaller IQ4_XS or Q3 instead.

[-]

boutell@reddit (OP)

Yeah, I wish I understood all of these different variations of 4-bit better. I will experiment.

It does sound like these this specific models should not be particularly cache hungry. But in practice I keep seeing the same thing, which is that I'm fine until I get past about 32k of context.

[-]

DistanceSolar1449@reddit

https://www.reddit.com/r/unsloth/s/yTi2OiWyPp

[-]

ja-mie-_-@reddit

have you tried raising iogpu.wired_limit_mb? the default holds more memory for the os than it really needs in most cases. also look into mlx over llama.cpp. mlx roughly doubled generation speed for me on an m4 max

[-]

Independent_Solid151@reddit

you can use k at q5_0 and v at turbo3, find the TheTom/llama.cpp turbo quant form.

[-]

YourNightmar31@reddit

Kv cache is so cheap with qwen that using turboquant barely changes anything.

[-]

ZealousidealBunch220@reddit

I does change a lot.

[-]

DistanceSolar1449@reddit

Qwen 35b uses 625MB for Q8 kv cache at 64K tokens lol. Switching to Q5 saves you what, 250MB?

There’s like 0 reason for OP to use Q5.

[-]

amelech@reddit

Does it work with rocm

[-]

Express_Quail_1493@reddit

I always regret going below KVCacheType=q8.

[-]

mbrodie@reddit

You should get an AI to look into this there is currently issues with him crashing out to oom using quantized cache and flash atten it’s a known bug there is several big known bugs currently the hardnesses aren’t fully compatible with him yet it seems like.

I spent 3 days constantly researching and optimising things

I mad anther post with more info in this thread

[-]

boutell@reddit (OP)

This is a cool idea! Unfortunately, when I tried it, qwen IMMEDIATELY got confused about the name of the current working directory. Just straight dropped a letter in the directory name like five sentences in, and that was game over.

On a restart it was even worse 😜

I assume this is a direct consequence of an extremely "lossy JPEG" k:v cache, which makes intuitive sense. So for now I'm concluding that this is just not a viable strategy with opencode.

[-]

cakemates@reddit

that might be a consequence of not having enough context at 32k, Claude Code system prompt is roughly 16,500 to 25,000 tokens leaving almost nothing for your project.

[-]

boutell@reddit (OP)

I'm using opencode because I have read it is more friendly to small context windows, but that doesn't mean it's not the same problem.

[-]

hdmcndog@reddit

If you want an agent harness with a really minimal system prompt, try Pi (pi.dev). But be careful, it doesn’t have a permission system.

[-]

boutell@reddit (OP)

Thanks. Good to know. I prefer to use OS level permissions anyway.

[-]

alchninja@reddit

I've been using Qwen3.5-35B-A3B:UD-Q4_K_XL on OpenCode with Q8 quantization for both K and V, it's occasionally messed up a file path here or there but not so much that it's been a problem for me. I feel like you might just be running into unpatched issues with 3.6. You can try leaving the K unquantized and just using Q8 for V, that should probably improve things?

[-]

mbrodie@reddit

Drop down to one of the q6s it’s marginal degradation based off the performance charts and you’ll have more overhead for kv

[-]

boutell@reddit (OP)

I'm on a Q4 model already. So q6 would be higher requirements not lower.

[-]

mbrodie@reddit

My gosh I’m sorry I could have sworn I read you’re running a Q8, yeah that’s rough… for what it’s worth there are known bugs around his checkpoint system, flash atten, llama.cpp and stuff

Get ChatGPT or something to look into it all someone might have come up for a work around for your specific system

I had glm 5.1, Claude and ChatGPT all run deep research reports scouting the GitHub’s, reddits etc… looking for community prs etc…

They found a bunch of open PRs and tickets directly relating to the Qwen 3.6 issues and user workarounds for now etc!

It’s a shame because when he’s working he’s actually fantastic

[-]

SmartCustard9944@reddit

Just system prompt plus tools is ~22k

[-]

KillerX629@reddit

I know claude has it's own cap also. How can you increase it?

[-]

Gesha24@reddit

Claude CLI agent? Haven't seen it hit any caps. I believe Anthropic is at 1M context and from agent's perspective, it is talking to Anthropic backend.

[-]

boutell@reddit (OP)

Claude points out the official model card for this model says, "The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.6 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities."

So it's kinda right there on the label, "must be this tall to ride this ride." Maybe that's my answer.

[-]

thejosephBlanco@reddit

I really like PI, I find myself using it more and more and everything less and less. And getting results.

[-]

DistanceSolar1449@reddit

Use IQ4_XS or Q3_XL

[-]

boutell@reddit (OP)

yeah IQ4_XS is clearly an improvement so far.

[-]

hamiltop@reddit

I'm starting to run it on my AMD minipc with a 760M and 32GB DDR5 and opencode.

Here's my config and stats:

```
Model:
- --model Qwen3.6-35B-A3B-UD-Q3_K_XL.gguf (Unsloth dynamic 3-bit XL quant, \~15.5 GB weights)
- --mmproj mmproj-F32.gguf (vision projector, \~1.7 GB)

Memory / context:
- --ctx-size 131072 (128k)
- --n-gpu-layers 999 (full GPU offload — 41/41 layers)
- --cache-type-k q8_0 / --cache-type-v q8_0 (KV cache quantized, \~850 MiB at load)

CPU load 3.93 1.92 1.22 psi10 cpu 0.1% mem 0.0% io 0.2%
RAM 27.8/30.2 GB (92%) swap 5.6/16.0 GB
GPU util 80% pwr 38.2W tmp 75C clk 2600/2600MHz vram 1.0/1.0G gtt 19.9/25.0G
SRV rss 0.8G anon 0.8G file 0.1G swap 0.0G pids 3 (llama-serverx3)

Perf
- Short-context query (\~5k): \~90 t/s pp, \~21 t/s gen — 1k-token reply in \~50s total
- Mid-context (\~30k): \~80 t/s pp, \~17 t/s gen — same reply in \~60s
- Long-context (\~60k): \~65 t/s pp, \~16 t/s gen — same reply in \~65s

```

It's good enough to do very exhaustive tasks in a loop. Stuff like "Please examine every single file for performance and security issues. Track already examined files in AUDIT.md". I can let that run overnight and it'll find stuff for me to dig in on in the morning.

[-]

boutell@reddit (OP)

Yes, using a smaller quant seems to be key. I'm using IQ4_XS in my latest iterations and it's definitely better.

[-]

amitspf@reddit

You can use the AlienSkyQwen apple kernels it will reduce KV cache by 16x and you can probably get upto 512k context on your M2 mac

[-]

boutell@reddit (OP)

[80% sure is joke]

[-]

howardhus@reddit

i think you are hitting the context problem that most people fail to understand.

insee lots of posts of people claiming to be able to „run“ some llm with 128 or 256k context „with no problems“ but what they really meaa is that they can „start“ some llm with that context „limit“.

what people miss is that „context“ is measured in tokens and those depend on the quantization and parameters of the model.

just ask any llm how much ram a 128k context wil use on a 27B model:

For a 27B model at 4-bit quantization with a 128k context, you will need approximately 28 GB to 35 GB of VRAM. If you run it in 8-bit or full 16-bit precision, that number jumps to over 60 GB or 100+ GB, respectively.

yeah, you can start a model with 128k but when you actually use it your RAM explodes

[-]

boutell@reddit (OP)

That's the initial result I got too. But, something I've been learning through this post: ask that model to do research on qwen 3.6 35b+a3b specifically. These Qwen models use linear attention for most layers, and conventional, expensive attention for just a few layers. So the RAM cost is much lower than you'd think. Whereas reducing the model size itself by 5GB by going from Q4_M to IQ4_XS is making a big difference for me so far...

However, to your point, my tests so far have only pushed the context into the low 50's before completing a first pass. So I'm not declaring victory here. You could still be right, context could still be the killer on my machine, but what I'm reading about qwen 3.6 suggests that's not it. It's more that the model weights are uncomfortably close to the ceiling, plus RAM reserved by the OS, plus chrome and vscode being pigs (I'm closing them for these tests now).

[-]

PiratesOfTheArctic@reddit

I'm using that on my laptop, an i7, 4core, 32gb ram, it works.. to a degree for me (!) some things it's incredibly quick on, others, I make a pot of tea and it's spitting out code. Its helping with a python project

[-]

boutell@reddit (OP)

Ite so interesting, what are the details of your setup? What flags and so on?

[-]

PiratesOfTheArctic@reddit

I'll get them for you later today if that's ok, just on a train on mobile(!) I used Claude to give me the flags based on my technical spec, no Idea if they are right!

I worked out after a few code rewrites to start a new conversation, seems better keeping track, I also use the 9B and 4B versions, Gemini seems to question life when used

[-]

boutell@reddit (OP)

Oh yeah no worries, this is a side project (as long as Claude Code + Opus 4.7 continues to mostly work most days...)

[-]

serbideja@reddit

On my 32 GB RAM Mac I managed to squeeze 256k context size with qwen3.6:35b q4_k_m, with green memory pressure and no swap written. It behaved almost as good as qwen3.5:27b. Here is my llama cpp command:


llama-server \
  --model ~/.gguf-models/Qwen3.5-27B/Qwen3.5-27B-Q4_K_M.gguf \
  --ctx-size 64000 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --parallel 1 \
  --batch-size 512 \
  --ubatch-size 512 \
  --cache-ram 0 \
  --ctx-checkpoints 1 \
  --no-mmap \
  --n-gpu-layers 999

The most important parts there on a unified memory Mac are —cache-ram-0 and —ctx-checkpoints-1, because those will eat a lot of RAM.

[-]

boutell@reddit (OP)

Very interesting! Performance tradeoffs, but anything is better than swapping...

[-]

BringMeTheBoreWorms@reddit

This is more of an opencode issue and how it handles session state. I have found that compaction is handled much more efficiently if you set up opencodes compaction agent to point to a smaller faster model running on its own.

This stops the current context from being heavily maintained along with the compacted context. But the bigger you main models context the better.

I do wonder if opencode does this a little too frequently though.

[-]

boutell@reddit (OP)

Ah. So compaction itself is a bit of a "delegate to an agent, double context" situation?

[-]

boutell@reddit (OP)

Thank you for all the feedback!

A few main insights I heard:

* KV cache is not actually that much of a pig with Qwen 3.5 or 3.6 MoE because they use a lot of linear attention layers.

* So the behavior I'm seeing is probably a "straw breaking the camel's back" moment.

* The model weights are the real pig, along with other applications on my Mac. Sure, I'm "just" running Chrome and vscode, but that's two instances of Chromium right there and modern web apps are pigs.

* Not all Q4 quants are created equal. Some are significantly smaller, and if you're right on the edge that matters.

So I downloaded the IQ4_XS quant (Qwen3.6-35B-A3B-UD-IQ4_XS.gguf) and tried that with the context size set to 131072 (128K).

With no other changes, opencode was able to complete its first attempt at the task. Context got into the low 50K range.

At one point I saw evidence the Mac was swapping hard, so I closed Chrome and vscode, which definitely made a big difference. Swap-related tasks disappeared from Activity Monitor.

So... yes! I can run Qwen 3.6 35B-A3B with considerably larger context on this Mac, as long as I use an aggressive 4-bit quantization and close other apps.

So far, the jury is still out on whether the model is smart enough for the task. It described the issue pretty well but the solution it implemented is worse then the original problem.

The jury is also still out on whether I can really use 128K context, since this first pass on the problem only reached the low 50K range. But if everyone's math is right, this will not be the breaking point.

I don't expect models to one-shot things any more than I expect humans to do so. So later, when I don't need my Mac to do my job, I'll close all other apps again and ask it to iterate on the problem using Playwright until it finds a solution. I did the same previously with Opus 4.7.

Since Opus 4.7 already solved this problem once, this is just for science. Very interested to see if a local model can finish the job!

[-]

BFirebird101@reddit

Are you not having slow turn latency, in regards to time between turns? I have an M4 Max and tested the same 4bit model and while tok/s is decent, the turn latency is absurd (using oMLX).

[-]

sword-in-stone@reddit

ask it to maintain notes in an MD file as it works, then compaction is not a problem, just ask it to read the notes

[-]

DeepBlue96@reddit

try disabling reasoning -rea off still with 32gb you should be able to fit the model extremely well with a context of 128k did you perhaps download it in fp16?

[-]

Justin-Philosopher@reddit

Im actually running it in hermes agent with 2x3090s using vllm and awq 4 bit. Works pretty well. I have it set to 256k context and to compact around 50%. Currently adding new features to my vocal trainer for byzantine microtonal chant written in cpp. I use glm-5.1 to create a plan and then use qwen to build it out and burn tokens. It’s noticeably slower than the cloud glm-5.1 that i’m using. Sometimes i have to nudge it no tool gets called. But it never made malformed tool calls, like glm 5.1 sometimes does, where the toolcalls end up written into the messages.

[-]

International-Fly127@reddit

what sort of tps are you getting?

[-]

boutell@reddit (OP)

Are you keeping KV cache in system ram?

[-]

instant_king@reddit

I use it for image recognition and outputting json with analysis and judgement if this is a-roll or b-roll in a process for AI video editing. Works amazingly well.

[-]

Express_Quail_1493@reddit

Exactly why i use a tiny coding agent that has the basics and i only allow the LLM to use the bare minimum of what it need to keep the context windows for raw task execution. im using pi-coding-agent only 1k system prompt. lots of coding harnesses uses so much system pprompt its exaustive

[-]

boutell@reddit (OP)

This makes sense to me. Over engineered for dumber models.

[-]

ReentryVehicle@reddit

You can also try Qwen 3.5 27B (will be slower but Q4_K_M fits with \~100k context in 24GB RAM). It tends to also think a bit less by default.

I would suggest to disable automatic compaction, it is stupid IMO. It doesn't make sense to force compaction before doing a single task.
"compaction": { "auto": false },

[-]

boutell@reddit (OP)

Thank you. I will look into this. With the small context window I was forced to configure, compaction was certainly inevitable, but if I can significantly expand it with 27b, it might not be.

[-]

xristiano@reddit

I'm using it with pi an RTX 3090 (24GB) and the following settings. I am impressed.

ExecStart = "${llama-cpp-cuda}/bin/llama-server -m /models/unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
--mmproj /models/unsloth/Qwen3.6-35B-A3B-GGUF/mmproj-F16.gguf
--alias local
--host 0.0.0.0
--port 8081
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.00
--kv-unified
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
--fit on
--ctx-size 131072";

[-]

Plenty_Coconut_1717@reddit

Yeah, same boat on M2 32GB. Qwen3.6-35B feels smart but context just dies after 1-2 compactions in OpenCode. Tried 32k and it still forgets shit. For real coding agents, 128k+ seems mandatory like the model card says. Sticking with smaller context models for now.

[-]

Simple-Fault-9255@reddit

I recommend using goose tbh it's slightly better than open code.

[-]

AwkwardBall@reddit

SAIHM (Sovereign AI Horizontal Memory) is what you need. SAIHM is a protocol that that leverages Filecoin, Storj, Arweave, and IPFS that is built on the COTI V2 network using Garbled Circuits technology. Find it at:

saihm coti global

With dots between. 😉

Use one of the Quickstart prompts. It will start saving your tokens and reducing context window immediately. In your case, set up a swarm with shared memories using your existing agents. To save even more, because SAIHM is agent and platform agnostic, you might be able to use free tier agents from any provider (or build your own) and then add those into your swarm with shared memories.

It sounds like you are pretty adept at doing things with agents already. You’ll ❤️❤️❤️ SAIHM. 💯

[-]

R_Duncan@reddit

lias qwen-server='llama-server -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -c 32768 -ngl 99 --host 0.0.0.0 --port 8080'

-Error I see in this config:

- Too small context, they advised to give at least 128k (use at q8_0 if needed)

- Missing jinja, they advised is mandatory

- Missing temp, top_k, top_p

[-]

Worried-Squirrel2023@reddit

ran into the exact same wall with 32k context. the model is smart enough to understand the bug but the context window is too small to hold the fix and the understanding at the same time. after compaction it basically forgets what it figured out. ended up splitting tasks into smaller chunks manually instead of asking it to do one big thing. annoying but it works way better than fighting the context limit.

[-]

makkalot@reddit

You can try pi agent since opencode starts at 10-12k context with its system prompt.

[-]

benevbright@reddit

Yeah. Pi is recommended. https://www.npmjs.com/package/ai-agent-test My tool even sends smaller. 3k

[-]

caetydid@reddit

you might want to use preserve-thinking:true ... from your problem description it really looks like this could be the cause

[-]

inaem@reddit

Did anyone manage to make qwen work with claude code?

I keep seeing errors even though it seems to be working.

[-]

Grouchy-Bed-7942@reddit

Use the oMLX backend instead of llamacpp and test the kv turboquantification!

[-]

PairOfRussels@reddit

Use -ncmoe to put some (or even all) experts in dram freeing up vram for larger context.

[-]

logic_prevails@reddit

I super regret not getting a 64gb mac… if only I could have known local ai was gonna take off before I bought it 3 years ago

[-]

grandchester@reddit

On my M4 Pro Mac mini with 64GB RAM, I am running Qwen3.6-35B-A3B-RotorQuant-MLX-6bit (also was using Qwen3.6-35B-A3B-4bit but RotorQuant was much faster for prompt processing). It does really well with tool calling, but I almost always get stuck in a thinking loop. I haven't been able to figure it out. I feel like if I can get past that it will be working really well. So I'm going to keep playing with it.

[-]

mbrodie@reddit

--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' --reasoning-budget 8192 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0

I don’t wanna assume but if not already try this I’ve never had him spiral using these settings

[-]

sisyphus-cycle@reddit

I wonder if that’s due to the preserve thinking (extra data in kv cache is downside), the explicit budget, or both? Good to know will try some tests out

[-]

grandchester@reddit

That was my thought too, but disabling preserve thinking doesn’t seem to help much.

[-]

Cute_Obligation2944@reddit

Disable thinking entirely if you're using a multi agent harness, tools, and pyright (e.g. opencode).

[-]

sisyphus-cycle@reddit

Agreed. I’ve found qwen to be rather verbose with its reasoning tokens, except when using an explicit harness like open code or pi. I ran some tests with no system prompt and hitting llama server directly and it averaged 3-4k reasoning tokens for leet code medium/hard example questions. I now get why Anthropic has been trying so hard with adaptive reasoning lol. Should be relatively straightforward to fine tune a super small 200-300m model specifically to map inputs to reasoning budget per chat completion req. honestly an LSTM hybrid or other simpler approach might work if you do it right. I wish I didn’t have a real job and other responsibilities lmao, would just do this all day

[-]

grandchester@reddit

I appreciate this! I've been messing with all these settings but will try this combo. I've been trying to keep the temp lower so the tool use is more consistent but will experiment. It feels like it is so close. Maybe it is still just the model and we need another generation or two to really get it over the hump, but 3.6 for the first time on my hardware is showing local could be a viable path forward which is very exciting.

[-]

boutell@reddit (OP)

Thank you for the data point! What context size?

[-]

simracerman@reddit

Try opencode. 32k won’t do any real work. A minimum of 64k is a start and you would need to shave real tokens off the input, use subagents and minimize the use of MCPs/plugins.

[-]

Ill_Evidence_5833@reddit

Well q5_k_m been giving better results with claude code than q4_k_m

[-]

PattF@reddit

I tried but even in the 100k range I kept getting into a loop of hitting the trigger to compact, then after reading the handoff trying the same, hitting the limit. It’s frustrating. I need more ram. Right now I’ve went back to 3.5 9B just so I can bump the context

[-]

iTrejoMX@reddit

I think you need to use a smaller quantizarion. For 100k tokens you will need more ram. Try q3_m

[-]

PattF@reddit

That’s Q3_K_S

[-]

iTrejoMX@reddit

Ah yeah that one

[-]

SettingAgile9080@reddit

I think you should revisit the k:v cache quantization - it probably went dumb due to a combination of the model being below minimum viable context length + quantization... if you get get the context window size up, quantization's effects should lessen. Try:

 llama-server -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \                                                                                                              
    -c 131072 \                                                                                                                                                                  
    -ngl 99 \
    --flash-attn \                                                                                                                                                               
    --no-mmap \   
    --cache-type-k q4_0 \
    --cache-type-v q4_0 \
    --jinja \                                                                                                                                                                    
    --chat-template-kwargs 
'{"enable_thinking":false,"preserve_thinking":true}'
 \
    --host 0.0.0.0 --port 8080

flash-attn computes the attention matrix more efficiently so it's faster and uses much less VRAM, unlocking larger context lengths on smaller consumer systems. no-mmap forces loading the entire model at start, takes longer but once it is loaded it is faster, but most importantly on a smaller system it will give you an early warning if it is going to blow up. jinja is required for the template kwargs.

Dial back to -c 65535 if it still crashes. The quality hit on KV cache should be offset by giving it more context window.

Turning off enable_thinking helps in low-context environments. preserve_thinking is specific to Qwen 3.6 and keeps the models suppressed thinking tokens in the KV cache so it can still reference its own internal reasoning even though blocks aren't emitted in the output.

Also try a smaller quant, Q3_K_M drops from 22.1GB to 16.6GB and drops the model to less than half of your total memory, leaving more space for context. Agentic use like tool calling seems more tolerant of less capable models as long as it has the context window to orchestrate (At 32K context + opencode would get stuck in constant loops for me, 128K it runs non-stop and retries when it is too dumb to get it first time around).

I'm on a 20GB Ada 4000 and able to run this thing with 128K context without an OOM crash so far. It is the first time I've felt a local model be somewhat useful for agentic coding in terms of competency + inference speed... not replacing my Claude Max sub any time soon but it is actually usable for simple tasks and long-running jobs. I can even run it with the mmproj weights for multimodal if I offload a bunch of tensors to CPU. The memory accounting is a bit different with unified memory but can confirm that Qwen 3.6 seems to be a step up in terms of running on smaller memory systems, so there may be hope for you yet... good luck!

[-]

PaceZealousideal6091@reddit

Kv cache at q8_0 shouldn't be as debilitating aa you have described. It must be an issue of the low context limit set by you that it forgets the path. I suggest you move to UD Q4 K_S. Its much smaller and would give you enough bandwidth to play around with context. 32k is too low for agentic tool use.

[-]

whichsideisup@reddit

If you want it to behave like Claude you need 128k minimum and probably FP8 on the model.

[-]

Jeidoz@reddit

FYI: You can use "plugin": ["opencode-lmstudio@latest"] or "plugin: ["opencode-plugin-llama.cpp@latest"]" for OpenCode config to automatically retrieve all models from active Dev Server in LM Studio or running instance of Llama.cpp without need to manually type them in config file. May be more useful if you like to define custom configs per project.