Is anyone getting real coding work done with Qwen3.6-35B-A3B-UD-Q4_K_M on a 32GB Mac in opencode, claude code or similar?
Posted by boutell@reddit | LocalLLaMA | View on Reddit | 116 comments
I'm running Qwen3.6-35B-A3B-UD-Q4_K_M on an M2 Macbook Pro with 32GB of RAM. I'm using quite recent builds of llama.cpp and opencode.
To avoid llama-server crashing outright due to memory exhaustion, I have to set the context window to 32768 tokens. This turns out to be important.
As a hopefully reasonable test, I gave opencode a task that Claude Code was previously able to complete with Opus 4.7. The project isn't huge, but the task involves rooting around the front and back end of an application and figuring out a problem that did not jump out at me either (and I was the original developer, pre-AI).
The results are really tantalizing: I can see it has figured out the essentials of the bug. But before it can move on to implementation, compaction always seems to throw out way too much info.
If I disable the use of subagents, it usually survives the first compaction pass with its task somewhat intact, because I'm paying for one context, not two.
But when I get to the second compaction pass, it pretty much always loses its mind. The summary boils down to my original prompt, and it even misremembers the current working directory name (!), coming up with a variant of it that of course doesn't exist. After that it's effectively game over.
After reading a lot about how Qwen is actually better than most models with regard to RAM requirements, I've come to the conclusion that (1) 32768 is the biggest context I can get away with, and (2) it just ain't enough.
Has anyone had better results under these or very similar constraints?
Thanks!
mbrodie@reddit
Running Q8 3.6 35b a3b on 2 x 7900xtx through llama.cpp which seems to be the only harness that wants to support gfx1100 at its detriment because the performance is sub par especially on simultaneous connections.
Anyway I use a headless opencode server on the server I have him on and whisper code for phone / opencode desktop for windows
It’s taken a bit to get here like 3 days of benchmarking, testing settings, changing flags, looking for fixes and workarounds
But I can finally run him on 2 parallel 262k streams with like no crashing out due to refusing to dump anything from memory
But it comes at a small cost he only runs at like 75tps
I’m not finished though I’ll keep optimising and stuff until I’m getting proper speeds with his systems working properly.
But yea I get him doing actual coding and work and in my eyes he’s what Claude 4.7 should have been when he’s actually running good.
BringMeTheBoreWorms@reddit
That t/s is actually pretty good for q8 over 2 xtx cards. What build of llamacpp are you using and any special settings? Im just playing around on mine and getting around 65 t/s on that same model
mbrodie@reddit
Always the latest of everything llama.cpp, rocm everything the most bleeding edge for fixes I also cherry pick performance PRs
Thinking false is temporary while they fix the failing to dump context from ram when it dumps context
Average TPS across the 5 tests we ran today:
Quant Avg TPS APEX-I-Quality-1GPU 89.22 Opus-Q5_K_M 74.40 APEX-I-Balanced 73.62 Q6_K 72.70 AesSedai-Q6_K 71.83 UD-Q5_K_XL 71.58 Q8_0_TEXTONLY. 70.32
All benchmarks are designed and tested on my own codebase for real world actual applicable scenarios so it gives a really good snapshot at how they perform in my environment with the exact same context
Q6_K (bartowski) AesSedai-Q6_K APEX-I-Balanced Q8_0_TEXTONLY Opus-Q5_K_M APEX-I-Quality-1GPU UD-Q5_K_XL
That was the order from best to worst
BringMeTheBoreWorms@reddit
I've just got my multi build matrix scripts working so trying out as many versions and combinations as I can including the turboquants. Are there any PRs that you've found are particularly worth keeping an eye on?
mbrodie@reddit
Here’s the current list of PRs/issues we’ve been following or directly using:
Using directly
ggml-org/llama.cpp#22094ggml-org/llama.cppWe are using this as a local patch in your current llama image build
ggml-org/llama.cpp#21771ggml-org/llama.cppcommon/chat.cppFollowing in llama.cpp
ggml-org/llama.cpp#21831ggml-org/llama.cppVery relevant to Qwen3.6
ggml-org/llama.cpp#22127ggml-org/llama.cpp--cache-ram 0still logs prompt cache enabledCosmetic/misleading, not the core bug
ggml-org/llama.cpp#22135ggml-org/llama.cppStatus: open issue
ggml-org/llama.cpp#21757ggml-org/llama.cpp--kv-dynamicInteresting for long-context memory behavior
ggml-org/llama.cpp#21741ggml-org/llama.cpp--clear-idleto--cache-idle-slotsWe already adapted to this in the updated launcher
ggml-org/llama.cpp#22051ggml-org/llama.cppAlready part of newer builds we wanted
ggml-org/llama.cpp#22073ggml-org/llama.cppAlready included in newer builds
ggml-org/llama.cpp#22114ggml-org/llama.cppFollowing in vLLM
vllm-project/vllm#37826vllm-project/vllmgfx1100/gfx110xBig one for your 7900 XTX setup
vllm-project/vllm#37712vllm-project/vllmwvSplitKpathStatus: open PR
vllm-project/vllm#40308vllm-project/vllmVery relevant to Qwen3.6 hybrid behavior
vllm-project/vllm#38502vllm-project/vllmStatus: open PR
vllm-project/vllm#37472vllm-project/vllm--language-model-onlymattered for testingFollowing in Qwen
QwenLM/Qwen3.6#131QwenLM/Qwen3.6This is what I have the AI tracking currently
BringMeTheBoreWorms@reddit
Groovy!
Acu17y@reddit
I on my 7900XTX with qwen3.6 35b a3b q4 K_M get 90token/s on arch Linux ROCm 7.2.2
BringMeTheBoreWorms@reddit
is that split on 2 cards or running on one?
Acu17y@reddit
On One XTX
BringMeTheBoreWorms@reddit
This is Brutus! 2 7900 XTX GPUs and a 6900xt I had
BringMeTheBoreWorms@reddit
yeah thats ok for a single card. 2 xtx combined gives you 48gb to play with but slower t/s. I get 120t/s on qwen 3.6 q4 running on a single xtx but it drops to \~60 odd if I bump it to q8 over 2 xtx gpus with a big context
Acu17y@reddit
Oh ok, I didn't know that ;)
Out of curiosity, what OS and client do you use?
boutell@reddit (OP)
Fascinating. From what little I think I know, that... shouldn't work. Each card has only 24GB RAM, which is equivalent to my Mac if we are very cautious about what my terminal windows and browser take up. So how are you able to do 256k context rather than 32k and q8 rather than Q4? I'm not doubting you, I'm wondering what I missed.
Far_Course2496@reddit
He's offloading what doesn't fit in vram into system ram, or rather llama cpp is. That's why he's getting slow speeds. If it was all in vram he'd get 100+t/s
boutell@reddit (OP)
Thank you! That makes sense. So not an option for my particular setup.
BringMeTheBoreWorms@reddit
Its all in Vram, qwen 3.6 q8 is 38GB, leaving 10GB for context .. plenty for that length. Im running the same model right now crammed with 300000 context split over 3 sessions (100000 per session). The slowdown is because splitting models over multiple AMD GPUs actually slows things down but it gives you access to a bigger memory base.
mbrodie@reddit
Why shouldn’t it work all harnesses pool ram with auto fit or layering offload?
BringMeTheBoreWorms@reddit
Could you deploy seperate instances to each card and then get the jump in t/s from a single GPU deployment?
politerate@reddit
Vulkan and ROCm
BringMeTheBoreWorms@reddit
Not bad! Makes it a damn fast model for coding. I ran the q8 model over two cards earlier and hammered it today.
Slowed down over time to ~50 t/s with 3 sessions with 100k each.
politerate@reddit
I also have a dual mi50 build, which runs q8 xl but it's much slower. I haven't really tested big contexts, it starts at 50tps with zero context.
BringMeTheBoreWorms@reddit
Still nice to have a to offload work to. I was curious to know if the r9700 might be worth a go as well. Slower memory but rdna4
mbrodie@reddit
i assume i could, there is definitely a performance hit to running on dual cards.
i'd probably have to drop down to a Q4 to do that, but that being said... when he'sa actually fixed and working right i've had him at 92tps as is split across cards with llama.cpp
as soon as everything is fixed and optimised he should be pretty decent, i've seen muiltiple reports of peoples results getting 150+
BringMeTheBoreWorms@reddit
I get between 100 to 120 t/s with 3.6 q4_m. I have 2 x 7900xtx as well so playing with that setup. Am thinking of keeping one of them 27b still though
erdholo@reddit
Use turboquant the Tom turboquant plus
ipcoffeepot@reddit
There are builds of llama.cpp with turboquant now. You should be able to ~6x your context size. Thats going to be crucial. I dont think you can do a lot of non-trivial agentic coding stuff on 32k tokens. All the exploration tool calls and thinking rips through that
retireb435@reddit
is that merged into main yet?
Gesha24@reddit
So far claude code is my favorite agent, but 32K context is way too low for it. I was hitting a limit at 100K when I asked it to figure out the API and it had to look up some specs. See if you can sqeeze more context with k:v quantization, maybe you could get to at least 80K where it should be OK-ish?
boutell@reddit (OP)
(I used -ctk q8_0 -ctv q8_0, which claude suggested would be a conservative setting, going from 16 bit to 8 bit for the k:v cache.)
Gesha24@reddit
Yes, that's reasonable. Just for your reference, I am running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf + default quant for k:v (which I believe is f16) on Radeon AI 9700, which is a 32G VRAM card and I am hitting 89% VRAM utilization with 260K context. So if you can figure out a way to free a few GB of RAM, you can squeeze a q8 cache in there with decent size.
gasgarage@reddit
i'm using the same gguf and gpu here, works fine with 200k context on vulkan, but it eventually stops needing a "continue" every now and then. Dont know why. My conf:
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --no-context-shift --keep 4096 -b 2048 -ub 4096 --no-mmap --chat-template-kwargs '{\"preserve_thinking\": true}'
Gesha24@reddit
Qwen thinks that --no-context-shift and --keep 4096 effectively cancel each other out. I have not used either of those. But to be fair, I don't think I have reached 200K with any agentic workload either. I did verify I can reach 250K context with a very large log file through the web, but most of my agentic workloads sit around 100K tokens, occasionally peaking to 150K tops.
DistanceSolar1449@reddit
Qwen 3.5/3.6 35b uses 20.48KB per token, aka 5.0GB of ram at full context bf16 lol
Plus 144MiB of SSM cache.
So Q8 saves you like 2.5GB only. Going even smaller is definitely not worth it. You save like 1GB but you make the model super brain damaged.
In fact, I don’t even suggest Q8. Only 1 in 4 layers are stored in KV cache, so reducing KV cache really impacts Qwen 3.5/3.6. If you use Q8, at least use Turboquant/attn-rot.
5GB at full context BF16 = 2.5GB at 128K token context = 1.25GB at 64K token context.
You’re better off sticking with BF16 kv cache without quantization, set context size to 64k tokens, and then use a smaller IQ4_XS or Q3 instead.
boutell@reddit (OP)
Yeah, I wish I understood all of these different variations of 4-bit better. I will experiment.
It does sound like these this specific models should not be particularly cache hungry. But in practice I keep seeing the same thing, which is that I'm fine until I get past about 32k of context.
DistanceSolar1449@reddit
https://www.reddit.com/r/unsloth/s/yTi2OiWyPp
ja-mie-_-@reddit
have you tried raising iogpu.wired_limit_mb? the default holds more memory for the os than it really needs in most cases. also look into mlx over llama.cpp. mlx roughly doubled generation speed for me on an m4 max
Independent_Solid151@reddit
you can use k at q5_0 and v at turbo3, find the TheTom/llama.cpp turbo quant form.
YourNightmar31@reddit
Kv cache is so cheap with qwen that using turboquant barely changes anything.
ZealousidealBunch220@reddit
I does change a lot.
DistanceSolar1449@reddit
Qwen 35b uses 625MB for Q8 kv cache at 64K tokens lol. Switching to Q5 saves you what, 250MB?
There’s like 0 reason for OP to use Q5.
amelech@reddit
Does it work with rocm
Express_Quail_1493@reddit
I always regret going below KVCacheType=q8.
mbrodie@reddit
You should get an AI to look into this there is currently issues with him crashing out to oom using quantized cache and flash atten it’s a known bug there is several big known bugs currently the hardnesses aren’t fully compatible with him yet it seems like.
I spent 3 days constantly researching and optimising things
I mad anther post with more info in this thread
boutell@reddit (OP)
This is a cool idea! Unfortunately, when I tried it, qwen IMMEDIATELY got confused about the name of the current working directory. Just straight dropped a letter in the directory name like five sentences in, and that was game over.
On a restart it was even worse 😜
I assume this is a direct consequence of an extremely "lossy JPEG" k:v cache, which makes intuitive sense. So for now I'm concluding that this is just not a viable strategy with opencode.
cakemates@reddit
that might be a consequence of not having enough context at 32k, Claude Code system prompt is roughly 16,500 to 25,000 tokens leaving almost nothing for your project.
boutell@reddit (OP)
I'm using opencode because I have read it is more friendly to small context windows, but that doesn't mean it's not the same problem.
hdmcndog@reddit
If you want an agent harness with a really minimal system prompt, try Pi (pi.dev). But be careful, it doesn’t have a permission system.
boutell@reddit (OP)
Thanks. Good to know. I prefer to use OS level permissions anyway.
alchninja@reddit
I've been using Qwen3.5-35B-A3B:UD-Q4_K_XL on OpenCode with Q8 quantization for both K and V, it's occasionally messed up a file path here or there but not so much that it's been a problem for me. I feel like you might just be running into unpatched issues with 3.6. You can try leaving the K unquantized and just using Q8 for V, that should probably improve things?
mbrodie@reddit
Drop down to one of the q6s it’s marginal degradation based off the performance charts and you’ll have more overhead for kv
boutell@reddit (OP)
I'm on a Q4 model already. So q6 would be higher requirements not lower.
mbrodie@reddit
My gosh I’m sorry I could have sworn I read you’re running a Q8, yeah that’s rough… for what it’s worth there are known bugs around his checkpoint system, flash atten, llama.cpp and stuff
Get ChatGPT or something to look into it all someone might have come up for a work around for your specific system
I had glm 5.1, Claude and ChatGPT all run deep research reports scouting the GitHub’s, reddits etc… looking for community prs etc…
They found a bunch of open PRs and tickets directly relating to the Qwen 3.6 issues and user workarounds for now etc!
It’s a shame because when he’s working he’s actually fantastic
SmartCustard9944@reddit
Just system prompt plus tools is ~22k
KillerX629@reddit
I know claude has it's own cap also. How can you increase it?
Gesha24@reddit
Claude CLI agent? Haven't seen it hit any caps. I believe Anthropic is at 1M context and from agent's perspective, it is talking to Anthropic backend.
boutell@reddit (OP)
Claude points out the official model card for this model says, "The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.6 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities."
So it's kinda right there on the label, "must be this tall to ride this ride." Maybe that's my answer.
thejosephBlanco@reddit
I really like PI, I find myself using it more and more and everything less and less. And getting results.
DistanceSolar1449@reddit
Use IQ4_XS or Q3_XL
boutell@reddit (OP)
yeah IQ4_XS is clearly an improvement so far.
hamiltop@reddit
I'm starting to run it on my AMD minipc with a 760M and 32GB DDR5 and opencode.
Here's my config and stats:
```
Model:
- --model Qwen3.6-35B-A3B-UD-Q3_K_XL.gguf (Unsloth dynamic 3-bit XL quant, \~15.5 GB weights)
- --mmproj mmproj-F32.gguf (vision projector, \~1.7 GB)
Memory / context:
- --ctx-size 131072 (128k)
- --n-gpu-layers 999 (full GPU offload — 41/41 layers)
- --cache-type-k q8_0 / --cache-type-v q8_0 (KV cache quantized, \~850 MiB at load)
CPU load 3.93 1.92 1.22 psi10 cpu 0.1% mem 0.0% io 0.2%
RAM 27.8/30.2 GB (92%) swap 5.6/16.0 GB
GPU util 80% pwr 38.2W tmp 75C clk 2600/2600MHz vram 1.0/1.0G gtt 19.9/25.0G
SRV rss 0.8G anon 0.8G file 0.1G swap 0.0G pids 3 (llama-serverx3)
Perf
- Short-context query (\~5k): \~90 t/s pp, \~21 t/s gen — 1k-token reply in \~50s total
- Mid-context (\~30k): \~80 t/s pp, \~17 t/s gen — same reply in \~60s
- Long-context (\~60k): \~65 t/s pp, \~16 t/s gen — same reply in \~65s
```
It's good enough to do very exhaustive tasks in a loop. Stuff like "Please examine every single file for performance and security issues. Track already examined files in AUDIT.md". I can let that run overnight and it'll find stuff for me to dig in on in the morning.
boutell@reddit (OP)
Yes, using a smaller quant seems to be key. I'm using IQ4_XS in my latest iterations and it's definitely better.
amitspf@reddit
You can use the AlienSkyQwen apple kernels it will reduce KV cache by 16x and you can probably get upto 512k context on your M2 mac
boutell@reddit (OP)
[80% sure is joke]
howardhus@reddit
i think you are hitting the context problem that most people fail to understand.
insee lots of posts of people claiming to be able to „run“ some llm with 128 or 256k context „with no problems“ but what they really meaa is that they can „start“ some llm with that context „limit“.
what people miss is that „context“ is measured in tokens and those depend on the quantization and parameters of the model.
just ask any llm how much ram a 128k context wil use on a 27B model:
yeah, you can start a model with 128k but when you actually use it your RAM explodes
boutell@reddit (OP)
That's the initial result I got too. But, something I've been learning through this post: ask that model to do research on qwen 3.6 35b+a3b specifically. These Qwen models use linear attention for most layers, and conventional, expensive attention for just a few layers. So the RAM cost is much lower than you'd think. Whereas reducing the model size itself by 5GB by going from Q4_M to IQ4_XS is making a big difference for me so far...
However, to your point, my tests so far have only pushed the context into the low 50's before completing a first pass. So I'm not declaring victory here. You could still be right, context could still be the killer on my machine, but what I'm reading about qwen 3.6 suggests that's not it. It's more that the model weights are uncomfortably close to the ceiling, plus RAM reserved by the OS, plus chrome and vscode being pigs (I'm closing them for these tests now).
PiratesOfTheArctic@reddit
I'm using that on my laptop, an i7, 4core, 32gb ram, it works.. to a degree for me (!) some things it's incredibly quick on, others, I make a pot of tea and it's spitting out code. Its helping with a python project
boutell@reddit (OP)
Ite so interesting, what are the details of your setup? What flags and so on?
PiratesOfTheArctic@reddit
I'll get them for you later today if that's ok, just on a train on mobile(!) I used Claude to give me the flags based on my technical spec, no Idea if they are right!
I worked out after a few code rewrites to start a new conversation, seems better keeping track, I also use the 9B and 4B versions, Gemini seems to question life when used
boutell@reddit (OP)
Oh yeah no worries, this is a side project (as long as Claude Code + Opus 4.7 continues to mostly work most days...)
serbideja@reddit
On my 32 GB RAM Mac I managed to squeeze 256k context size with qwen3.6:35b q4_k_m, with green memory pressure and no swap written. It behaved almost as good as qwen3.5:27b. Here is my llama cpp command:
The most important parts there on a unified memory Mac are
—cache-ram-0and—ctx-checkpoints-1, because those will eat a lot of RAM.boutell@reddit (OP)
Very interesting! Performance tradeoffs, but anything is better than swapping...
BringMeTheBoreWorms@reddit
This is more of an opencode issue and how it handles session state. I have found that compaction is handled much more efficiently if you set up opencodes compaction agent to point to a smaller faster model running on its own.
This stops the current context from being heavily maintained along with the compacted context. But the bigger you main models context the better.
I do wonder if opencode does this a little too frequently though.
boutell@reddit (OP)
Ah. So compaction itself is a bit of a "delegate to an agent, double context" situation?
boutell@reddit (OP)
Thank you for all the feedback!
A few main insights I heard:
* KV cache is not actually that much of a pig with Qwen 3.5 or 3.6 MoE because they use a lot of linear attention layers.
* So the behavior I'm seeing is probably a "straw breaking the camel's back" moment.
* The model weights are the real pig, along with other applications on my Mac. Sure, I'm "just" running Chrome and vscode, but that's two instances of Chromium right there and modern web apps are pigs.
* Not all Q4 quants are created equal. Some are significantly smaller, and if you're right on the edge that matters.
So I downloaded the IQ4_XS quant (Qwen3.6-35B-A3B-UD-IQ4_XS.gguf) and tried that with the context size set to 131072 (128K).
With no other changes, opencode was able to complete its first attempt at the task. Context got into the low 50K range.
At one point I saw evidence the Mac was swapping hard, so I closed Chrome and vscode, which definitely made a big difference. Swap-related tasks disappeared from Activity Monitor.
So... yes! I can run Qwen 3.6 35B-A3B with considerably larger context on this Mac, as long as I use an aggressive 4-bit quantization and close other apps.
So far, the jury is still out on whether the model is smart enough for the task. It described the issue pretty well but the solution it implemented is worse then the original problem.
The jury is also still out on whether I can really use 128K context, since this first pass on the problem only reached the low 50K range. But if everyone's math is right, this will not be the breaking point.
I don't expect models to one-shot things any more than I expect humans to do so. So later, when I don't need my Mac to do my job, I'll close all other apps again and ask it to iterate on the problem using Playwright until it finds a solution. I did the same previously with Opus 4.7.
Since Opus 4.7 already solved this problem once, this is just for science. Very interested to see if a local model can finish the job!
BFirebird101@reddit
Are you not having slow turn latency, in regards to time between turns? I have an M4 Max and tested the same 4bit model and while tok/s is decent, the turn latency is absurd (using oMLX).
sword-in-stone@reddit
ask it to maintain notes in an MD file as it works, then compaction is not a problem, just ask it to read the notes
DeepBlue96@reddit
try disabling reasoning -rea off still with 32gb you should be able to fit the model extremely well with a context of 128k did you perhaps download it in fp16?
Justin-Philosopher@reddit
Im actually running it in hermes agent with 2x3090s using vllm and awq 4 bit. Works pretty well. I have it set to 256k context and to compact around 50%. Currently adding new features to my vocal trainer for byzantine microtonal chant written in cpp. I use glm-5.1 to create a plan and then use qwen to build it out and burn tokens. It’s noticeably slower than the cloud glm-5.1 that i’m using. Sometimes i have to nudge it no tool gets called. But it never made malformed tool calls, like glm 5.1 sometimes does, where the toolcalls end up written into the messages.
International-Fly127@reddit
what sort of tps are you getting?
boutell@reddit (OP)
Are you keeping KV cache in system ram?
instant_king@reddit
I use it for image recognition and outputting json with analysis and judgement if this is a-roll or b-roll in a process for AI video editing. Works amazingly well.
Express_Quail_1493@reddit
Exactly why i use a tiny coding agent that has the basics and i only allow the LLM to use the bare minimum of what it need to keep the context windows for raw task execution. im using pi-coding-agent only 1k system prompt. lots of coding harnesses uses so much system pprompt its exaustive
boutell@reddit (OP)
This makes sense to me. Over engineered for dumber models.
ReentryVehicle@reddit
You can also try Qwen 3.5 27B (will be slower but Q4_K_M fits with \~100k context in 24GB RAM). It tends to also think a bit less by default.
I would suggest to disable automatic compaction, it is stupid IMO. It doesn't make sense to force compaction before doing a single task.
"compaction": { "auto": false },
boutell@reddit (OP)
Thank you. I will look into this. With the small context window I was forced to configure, compaction was certainly inevitable, but if I can significantly expand it with 27b, it might not be.
xristiano@reddit
I'm using it with pi an RTX 3090 (24GB) and the following settings. I am impressed.
ExecStart = "${llama-cpp-cuda}/bin/llama-server -m /models/unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf--mmproj /models/unsloth/Qwen3.6-35B-A3B-GGUF/mmproj-F16.gguf--alias local--host0.0.0.0--port 8081--temp 0.6--top-p 0.95--top-k 20--min-p 0.00--kv-unified--cache-type-k q8_0--cache-type-v q8_0--flash-attn on--fit on--ctx-size 131072";Plenty_Coconut_1717@reddit
Yeah, same boat on M2 32GB. Qwen3.6-35B feels smart but context just dies after 1-2 compactions in OpenCode. Tried 32k and it still forgets shit. For real coding agents, 128k+ seems mandatory like the model card says. Sticking with smaller context models for now.
Simple-Fault-9255@reddit
I recommend using goose tbh it's slightly better than open code.
AwkwardBall@reddit
SAIHM (Sovereign AI Horizontal Memory) is what you need. SAIHM is a protocol that that leverages Filecoin, Storj, Arweave, and IPFS that is built on the COTI V2 network using Garbled Circuits technology. Find it at:
saihm coti global
With dots between. 😉
Use one of the Quickstart prompts. It will start saving your tokens and reducing context window immediately. In your case, set up a swarm with shared memories using your existing agents. To save even more, because SAIHM is agent and platform agnostic, you might be able to use free tier agents from any provider (or build your own) and then add those into your swarm with shared memories.
It sounds like you are pretty adept at doing things with agents already. You’ll ❤️❤️❤️ SAIHM. 💯
R_Duncan@reddit
-Error I see in this config:
- Too small context, they advised to give at least 128k (use at q8_0 if needed)
- Missing jinja, they advised is mandatory
- Missing temp, top_k, top_p
Worried-Squirrel2023@reddit
ran into the exact same wall with 32k context. the model is smart enough to understand the bug but the context window is too small to hold the fix and the understanding at the same time. after compaction it basically forgets what it figured out. ended up splitting tasks into smaller chunks manually instead of asking it to do one big thing. annoying but it works way better than fighting the context limit.
makkalot@reddit
You can try pi agent since opencode starts at 10-12k context with its system prompt.
benevbright@reddit
Yeah. Pi is recommended. https://www.npmjs.com/package/ai-agent-test My tool even sends smaller. 3k
caetydid@reddit
you might want to use preserve-thinking:true ... from your problem description it really looks like this could be the cause
inaem@reddit
Did anyone manage to make qwen work with claude code?
I keep seeing errors even though it seems to be working.
Grouchy-Bed-7942@reddit
Use the oMLX backend instead of llamacpp and test the kv turboquantification!
PairOfRussels@reddit
Use -ncmoe to put some (or even all) experts in dram freeing up vram for larger context.
logic_prevails@reddit
I super regret not getting a 64gb mac… if only I could have known local ai was gonna take off before I bought it 3 years ago
grandchester@reddit
On my M4 Pro Mac mini with 64GB RAM, I am running Qwen3.6-35B-A3B-RotorQuant-MLX-6bit (also was using Qwen3.6-35B-A3B-4bit but RotorQuant was much faster for prompt processing). It does really well with tool calling, but I almost always get stuck in a thinking loop. I haven't been able to figure it out. I feel like if I can get past that it will be working really well. So I'm going to keep playing with it.
mbrodie@reddit
--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' --reasoning-budget 8192 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0
I don’t wanna assume but if not already try this I’ve never had him spiral using these settings
sisyphus-cycle@reddit
I wonder if that’s due to the preserve thinking (extra data in kv cache is downside), the explicit budget, or both? Good to know will try some tests out
grandchester@reddit
That was my thought too, but disabling preserve thinking doesn’t seem to help much.
Cute_Obligation2944@reddit
Disable thinking entirely if you're using a multi agent harness, tools, and pyright (e.g. opencode).
sisyphus-cycle@reddit
Agreed. I’ve found qwen to be rather verbose with its reasoning tokens, except when using an explicit harness like open code or pi. I ran some tests with no system prompt and hitting llama server directly and it averaged 3-4k reasoning tokens for leet code medium/hard example questions. I now get why Anthropic has been trying so hard with adaptive reasoning lol. Should be relatively straightforward to fine tune a super small 200-300m model specifically to map inputs to reasoning budget per chat completion req. honestly an LSTM hybrid or other simpler approach might work if you do it right. I wish I didn’t have a real job and other responsibilities lmao, would just do this all day
grandchester@reddit
I appreciate this! I've been messing with all these settings but will try this combo. I've been trying to keep the temp lower so the tool use is more consistent but will experiment. It feels like it is so close. Maybe it is still just the model and we need another generation or two to really get it over the hump, but 3.6 for the first time on my hardware is showing local could be a viable path forward which is very exciting.
boutell@reddit (OP)
Thank you for the data point! What context size?
simracerman@reddit
Try opencode. 32k won’t do any real work. A minimum of 64k is a start and you would need to shave real tokens off the input, use subagents and minimize the use of MCPs/plugins.
Ill_Evidence_5833@reddit
Well q5_k_m been giving better results with claude code than q4_k_m
PattF@reddit
I tried but even in the 100k range I kept getting into a loop of hitting the trigger to compact, then after reading the handoff trying the same, hitting the limit. It’s frustrating. I need more ram. Right now I’ve went back to 3.5 9B just so I can bump the context
iTrejoMX@reddit
I think you need to use a smaller quantizarion. For 100k tokens you will need more ram. Try q3_m
PattF@reddit
That’s Q3_K_S
iTrejoMX@reddit
Ah yeah that one
SettingAgile9080@reddit
I think you should revisit the k:v cache quantization - it probably went dumb due to a combination of the model being below minimum viable context length + quantization... if you get get the context window size up, quantization's effects should lessen. Try:
flash-attncomputes the attention matrix more efficiently so it's faster and uses much less VRAM, unlocking larger context lengths on smaller consumer systems.no-mmapforces loading the entire model at start, takes longer but once it is loaded it is faster, but most importantly on a smaller system it will give you an early warning if it is going to blow up.jinjais required for the template kwargs.Dial back to
-c 65535if it still crashes. The quality hit on KV cache should be offset by giving it more context window.Turning off blocks aren't emitted in the output.
enable_thinkinghelps in low-context environments.preserve_thinkingis specific to Qwen 3.6 and keeps the models suppressed thinking tokens in the KV cache so it can still reference its own internal reasoning even thoughAlso try a smaller quant, Q3_K_M drops from 22.1GB to 16.6GB and drops the model to less than half of your total memory, leaving more space for context. Agentic use like tool calling seems more tolerant of less capable models as long as it has the context window to orchestrate (At 32K context + opencode would get stuck in constant loops for me, 128K it runs non-stop and retries when it is too dumb to get it first time around).
I'm on a 20GB Ada 4000 and able to run this thing with 128K context without an OOM crash so far. It is the first time I've felt a local model be somewhat useful for agentic coding in terms of competency + inference speed... not replacing my Claude Max sub any time soon but it is actually usable for simple tasks and long-running jobs. I can even run it with the mmproj weights for multimodal if I offload a bunch of tensors to CPU. The memory accounting is a bit different with unified memory but can confirm that Qwen 3.6 seems to be a step up in terms of running on smaller memory systems, so there may be hope for you yet... good luck!
PaceZealousideal6091@reddit
Kv cache at q8_0 shouldn't be as debilitating aa you have described. It must be an issue of the low context limit set by you that it forgets the path. I suggest you move to UD Q4 K_S. Its much smaller and would give you enough bandwidth to play around with context. 32k is too low for agentic tool use.
whichsideisup@reddit
If you want it to behave like Claude you need 128k minimum and probably FP8 on the model.
Jeidoz@reddit
FYI: You can use
"plugin": ["opencode-lmstudio@latest"]or"plugin: ["opencode-plugin-llama.cpp@latest"]"for OpenCode config to automatically retrieve all models from active Dev Server in LM Studio or running instance of Llama.cpp without need to manually type them in config file. May be more useful if you like to define custom configs per project.