Best config for Qwen3.6?

Posted by CatSweaty4883@reddit | LocalLLaMA | View on Reddit | 21 comments

With all the high praise for the model all around, I also want to try it on my own. I have an rtx3060 12gb vram and 16gb system ram. How may I load the 27b model in my system? Or is it even possible? Tasks I want to do are: coding, some visual reasoning and agentic tasks.

[-]

Sharp_Classroom9686@reddit

just go with 35b MOE 32K Context , Q4K, and use a good Agentic Tool like Forge. Dont use OpenCode. maybe you can get 25/30tks

[-]

redblood252@reddit

How is forge better than opencode? I’m not arguing I’m wondering. I have little knowledge on agentic tools

[-]

Sharp_Classroom9686@reddit

In OpenCode, a single task typically consumes at least 25k tokens of context when using prompt-based workflows. The same tends to happen with ClaudeCode.

With Forge, however, you can achieve similar results while using only around 5–7k tokens of context.

If you’re running a local model on limited hardware (e.g., 8GB or 16GB), this difference in how context is handled becomes a game changer.

[-]

redblood252@reddit

I do have qwen 27b iq3 on 16gb vram. So it sounds good. I use superpowers and subagents with opencode. Is there something equivalent you would recommend?

[-]

Sharp_Classroom9686@reddit

Forge has native Claude Code plugin support — drop the plugin in .forge/plugins/ or symlink it from \~/.claude/plugins/ and it shows up under /plugins. Honest caveat: only gstack has been tested end-to-end so far, but I’ll try superpowers today and report back.
Subagents are first-class. Built-in registry (explorer, reviewer, tester, debug, summarizer, refactorer, docs, commit, builder) plus whatever your plugins ship. spawn_subagents fans out in parallel — goroutines + semaphore, configurable concurrency. Explore mode is built around it for read-only analysis.

[-]

redblood252@reddit

I used opencode for coding. If you say forge has natively reviewer/tester/debug/refactorer/docs that's most of what I was. Will I need to do anything specific to have these work? Or are these built in plugins already using well curated prompts?

[-]

Sharp_Classroom9686@reddit

Just use /agent name prompt -- give it a try. I'm hungry for feedback

[-]

redblood252@reddit

Getting the same error here: https://github.com/tailcallhq/forgecode/pull/3255

But with llama.cpp

[-]

Sharp_Classroom9686@reddit

mb. https://github.com/defexnicolas/forge

[-]

redblood252@reddit

Doesn't work properly using llama.cpp I get this error: Assistant message must contain either 'content' or 'tool_calls'.

If there is already something called forge why did you call your project forge as well? I've seen at least 3 different projects called 'forge'

[-]

Sharp_Classroom9686@reddit

what do you want to test i can do the run for you i has qwen3.6 27b

[-]

redblood252@reddit

It’s coding related tasks. Off the top of my head: - full documentation of a component (overview/setup/configuration/architecture/troubleshooting) - review code for correctness - refactor to get rid of dead code/useless code - simplify the codebase - curate tests (LLMs tend to make hundreds of useless tests but forget logic related tests which are more critical)

[-]

llama-server --model models/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf \
--port 8080 \
--host 127.0.0.1 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--temperature 0.6 \
--flash-attn on \
--cache-type-k q5_1 \
--cache-type-v q4_1 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--ctx-size 131072 \
--n-cpu-moe 32 \
--mmproj models/mmproj-F16.gguf \
--chat-template-kwargs '{"preserve_thinking": true}'

This one takes around 10GB in VRAM for me.

[-]

Jester14@reddit

Just use -fit

[-]

ps5cfw@reddit

You don't.

Your best best is the 35b MoE, which can run at acceptable speeds at q4, but not 27b, no.

[-]

CatSweaty4883@reddit (OP)

Is 3.5 9B the best I get from qwen family of models? :((

[-]