Qwen 3.6 27B is out | TheaterFire

[-]

davl3232@reddit

I can't believe we're getting so close to opus 4.5 levels with 2 3090s

[-]

cmplx17@reddit

how is it to run with 2 x 3090? i just have one 3090 but wondering if it’s worth getting another one. does the speed scale 2x?

[-]

Just to drive the speed question home, I have 3090s at home and a Pro 6000 Blackwell Max Q at work. On identical inference workloads that completely fit in the VRAM of both setups the Blackwell is like 10-15% faster.

[-]

Isn't there a speed boost from multiple cards with NVLink on VLM?

[-]

kmp11@reddit

i am putting it through the test on my 2x4090. I can fit the Q6_XL with full Q8 full context window. Its coding at 20-25tk/sec with context window 50% full. it takes a while to ingest large context, but otherwise chugging along quite nicely.

[-]

Subject-Tea-5253@reddit

You will find this video interesting: https://www.youtube.com/watch?v=xS5wao4H4u4&t

th bitchslapping and floor mopping to Gemma4 in brutal...

Initial use for similar sized quants seems to give the edge to Gemma in my opinion quality wise.

[-]

mikewilkinsjr@reddit

Dammit, I JUST rebuilt around Gemma4 at the house. :D

Thomas-Lore@reddit

Claude not being MoE would explain their huge compute issues though. :) Although they did speed up Opus recently, so maybe they mvoed to MoE recently? Or just made it more sparse.

[-]

[-]

TokenRingAI@reddit

It's got looping and other obvious issues, I have free access to it but mostly use Sonnet 4.6 or GPT 5.4.

Sonnet is really reliable and stable

The best i1q4 xss are 14.7GB. KVcache K Q8 and KVcache V turbo2 and you will have ctx 75k... Works greate

Thanks for your help. Is that an unsloth quant?

[-]

Pablo_the_brave@reddit

This one model https://huggingface.co/mradermacher/Qwen3.5-27B-i1-GGUF/resolve/main/Qwen3.5-27B.i1-IQ4_XS.gguf?download=true

Compile turboquant form TheTom: https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache

My llama.cpp config:

--models-preset "$CONFIG_PATH" \
--models-max 1 \
--host 0.0.0.0 \
--port 8081 \
-t 8 \
--parallel 1 \
--cont-batching \
--keep -1 \
--chat-template-file "$DIR/chat_template.jinja" \
--chat-template-kwargs '{"preserve_thinking": true}' \
--defrag-thold 0.3 \
--cache-reuse 1024 \
--jinja \
--temp 0.15 \
--top-k 1 \
--min-p 0.1 \
--spec-type ngram-mod \
--spec-ngram-size-n 24 \
--draft-min 4 \
--draft-max 64 \
--repeat-last-n 512 \
--repeat-penalty 1.05 \
and the model.ini with the rest of settins (I'm using router)
[Qwen3.5-27B]
model = models/Qwen3.5-27B.i1-IQ4_XS.gguf
ctx-size = 75000
n-gpu-layers = 99
cache-type-k = q8_0
cache-type-v = turbo2
batch-size = 512
ubatch-size = 128
flash-attn = true
no-mmap = true

[-]

Careful_Swordfish_68@reddit

HauHauCS Version is 15.1GB in size. Qwen context does not eat much memory.

I just run LM Studio with Qwen3.6-26b (Q4_K_M) and with 16000 context size it is running about 3.08 token/sec at the beginning of new chat. I have 16 GB VRAM and 32 GB of RAM.

Have not yet tested with coding using Pi, just downloaded and made first tests.

[-]

grumd@reddit

IQ3_XXS from unsloth

[-]

Professional-Bear857@reddit

Try the unsloth quants on huggingface

[-]

Professional-Bear857@reddit

And a smaller version https://huggingface.co/sm54/Qwen3.6-27B-Q4_K_M-GGUF

[-]

Awwtifishal@reddit

You shouldn't quantize ssm_* so much. They should be at least q8. I usually add something like this to llama-quantize:

I'm testing the duo agent all this week so will let you know! I'm using OpenCode, it's primarily via agent config prompt, so will have to see how that fares.

Builder: Qwen3.6-35B-A3B Q4_K_M, 262k context, q8_0 KV cache. Invokes sub-agent with (paraphrasing): "If it's easy, go with it. If it's complex, ping the Debater up to 3 times. Be mindful of security, coding practices in AGENTS.md, and architecture and DTOs defined master_plan.md. Ask clarifying questions."

The capability density doubles every 3 to 3.5 months. So an 800B now will be as good as a 400B in 3ish months and that will equal a 200B at 6 or 7 months, and finally a 100B at 10 to 12 months. We're talking MOE. Thats roughly equal to a 30-35B dense. So yes.

[-]

skyyyy007@reddit

i just setup m5pro 64gb and it works pretty good, context window of 128k in lm studio and it is running at speed on average at least 55-70 tps depending on the type of prompts. Uses about 32gb ram for this

Layer	Tool	Status
Local model	Qwen3.6 35B 4bit @ 75 tok/s	✅
Inference server	LM Studio (MLX M5 v1.6)	✅
Coding agent	OpenCode 1.14.19	✅
Web search	SearXNG (local, OrbStack)	✅

[-]

Beginning-Window-115@reddit

you should say that you are running a different model or people are gonna think you're running qwen3.6 27b

[-]

skyyyy007@reddit

Edited to reflect the model, thanks for highlighting that👍🏻

[-]

Resident_Bell_4457@reddit

That is impressive. Would you mind giving a few updates after you used it?

Qwen is definitely not known for benchmaxxing. Even back in the Model 2 series, it was clear that Qwen was doing something different with their training. For example, it’s well known that their pre-training data sets were significantly more math-heavy than others.

In my private tests, Qwen-3.6-35B actually indeed delivered better results than Opus-4.7

Damn, I was just wrapping up my tests of Qwen3.6 35B vs Qwen3.5 27B.

screw the haters, they dont get it. Some of us DO make our own applications.. and they tend to work because we put time into crafting it for the model. I've been working on this for \~ 10 months and qwen3.5 made it shine. qwen3.6 is actually paying attention to the system prompt and figured out the parallel tool execution. the moe tend to send single tool calls for some reason. 27B likes to group them, that is one difference I am seeing already. anyhow, I am also loving the recent trend of people realizing the locals work best with curated tool sets. heck all they really need is good prompts and guarded bash access nowadays in all honesty. ok I am rambling, but I laughed at the haters and came back to this comment after work to brag about my own 'harness' because yes some of us do that, I mean this IS local llama...

[-]

Tamitami@reddit

Can you share this? I'm also on cachyos and I mainly use forge-code

[-]

ionizing@reddit

I need to discuss with my employment the ramifications of sharing it first, because I built a lot of it on the clock and they know about it, so it might be an issue to just open source it now. It gets used in an offline environment for work purposes. It would need a bit of cleanup and some other features before releasing too.

ComplexType568@reddit

A harness is basically the new buzzword that is used as a pretty large umbrella term which basically is like a way to give LLMs tools. Think of it like a mech suit to a person, a harness is the mech suit for an LLM. Also I recommend you drop ollama, just imo tho hahahah

[-]

arcanemachined@reddit

It's not a buzzword. It's a new class of tool, which is why it needs itss own word to describe it.

[-]

ThisWillPass@reddit

We use to say “how you hold it” for lack of better words. “Harness” is cleaned up.

[-]

redboy33@reddit

Would openclaw and Hermes be considered harnesses? What about qwen coder cli? Oh, and what’s wrong with ollama? I kinda like it better than lmstudio, I guess because I don’t tweak any settings. Just use Ollama serve, ollama run qwen, seem so simple and intuitive?

[-]

redonculous@reddit

Openclaw yes. Hermes unsure. I don’t know too much about Hermes but believe it’s a model with claw like features, rather than a program with tools like openclaw.

[-]

Eyelbee@reddit

[-]

boston101@reddit

Hahahaha perfect img for us nerds

[-]

NeuroPalooza@reddit

I see a bunch of benchmarks posted, but how does this compare to GLM 4.7 flash for creative writing? Is it uncensored out of the box?

[-]

WhyLifeIs4@reddit

Benchmarks

[-]

pmttyji@reddit

It beats 397B on 10/12 items

[-]

Some things are not available on the internet and yet large llms know them from books and other offline data.

[-]

Imaginary-Unit-3267@reddit

This is what scihub and libgen are for. (And physical libraries plus OCR, if need be.)

[-]

commenterzero@reddit

Like how the streets work

[-]

Far-Low-4705@reddit

its not really knowledge imo, it's more about nuance.

It understands the nuance of your prompt more, and sees things that aren't implicitley said, even in code. especially when doing on generic SWE coding, and doing much harder and more complex or low level coding tasks.

for me the biggest difference i see is nuance. (obviously knowledge is better too, but its not that big of a factor imo)

[-]

huffalump1@reddit

Yep my opinion based on my current understanding is that these new smaller models are getting really good at agentic coding workflows, tool calling, etc...

Definitely not as good for world knowledge and writing, like you say. BUT when they're fast and cheap, search and iteration is easy

[-]

FullstackSensei@reddit

Having used both, 3.5 397B at Q4 and 3.6 35B Q8 side by side for agentic coding. Within this scope, I can say they're practically matched. But keep in mind this is a pretty narrow scope and one that is often very much a beaten path.

I'm sure if you go to more obsecure programming languages, or tasks unrelated to programming, 397B will win.

[-]

RevolutionaryGold325@reddit

Also depends on what you are programming. If you do some low complexity UI+backend+database coding, you don't really benefit from the more clever models. If you do some complex refactoring, algorithm design, heavy math and solve difficult problems, the more powerful models are able to figure things out better.

[-]

FullstackSensei@reddit

Haven't tried really complex stuff with 3.6, but I can say I did try fairly complex tasks on large projects and 3.6 35B held well. 3.5 couldn't handle much simpler tasks.

I do have some low level C++ tasks I want test 35B and 27B with. We'll see how it holds.

[-]

snmnky9490@reddit

The 35b moe is 3.6 whereas the 397b is 3.5

[-]

ZBoblq@reddit

So what? There was at most maybe a month between their release

[-]

etaoin314@reddit

Yeah, realistically there will be a bigger difference in real world use compared to benchmarks, however, I do think that the gap has closed in a meaningful way and what they have been able to achieve with the 30 billion class. Models is truly impressive. Anybody with a strong gaming computer can run a 30 billion class model the 397B takes $10,000 worth of hardware.

[-]

snmnky9490@reddit

What do you mean "so what?"

the3dwin@reddit

Use specs and markdown files like openspec.

Also custom commands like "/explain"

Recently I created explain.toml which has been tremendous help before I get it to execute:

description = "Explain Following Prompt ARGS: "

prompt = """

## Expected Format

The command follows this format: `/explain `

## Behavior

Check if a file named "Explained-Prompts.md" exist, if does not exist create the file.
Make a copy of existing "Explained-Prompts.md" in case there is a mistake in appending to file and file content gets replaced upon update and can be used to restore easily.
Avoiding executing the prompt.
Analyze the prompt.
Explain in detail what is understood from the prompt.
Explain the goals from what is understood from the prompt.
Explain the non goals from what is understood from the prompt.
Explain the plan of action from the understood prompt.
Explicitely and in detail explain how the prompt could be improved, list out what is ambiguius and implicit then how could be without ambiguity and be explicit.
Give a detailed improved prompt that is explicit without any ambiguity.
Update "Explained-Prompts.md" file by adding to the end of the file with following.

Add to end of Explained-Prompts.md file:

----------------------------------------

###### Prompt:

[PROMPT]

###### Understood Explanation:

[UNDERSTOOD EXPLANATION]

###### Goals:

[GOALS]

###### Non Goals:

[NON GOALS]

###### Plan:

[PLAN]

###### Improvement:

[IMPROVEMENT]

[LIST OF AMBIGIOUS IMPLICIT TEXTS]

###### Improved Prompt:

[IMPROVED PROMPT]

----------------------------------------

"""

[-]

metigue@reddit

Do you have assumed knowledge in your real world use cases?

It's best to treat smaller models like this as pure tools. Give them the detail and knowledge to execute on and they'll blow you away.

Both_Opportunity5327@reddit

And I bet you he is right.

[-]

toothpastespiders@reddit

Seriously. I'm a little shocked by how many posts I scrolled through that are seriously stating that this is comparable to claude. Not just beating it in benchmarks, but extrapolating that to mean it will deliver real world performance at that level.

It's just bizarre to me that anyone getting serious use out of local models can still take the big benchmarks at face value. I think they can typically be suggestive of a model's strengths and weaknesses. But that's about it.

[-]

blutosings@reddit

The bench comparison are to Opus 4.5 not the latest 4.7. Also Opus is a much larger model.

[-]

Healthy-Nebula-3603@reddit

Qwen in coding are really good ...I can believe that 3.6 27b dense can be so good.

The general rule of thumb is that a MoE like 35B A3B is roughly equal to a dense model of sqrt(a*b) parameters: sqrt(35B*3B)=10.25B.

[-]

Tamitami@reddit

what about forge code as a harness? it seems to beat claude code with opus too.
I really like Qwen3-Coder-Next as it is running fast and provides good results if you steer it well. I'd like to see it in comparison to this new Qwen3.6 27B model and the MoE model 35B-A3B, but I can't find some good sources.

[-]

AuroraFireflash@reddit

Qwen3.6-35B-A3B

Which quant? MoE's are really sensitive about how quant is done from my limited understanding.

[-]

jake-writes-code@reddit

bf16

[-]

Artistic_Okra7288@reddit

Same. Qwen3-Coder-Next has been my go to for agentic coding. It’s still not great but I queued up Qwen3.6-27B and giving it a roll.

[-]

social_tech_10@reddit

Opencode works well with Qwen models

$ npm install -g opencode-ai

Just spent 3 weeks days benchmarking and fine tuning Qwen3.5-27B and iterating with variants. Now this drops. 🤦‍♂️ But it makes me happy of course.

Dense like patrick star? No. Not quite. But we flew passt worms and we're getting close to the number of neurons in a fruit fly. About 1 or 2 orders of magnitude to go

the3dwin@reddit

https://www.canirun.ai/

[-]

QuantumCatalyzt@reddit

What is your setup with models like this for agentic coding?

[-]

tecneeq@reddit

5090:

/home/kst/bin/llama-b8838/llama-server --hf-repo unsloth/Qwen3.6-27B-GGUF:UD-Q6_K_XL --alias Qwen3.6:27b --no-mmap --host 0.0.0.0 --port 11337 --gpu-layers 99 --fit on --threads 8 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --presence-penalty 0.0 --repeat-penalty 1.0 --temperature 0.6 --top-k 20 --top-p 0.95 --n-predict 32768 --ctx-size 196608

[-]

OK. I hooked it up to Claude Code and let it rip on a text processing problem I had. I've ran it on this same problem on Qwen 3.5 122B, Qwen 3.6 35-A3B, and now finally Qwen 3.6-27. It took over 40 minutes to process nine relatively smallish text files - 10-30KB each. Qwen 3.5 122B (MoE) took 30 minutes. I tried Qwen 3.5 397B but... I didn't have the 5-6 hours it would have taken to crunch this project.

Diligent-Detective97@reddit

it´s so slow on a 3090 full in vram with the qwen3.6-27b@q3_k_xl and using opencode

[-]

JsThiago5@reddit

How many t/s and PP do you get with a 3090?

[-]

m0lest@reddit

Thanks Qwen! This is the best open model I have seen so far in this "weight-class". The only model so far which actually works and does tool calling perfectly fine!

[-]

QuinsZouls@reddit

Running the model with vulkan backend and turboquant setup at 26t/s with 131k context windows using a RX 9070 16gb:

./build/bin/llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q3_K_M \
  -c 131072 \
  -ngl 999 \
  --parallel 1 \
  --flash-attn on \
  -b 512 -ub 512 \
  -ctk turbo3 -ctv turbo3 \
  --kv-unified

slot create_check: id  0 | task 7863 | created context checkpoint 8 of 32 (pos_min = 46278, pos_max = 46278, n_tokens = 46279, size = 149.626 MiB)
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id  0 | task 7863 |  
prompt eval time =   73373.31 ms / 24535 tokens (    2.99 ms per token,   334.39 tokens per second)
      eval time =   13986.35 ms /   366 tokens (   38.21 ms per token,    26.17 tokens per second)
     total time =   87359.66 ms / 24901 tokens
slot      release: id  0 | task 7863 | stop processing: n_tokens = 46648, truncated = 0
srv  update_slots: all slots are idle

Tool calling seems fine like 90% success rate on cline

[-]

IrisColt@reddit

I just started feeding it my benchmarks. Its grasp of literary stylistic commentary is insane. It picks up on everything Gemma 4 does... and then a whole lot more.

[-]

toothpastespiders@reddit

Have you tested 3.5 27b on it too? I'm really curious how the two compare with each other on anything related to the humanities.

[-]

AloneSYD@reddit

3.5 has overthinking issues

[-]

BubrivKo@reddit

Okay, am I the only one who no longer believes in these benchmarks!?

Or is it just that local models don't work as well for me (I don't know how to use them properly), or that these benchmarks are heavily exaggerated?

The Qwen 3.6 MoE model is also theoretically very close to Opus. However, in practice, the responses I get from Qwen are significantly worse than those from Opus. Opus manages to understand me and get the job done with a single prompt, whereas with Qwen, I often have to further clarify what I want to happen, or it simply fails to provide an accurate answer, especially if it's something it hasn't been well-trained on.

[-]

Caffdy@reddit

how do I enable the preserve thinking option on llama.cpp?

[-]

tarruda@reddit

Here's the full script I'm using:

#!/bin/sh -e

model=$HOME/ml-models/huggingface/unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_S.gguf
mmproj=$HOME/ml-models/huggingface/unsloth/Qwen3.6-27B-GGUF/mmproj-BF16.gguf

ctx=65535
parallel=1

ctx_size=$((ctx * parallel))

llama-server --no-mmap --no-warmup --mmproj $mmproj --model $model --ctx-size $ctx_size --swa-full -np $parallel --jinja --temp 1.0 --repeat-penalty 1.0  --presence-penalty 0.0 --top-p 0.95 --top-k 20 --min-p 0.00 --host 0.0.0.0 --chat-template-kwargs '{"preserve_thinking": true}' --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

Oh god, that reminds me that I just heard of KokoClone this morning.. Now to have codex 5.4 set up a qwen3.6-27b agent that will go research and install it... Unless Spud drops before I finish browsing.

[-]

Caffdy@reddit

in your experience, which one is the best one (tts) or the ones who are worth the trouble trying?

[-]

_-_David@reddit

I am currently using faster-qwen3-tts because some absolute legend made it faster and support streaming. But, it does absolutely spike my GPU utilization.

Pocket-TTS has been great. No real complaints there. It just works.

LuxTTS came out like 3 days later and was basically the same thing with more language support if memory serves..

Omnivoice is the new hotness, but I've only tried it for like 10 minutes and thought, "Oh, this is nice" because I don't have any problems that I need to fix when it comes to TTS. With those other options I'm already at faster than real-time generation, streaming support, voice cloning, etcetera.

And KokoClone is the latest thing to catch my attention because Kokoro is UNBELIEVABLY FAST, and clean and pleasant. It isn't emotive, and you can't clone voices. So if KokoClone can do voice cloning well, it instantly becomes the best (non-emotive) tts available just by virtue of the fact that you could run it at like 50x real time on a crap CPU.

If you have any questions about specific scenarios, like maybe you want something expressive but don't need voice cloning, feel free to ask. I have tried at least a dozen tts models recently, and each shines in its own way.

[-]

Caffdy@reddit

like maybe you want something expressive but don't need voice cloning

that would be a good start, yeah

Besides those, I know that not many models offer voice cloning, but, of those who do, which one do you really think is the best currently (before going the paid rute, aka 11Labs)?

[-]

_-_David@reddit

Actually, you'd be blown away by how many offer good quality voice cloning these days. I've got to say that omnivoice was very clean and a good quality clone, as well as being quite fast. What are you aiming to use the tts for? I have different standards for a voice assistant versus the program I have narrate books for me.

[-]

Caffdy@reddit

which one would you recommend for each case? what about rp

[-]

_-_David@reddit

Try omnivoice. It seems to be the gold standard currently. If that doesn't do what you need it to, we can go from there.

[-]

IrisColt@reddit

mother of God...

[-]

laterbreh@reddit

For those of you thinking youre matching the 397B version on these benchmarks with a 27b dense, youre smoking crack.

Fr can we all just take a moment and say thank you China? While America is trying to fuck the world over left and right, China seems to be the only thing that’s giving us any sort of leverage at all.

Diligent-Detective97@reddit

What are the best settings for a 3090 and 64gb ram?

[-]

tecneeq@reddit

llama-server --hf-repo unsloth/Qwen3.6-27B-GGUF:UD-Q6_K_XL --alias Qwen3.6:27b --no-mmap --host 0.0.0.0 --port 11337 --gpu-layers 99 --fit on --threads 8 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --presence-penalty 0.0 --repeat-penalty 1.0 --temperature 0.6 --top-k 20 --top-p 0.95 --n-predict 32768 --ctx-size 65536

qwen3.6-35b and qwen3.6-35b-a3bv mean the same model, the Qwen 3.6 35b overall, 3b per expert mixture of experts. Every input goes through 3b parameters.

Qwen 3.6 27b is a dense model, every input goes through 26b parameters. The result is better if all else is equal, but speed is slower. You can get more context with 27b because the weights are a bit smaller.

I get 50 t/s with 27b, 170 t/s with 35b-a3b on a RTX 5090.

[-]

VoiceApprehensive893@reddit

why are we comparing a 16 gigabyte vram card model to an expensive ass frontier model from <half a year ago

[-]

tecneeq@reddit

Because it's as clever. The only difference is it doesn't has as much facts and thus hallucinates more. Easily solved with websearches.

[-]

redpandafire@reddit

Which model for 16GB?

I made a mxfp4 in text mode that would fit on a smaller Mac

The model is really good even at mxfp4

Let's have a competition between gemma and qwen every month, gemma 4.1 > qwen 3.7 > gemma 4.2 >qwen > 3.8 ...

[-]

SweetSeagul@reddit

Couldn't help but imagine this.

[-]

BasicBelch@reddit

So then why does Qwen3.6 35B exist if a smaller model beats it in every use case?

[-]

misha1350@reddit

Because dense models are for dGPUs, whereas MoE models are for mini-PCs, Macbooks and laptops without a dGPU. Qwen3.6 35B A3B runs at 10 tokens/sec on DDR4-3200 on my laptop, whereas Qwen3.6 27B would be unusable.

[-]

Look_0ver_There@reddit

Dense vs MoE

that's impressive

benchmaxxing is a real thing so i wouldnt take it too seriously

lets just use it and decide

[-]

Borkato@reddit

Holy fucking shit YESSS

[-]

someguy@reddit

Just as I thought we wouldn't get a 27B 3.6.

Let's see if thinking is actually usable on that one.

[-]