Qwen 3.5 122B vs Qwen 3.6 35B - Which to choose?

Posted by Storge2@reddit | LocalLLaMA | View on Reddit | 71 comments

Hello guys,
has anybody tested both on Evals and Benchmarks to see the difference?

I am running a DGX Spark 128GB machine and am contemplating which model to choose for Coding (Opencode) and Chat (Openwebui) - of course the speed will be higher with the 35B but has anybody here checked the Quality and Performance on Benchmarks for these two models? what are your experiences?

Artificial Analysis ranks the 35B 3.6 higher than the 122B 3.5 on Coding, on Agentic Use Cases and on the general Index.

Now i am worried that it's gonna perform worse than the 3.6 in terms of long running tool calling tasks. and in terms of its "Intelligence" / IQ. What are your experiences so far?

[-]

mangoking1997@reddit

Why don't you just try them? Why worry, just test it and see what you prefer, you already have the hardware.

[-]

Havarem@reddit

This! I'm a teacher and the number of time people waited 15 minutes for me to be available to ask me a question they could have just tested in those 15 minutes is astronomical. Ok in that case it will take more than 15 minutes but still why nit do it :)

[-]

jopereira@reddit

I figured this out long time ago and even if I was available, I make them wait/try to solve the problem first. With time, the calls numbers went down a lot and people gained skills.

[-]

Havarem@reddit

My reaction is "let's try it"

[-]

jopereira@reddit

Me too, and I think most of us do the same. It's the curiosity that moves us. I think we still have Grok Code Fast 1 (optimized) for free in Kilo Code but I've spent a day working with Qwen3.6 just because it does the same (for me, in that particular case) and I've even solved a problem I was unable to solve with Grok (embedded systems).

[-]

TheItalianDonkey@reddit

To be fair, how do you "test" LLMs in 15 mins?

Local benchmarks take anywhere from hours to days.

What remains is testing on feelings and asking if we should take the car to the carwash, which is ... meh ... from a production standpoint imho.

[-]

Storge2@reddit (OP)

Yeah you are right, will do sir.

[-]

iamapizza@reddit

Send it to me for testing. I'll definitely need to run a number of tests though before telling you which one you could have gone for.

[-]

DaniDubin@reddit

For my usecase - Hermes Agent, doing long-context conversion with lots of tool calls, Qwen3-122B is much smarter and consistent. Qwen-3.6-35B breaks after ~50-60k tokens, keeps repeating wrong solution and generally performs worse.

[-]

HornyGooner4402@reddit

OP you know you can download, try both, and delete the one you don't like, right? It took me a year to finally delete the old models I don't like or use, but it can be done

[-]

Storge2@reddit (OP)

Will test it. thanks.

[-]

HornyGooner4402@reddit

Good luck, let us know your results

[-]

xenophonf@reddit

Benchmarking anything is notoriously difficult to get right. It requires a lot of skill and experience. Blowing someone off by telling them to, in effect, become qualified subject-matter experts without also coaching them on how is... unhelpful, to say the least.

[-]

nakedspirax@reddit

The OP needs to test the models for their use case. There's no blowing someone off here. They are both very capable models but you have to test it yourself.

[-]

NotumRobotics@reddit

We keep an archive, deleting them feels like murder.

[-]

mecshades@reddit

I'm glad I am not the only one that feels this way. Is this data hoarding?

[-]

Makers7886@reddit

doomsday prepping for the cultured

[-]

ProfessionalSpend589@reddit

I let live only the models with the proper weights and lineage.

My NAS (4TB) is not enough to support all models anyway.

[-]

Refefer@reddit

I bought a NAS to never feel that pain again. I'm up to 9 tbs of models now :D

[-]

Excellent-Skirt8115@reddit

Ah didn't know you're active here Dario

[-]

jikilan_@reddit

Ollama? 😅

[-]

Evening-Fox9785@reddit

Qwen 3.6 at Q8_K_XL is what I am running over Qwen 3.5 122B IQ_3_S, it may be marginally less capable but speed makes up for it

[-]

TheItalianDonkey@reddit

One thing i don't understand, if you're able to run Q8 K XL, you should be able to run Q6 of 122b ... why are you running Q3?

[-]

Evening-Fox9785@reddit

i’ll try IQ4_NL

i have around 70gb vram and 60gb ram

[-]

TheItalianDonkey@reddit

Math isn't mathing?

[-]

Evening-Fox9785@reddit

Q6 is 100 gb+ no? I’d like to try it but I don’t think i’ll be able to fit the max context 262k tokens without quantizing the kv cache

[-]

TheItalianDonkey@reddit

you're actually right, my bad - you can run q8 on about 60, and for the q6 122b its more about 90-100

[-]

Storge2@reddit (OP)

Will try and check to see how they perform.

[-]

Impossible_Car_3745@reddit

used both on 2x rtx pro 6000. qwen 3.6 35b-a3b wins in all aspects. The performances are exactly like the bench ,i e., a bit better than 3.5- 122b . and is super fast. with mtp it gives 300 tps. just..plazingly fast

[-]

meca23@reddit

Sorry for stupid question. Buy what is MTP?

[-]

m_mukhtar@reddit

multi token prediction. It predicts multiple tokens per infersnce step and It makes token generation faster for models that support it. But you have to use an inference engine that implements it (as far as i know llama.cpp does NOT have it implemented yet. I know vllm have it implemented)

[-]

ubrtnk@reddit

I use qwen 3.6 for the families default model on a pair of 4080 and I get a little over 100t/s at 131k context. Might be time to look at vllm for 3.6 vs llama-swap with cpp

[-]

Voxandr@reddit

How about 3.6 knowledge Breath compare to 122b. have you tested it with niche frameworks like svelte? 122b can handle svelte well.

[-]

Apprehensive-View583@reddit

It’s fine cause knowledge can be extended by searching and rag, if it’s fast enough you don’t need that much knowledge

[-]

DarkEye1234@reddit

Small projects are reasonably well working. But you will need context7 all the time with 35b model

[-]

Impossible_Car_3745@reddit

svelte? no test

[-]

Storge2@reddit (OP)

Thats insane speed.

[-]

Prudent-Ad4509@reddit

One extra thing to keep in mind. I run 3-bit quant of 122B for coding and it works most of the time better than 35B with 8-bit quant. But I've recently tried to task them both with visual mechanical tasks aaaand... poof. Total collapse. The one with 3-bit even started to forget its working directory.

So, as long as you use them only for coding, you can experiment and switch between them. But when you move significantly far away from coding, quantization becomes a much bigger issue than lower knowledge base.

[-]

a-babaka@reddit

Tried both. Both are bad in real java monolith project.

[-]

AdamDhahabi@reddit

Which model are you preferring for your use case?

[-]

a-babaka@reddit

Codex 5.4. none of local does the job well yet. I spent bunch of time to test qwen models. Despite everyone's delight, they behaved badly in my work project.

[-]

Terminator857@reddit

122B q4 worked better for me. 3.6 q8 got stuck in a loop. Haven't had that issue with 122B.

[-]

Dry_Yam_4597@reddit

In terms of tool calling 3.6 is an absolute beast.

[-]

rorowhat@reddit

Even compared to the 122B model???

[-]

DistanceSolar1449@reddit

All the newest Chinese models are trained on a trillion tokens of openclaw data, and it shows

[-]

milpster@reddit

i tried the 122b model and the 27b model just before switching to 3.6 and they both appeared way dumber than 3.6

[-]

danish334@reddit

I can attest to that.

[-]

Dry_Yam_4597@reddit

Just to be clear - i wasnt comparing. All i said is that it's doing an amasing job. So if one wants to save money they can use it just fine.

[-]

Steus_au@reddit

second to it

[-]

Storge2@reddit (OP)

I will try that one.

[-]

Steus_au@reddit

qwen3.6 does better for me but i’m not coding. better because its faster on my 5060ti and actually listens what i ask and capable use tools like tavily when needed.

[-]

AustinM731@reddit

3.6 feels smarter somehow. If you have tools available in your environment, it is very good at using them and will ground itself with Internet searches if you feed it a MCP like Brave or Tavily. I was running 122b as my daily driver, but I have since switched to 3.6 in the past few days.

[-]

Lucis_unbra@reddit

122B for anything knowledge related, and at least GLSL programming... Although Gemma 31B runs circles around Qwen for that language at least.

3.6 does patch up a bunch of issues 3.5 had. When it tried to do glsl, the 35b moe would usually change its mind during the code generation, even after reasoning. It doesn't do that anymore.

I tired using 3.6 for a demo, making a simple path tracer. Gemma made one mistake, flipped the camera, but had no issues.

Qwen 3.6 kept making mistakes.

I'd try both 122B and 3.6, and if possible, Gemma 4 31B.

They all hit different areas differently. But, 3.6 is shaping up nicely.

[-]

ravage382@reddit

I used both for actual agent based work last week using skill files and they both have their place..

122B is better all around out of the box, but its bigger and the speed drop snowballs pretty fast in my setup around 45k tokens of PP. I would give it my initial prompt, 1 or more skill files and then have it do something. By the time its ran for a few minutes, the context would start to pile up. At that point, my cache may or may not break and I have to reprocess everything for the next prompt of "Take the information you learned and update the skill files."

More often than not, I would have to wait 10 minutes for the PP to finish because the cache was broken. What I found was Qwen 3.6 was just as capable of looking over all the data that Qwen 3.5 122b had just churned and could make an update to the skill file, while only taking 45 seconds to PP and produce the update.

I did see there were some llama.cpp improvements to caching for those and speculative decoding, so it may be better today when I am using it.

The other thing I noticed is if I had 3.6 35b use the skill that had been created by 122b, it performed just as well as 122b did using the same skill file.

[-]

PassengerPigeon343@reddit

Depending on your use case and hardware your results may vary but for me, the speed of 3.6 makes it the easy choice. Fast tool calls, fast information processing, fast output. It’s amazing.

[-]

AlwaysLateToThaParty@reddit

I find the 122b heretic mxfp4_moe model the best all rounder for 75GB of VRAM. 35B may be good at some other use-cases, but i haven't felt any need to change. Maybe if we get a 122B 3.6 model.

[-]

Due_Net_3342@reddit

try step 3.5 flash it is better than 122b

[-]

Front-Relief473@reddit

123g/128g after step fun deployed iq4xs, oh, I don't do anything else.

[-]

Due_Net_3342@reddit

not true, i am using it on strix halo 128gb with 128k context and q8 kv cache

[-]

BankjaPrameth@reddit

I find 122B is better than 35B. It’s slower for sure but it can get things done more correctly and thoroughly. So I decided to stick with 122B.

However, last week 122B got stuck with the problem for hours so I decided to try free 397B via Ollama Cloud and find myself stunning on the quality difference. 397B easily solved mostly everything in single run (Hit the 5hr limit in like 10 minutes though).

They said with single DGX Spark, you leave the $1,700 ConnectX-7 port unused. So…. I just received my second Spark and still waiting for QSFP cable to connect between them to run 397B on dual Spark.

I hope you don’t find yourself follow my steps.

[-]

AncientGrief@reddit

Qwen3.5 397B A17B UD-IQ4_XS 4-Bit Quant for dual Spark? or 3-Bit? Wonder how good these version perform vs the Cloud variant.

[-]

BankjaPrameth@reddit

It will be int4-AutoRound running via vLLM https://huggingface.co/Intel/Qwen3.5-397B-A17B-int4-AutoRound

[-]

East-Ferret6439@reddit

tu peux essayer avec ce moteur d'inférence aussi, il devrait être beaucoup plus rapide et flexible:
deharoalexandre-cyber/EIE: A generic, policy-driven, multi-model GGUF inference server. TurboQuant-native. CUDA + ROCm

[-]

ang3l12@reddit

Does it support RPC to use two hosts though?

[-]

Storge2@reddit (OP)

Yeah I am scared too hope I don't get dragged into the Compute hole.

[-]

floconildo@reddit

Care to expand a bit more on your 397B plans with DGX Spark? I’m in the research phase of bumping my specs and running 397B would be very nice if I manage to do so at proper speeds and without spending tens of thousands 😄

[-]

BankjaPrameth@reddit

Sure! I follow the these links

https://forums.developer.nvidia.com/t/qwen3-5-397b-a17b-run-in-dual-spark-but-i-have-a-concern/361967
https://github.com/eugr/spark-vllm-docker

And if you can wait until I got my cable to test I can report back the result later.

My rush buy was because I got it at around $3,812 per unit and I believe this price or updated model won’t show up again for a long time.

[-]