Qwen 3.5 122B vs Qwen 3.6 35B - Which to choose?
Posted by Storge2@reddit | LocalLLaMA | View on Reddit | 117 comments
Hello guys,
has anybody tested both on Evals and Benchmarks to see the difference?
I am running a DGX Spark 128GB machine and am contemplating which model to choose for Coding (Opencode) and Chat (Openwebui) - of course the speed will be higher with the 35B but has anybody here checked the Quality and Performance on Benchmarks for these two models? what are your experiences?
Artificial Analysis ranks the 35B 3.6 higher than the 122B 3.5 on Coding, on Agentic Use Cases and on the general Index.
Now i am worried that it's gonna perform worse than the 3.6 in terms of long running tool calling tasks. and in terms of its "Intelligence" / IQ. What are your experiences so far?
Impossible_Car_3745@reddit
used both on 2x rtx pro 6000. qwen 3.6 35b-a3b wins in all aspects. The performances are exactly like the bench ,i e., a bit better than 3.5- 122b . and is super fast. with mtp it gives 300 tps. just..plazingly fast
meca23@reddit
Sorry for stupid question. Buy what is MTP?
m_mukhtar@reddit
multi token prediction. It predicts multiple tokens per infersnce step and It makes token generation faster for models that support it. But you have to use an inference engine that implements it (as far as i know llama.cpp does NOT have it implemented yet. I know vllm have it implemented)
ksmathers@reddit
Multi token acceleration is a little more complex than just multiple tokens per inference step.
There are actually two distinct methods:
Speculative Decoding combines a smaller, faster "draft" model with a large, complete model. The fast model predicts the next \~5 tokens, and the large model verifies them in a single batch, stopping at the first divergence. This delivers 2.5x to 3x speedups and allows separate hardware acceleration (e.g., running the small model on an NPU and the large one on a GPU).
Native Multi-Token Prediction uses extra "prediction heads" built directly into a single model. During a single forward pass, these heads forecast multiple future tokens simultaneously. If the sequence matches the model's confidence threshold, all tokens are emitted at once, allowing the remaining n+1, n+2, ... tokens to be processed at the rate of token ingestion instead of token inference.
ubrtnk@reddit
I use qwen 3.6 for the families default model on a pair of 4080 and I get a little over 100t/s at 131k context. Might be time to look at vllm for 3.6 vs llama-swap with cpp
ArtfulGenie69@reddit
Pretty sure you don't have to dump llama-swap it should handle the vllm commands and hosting just fine. That way you just get it built do the command you want and point at your models but then make a spot for it in the llama-swap yaml. They should officially support it :)
ubrtnk@reddit
Oh they do, just gotta figure it out. The image I currently run doesn't have docker in docker support so have to find the new image first
Infantryman1977@reddit
You don't have 131K context. Send 131K in one shot to your LLM and your 4080's will run OOM. Setting the context to 131K does not mean that's the context that's available. It is just a max setting. So many people don't understand that, you are not alone. I am running 3.6 in BF16 on quad 3090 which leaves me around 25GB for the context. Context is set to 128K and is also based on the vLLM estimate. If I am sending 140K in one shot, it will OOM. Therefore, 128K is fine. So, don't tell us you are running your pair of 4080's at 131K. The KV Cache alone is 25GB...do you have that much VRAM? lol
ubrtnk@reddit
That's fair lol. I mean I do, I just limit 3.6 to the 4080s. Max context I have configured is 131k lol
Every-Comment5473@reddit
Can you share your vllm command for running the qwen 3.6 with mtp?
Impossible_Car_3745@reddit
why not
```
MODEL="Qwen3.6-35B-A3B-FP8"
PORT=8080
VLLM_PORT=8000
docker run -d --rm --name vllm-server \
--gpus all \
-p ${PORT}:${VLLM_PORT} \
-v \~/local-models:/models \
-e NCCL_P2P_DISABLE=1 \
-e NCCL_DEBUG=INFO \
-e VLLM_USE_DEEP_GEMM=0 \
-e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
vllm/vllm-openai:v0.19.0 \
/models/${MODEL} \
--disable-custom-all-reduce \
--attention-backend FLASHINFER \
--max-model-len 524288 \
--tensor-parallel-size 2 \
--max-num-seqs 11 \
--enable-chunked-prefill \
--enable-prefix-caching \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.926 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--host 0.0.0.0 \
--port ${VLLM_PORT} \
--dtype auto \
--tokenizer-mode auto \
--limit-mm-per-prompt '{"image":5, "video":0}' \
--speculative-config '{"method":"mtp", "num_speculative_tokens":2}' \
--hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' \
--served-model-name qwen3p6
```
Voxandr@reddit
How about 3.6 knowledge Breath compare to 122b. have you tested it with niche frameworks like svelte? 122b can handle svelte well.
Apprehensive-View583@reddit
It’s fine cause knowledge can be extended by searching and rag, if it’s fast enough you don’t need that much knowledge
Voxandr@reddit
In Theory.
DarkEye1234@reddit
Small projects are reasonably well working. But you will need context7 all the time with 35b model
Impossible_Car_3745@reddit
svelte? no test
Storge2@reddit (OP)
Thats insane speed.
BankjaPrameth@reddit
I find 122B is better than 35B. It’s slower for sure but it can get things done more correctly and thoroughly. So I decided to stick with 122B.
However, last week 122B got stuck with the problem for hours so I decided to try free 397B via Ollama Cloud and find myself stunning on the quality difference. 397B easily solved mostly everything in single run (Hit the 5hr limit in like 10 minutes though).
They said with single DGX Spark, you leave the $1,700 ConnectX-7 port unused. So…. I just received my second Spark and still waiting for QSFP cable to connect between them to run 397B on dual Spark.
I hope you don’t find yourself follow my steps.
Caffdy@reddit
did you get your cable in order to connect your Sparks? if so, have you tried already Qwen 397B?
BankjaPrameth@reddit
Yep, read more here https://www.reddit.com/r/LocalLLaMA/s/c8OlqlXi52
floconildo@reddit
Care to expand a bit more on your 397B plans with DGX Spark? I’m in the research phase of bumping my specs and running 397B would be very nice if I manage to do so at proper speeds and without spending tens of thousands 😄
BankjaPrameth@reddit
Sure! I follow the these links
And if you can wait until I got my cable to test I can report back the result later.
My rush buy was because I got it at around $3,812 per unit and I believe this price or updated model won’t show up again for a long time.
floconildo@reddit
I can definitely wait haha. Thanks!
BankjaPrameth@reddit
I just received my cable today and finished setting 2 Sparks cluster.
Loading Qwen3.5 397B-A17B model completely deplete the RAM. You'll have like only 1-2 GB of ram for other processes.
I've tested much yet. But as far as I can feel with OpenCode, the 397B has very strong debug capability. It's one shot every problem so far and use token very efficiently. The token generation is about 28 t/s as expected. Instruction following, delegation, tool uses are noticeably better than 122B.
Lastly, with Dual Spark, I can now also choose to run Minimax M2.7 or Deepseek 4 Flash too. It opens another world of opportunity.
Last of last, I hope you don’t find yourself follow my steps.
floconildo@reddit
You followed through! Thanks a lot for that haha
Damn, 397B sounds like a beast. How long does it take to load the model?
And don't worry, gonna be waiting a few more months before I commit to a spec bump. My Strix Halo is still going strong.
BankjaPrameth@reddit
LOL!
Loading the model takes only 5 minutes with vLLM flag --load-format instanttensor. I was very surprised too. I thought it should take like 10+ minutes.
Let’s wait for Qwen 3.6 122B and 397B update and you will be able to enjoy your Strix Halo longer! (I have high hope for these 2 models)
Caffdy@reddit
that's a goddamn good deal, here the spark is almost double that
Prudent-Ad4509@reddit
Are you planning to run a 4bit quant or shoot for higher ?
Caffdy@reddit
remindme! 1week
RemindMeBot@reddit
I will be messaging you in 7 days on 2026-04-28 01:46:44 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
AncientGrief@reddit
Qwen3.5 397B A17B UD-IQ4_XS 4-Bit Quant for dual Spark? or 3-Bit? Wonder how good these version perform vs the Cloud variant.
BankjaPrameth@reddit
It will be int4-AutoRound running via vLLM https://huggingface.co/Intel/Qwen3.5-397B-A17B-int4-AutoRound
East-Ferret6439@reddit
tu peux essayer avec ce moteur d'inférence aussi, il devrait être beaucoup plus rapide et flexible:
deharoalexandre-cyber/EIE: A generic, policy-driven, multi-model GGUF inference server. TurboQuant-native. CUDA + ROCm
ang3l12@reddit
Does it support RPC to use two hosts though?
Storge2@reddit (OP)
Yeah I am scared too hope I don't get dragged into the Compute hole.
po_stulate@reddit
You can use nvidia's free api for 397b too, sometimes it's very slow but it's free.
mitreffahcs@reddit
The biggest issue I have with 3.6 is it's tendency to go in massive thought-spirals and I haven't figured out how to reel that in. I've used 3.5 122B and it's good at RE stuff, using MCP for JADX and IDA. And I get about 55 tokens/sec on my Mac Studio. I'm trying 3.6 again and it's mostly been thinking the entire time I've been writing this post.
HornyGooner4402@reddit
OP you know you can download, try both, and delete the one you don't like, right? It took me a year to finally delete the old models I don't like or use, but it can be done
NotumRobotics@reddit
We keep an archive, deleting them feels like murder.
mecshades@reddit
I'm glad I am not the only one that feels this way. Is this data hoarding?
Makers7886@reddit
doomsday prepping for the cultured
NotumRobotics@reddit
OMG yes. I keep thinking "Sure, but what if I'm randomly transported to 1735. via micro-wormhole, what LLM will I have with me?!"
ProfessionalSpend589@reddit
I let live only the models with the proper weights and lineage.
My NAS (4TB) is not enough to support all models anyway.
Refefer@reddit
I bought a NAS to never feel that pain again. I'm up to 9 tbs of models now :D
Excellent-Skirt8115@reddit
Ah didn't know you're active here Dario
Storge2@reddit (OP)
Will test it. thanks.
HornyGooner4402@reddit
Good luck, let us know your results
xenophonf@reddit
Benchmarking anything is notoriously difficult to get right. It requires a lot of skill and experience. Blowing someone off by telling them to, in effect, become qualified subject-matter experts without also coaching them on how is... unhelpful, to say the least.
nakedspirax@reddit
The OP needs to test the models for their use case. There's no blowing someone off here. They are both very capable models but you have to test it yourself.
jikilan_@reddit
Ollama? 😅
a-babaka@reddit
Tried both. Both are bad in real java monolith project.
AdamDhahabi@reddit
Which model are you preferring for your use case?
a-babaka@reddit
Codex 5.4. none of local does the job well yet. I spent bunch of time to test qwen models. Despite everyone's delight, they behaved badly in my work project.
TheHonestCTO@reddit
maybe you have a very low quant. In my experience, 3.5 122B Q8_k_XL is very capable for any project I throw at it in any of the major languages.
Dry_Yam_4597@reddit
In terms of tool calling 3.6 is an absolute beast.
rorowhat@reddit
Even compared to the 122B model???
relmny@reddit
In a multi-turn chat, where minimax-m2.7 made an very wrong statement, qwen3.5-397b, when asked "check previous answer" said it was right (tried a few times and also 1-2 turns), asked it qwen3.6-35b, and it said the statement was wrong, asked it again, and it said the same.
I'm, for now, very surprised by qwen3.6-35b.
imac@reddit
Well, clearly you are talking about running local quants with decent accuracy .. pretty sure this thread is framed at 128GB ;
relmny@reddit
I don't understand.
Anyway, the Minimax one I used to run was q4-km so it should be good enough.
milpster@reddit
i tried the 122b model and the 27b model just before switching to 3.6 and they both appeared way dumber than 3.6
danish334@reddit
I can attest to that.
imac@reddit
at 200+ turns, reasoning/thinking enabled, when n_predict overflows 32k .. I doubt you are going beat 122B with the right params.. I use 3.6-35B for comms, and stuff with much lower output tokens; It isn't like you can just re-test harness scenarios in a blink, so at 1 week, I think you would be hard pressed to say anything is like 122B yet. I am about to replace Gemma4 with 3.6-35B as a reviewer for 122B .. and expect some lift there.
danish334@reddit
Scenarios are a factor. My use case was more of high token input like 80k tokens (multi-conversation) and predicting about 10k at each turn. Both 122b and 27b used too much thinking budget which I expected considering circular reasoning but 3.6 35b FP8 had just the right reasoning with reasonable output. My previous working model was gpt oss120b but it was dumb beyond repair at medium reasoning sometimes and at high reasoning, it was even worse due to nonstop reasoning.
575_Inverse@reddit
qwen3.5-122b is actually way better. But you have to use the correct parameters.
If you use the default temperature of 0.1, no wonder it looks like its head is stuck up its own ass.
I use Temperature 0.7 or more, Top K sampling 0. Repeat penalty 1.01. no presence penalty. Top P sampling 1.0, min P sampling 0.1
With these parameters, the 122b shows clear signs of actual reasoning and blows away everything else. The only downside is my RIG, not enough RAM to make it run fast.
If I want 3.6-35b to get closer to this, I have to bring the temperature to 0.9 minimum, and use the same parameters listed above. If I bring the temperature to 1.0 the thinking starts breaking... pulling hallucinations.
my_name_isnt_clever@reddit
I've been running the 122b for awhile, but since trying the 3.6 35b I haven't even loaded it. The 35b isn't as intuitive, but it's so much faster and lighter on my system while still nailing agentic tasks, I've been sticking with it. It's an incredible model.
DistanceSolar1449@reddit
All the newest Chinese models are trained on a trillion tokens of openclaw data, and it shows
Dry_Yam_4597@reddit
Just to be clear - i wasnt comparing. All i said is that it's doing an amasing job. So if one wants to save money they can use it just fine.
Storge2@reddit (OP)
Tried both, seem on par in terms of tool calling and intelligence, I am running now the 3.6 in FP8. Amazing how fast the Intelligence/GB of Model increases. it wasn't even two months since the 122B released and its already matched by 35B models.
Steus_au@reddit
second to it
Storge2@reddit (OP)
I will try that one.
Evening-Fox9785@reddit
Qwen 3.6 at Q8_K_XL is what I am running over Qwen 3.5 122B IQ_3_S, it may be marginally less capable but speed makes up for it
TheItalianDonkey@reddit
One thing i don't understand, if you're able to run Q8 K XL, you should be able to run Q6 of 122b ... why are you running Q3?
Evening-Fox9785@reddit
i’ll try IQ4_NL
i have around 70gb vram and 60gb ram
imac@reddit
Yeah, if you can't run 122B at a decent quality for long context, you are always going to be happer with 3.6-35B
TheItalianDonkey@reddit
Math isn't mathing?
Evening-Fox9785@reddit
Q6 is 100 gb+ no? I’d like to try it but I don’t think i’ll be able to fit the max context 262k tokens without quantizing the kv cache
TheItalianDonkey@reddit
you're actually right, my bad - you can run q8 on about 60, and for the q6 122b its more about 90-100
575_Inverse@reddit
Unless your rig is VERY expensive, expect 122b to spill into swap, that makes it remarkably slow, unfortunately, because it's otherwise almost flawless.
575_Inverse@reddit
If 122b starts spilling to swap it gets very very slow.
TheItalianDonkey@reddit
i was just wrong with mental calcs, he can run q8 k xl and not be able to run q6 122b, simple as that
Storge2@reddit (OP)
Will try and check to see how they perform.
PandaBearFred@reddit
well,I don't have much experience on 3.6-35B because I just switched to it 2 days ago. but I could tell the difference. you could test following prompts in with qwen coder cli by yourself:
"帮我用html实现一个电脑桌面,类似windows的风格,点开开始按钮后里面自带4个小程序:1. 计算器 2. 汇率换算小程序 3. 贪吃蛇小游戏 4. 文本编辑器"
It's in Chinese, just a simple request asking the model to write an HTML to simulate a windows style desktop that includes 4 apps in it: 1. a calculate, 2. a currency exchange rate converter, 3. a snake game 4. a text editor.
This test won't take long, I tried it with Qwen3.5-397B-A17B@Q4, Qwen3.5-122B-A10B@Q4, Qwen3.5-27B@Q4, Qwen3.5-35B-A3B, and Qwen3.6-35B-A3B. All of them can "Finish" the challange in minutes.
But, Qwen3.6 is the fastest and produced the best result ( yes, even better than 3.5-397B). The worst is Qwen3.5-35B-A3B, it was fast and can "finish" the job in less than 2 minutes (similar as Qwen3.6), but all the apps were not functional, feels like they're just mockups. Others, all built functional apps, but some of them have bugs.
My result: Qwen3.6 > Qwn3.5-397B > Qwen3.5-27B > Qwen3.5-122B > Qwen3.5-35B-A3B (IN THIS TEST!!)
vulcan4d@reddit
GPT-OSS-120B is underrated and I find it better of either of these but I don't code so maybe that is why. Also it tunes much better in llama.cpp so the performance is real good. Have yet to find a better one for the size & speed.
ravage382@reddit
I used both for actual agent based work last week using skill files and they both have their place..
122B is better all around out of the box, but its bigger and the speed drop snowballs pretty fast in my setup around 45k tokens of PP. I would give it my initial prompt, 1 or more skill files and then have it do something. By the time its ran for a few minutes, the context would start to pile up. At that point, my cache may or may not break and I have to reprocess everything for the next prompt of "Take the information you learned and update the skill files."
More often than not, I would have to wait 10 minutes for the PP to finish because the cache was broken. What I found was Qwen 3.6 was just as capable of looking over all the data that Qwen 3.5 122b had just churned and could make an update to the skill file, while only taking 45 seconds to PP and produce the update.
I did see there were some llama.cpp improvements to caching for those and speculative decoding, so it may be better today when I am using it.
The other thing I noticed is if I had 3.6 35b use the skill that had been created by 122b, it performed just as well as 122b did using the same skill file.
Caffdy@reddit
how do you know that the cache is broken?
ravage382@reddit
I was seeing errors from amdgpu in dmesg and having device resets and that was breaking my cache. It seems better after this weeks llama.cpp updates.
Caffdy@reddit
do you think is an issue with AMD in particular?
ravage382@reddit
The rocm stack isn't as mature as nvidias, but it's definitely improved in the last year. This problem in particular I saw noted in a llama.cpp GitHub issue when I was researching the GPU. I was hoping it was fixed, but not yet it seems.
mangoking1997@reddit
Why don't you just try them? Why worry, just test it and see what you prefer, you already have the hardware.
Havarem@reddit
This! I'm a teacher and the number of time people waited 15 minutes for me to be available to ask me a question they could have just tested in those 15 minutes is astronomical. Ok in that case it will take more than 15 minutes but still why nit do it :)
TheItalianDonkey@reddit
To be fair, how do you "test" LLMs in 15 mins?
Local benchmarks take anywhere from hours to days.
What remains is testing on feelings and asking if we should take the car to the carwash, which is ... meh ... from a production standpoint imho.
Caffdy@reddit
if you don't already have a framework of work ready to be tested, using LLMs will always be a test of feelings
TheItalianDonkey@reddit
We’re talking about time to benchmark, and I stand by my point that benchmarking an LLM in 15 mins is impossible.
Caffdy@reddit
yeah, maybe I got misinterpreted, I thought you were complaining that the other guy didn't share some insight or quick reply on the performance of the model. I agree that testing a model in 15 mins is just not realistic. Heck, I've been testing 122B the whole day with just one use-case (trying to fine-tune a markdown instruction document for it) and I'm not even done yet
jopereira@reddit
I figured this out long time ago and even if I was available, I make them wait/try to solve the problem first. With time, the calls numbers went down a lot and people gained skills.
Havarem@reddit
My reaction is "let's try it"
jopereira@reddit
Me too, and I think most of us do the same. It's the curiosity that moves us. I think we still have Grok Code Fast 1 (optimized) for free in Kilo Code but I've spent a day working with Qwen3.6 just because it does the same (for me, in that particular case) and I've even solved a problem I was unable to solve with Grok (embedded systems).
Storge2@reddit (OP)
Yeah you are right, will do sir.
iamapizza@reddit
Send it to me for testing. I'll definitely need to run a number of tests though before telling you which one you could have gone for.
EveningGold1171@reddit
3.5 122b is smarter, but 3.6 has clearly had more rl training. 3.6 122b should be very interesting if they release it.
DaniDubin@reddit
For my usecase - Hermes Agent, doing long-context conversion with lots of tool calls, Qwen3-122B is much smarter and consistent. Qwen-3.6-35B breaks after ~50-60k tokens, keeps repeating wrong solution and generally performs worse.
my_name_isnt_clever@reddit
What quantizations are you using? I've been using Hermes with Qwen 3.6 25b Q6 and I haven't had any issues, it's just less intuitive than the 122b. But the speed is crazy.
DaniDubin@reddit
I use it on Mac Studio M4 Max (128). Tried Qwen-3.6-35b Q8 mlx. I mean yea the speed is super fast, and I managed to write good code with it, but on shorter context. Maybe will give it another try, will increase temp to lower chances of repetitive loops. Anyway I’m hoping they’ll release Qwen-3.6-122B, i think its size is a sweet spot for 128gb systems (the 4-6 bits).
my_name_isnt_clever@reddit
I don't have issues with loops when using it with an agent harness like Hermes. I'm using the Q6 with offically recommended sampling params and it kills.
spaceman_@reddit
I was using 122B before, but 3.6 35B is in my subjective experience good enough for agentic coding and much faster on my setup (35B Q6 fits in one 32GB GPU vs 122B Q4 which I have to spread across 3 GPUs) so have been mostly using the new 35B since last weekend.
I'm eagerly awaiting the 122B update though.
Sabin_Stargem@reddit
3.6. It has been punching way above its weight, and is faster. It doesn't get into looping nearly as often with the thinking.
I deleted 122b and a quantized 397b, the 3.6 35b is just that good.
Prudent-Ad4509@reddit
One extra thing to keep in mind. I run 3-bit quant of 122B for coding and it works most of the time better than 35B with 8-bit quant. But I've recently tried to task them both with visual mechanical tasks aaaand... poof. Total collapse. The one with 3-bit even started to forget its working directory.
So, as long as you use them only for coding, you can experiment and switch between them. But when you move significantly far away from coding, quantization becomes a much bigger issue than lower knowledge base.
Terminator857@reddit
122B q4 worked better for me. 3.6 q8 got stuck in a loop. Haven't had that issue with 122B.
Steus_au@reddit
qwen3.6 does better for me but i’m not coding. better because its faster on my 5060ti and actually listens what i ask and capable use tools like tavily when needed.
AustinM731@reddit
3.6 feels smarter somehow. If you have tools available in your environment, it is very good at using them and will ground itself with Internet searches if you feed it a MCP like Brave or Tavily. I was running 122b as my daily driver, but I have since switched to 3.6 in the past few days.
Lucis_unbra@reddit
122B for anything knowledge related, and at least GLSL programming... Although Gemma 31B runs circles around Qwen for that language at least.
3.6 does patch up a bunch of issues 3.5 had. When it tried to do glsl, the 35b moe would usually change its mind during the code generation, even after reasoning. It doesn't do that anymore.
I tired using 3.6 for a demo, making a simple path tracer. Gemma made one mistake, flipped the camera, but had no issues.
Qwen 3.6 kept making mistakes.
I'd try both 122B and 3.6, and if possible, Gemma 4 31B.
They all hit different areas differently. But, 3.6 is shaping up nicely.
PassengerPigeon343@reddit
Depending on your use case and hardware your results may vary but for me, the speed of 3.6 makes it the easy choice. Fast tool calls, fast information processing, fast output. It’s amazing.
AlwaysLateToThaParty@reddit
I find the 122b heretic mxfp4_moe model the best all rounder for 75GB of VRAM. 35B may be good at some other use-cases, but i haven't felt any need to change. Maybe if we get a 122B 3.6 model.
Due_Net_3342@reddit
try step 3.5 flash it is better than 122b
Front-Relief473@reddit
123g/128g after step fun deployed iq4xs, oh, I don't do anything else.
Due_Net_3342@reddit
not true, i am using it on strix halo 128gb with 128k context and q8 kv cache
ilintar@reddit
3.6 35B and I don't think it's even close considering all factors included.