TheaterFire

Qwen 3 235b beats sonnet 3.7 in aider polyglot

Posted by Independent-Wind4462@reddit | LocalLLaMA | View on Reddit | 93 comments

Qwen 3 235b beats sonnet 3.7 in aider polyglot
Win for open source

Reply to Post

93 Comments

skrshawk@reddit

How can the cost of running the model be evaluated in comparison? I suspect it would be quite favorable, but for instance if renting GPUs how much you would need and the runtime involved. Alternatively, what API services are charging by the token and how much it took.
View on Reddit #55379476

a_beautiful_rhind@reddit

free on openrouter.
View on Reddit #55381954

Fit_Voice_3842@reddit

$0.15/M input tokens$0.60/M output tokens ? how is it free
View on Reddit #55666490

a_beautiful_rhind@reddit

it was free.. 3 days ago
View on Reddit #55676835

Lpaydat@reddit

The costs of OpenAI models are always absurd for me.
View on Reddit #55387328

power97992@reddit

They need to pay for their R&D and make a profit!
View on Reddit #55493268

Correct-Dimension786@reddit

I'm not sure whats going on, but poe is charging only 40 points (you get 1million for $20) for every message to this bot and it may even have 100k context in that 40 point price but that I haven't tested except to send it the text from a pdf and have it write a song about it. anyways, it wrote some amazing songs and I'm liking it so far but this 40 point thing is strange, should be a lot more. its the 235b parameter https://preview.redd.it/cdclz2l4lwye1.png?width=627&format=png&auto=webp&s=0aaec58c51212ffc574a13cb448363b24a5c8166
View on Reddit #55477557

Osama_Saba@reddit

What stops anthropics and openai and google and all from offering it in their api?
View on Reddit #55372529

robertpiosik@reddit

Stupidity of the idea ;)
View on Reddit #55372885

Former_Elderberry647@reddit

Hi Robert, I reached out to you about your app Taaabs via Reddit chat. Was hoping you could assist. Thanks in advance
View on Reddit #55476137

Osama_Saba@reddit

Why
View on Reddit #55373964

merotatox@reddit

Hmmmm , ok, let's all throw away the models we spent millions training/ developing /maintaining and start hosting the best model an online benchmark says its good and then lets call it ours.
View on Reddit #55374304

Sudden-Lingonberry-8@reddit

so you're saying... they're saving face? AHAHAHA!
View on Reddit #55405173

ortegaalfredo@reddit

Pretty easy to identify a model because of tokenization is quite unique.
View on Reddit #55374745

Mbando@reddit

How is “cost” calculated? I would guess for the closed models it’s API calls, but there is at least some notional cost for Quinn, at least for electricity, right?
View on Reddit #55467879

SomeOddCodeGuy@reddit

Man this model has me feel like I'm taking crazy pills. I have not had nearly this good of experience with it for coding. I'll keep at it, though. Maybe the trick really is turning thinking off. Maybe the thinking is causing my hallucination woes.
View on Reddit #55372521

segmond@reddit

Are you running the full precision or q8 quant?
View on Reddit #55378440

SomeOddCodeGuy@reddit

q8 quant gguf. Latest quant I can find from unsloth, latest build of KoboldCpp (1.90.2) which was within 11 commits of main from Llama.cpp (all from today/yesterday, none that seem to affect Qwen3). I'll try pulling down the latest mlx-lm if Qwen3 support there looks good, and see how bf16 looks. I have the M3 Ultra 512GB, so I should slide just in on having enough RAM to run that.
View on Reddit #55379203

f3llowtraveler@reddit

How many tokens/sec are you getting on that Mac?
View on Reddit #55464524

Healthy-Nebula-3603@reddit

Coding under kobold ...really ?? Why you don't use llamacpp-server ? You get far better experience. https://preview.redd.it/qmwcmiovpnye1.png?width=2459&format=png&auto=webp&s=16c9b8e18809077b6954b7d9adb14013c0581aef Maybe you have a wrong configuration.
View on Reddit #55380111

lannistersstark@reddit

>You get far better experience. >Calculate weight after gaining 5% I feel like what you're coding and what they're coding might not be comparable.
View on Reddit #55383954

Healthy-Nebula-3603@reddit

...and you taking assumptions from testing interface? Lol
View on Reddit #55384320

lannistersstark@reddit

Yes? You provided that counterexample as "Look it can code fine."
View on Reddit #55392686

Healthy-Nebula-3603@reddit

Wow ... you're retarded.
View on Reddit #55401213

a_beautiful_rhind@reddit

I have thinking off and used both ik and llama-server.. model just hallucinates when it doesn't know something. Was one of the first things I noticed trying it over API. Local experience is no different.
View on Reddit #55381715

segmond@reddit

I'm currently downloading q8 gguf so going to be trying it tomorrow. Are you downloading the normal model or the extended 128k one? I looked at the discussions for the 128k ones and they seem to have some issues, so I decided to err on the side of caution and just do the original.
View on Reddit #55382914

SomeOddCodeGuy@reddit

Normal; figured there was likely a quality degradation on the 128k to extend the context length. Probably not enough to harm creative writing, but for coding/architecture/rag I want to claw back every ounce of quality I can get.
View on Reddit #55386565

brotie@reddit

Are you running a q4 quant through ollama or the full unquantized version? Thinking mode or no-think?
View on Reddit #55378634

SomeOddCodeGuy@reddit

Thinking, q8. I'm trying no thinking tonight to see if that helps at all.
View on Reddit #55379229

__Maximum__@reddit

And?
View on Reddit #55395933

SomeOddCodeGuy@reddit

So far it's actually ok. I need to test it a lot more thoroughly, but it's really starting to play nice in my workflows with thinking disabled. The responses it is giving are far more sane than what I seeing before, and when coupled with GLM-4 it actually produces some reasonable responses. I'll need a few days with it to get a real feel, but right now I'm at least far happier without the thinking.
View on Reddit #55398637

DeltaSqueezer@reddit

Try with unquantized KV cache. It's still a bit too early for me to say, but so far, I much preferred the unquantized. I only use the standard 40960 context, not the extended 128k model, so it only takes <4GB VRAM for max KV cache.
View on Reddit #55443736

CountlessFlies@reddit

What inference engine are you using? And how do you disable thinking completely? You can send /no_think with your initial request, but if you’re using a coding agent, subsequent requests made automatically won’t have this tag, and the model will start thinking again.
View on Reddit #55395489

SomeOddCodeGuy@reddit

I'm using koboldcpp, and I have WilmerAI between it and the front end. What I ended up doing, and its working great for me, is making a chatml variant template with an assistant prefix that looks like this: "promptTemplateAssistantPrefix": "<|im_start|>assistant\n<think>\n\n</think>\n\n", Essentially mimicking what the model does if you do /no\_think. This causes the model to think that it's already produced those tags, and I never get thinking at all. So far it's working really well, and I'm a lot happier with the response quality now, so we'll see how it holds up.
View on Reddit #55398759

davewolfs@reddit

I will add, I have been using this for a couple of hours now with aider after modifying LiteLLM so that it doesn't think and using the correct temperature etc per guidelines and this thing is a bit of a show and not in a good way. It is hallucinating like crazy.
View on Reddit #55436114

gamblingapocalypse@reddit

Is it possible that adding more 'thinking' is just burning through the token limit and actually making the outputs less accurate?
View on Reddit #55373867

Jonodonozym@reddit

Wouldn't be surprised, given Anthropic's studies showing Claude's explanations were often a post-hoc and created after it had already intuited the answer. If Qwen 3 is the same, then "show your working" or "reasoning" style of thinking blocks could well be a waste of valuable context size.
View on Reddit #55392312

randomanoni@reddit

So do it in even more steps: think, conclusion, drop think, answer, drop conclusion, User:
View on Reddit #55396413

Echo9Zulu-@reddit

Got good results this way on a pygame task with 0.6b @q8 and 1.7b @ q4km
View on Reddit #55422118

davewolfs@reddit

Sorry for stating the obvious but are you setting Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 For non thinking mode.
View on Reddit #55417175

Leflakk@reddit

Thanks for feedback, don’t forget to use the Qwen recommanded parameters which are different from the thinking mode.
View on Reddit #55401306

shark8866@reddit

im pretty sure the aider is based on competitive programming problems similar to leetcode. You might be using it for swe
View on Reddit #55378653

das_rdsm@reddit

\> Focuses on the *most difficult* 225 exercises out of the 697 that Exercism provides for those languages. Among the coding benchmarks is one of the less bad one's , usually reflects the results kinda of ok. cost is usually a bit distorted as those are really short tasks compared to real world tasks. SWEBench is usually better and the tests of the frameworks there usually are even better. but Aider polyglot has it's value. Certainly is not irrelevant.
View on Reddit #55400147

BoJackHorseMan53@reddit

Where were you all this time when a Chinese model wasn't at the top?
View on Reddit #55392046

Needausernameplzz@reddit

I think you're right. It got some "easy" questions wrong while thinking but gave me the solution perfectly with /nothink
View on Reddit #55393266

Timely_Second_6414@reddit

I think the usecases are very specific. I have had great experiences using this model (thinking mode) for testing neural network architectures and training them. It follows complex instructions very well and can reason very well about the datasets, structure, etc. It solves a few problems better than gemini pro for me (gemini generates way too much code, and implements things i didnt ask for). However it is not very good at frontend (it feels very lazy, a problem many models have). I think for this the best experience you can get locally is GLM 4 32b, although quality starts to degrade after multiple turns of conversation.
View on Reddit #55372959

emprahsFury@reddit

aider polyglot is specifically about breadth of difficult problems. Hence the name polyglot. I dont know why we have to do this dance of not admitting something is good. There always has to be a caveat. It's just a good model you don't have to save yourself by saying "its not good at UI" or "its good but only for turns 1,2 & 3"
View on Reddit #55377436

ansmo@reddit

Because we're not here to rely on benchmarks. We're here to compare experience.
View on Reddit #55390257

maddogawl@reddit

Same I have not seen good results with this model at all
View on Reddit #55386295

cantgetthistowork@reddit

Qwen has always been benchmaxxed garbage unusable in real world situations
View on Reddit #55374294

tengo_harambe@reddit

relevant username
View on Reddit #55382113

ReasonablePossum_@reddit

>not had nearly this good of experience with it for coding. This isn´t a coding benchmark? I mean, people use LLMs for a lot other stuff lol
View on Reddit #55374224

TheActualStudy@reddit

OK, this has convinved me to try it with 128GB RAM, a 3090, and mmap in llama.cpp and see what I get. I'm not super hopeful, but why not try? I'll update later.
View on Reddit #55428947

13henday@reddit

Use a smaller quant ?
View on Reddit #55458667

extraquacky@reddit

Whole mode my arse I can't afford the time and dollar to let it rewrite the file everytime
View on Reddit #55371807

frivolousfidget@reddit

It performs as well as non thinking claude with diff mode.
View on Reddit #55373186

davewolfs@reddit

Not in Rust it doesn’t. Also it’s making some wild mistakes in practice.
View on Reddit #55439080

frivolousfidget@reddit

Yeah.. that is why I mentioned in Aider. YMMV for language specific. Always nice to have an eval with your most common interactions.
View on Reddit #55441102

davewolfs@reddit

It hallucinates like crazy. I don’t know how it’s scoring this high while making the mistakes I am seeing.
View on Reddit #55446858

frivolousfidget@reddit

Because AI act very differently for different usecases. Usually hallucinations are caused by the AI being unsure about what is right or wrong. Your usecase is probably one where this AI do not perform well.
View on Reddit #55447868

extraquacky@reddit

That's impressive tbh Gotta check the benchmarks
View on Reddit #55373225

robertpiosik@reddit

You can to your instruction \`Use ellipsis comments, e.g. "// ...", when necessary.\`
View on Reddit #55372801

extraquacky@reddit

Wdym How does that perform on aider?
View on Reddit #55372926

robertpiosik@reddit

In real world use it performs great.
View on Reddit #55373072

Negative_Piece_7217@reddit

But how about it taking ages to return output? Smh
View on Reddit #55428599

Healthy-Nebula-3603@reddit

https://github.com/Aider-AI/aider/pull/3908/files And Qwen 32b 45 % Impressive!
View on Reddit #55380468

Zpassing_throughZ@reddit

I'm running Qewn 30B on my phone (because it only uses 3B active parameter.) wow, what results. very great.
View on Reddit #55407447

sannysanoff@reddit

At the time of the writing, image on this post is different from what I observe at benchmark page: https://aider.chat/docs/leaderboards/ (there's no qwen 3 in the leaderboard).
View on Reddit #55373615

rmontanaro@reddit

Maybe OP built the docs from this https://github.com/Aider-AI/aider/pull/3908/files But it's not live, not even reviewed
View on Reddit #55375002

Thireus@reddit

OP dropped the screenshot and has left the chat 👀
View on Reddit #55399308

intergalacticskyline@reddit

Same
View on Reddit #55374780

DinoAmino@reddit

Maybe OP is a lying karma whore and faked it?
View on Reddit #55382737

Different_Fix_2217@reddit

Does not match my use of the model at all.
View on Reddit #55396822

sirjoaco@reddit

Take your own conclusions → [https://www.rival.tips/compare?model1=claude-3.7-sonnet&model2=qwen3-235b-a22b](https://www.rival.tips/compare?model1=claude-3.7-sonnet&model2=qwen3-235b-a22b)
View on Reddit #55386088

panchovix@reddit

Wow that pokemon UI is impressive but it's kinda bugged, seems gemini made it working but without being animated.
View on Reddit #55393612

vitorgrs@reddit

I still have a few issues with it, specially multilingual. Sometimes when using in Portuguese, it answer some words in English (grok, gemini does it too). Gemini Pro and DeepSeek translations are also superior too.
View on Reddit #55393561

Justpassing017@reddit

What does o3 + 4.1 means ? o3 and 4.1 are their owns model isnt?
View on Reddit #55384664

MRWONDERFU@reddit

architect plus implementing model
View on Reddit #55392760

ZookeepergameOld6699@reddit

The problem is throughput. I also confirmed Qwen 3 235b is awesome for other tasks such as summarization or research. But, it is very slow in a local environment. Not productive on coding usage.
View on Reddit #55390862

sunomonodekani@reddit

Today Qwen's lovers sleep with their stomachs full (of lies that inflate the ego)
View on Reddit #55386736

Frequent_Repeat_1634@reddit

Not visible for me right now 🤔 https://aider.chat/docs/leaderboards/
View on Reddit #55374028

DinoAmino@reddit

Wasn't visible to me yesterday either. Is it fake? What's up OP?
View on Reddit #55382248

TSG-AYAN@reddit

Not fake, but not verified yet either. The PR is still open
View on Reddit #55385236

DinoAmino@reddit

OP you need to answer for this. Yesterday there was a discussion about no Qwen 3 on the leaderboard and it is still not there today. How did you obtain this screenshot?
View on Reddit #55382630

Healthy-Nebula-3603@reddit

And it is no thinking version??? Wow
View on Reddit #55380335

StraightChemistry629@reddit

Source: https://x.com/scaling01/status/1918752403165462806
View on Reddit #55380022

merotatox@reddit

My only issue with it is the context length is so small to get anything done.
View on Reddit #55374336

I_will_delete_myself@reddit

Dang, I'll give it a try. If they negotiate sanctions against private entities for the sale of TikTok being more lenient, then expect it to get even higher in the future.
View on Reddit #55374274

Ordinary_Mud7430@reddit

Let me guess... They gave him the same tests as always, which they already added to his training base 🙂
View on Reddit #55372946

13henday@reddit

Polyglot predates 3.7 by 3 months they had more than enough time to bench max if they wanted to. Also I’ve been running this test today and it’s a very broad test.
View on Reddit #55373712

BoJackHorseMan53@reddit

Can be said for every top LLM
View on Reddit #55373452

Timely_Second_6414@reddit

Yes this model is very good in my experience. Do we know if this is with or without thinking?
View on Reddit #55371440

Independent-Wind4462@reddit (OP)

This is crazy thing bc this results seems to be from non thinking mode
View on Reddit #55371586