Qwen 3 235b beats sonnet 3.7 in aider polyglot

[-]

skrshawk@reddit

How can the cost of running the model be evaluated in comparison? I suspect it would be quite favorable, but for instance if renting GPUs how much you would need and the runtime involved. Alternatively, what API services are charging by the token and how much it took.

Reply

[-]

a_beautiful_rhind@reddit

free on openrouter.

Reply

[-]

Fit_Voice_3842@reddit

$0.15/M input tokens$0.60/M output tokens ? how is it free

Reply

[-]

a_beautiful_rhind@reddit

it was free.. 3 days ago

Reply

[-]

Lpaydat@reddit

The costs of OpenAI models are always absurd for me.

Reply

[-]

power97992@reddit

They need to pay for their R&D and make a profit!

Reply

[-]

Correct-Dimension786@reddit

I'm not sure whats going on, but poe is charging only 40 points (you get 1million for $20) for every message to this bot and it may even have 100k context in that 40 point price but that I haven't tested except to send it the text from a pdf and have it write a song about it. anyways, it wrote some amazing songs and I'm liking it so far but this 40 point thing is strange, should be a lot more. its the 235b parameter https://preview.redd.it/cdclz2l4lwye1.png?width=627&format=png&auto=webp&s=0aaec58c51212ffc574a13cb448363b24a5c8166

Reply

[-]

Osama_Saba@reddit

What stops anthropics and openai and google and all from offering it in their api?

Reply

[-]

robertpiosik@reddit

Stupidity of the idea ;)

Reply

[-]

Former_Elderberry647@reddit

Hi Robert, I reached out to you about your app Taaabs via Reddit chat. Was hoping you could assist. Thanks in advance

Reply

[-]

Osama_Saba@reddit

Why

Reply

[-]

merotatox@reddit

Hmmmm , ok, let's all throw away the models we spent millions training/ developing /maintaining and start hosting the best model an online benchmark says its good and then lets call it ours.

Reply

[-]

Sudden-Lingonberry-8@reddit

so you're saying... they're saving face? AHAHAHA!

Reply

[-]

ortegaalfredo@reddit

Pretty easy to identify a model because of tokenization is quite unique.

Reply

[-]

Mbando@reddit

How is “cost” calculated? I would guess for the closed models it’s API calls, but there is at least some notional cost for Quinn, at least for electricity, right?

Reply

[-]

SomeOddCodeGuy@reddit

Man this model has me feel like I'm taking crazy pills. I have not had nearly this good of experience with it for coding. I'll keep at it, though. Maybe the trick really is turning thinking off. Maybe the thinking is causing my hallucination woes.

Reply

[-]

segmond@reddit

Are you running the full precision or q8 quant?

Reply

[-]

SomeOddCodeGuy@reddit

q8 quant gguf. Latest quant I can find from unsloth, latest build of KoboldCpp (1.90.2) which was within 11 commits of main from Llama.cpp (all from today/yesterday, none that seem to affect Qwen3). I'll try pulling down the latest mlx-lm if Qwen3 support there looks good, and see how bf16 looks. I have the M3 Ultra 512GB, so I should slide just in on having enough RAM to run that.

Reply

[-]

f3llowtraveler@reddit

How many tokens/sec are you getting on that Mac?

Reply

[-]

Healthy-Nebula-3603@reddit

Coding under kobold ...really ?? Why you don't use llamacpp-server ? You get far better experience. https://preview.redd.it/qmwcmiovpnye1.png?width=2459&format=png&auto=webp&s=16c9b8e18809077b6954b7d9adb14013c0581aef Maybe you have a wrong configuration.

Reply

[-]

lannistersstark@reddit

>You get far better experience. >Calculate weight after gaining 5% I feel like what you're coding and what they're coding might not be comparable.

Reply

[-]

Healthy-Nebula-3603@reddit

...and you taking assumptions from testing interface? Lol

Reply

[-]

lannistersstark@reddit

Yes? You provided that counterexample as "Look it can code fine."

Reply

[-]

Healthy-Nebula-3603@reddit

Wow ... you're retarded.

Reply

[-]

a_beautiful_rhind@reddit

I have thinking off and used both ik and llama-server.. model just hallucinates when it doesn't know something. Was one of the first things I noticed trying it over API. Local experience is no different.

Reply

[-]

segmond@reddit

I'm currently downloading q8 gguf so going to be trying it tomorrow. Are you downloading the normal model or the extended 128k one? I looked at the discussions for the 128k ones and they seem to have some issues, so I decided to err on the side of caution and just do the original.

Reply

[-]

SomeOddCodeGuy@reddit

Normal; figured there was likely a quality degradation on the 128k to extend the context length. Probably not enough to harm creative writing, but for coding/architecture/rag I want to claw back every ounce of quality I can get.

Reply

[-]

brotie@reddit

Are you running a q4 quant through ollama or the full unquantized version? Thinking mode or no-think?

Reply

[-]

SomeOddCodeGuy@reddit

Thinking, q8. I'm trying no thinking tonight to see if that helps at all.

Reply

[-]

Maximum@reddit

And?

Reply

[-]

SomeOddCodeGuy@reddit

So far it's actually ok. I need to test it a lot more thoroughly, but it's really starting to play nice in my workflows with thinking disabled. The responses it is giving are far more sane than what I seeing before, and when coupled with GLM-4 it actually produces some reasonable responses. I'll need a few days with it to get a real feel, but right now I'm at least far happier without the thinking.

Reply

[-]

DeltaSqueezer@reddit

Try with unquantized KV cache. It's still a bit too early for me to say, but so far, I much preferred the unquantized. I only use the standard 40960 context, not the extended 128k model, so it only takes <4GB VRAM for max KV cache.

Reply

[-]

CountlessFlies@reddit

What inference engine are you using? And how do you disable thinking completely? You can send /no_think with your initial request, but if you’re using a coding agent, subsequent requests made automatically won’t have this tag, and the model will start thinking again.

Reply

[-]

SomeOddCodeGuy@reddit

I'm using koboldcpp, and I have WilmerAI between it and the front end. What I ended up doing, and its working great for me, is making a chatml variant template with an assistant prefix that looks like this: "promptTemplateAssistantPrefix": "<|im_start|>assistant\n<think>\n\n</think>\n\n", Essentially mimicking what the model does if you do /no\_think. This causes the model to think that it's already produced those tags, and I never get thinking at all. So far it's working really well, and I'm a lot happier with the response quality now, so we'll see how it holds up.

Reply

[-]

davewolfs@reddit

I will add, I have been using this for a couple of hours now with aider after modifying LiteLLM so that it doesn't think and using the correct temperature etc per guidelines and this thing is a bit of a show and not in a good way. It is hallucinating like crazy.

Reply

[-]

gamblingapocalypse@reddit

Is it possible that adding more 'thinking' is just burning through the token limit and actually making the outputs less accurate?

Reply

[-]

Jonodonozym@reddit

Wouldn't be surprised, given Anthropic's studies showing Claude's explanations were often a post-hoc and created after it had already intuited the answer. If Qwen 3 is the same, then "show your working" or "reasoning" style of thinking blocks could well be a waste of valuable context size.

Reply

[-]

randomanoni@reddit

So do it in even more steps: think, conclusion, drop think, answer, drop conclusion, User:

Reply

[-]

Echo9Zulu-@reddit

Got good results this way on a pygame task with 0.6b @q8 and 1.7b @ q4km

Reply

[-]

davewolfs@reddit

Sorry for stating the obvious but are you setting Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 For non thinking mode.

Reply

[-]

Leflakk@reddit

Thanks for feedback, don’t forget to use the Qwen recommanded parameters which are different from the thinking mode.

Reply

[-]

shark8866@reddit

im pretty sure the aider is based on competitive programming problems similar to leetcode. You might be using it for swe

Reply

[-]

das_rdsm@reddit

\> Focuses on the *most difficult* 225 exercises out of the 697 that Exercism provides for those languages. Among the coding benchmarks is one of the less bad one's , usually reflects the results kinda of ok. cost is usually a bit distorted as those are really short tasks compared to real world tasks. SWEBench is usually better and the tests of the frameworks there usually are even better. but Aider polyglot has it's value. Certainly is not irrelevant.

Reply

[-]

BoJackHorseMan53@reddit

Where were you all this time when a Chinese model wasn't at the top?

Reply

[-]

Needausernameplzz@reddit

I think you're right. It got some "easy" questions wrong while thinking but gave me the solution perfectly with /nothink

Reply

[-]

Timely_Second_6414@reddit

I think the usecases are very specific. I have had great experiences using this model (thinking mode) for testing neural network architectures and training them. It follows complex instructions very well and can reason very well about the datasets, structure, etc. It solves a few problems better than gemini pro for me (gemini generates way too much code, and implements things i didnt ask for). However it is not very good at frontend (it feels very lazy, a problem many models have). I think for this the best experience you can get locally is GLM 4 32b, although quality starts to degrade after multiple turns of conversation.

Reply

[-]

emprahsFury@reddit

aider polyglot is specifically about breadth of difficult problems. Hence the name polyglot. I dont know why we have to do this dance of not admitting something is good. There always has to be a caveat. It's just a good model you don't have to save yourself by saying "its not good at UI" or "its good but only for turns 1,2 & 3"

Reply

[-]

ansmo@reddit

Because we're not here to rely on benchmarks. We're here to compare experience.

Reply

[-]

maddogawl@reddit

Same I have not seen good results with this model at all

Reply

[-]

cantgetthistowork@reddit

Qwen has always been benchmaxxed garbage unusable in real world situations

Reply

[-]

tengo_harambe@reddit

relevant username

Reply

[-]

ReasonablePossum_@reddit

>not had nearly this good of experience with it for coding. This isn´t a coding benchmark? I mean, people use LLMs for a lot other stuff lol

Reply

[-]

TheActualStudy@reddit

OK, this has convinved me to try it with 128GB RAM, a 3090, and mmap in llama.cpp and see what I get. I'm not super hopeful, but why not try? I'll update later.

Reply

[-]

13henday@reddit

Use a smaller quant ?

Reply

[-]

extraquacky@reddit

Whole mode my arse I can't afford the time and dollar to let it rewrite the file everytime

Reply

[-]

frivolousfidget@reddit

It performs as well as non thinking claude with diff mode.

Reply

[-]

davewolfs@reddit

Not in Rust it doesn’t. Also it’s making some wild mistakes in practice.

Reply

[-]

frivolousfidget@reddit

Yeah.. that is why I mentioned in Aider. YMMV for language specific. Always nice to have an eval with your most common interactions.

Reply

[-]

davewolfs@reddit

It hallucinates like crazy. I don’t know how it’s scoring this high while making the mistakes I am seeing.

Reply

[-]

frivolousfidget@reddit

Because AI act very differently for different usecases. Usually hallucinations are caused by the AI being unsure about what is right or wrong. Your usecase is probably one where this AI do not perform well.

Reply

[-]

extraquacky@reddit

That's impressive tbh Gotta check the benchmarks

Reply

[-]

robertpiosik@reddit

You can to your instruction \`Use ellipsis comments, e.g. "// ...", when necessary.\`

Reply

[-]

extraquacky@reddit

Wdym How does that perform on aider?

Reply

[-]

robertpiosik@reddit

In real world use it performs great.

Reply

[-]

Negative_Piece_7217@reddit

But how about it taking ages to return output? Smh

Reply

[-]

Healthy-Nebula-3603@reddit

https://github.com/Aider-AI/aider/pull/3908/files And Qwen 32b 45 % Impressive!

Reply

[-]

Zpassing_throughZ@reddit

I'm running Qewn 30B on my phone (because it only uses 3B active parameter.) wow, what results. very great.

Reply

[-]

sannysanoff@reddit

At the time of the writing, image on this post is different from what I observe at benchmark page: https://aider.chat/docs/leaderboards/ (there's no qwen 3 in the leaderboard).

Reply

[-]

rmontanaro@reddit

Maybe OP built the docs from this https://github.com/Aider-AI/aider/pull/3908/files But it's not live, not even reviewed

Reply

[-]

Thireus@reddit

OP dropped the screenshot and has left the chat 👀

Reply

[-]

intergalacticskyline@reddit

Same

Reply

[-]

DinoAmino@reddit

Maybe OP is a lying karma whore and faked it?

Reply

[-]

Different_Fix_2217@reddit

Does not match my use of the model at all.

Reply

[-]

sirjoaco@reddit

Take your own conclusions → [https://www.rival.tips/compare?model1=claude-3.7-sonnet&model2=qwen3-235b-a22b](https://www.rival.tips/compare?model1=claude-3.7-sonnet&model2=qwen3-235b-a22b)

Reply

[-]

panchovix@reddit

Wow that pokemon UI is impressive but it's kinda bugged, seems gemini made it working but without being animated.

Reply

[-]

vitorgrs@reddit

I still have a few issues with it, specially multilingual. Sometimes when using in Portuguese, it answer some words in English (grok, gemini does it too). Gemini Pro and DeepSeek translations are also superior too.

Reply

[-]

Justpassing017@reddit

What does o3 + 4.1 means ? o3 and 4.1 are their owns model isnt?

Reply

[-]

MRWONDERFU@reddit

architect plus implementing model

Reply

[-]

ZookeepergameOld6699@reddit

The problem is throughput. I also confirmed Qwen 3 235b is awesome for other tasks such as summarization or research. But, it is very slow in a local environment. Not productive on coding usage.

Reply

[-]

sunomonodekani@reddit

Today Qwen's lovers sleep with their stomachs full (of lies that inflate the ego)

Reply

[-]

Frequent_Repeat_1634@reddit

Not visible for me right now 🤔 https://aider.chat/docs/leaderboards/

Reply

[-]

DinoAmino@reddit

Wasn't visible to me yesterday either. Is it fake? What's up OP?

Reply

[-]

TSG-AYAN@reddit

Not fake, but not verified yet either. The PR is still open

Reply

[-]

DinoAmino@reddit

OP you need to answer for this. Yesterday there was a discussion about no Qwen 3 on the leaderboard and it is still not there today. How did you obtain this screenshot?

Reply

[-]

Healthy-Nebula-3603@reddit

And it is no thinking version??? Wow

Reply

[-]

StraightChemistry629@reddit

Source: https://x.com/scaling01/status/1918752403165462806

Reply

[-]

merotatox@reddit

My only issue with it is the context length is so small to get anything done.

Reply

[-]

I_will_delete_myself@reddit

Dang, I'll give it a try. If they negotiate sanctions against private entities for the sale of TikTok being more lenient, then expect it to get even higher in the future.

Reply

[-]

Ordinary_Mud7430@reddit

Let me guess... They gave him the same tests as always, which they already added to his training base 🙂

Reply

[-]

13henday@reddit

Polyglot predates 3.7 by 3 months they had more than enough time to bench max if they wanted to. Also I’ve been running this test today and it’s a very broad test.

Reply

[-]

BoJackHorseMan53@reddit

Can be said for every top LLM

Reply

[-]

Timely_Second_6414@reddit

Yes this model is very good in my experience. Do we know if this is with or without thinking?

Reply to Post

93 Comments