R1+Sonnet set a new SOTA on the aider polyglot benchmark, at 14X less cost compared to o1

Posted by Xhehab_@reddit | LocalLLaMA | View on Reddit | 46 comments

https://preview.redd.it/zub2yfarfzee1.jpg?width=1656&format=pjpg&auto=webp&s=b92fd272248cd2290b56236ab40716acd51979aa **64% R1+Sonnet** 62% o1 **57%** **R1** 52% Sonnet 48% DeepSeek V3 >"There has been some recent discussion about extracting the <think> tokens from R1 and feeding them to Sonnet. To be clear, the results above are not using R1’s thinking tokens. Using the thinking tokens appears to produce worse benchmark results. o1 paired with Sonnet didn’t produce better results than just using o1 alone. Using various other models as editor didn’t seem to improve o1 or R1 versus their solo scores. >\--- Aider supports using a pair of models for coding: >\-An Architect model is asked to describe how to solve the coding problem. Thinking/reasoning models often work well in this role. >\-An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files. >**R1 as architect with Sonnet as editor has set a new SOTA of 64.0%** on the aider polyglot benchmark. They achieve this at **14X less cost** compared to the previous o1 SOTA result." [*https://aider.chat/2025/01/24/r1-sonnet.html*](https://aider.chat/2025/01/24/r1-sonnet.html)

46 Comments

[-]

LittleGalaxyBrain@reddit

Cool to see R1+Sonnet at 64%. Cheaper than o1 and better results. Actually, we at Refact hit 76% with our AI agent + non-thinking Sonnet setup. Haven't tested with other models yet, now working on the score with thinking enabled.

fedya1@reddit

Need to see R1+V3. That would be a killer on cost.

Soft_Hedgehog_4317@reddit

Yeah, looks like this will be the most bang for your buck! Sonnet is quite expensive.

Mother_Soraka@reddit

why not R1+DS3 or R1+R1?

AriyaSavaka@reddit

Yeah looking forward to the R1 + V3 result.

boredcynicism@reddit

From experience, these combos get confused about which edits are done and they still have to do. Might be fixed with prompts but right now that is what it is.

OXKSA1@reddit

do you know what will be better than R1+R1? its the **R1²**

Educational_Gap5867@reddit

Hey can someone show me how to actually use Aider to get stuff done?

You start aider in the root of your git repo, /add the file you want to work on, and type /code and describe what you want. /architect if it's complicated and you want to double check

Long-John-Sliver22@reddit

Lazy much? https://aider.chat/docs/usage/tutorials.html

NewGeneral7964@reddit

Btw, victortaelin scripts insights inspired this tool. Credits to him.

MoffKalast@reddit

Combine a few more models together and we'll finally have Devastator.

davewolfs@reddit

This is the Theranos moment for North American AI. Overhyped expensive bullshit.

vdp@reddit

Why are the Gemini thinking models still not supported?

pigeon57434@reddit

because google keeps releasing models in the AI studio but not the API and you need API to run benchmarks pretty much

eposnix@reddit

"gemini-2.0-flash-thinking-exp-01-21" is on the API. I was just using it, in fact. The big issue is that it's heavily rate limited. Doing just a handful of tasks in a short time will return errors after a while.

mycall@reddit

How long of a wait to reset? Just break up the benchmark into chunks and it will simply take longer to run.

HelpfulHand3@reddit

You get 15 per minute and 1500 per day. In any case they're not production ready and Google will train on submitted data (if on free plan) so it's understandable there's no support for it yet.

ThisWillPass@reddit

I get .2 responses per second.

bitmoji@reddit

Yeah exp 1206 is amazing but the rate limiting makes it unusable i wish they would release it to paid

extopico@reddit

Try the new 2.0 thinking model. It is so far the best of them all. No random language switching either.

BoJackHorseMan53@reddit

Because they're experimental. Everyone will integrate them once they're out of experimental.

ArgumentFeeling@reddit

The API isn't released yet

HatZinn@reddit

Fascinating

Sky-kunn@reddit

https://preview.redd.it/kw9uzp5jmzee1.png?width=788&format=png&auto=webp&s=3a7cce8070e0e728df4aa103a28e4dd2e76cdb96 o1 at $186.5 😭

C'mon what do you have against the billionaires?

whosbabo@reddit

They are going to have to return their Yachts. So sad.

Enough-Meringue4745@reddit

I would like to see a reasoning + non-reasoning comparison: R1+Sonnet O1+Sonnet etc

MLDataScientist@reddit

If you read their post, they say "o1 paired with Sonnet didn’t produce better results than just using o1 alone."

Pro-editor-1105@reddit

it reflects their companies ideologies.

ANONYMOUSEJR@reddit

I don't think we can do it for O1... the 'thinking' tokens that are displayed are actually summaries from my understanding as openai worst fears of others figuring out the process are slowly coming true. Basically, even if we do try it, it won't nearly be on the same level since R1 shows it thought process while openai will insta-ban you if you try to get O1 to reveal its thinking (from what I recall of some users over on openai when O1 preview was released).

jd_3d@reddit

In this case they only used the final output from R1, not all the thinking tokens. When they tried using the thinking tokens they actually got worse results.

jaMMint@reddit

they should try to condense the thinking tokens to just the ones reflected in the R1 output, not the unused ones.

Snoo_64233@reddit

Does o1 need to charge really that much to cover the bottom-line, or is Sam just doing a huge markup?

Pvt_Twinkietoes@reddit

I'll like to see it's performance on simple bench.

Agentic multi-agent role-based chat conversations is another form of thinking time which R1's <think> is quite similar to.

hassan789_@reddit

The new flash 2 thinking + sonnet is 🔥

segmond@reddit

I'll like to see a benchmark of R1-Distill-Qwen-32B+Qwen2.5-32B-Coder?

vert1s@reddit

It’s an open benchmark you can run it yourself fairly easily (if you can run those models)

m3kw@reddit

Some how the architect/edit workflow never works for me. The architect would always also write the code as well while the Editor model would only be used to apply the code from architect

With DeepSeek you get the issue that the editor is confused whether or not the architect has done the edits, so indeed I almost always just use /code now. But I guess this combo works better.

Recoil42@reddit

This makes a lot of sense, actually. Sonnet is still the reigning champ on coding, but R1 very clearly is better at abstractions. I haven't actually used Aider, but does it fluidly switch between architect and editor by itself?

flextrek_whipsnake@reddit

Just so people are aware, Cline recently added a similar feature. You can easily toggle between "Plan" and "Act" modes.

t_krett@reddit

yes, described here https://aider.chat/docs/usage/modes.html#architect-mode-and-the-editor-model what I think is really funny is their copy-paste mode https://aider.chat/docs/usage/copypaste.html it watches your clipboard to integrate models like o3 that don't actually have an api as thinking model into the workflow

cant-find-user-name@reddit

as far as I know, you you have to use \`/architect\` to invoke the architect model. When you finish planning out the changes, you apply the code and then the coding model comes into play

Mediocre_Tree_5690@reddit

Why not r1 + sonnet

Reply to Post

46 Comments