DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks

Posted by Federal_Spend2412@reddit | LocalLLaMA | View on Reddit | 23 comments

Is this accurate? I use DS v4 in OpenCode and find it nearly on par with Sonnet 4.6, so I'm surprised the score is so low.

https://deepswe.datacurve.ai/

[-]

Life-Screen-9923@reddit

They’re just terrified of Chinese AI and losing the price war. It's why they've resorted to smear campaigns and negative framing.

With the AI bubble stretching thin, investor expectations sky-high, and computational costs skyrocketing, they are desperate.

Even Eric Schmidt recently admitted that China is only like 6 months behind in AI.

This entire narrative is just coping mechanism to calm nervous investors.

[-]

Wide_Big_6969@reddit

how many benchmarks are legitimately reflective of real performance? It seems all of them don't reflect model performance in a way that's legitimate almost ever.

[-]

ThePixelHunter@reddit

I've heard only bad things about this new benchmark

[-]

ThePixelHunter@reddit

Inconsistencies between models, older frontier beating newer frontier, things that just don't make sense. Search the sub for "DeepSWE" and you'll find those discussions from the past couple weeks.

[-]

MokoshHydro@reddit

No. You should evaluate models yourself for your exact flow. Currently you can "make up benchmark" that will show just about anything.

P.S. DS4-pro is currently my primary coding model and I'm completely satisfied with it.

[-]

It’s great when the technical difficulty of the task amounts to something like a TODO list. For real work where frontier intelligence is justified it falls flat. It’s good enough for novice or casual users. Not for any serious work where money and consequences are on the line.

[-]

sloptimizer@reddit

Here are the steps to replicate these kinds of benchmarks:

Take money from AI labs that want to score well
Create lots of individual tests
Run tests across all models
Discard all the tests where competing models outperformed
Hand pick tests where models you were paid for scored first
Publish the final test selection and results as the new unbiased benchmark

The variable you control is the test selection, so you can make stats show anything you want to show.

Just look at how artificial analysis keeps changing their test selection as soon as GPT loose the first spot, and then boom, they are back at number one!

[-]

Exciting_Garden2535@reddit

They are using large code bases and run models via very simple mini-swe-agent (100 lines of code if I’m not mistaken). For example, mini-swe-agent does not manage and compact context at all. I suspect that good models that failed the tests just read a lot of codebase relying on context compaction, or on a system prompt that prevents from that, so they failed. And success models just acted differently, like rely more on console tools for search specific files. That is it.

[-]

macboller@reddit

I found GPT5.5 to genuinely suck at codingregardless of the harness, so not even sure what these trust-me-bro benchmarks are measuring

[-]

korino11@reddit

Depends from your tasksAND politics of compny. If your tasks have somthing that politics do not alloow. You will not have a goodoutput in code... or..you need to think and create specialfile with explonation for model,why it legit

[-]

Pleasant-Shallot-707@reddit

What nonsense is this?

[-]

stoppableDissolution@reddit

I have the exact opposite experience lol. 5.5 is the first model ever I dont feel the need to handhold

[-]

Different_Fix_2217@reddit

Lol what? Gpt 5.5 on extra high is legit next level. It can one shot cutting edge cases with little to no hand holding, it rarely if ever makes mistakes. Nothing else is even close.

[-]

SteppenAxolotl@reddit

I don't like the trends. I think it will take a lot more than 96GB VRAM to get LocalLLaMA to the leading edge of perf. Prices will get a lot worse before it gets better so local might not win in the near term.

The cost of multi TB VRAM to do local is not something the vast majority of people could manage.

[-]

sn2006gy@reddit

I built a custom API that has some smarts around the DeepSeek Architecture - I may give this benchmark a try sometime next week and see if it helps bridge the gap. Compressed Sparse Attention needs some work and assistance in different ways than large dense models.

[-]

Anbeeld@reddit

I dropped DeepSeek V4 Pro after like a week of trying it out in real tasks. GLM 5.1 is just better.

[-]

Federal_Spend2412@reddit (OP)

Yes, but ds v4 pro only passes 8% of tasks ?!

[-]

Anbeeld@reddit

It hallucinates quite a lot so it's entirely possible.

[-]

Dry_Yam_4597@reddit

Claude may pass benchmarks but it's become incredibly annoying. Forgets or omits important stuff, fails to follow guidelines, struggles to implement even basic things without overcomplicating, adds tons of prose to comments, and so on. I basically cant use it without reviewing everything it does using Qwen 3.6, and sometimes its actually quicker or better to either do stuff myself or have qwen do it.

[-]

wllmsaccnt@reddit

I've found on really big asks that the higher effort settings sometimes balloon up the context with all the extra planning, testing, and concerns requiring extra turns clarifying. You really have to think carefully about the scope of an ask with Claude because it will attempt to do everything you ask for in a session, even if you are asking it to rewrite half of a medium or larger system.

[-]

UniqueAttourney@reddit

I am not sure about this benchmark tbh, i saw it on theo's stream. i kinda felt it is trying to show more differences between models, like a logarithmic scale type difference.

It certainly triggers some FOMO for anyone using GLM 4.7 (not me of course) how much difference in usefulness are the GPT5.5 model ??

[-]

Foxiya@reddit

On my tasks(VHDL, C, Dart, Python) gemini 3.1 pro is much better than mimo, kimi and glm. So, this bench doesnt alighn with my experience.