DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks

Posted by Federal_Spend2412@reddit | LocalLLaMA | View on Reddit | 23 comments

Is this accurate? I use DS v4 in OpenCode and find it nearly on par with Sonnet 4.6, so I'm surprised the score is so low.

https://deepswe.datacurve.ai/