A moment of thanks for DeepSeek
Posted by DeltaSqueezer@reddit | LocalLLaMA | View on Reddit | 3 comments
Even when I'm not using their models, they're sharing their R&D which benefits the whole ecosystem and consumers, esp. those that make AI cheaper and more efficient. And by setting low prices, they are pushing costs down and reducing prices for us all.
NineThreeTilNow@reddit
V4 has some really suspect design decisions that I don't fully understand to this day. They pushed inference price in to the floor but did it at the cost of "intelligence".
It seems hyper optimized to be "mid" as this generation seems to call it... while being hyper cheap.
It seems like it needs a larger vocabulary if you're going to train THAT long. Further, the sliding local attention is TINY. 128 tokens, with a tiny vocab is not a very dense information slice.
Gemma 4 by comparison uses sliding local attention at 1024 tokens? on 4/5 layers. With a ~256k vocab.
Gemma 4 31b could be massively improved with some of DeepSeek and Moonshot's research. Specifically their transformer based residual streams and possibly taking a real shot at implementing Engram.
There's some quantization aware training stuff too that I believe is in V4 as well that could help the quants of Gemma 31b. That QAT is done such that when they quantize DS v4 for inference, the model loses less performance. It's a well understood concept but Deepseek is the first to do a decent level of documentation from a large lab I think. I believe it was developed under bitnet from Microsoft? or the idea was kicked around and MS used it with bitnet.
Lastly, DeepSeek's v4 is like v3.2 ... It uses like 5x the tokens to solve the same problem other models solve. This is classical over thinking of a problem by these models.
So price adjusted, you have to multiply by 5, because actual tokens per hard problem is a lot more.
graypasser@reddit
So.... did you actually calculated and compared token usage in multiple tasks, or is it "vibe writing"?
goldcakes@reddit
Reminded me of this meme, 10x developer at deekseek: https://youtube.com/shorts/yMYUDN4Tfws
But yeah, serious props to DeepSeek for being one of the most open labs when it comes to sharing research. Plus R1 really did change the landscape in a good way.