Deepseek Vision Coming
Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 41 comments
From Xiaokang Chen on 𝕏: https://x.com/PKUCXK/status/2049066514284962040
Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 41 comments
From Xiaokang Chen on 𝕏: https://x.com/PKUCXK/status/2049066514284962040
AnomalyNexus@reddit
What do people actually use vision for ?
faizalmzain@reddit
my app use it for auto saving receipt/pay slip/bank statement
ritonlajoie@reddit
to view
RegisteredJustToSay@reddit
It tightens feedback loops for many many many tasks by allowing visual input rather than needing structured data (which can be really hard to obtain, too).
Far_Cat9782@reddit
Helps so much in debugging. You can show it screenshot of the problem. Especially if the problem has console errors. Design ui to how you like it etc;
PoccaPutanna@reddit
In Cursor, making a screenshot is usually much faster than writing the content of an app or web page, for both development and debug. I don't even consider models without vision. Also, it's very useful to catalog images and videos for datasets
Voxandr@reddit
Making apps from screenshots , very powerful that way.
And Document OCR
Enough-Astronaut9278@reddit
been running v4-flash on my agent setup all week. honestly the 1M context is the real upgrade here, my long tasks stopped breaking halfway through. pro is overkill for most things but flash at that price? no brainer.
Few_Painter_5588@reddit
They have the base models already, so that's most of the work done infrastructure wise. Multimodality is usually baked in after the pretraining stage.
So the time between Deepseek V4-preview and V4 proper will probably not be that long, especially since Deepseek v4 was deployed nearly 2-3 weeks ago.
aeroumbria@reddit
Honestly I would have assumed that first-class vision training would be more seriously experimented on rather than leaving vision as second class by now.
segmond@reddit
it's no second class for them, checkout their OCR and papers on vision. They have a clue, they are just not in a pissing contest.
Recoil42@reddit
Everyone's speculating here, but I really think they did get (rightfully) sidetracked with the Huawei thing.
aeroumbria@reddit
I did not intend to mean how important they treat vision, but rather technically how vision are being trained. It was my impression that training a model with equal treatment of vision and language from the start would be the natural next step to training vision as an bolted on component after language training.
ObsidianNix@reddit
deepseek-ocr model was one of the best when it came out. I’m sure it’s still up there versus current models.
zball_@reddit
They are solving training instabilities while doing DeepSeek v4. I can't imagine what will they encounter when training on VLM in the first place with all those novel architectures.
Few_Painter_5588@reddit
They probably found the performance lacking and culled the feature. The leaks for v4 all said that v4-lite was going to be multimodal. If they do implement vision in v4 proper or v4.1, it'll probably only be on the v4-lite model.
Arcosim@reddit
I'm currently super excited about V4. Everything is pointing out at that it was heavily undertrained, which means we're going to see huge jumps in capabilities during the the next few months.
Few_Painter_5588@reddit
My understanding V4-Flash-Preview was trained properly. The V4-Pro-Preview was underbaked. So V4-Pro has potential.
NerasKip@reddit
Train from a base models isn't efficient as if you did it from start.. seems strange
Few_Painter_5588@reddit
No, most multimodal models are text only when trained. Adding multimodal data at that stage has no real benefit
NerasKip@reddit
But the AI have to be trained heavily no ? I saw like it lose a lot when you add a vlm layer for image generation, then you have to retrain it from that
Few_Painter_5588@reddit
Correct, you bolt on the visual layers and then you continue training on that.
Zymedo@reddit
Isn't Kimi K2.5 natively multimodal because Moonshot found that it yields better results than later-stage training?
Few_Painter_5588@reddit
Not really, here's the table from their paper. The differences are too small, given the non-deterministic nature of LLM models, amongst other issues.
Table 1:Performance comparison across different vision-text joint-training strategies. Early fusion with a lower vision ratio yields better results given a fixed total vision-text token budget.
dampflokfreund@reddit
Yes and it also improves generalisation, even text performance is increased because the model has a broader understanding about topics. Images to say more than a thousand words after all, it is more data. I believe there was a paper on that from another chinese model maker.
VotZeFuk@reddit
Man, I just want a properly functioning GGUF for .flash version supported in llama.cpp. Why does it seem like no one really cares about it (I mean, the developers / big contributors), unlike what it was with that Qwen3 Next thing.
zball_@reddit
Architecture too novel.
NickCanCode@reddit
Your link is not working.
`Hmm...this page doesn’t exist. Try searching for something else.`
Nunki08@reddit (OP)
Xiaokang Chen deleted his post.
Alternative-Row-5439@reddit
Is that a good or bad sign?
coder543@reddit
generally it is one of those, yes
ritonlajoie@reddit
the old yesarrooo
po_stulate@reddit
How many trillion parameters is it?
ComplexType568@reddit
I think V4 Pro is 1.6T and Flash is like 284B? (0.3T)
RegisteredJustToSay@reddit
Sweet! Always loved deepseek models but was forced to switch to others due to lack of native multimodality. I welcome the chance to start using these again.
Right-Law1817@reddit
I am expecting vision by 5th may
Worried-Squirrel2023@reddit
hoping for native multimodal v4.1 not a separate vision branch. separate models for image and text is how qwen ended up with 5 model variants nobody can keep straight.
silenceimpaired@reddit
Who could have seen this coming? Not Deepseek... At least not yet.
createthiscom@reddit
V4 being multimodal would be a big deal. It would be awesome to have a local frontier model with vision.
AykutSek@reddit
link's dead but excited to see what they ship.
dampflokfreund@reddit
Hope its not seperate models, but a V4.1 with native multimodality. If they release vision dedicated models now, they didn't get the point why people ask for native multimodality in the first place.