Deepseek Vision Coming

[-]

AnomalyNexus@reddit

What do people actually use vision for ?

[-]

faizalmzain@reddit

my app use it for auto saving receipt/pay slip/bank statement

[-]

RegisteredJustToSay@reddit

It tightens feedback loops for many many many tasks by allowing visual input rather than needing structured data (which can be really hard to obtain, too).

[-]

Far_Cat9782@reddit

Helps so much in debugging. You can show it screenshot of the problem. Especially if the problem has console errors. Design ui to how you like it etc;

[-]

In Cursor, making a screenshot is usually much faster than writing the content of an app or web page, for both development and debug. I don't even consider models without vision. Also, it's very useful to catalog images and videos for datasets

[-]

Voxandr@reddit

Making apps from screenshots , very powerful that way.
And Document OCR

[-]

Enough-Astronaut9278@reddit

been running v4-flash on my agent setup all week. honestly the 1M context is the real upgrade here, my long tasks stopped breaking halfway through. pro is overkill for most things but flash at that price? no brainer.

[-]

Few_Painter_5588@reddit

They have the base models already, so that's most of the work done infrastructure wise. Multimodality is usually baked in after the pretraining stage.

So the time between Deepseek V4-preview and V4 proper will probably not be that long, especially since Deepseek v4 was deployed nearly 2-3 weeks ago.

[-]

aeroumbria@reddit

Honestly I would have assumed that first-class vision training would be more seriously experimented on rather than leaving vision as second class by now.

[-]

segmond@reddit

it's no second class for them, checkout their OCR and papers on vision. They have a clue, they are just not in a pissing contest.

[-]

Recoil42@reddit

Everyone's speculating here, but I really think they did get (rightfully) sidetracked with the Huawei thing.

[-]

aeroumbria@reddit

I did not intend to mean how important they treat vision, but rather technically how vision are being trained. It was my impression that training a model with equal treatment of vision and language from the start would be the natural next step to training vision as an bolted on component after language training.

[-]

ObsidianNix@reddit

deepseek-ocr model was one of the best when it came out. I’m sure it’s still up there versus current models.

[-]

zball_@reddit

They are solving training instabilities while doing DeepSeek v4. I can't imagine what will they encounter when training on VLM in the first place with all those novel architectures.

[-]

Few_Painter_5588@reddit

They probably found the performance lacking and culled the feature. The leaks for v4 all said that v4-lite was going to be multimodal. If they do implement vision in v4 proper or v4.1, it'll probably only be on the v4-lite model.

[-]

Arcosim@reddit

I'm currently super excited about V4. Everything is pointing out at that it was heavily undertrained, which means we're going to see huge jumps in capabilities during the the next few months.

[-]

Few_Painter_5588@reddit

My understanding V4-Flash-Preview was trained properly. The V4-Pro-Preview was underbaked. So V4-Pro has potential.

[-]

NerasKip@reddit

Train from a base models isn't efficient as if you did it from start.. seems strange

[-]

Few_Painter_5588@reddit

No, most multimodal models are text only when trained. Adding multimodal data at that stage has no real benefit

[-]

NerasKip@reddit

But the AI have to be trained heavily no ? I saw like it lose a lot when you add a vlm layer for image generation, then you have to retrain it from that

[-]

Few_Painter_5588@reddit

Correct, you bolt on the visual layers and then you continue training on that.

[-]

Zymedo@reddit

Isn't Kimi K2.5 natively multimodal because Moonshot found that it yields better results than later-stage training?

[-]

Few_Painter_5588@reddit

Not really, here's the table from their paper. The differences are too small, given the non-deterministic nature of LLM models, amongst other issues.

Table 1:Performance comparison across different vision-text joint-training strategies. Early fusion with a lower vision ratio yields better results given a fixed total vision-text token budget.

	Vision InjectionTiming	Vision-TextRatio	VisionKnowledge	VisionReasoning	OCR	TextKnowledge	TextReasoning	Code
Early	0%	10%:90%	25.8	43.8	65.7	45.5	58.5	24.8
Mid	50%	20%:80%	25.0	40.7	64.1	43.9	58.6	24.0
Late	80%	50%:50%	24.2	39.0	61.5	43.1	57.8	24.0

[-]

dampflokfreund@reddit

Yes and it also improves generalisation, even text performance is increased because the model has a broader understanding about topics. Images to say more than a thousand words after all, it is more data. I believe there was a paper on that from another chinese model maker.

[-]

VotZeFuk@reddit

Man, I just want a properly functioning GGUF for .flash version supported in llama.cpp. Why does it seem like no one really cares about it (I mean, the developers / big contributors), unlike what it was with that Qwen3 Next thing.

[-]

zball_@reddit

Architecture too novel.

[-]

NickCanCode@reddit

Your link is not working.

`Hmm...this page doesn’t exist. Try searching for something else.`

[-]

Nunki08@reddit (OP)

Xiaokang Chen deleted his post.

[-]