How are people doing the whole video captioning and understanding thing?

Posted by Lazy-Pattern-5171@reddit | LocalLLaMA | View on Reddit | 2 comments

I’ve not found a single model that’s trained on video as input. Is this just some smart Cv2 algorithm design coupled with using a multimodal model? Or do there exist true video->text models that are close to SoTa and more importantly they’re open source. That sounds pretty difficult all things considered I mean you would need an input space of Text + Video + Audio or Text + Image + Audio somehow synched together to then output text or audio and then be instruct tuned as well. Am I missing some critical information?