How are people doing the whole video captioning and understanding thing?
Posted by Lazy-Pattern-5171@reddit | LocalLLaMA | View on Reddit | 2 comments
I’ve not found a single model that’s trained on video as input.
Is this just some smart Cv2 algorithm design coupled with using a multimodal model? Or do there exist true video->text models that are close to SoTa and more importantly they’re open source.
That sounds pretty difficult all things considered I mean you would need an input space of Text + Video + Audio or Text + Image + Audio somehow synched together to then output text or audio and then be instruct tuned as well.
Am I missing some critical information?
2 Comments
Lissanro@reddit
BusRevolutionary9893@reddit