How are people doing the whole video captioning and understanding thing?

Posted by Lazy-Pattern-5171@reddit | LocalLLaMA | View on Reddit | 2 comments

I’ve not found a single model that’s trained on video as input. Is this just some smart Cv2 algorithm design coupled with using a multimodal model? Or do there exist true video->text models that are close to SoTa and more importantly they’re open source. That sounds pretty difficult all things considered I mean you would need an input space of Text + Video + Audio or Text + Image + Audio somehow synched together to then output text or audio and then be instruct tuned as well. Am I missing some critical information?

2 Comments

[-]

Lissanro@reddit

Qwen2.5 VL 72B and 7B supports video input: [https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) [https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) >**Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.

BusRevolutionary9893@reddit

I assume most are just using FFMPEG to grab a frame every couple of seconds instead of bothering with video.

Reply to Post

2 Comments