Which AI or LLM can be used to process a folder of MKV files?
Posted by DesignerFlaws@reddit | LocalLLaMA | View on Reddit | 23 comments
Which AI or LLM can be used to process a folder of MKV files and perform tasks like identifying scenes featuring a person riding a bicycle with a green helmet, then exporting the clip as an 8-second MP4 file?
Scary-Knowledgable@reddit
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding - https://vision-cair.github.io/LongVU/
Dead_Internet_Theory@reddit
The tool you want does not exist, but you can build it.
Mixtral and MiniCPM support video I think, but not huge files, I think they convert to frames. You could extract scenes from a video, ask these vision models to describe each scene, and an LLM to see if something similar happened, and when.
ShengrenR@reddit
If you want object detection you probably want YOLOv(8?) or if you're specifically wanting 'LLM' you might check out https://www.rhymes.ai/blog-details/aria-first-open-multimodal-native-moe-model - which can handle video (look \~75% down the page for an example; https://github.com/rhymes-ai/Aria/blob/main/inference/notebooks/04_video_understanding.ipynb as their example notebook (vllm version in the same folder).
BGFlyingToaster@reddit
PySceneDetect can help you detect scenes and create 1 file per scene. If you're not a coder, then just ask an LLM to write that for you. Then you have the 2nd problem: creating text summaries of each video. There are several tools out there that claim to be able to do that, but I'm not familiar with any. You'd be looking for one with an API you can call. That'll be a service you must pay for. Then you store those summaries in a searchable database of some kind. Could be as simple as a single text file if there are not a lot of videos. Otherwise, go with a database (local or cloud). Then you might need an interface built to search the text and use a player to go to the correct place in the video. You can get an LLM to write all that code for you, but you'll still need to pull it together yourself.
kryptkpr@reddit
ffmpeg to take screen grabs every 1, 2 or 5 seconds depending on how long the scenes you want are.
Run them in an offline batch through a good VLM, you can test the latest 72b llava model here: https://llava-onevision.lmms-lab.com/
Once you know roughly where the hits are, traditional video dsp can be used to find where the scene starts/ends (threshold frame deltas) or you can use VLM again with finer grained timescale.
ffmpeg again to extract scenes
herozorro@reddit
it would take foreeeeeever
kryptkpr@reddit
This is a video task, and videos contain a TON of information so it's always a resampling game when you have to process them.
With a modern GPU doing both video decoding and running the LLM it shouldn't be too bad, caveat is offline batching is a MUST along with chunked prefill.. this is basically a prompt processing task from the vision LLMs perspective.
An alternative approach to the pipeline proposed above would be to invert it: do the scene-split up front, then label a single image per scene. This would be less LLM, but so much more video processing so maybe a not a win.
herozorro@reddit
yeah i know, its just i cant get used to the idea that these GPUs can rip through thousands and thousand of images (frames). its mind boggling
kryptkpr@reddit
Multimodal is really cool in general, each frame becomes a couple hundred tokens, depending on how you size and slice things. 10k tok/sec of batch prompt processing on a modern GPU is feasible with an 8B vision model.
herozorro@reddit
what do you mean by size/slice? you mean crop and resize? do they have a size limit?
kryptkpr@reddit
No limit because of tiling, but that doesn't mean you can send in a gigapixel image and expect good things and there are often multiple resolutions available.
Here's for example the gpt-4o docs on vision:
You can see here the size of a single tile is 512x512 and it's represented by either 85 or 170 tokens.
For a task such as this, I would preprocess the inputs to exactly a single tile and use high resolution.
NeverSkipSleepDay@reddit
greenhelmetgrabber.exe, but be sure to get the bike version, you can select on the project’s website what context it should be for (if you are a patron, otherwise you’ll always get the WW1 trench version afaik)
Acceptable_Username9@reddit
Yeah, super easy. Just toss the MKV files into Notepad or something—computers are basically built to recognize green helmets on bicycles these days. Should automatically cut it down to the exact 8-second clip you need, no extra tools required.
(prompt engineering to make sarcasm, couldn't get it as good as yours)
NeverSkipSleepDay@reddit
Ah true notepad works too, keep forgetting how far Microsoft have come in their offering!
If I’m not mistaken it’s Ctrl+H for helmets and then it should pick up on the bike part and the green from context and your intent.
Zulfiqaar@reddit
You need to extract frames at sufficiently granular intervals, and then process them with a competent vision model to see if your content is present, then splice out the segment from there.
Majestic-Quarter-958@reddit
Implement your own thing, because it's very specific, something like transforming the video to images and then feeding them to an llm, here's a tool that you can inspire code from: https://github.com/AIxHunter/FileWizardAi
Such_Advantage_6949@reddit
Basically write your code or use llm to help u write code. No llm will magically do what u just mentioned
MikePounce@reddit
That's how I would do it :
Input video -> Split into .jpg using FFMPEG with frame number in the filename -> Yolo/Florence/Moondream/llama3.2v -> Find first and last image to match your prompt -> Add a little padding before and after -> FFMPEG to keep only the relevant portion.
If your video is 24 FPS, there are 24 frames each second you may analyse every 12 frame (.5 second).
https://huggingface.co/models?pipeline_tag=object-detection
eaerdiablosios@reddit
u/MikePounce how would you identify the scene with the bicycle that has a green helmed?
MikePounce@reddit
That part would be handled by Yolo/Florence/Moondream/etc. or any vision capable model. You could prompt something like "Is there a bicycle in this image? Only output yes or no".
eaerdiablosios@reddit
wow ! i learned something new today, did not know that! i'll look into it. it's something i wanted to try similar to this on videos, script it so that it edits the videos based on specific sections found. thx man!
daHaus@reddit
That's a very specific request lol
To do this you would need to make a script to parse each file and create a list of scene changes for each one, then pull a frame using that information and pass it to something to try and identify it.
DesignerFlaws@reddit (OP)
Agreed it’s a complex task, wasn’t sure if services existed yet to achieve this. This would be useful when searching large video sets, including video forensics.