Which AI or LLM can be used to process a folder of MKV files?

[-]

Scary-Knowledgable@reddit

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding - https://vision-cair.github.io/LongVU/

[-]

Dead_Internet_Theory@reddit

The tool you want does not exist, but you can build it.

Mixtral and MiniCPM support video I think, but not huge files, I think they convert to frames. You could extract scenes from a video, ask these vision models to describe each scene, and an LLM to see if something similar happened, and when.

[-]

ShengrenR@reddit

If you want object detection you probably want YOLOv(8?) or if you're specifically wanting 'LLM' you might check out https://www.rhymes.ai/blog-details/aria-first-open-multimodal-native-moe-model - which can handle video (look \~75% down the page for an example; https://github.com/rhymes-ai/Aria/blob/main/inference/notebooks/04_video_understanding.ipynb as their example notebook (vllm version in the same folder).

[-]

BGFlyingToaster@reddit

PySceneDetect can help you detect scenes and create 1 file per scene. If you're not a coder, then just ask an LLM to write that for you. Then you have the 2nd problem: creating text summaries of each video. There are several tools out there that claim to be able to do that, but I'm not familiar with any. You'd be looking for one with an API you can call. That'll be a service you must pay for. Then you store those summaries in a searchable database of some kind. Could be as simple as a single text file if there are not a lot of videos. Otherwise, go with a database (local or cloud). Then you might need an interface built to search the text and use a player to go to the correct place in the video. You can get an LLM to write all that code for you, but you'll still need to pull it together yourself.

[-]

kryptkpr@reddit

ffmpeg to take screen grabs every 1, 2 or 5 seconds depending on how long the scenes you want are.

Run them in an offline batch through a good VLM, you can test the latest 72b llava model here: https://llava-onevision.lmms-lab.com/

Once you know roughly where the hits are, traditional video dsp can be used to find where the scene starts/ends (threshold frame deltas) or you can use VLM again with finer grained timescale.

ffmpeg again to extract scenes

[-]

herozorro@reddit

it would take foreeeeeever

[-]

kryptkpr@reddit

This is a video task, and videos contain a TON of information so it's always a resampling game when you have to process them.

With a modern GPU doing both video decoding and running the LLM it shouldn't be too bad, caveat is offline batching is a MUST along with chunked prefill.. this is basically a prompt processing task from the vision LLMs perspective.

An alternative approach to the pipeline proposed above would be to invert it: do the scene-split up front, then label a single image per scene. This would be less LLM, but so much more video processing so maybe a not a win.

[-]

herozorro@reddit

yeah i know, its just i cant get used to the idea that these GPUs can rip through thousands and thousand of images (frames). its mind boggling

[-]

kryptkpr@reddit

Multimodal is really cool in general, each frame becomes a couple hundred tokens, depending on how you size and slice things. 10k tok/sec of batch prompt processing on a modern GPU is feasible with an 8B vision model.

[-]

herozorro@reddit

what do you mean by size/slice? you mean crop and resize? do they have a size limit?

[-]

kryptkpr@reddit

No limit because of tiling, but that doesn't mean you can send in a gigapixel image and expect good things and there are often multiple resolutions available.

Here's for example the gpt-4o docs on vision:

low will enable the "low res" mode. The model will receive a low-res 512px x 512px version of the image, and represent the image with a budget of 85 tokens. This allows the API to return faster responses and consume fewer input tokens for use cases that do not require high detail.

high will enable "high res" mode, which first allows the model to first see the low res image (using 85 tokens) and then creates detailed crops using 170 tokens for each 512px x 512px tile.

You can see here the size of a single tile is 512x512 and it's represented by either 85 or 170 tokens.

For a task such as this, I would preprocess the inputs to exactly a single tile and use high resolution.

[-]

NeverSkipSleepDay@reddit

greenhelmetgrabber.exe, but be sure to get the bike version, you can select on the project’s website what context it should be for (if you are a patron, otherwise you’ll always get the WW1 trench version afaik)

[-]

Acceptable_Username9@reddit

Yeah, super easy. Just toss the MKV files into Notepad or something—computers are basically built to recognize green helmets on bicycles these days. Should automatically cut it down to the exact 8-second clip you need, no extra tools required.

(prompt engineering to make sarcasm, couldn't get it as good as yours)

[-]

NeverSkipSleepDay@reddit

Ah true notepad works too, keep forgetting how far Microsoft have come in their offering!

If I’m not mistaken it’s Ctrl+H for helmets and then it should pick up on the bike part and the green from context and your intent.

[-]

Zulfiqaar@reddit

You need to extract frames at sufficiently granular intervals, and then process them with a competent vision model to see if your content is present, then splice out the segment from there.

[-]

Majestic-Quarter-958@reddit

Implement your own thing, because it's very specific, something like transforming the video to images and then feeding them to an llm, here's a tool that you can inspire code from: https://github.com/AIxHunter/FileWizardAi

[-]

Such_Advantage_6949@reddit

Basically write your code or use llm to help u write code. No llm will magically do what u just mentioned

[-]

MikePounce@reddit

That's how I would do it :

Input video -> Split into .jpg using FFMPEG with frame number in the filename -> Yolo/Florence/Moondream/llama3.2v -> Find first and last image to match your prompt -> Add a little padding before and after -> FFMPEG to keep only the relevant portion.

If your video is 24 FPS, there are 24 frames each second you may analyse every 12 frame (.5 second).

https://huggingface.co/models?pipeline_tag=object-detection

[-]

DesignerFlaws@reddit (OP)

Agreed it’s a complex task, wasn’t sure if services existed yet to achieve this. This would be useful when searching large video sets, including video forensics.