Microsoft silently releases OmniParser, a tool to convert screenshots into structured and easy-to-understand elements for Vision Agents

Posted by umarmnaq@reddit | LocalLLaMA | View on Reddit | 84 comments

[-]

arthurwolf@reddit

Oh wow, I've spend 3 month of my life doing exactly this, but for comic book pages instead of phone screenshots.

Like, detect panels, bubbles, faces, bodies, eyes, sound effects, speech bubble tails, etc, all so they can be fed to GPT4-V and it can reflect about them and use them to better understand what's going on in a given comic book page.

My thing was a lot of work. I think this one is a bit more straightforward all in all, but still pretty impressive.

[-]

frammie-@reddit

Hey there arthur,

Maybe you aren't aware but there has been this niche effort exactly what you're looking for.
It's called magi (v2) and it's on huggingface right here: https://huggingface.co/ragavsachdeva/magiv2

Might be worth looking into

[-]

arthurwolf@reddit

Thanks a lot, I'll look into it.

[-]

WAHNFRIEDEN@reddit

In what way is yours more advanced? Can it be run locally or it’s using cloud based LLMs? Is it a model, or orchestration of other non specialized models

[-]

arthurwolf@reddit

It does a bunch more analysis like understanding the actual story, the actual characters and their properties, the tails of bubbles and who they point at, sound effects, the context for each panel, its designed as orchestration of a bunch of custom trained models. Havent worked on it for months, got pretty far on the analysis part but the rest of the project was just too much for one person

[-]

nodeocracy@reddit

Message Microsoft and get yourself a job there

[-]

arthurwolf@reddit

I'm from the Linux crowd, if I got a job at Microsoft, the other bearded weirdos would likely murder me at the next bearded weirdo meetup.

[-]

soothaa@reddit

MS has had a heavy linux push recently, it's not what it used to be

[-]

pushkin0521@reddit

They have a whole army of PhDs and nobel candidate level hires stuffed in their labs and get applicants from ivy leagues x100 that, why bother with no name otaku

[-]

bucolucas@reddit

If I was able to get hired there anyone can honestly

[-]

Dazzling_Wear5248@reddit

What did you do?

[-]

bucolucas@reddit

Get fired

[-]

arthurwolf@reddit

Congrats. Doing LLM stuff?

[-]

bucolucas@reddit

hahahaaa no

[-]

Key_Extension_6003@reddit

Sounds cool. Any plans to open source this or have sass model?

[-]

arthurwolf@reddit

If I ever get to something usable, which isn't very likely considering how massive of a project it is.

[-]

RnRau@reddit

I would love to learn how you structure your prompts to do these things. Maybe instead of releasing what you have done, perhaps write a gentle guide for prompt engineering for detecting visual elements.

I would have no idea on how to start something like this, but I would love to learn, and I think alot of other would too.

[-]

arthurwolf@reddit

Here are some of the templates the system uses: https://gist.github.com/arthurwolf/d44bfc8d8aa2c4c98b230ab9ab4a4661

Note a lot of the stuff you see betweeen {{brackets}} gets replaced by the system with info from the database and/or previous prompt runs and/or previous analysis.

[-]

RnRau@reddit

Appreciate it mate! Cheers!

[-]

Key_Extension_6003@reddit

Yeah I've often pondered doing this for webtoons which is even harder. I've not really used visual llms though so it's been a whim rather than a plan.

Good luck with your project!

[-]

arthurwolf@reddit

You should try it out, you'll likely get further than you expect, llms can sort of be like magic for this stuff.

[-]

TheManicProgrammer@reddit

No reason to give up :)

[-]

arthurwolf@reddit

Well. The entire project is a manga-to-anime pipeline. And I'm pretty sure before I'm done with the project, we'll have SORA-like models that do everything my project does, but better, and in one big step... So, good reasons to give up. But I'm having fun, so I won't.

[-]

IJOY94@reddit

Do you decompose the comic into it's separate pieces? How do you handle "sound effects" that are normally not bubbled? Do you have a way to extract them (especially when they have a texture applied)?

[-]

arthurwolf@reddit

Do you decompose the comic into it's separate pieces?

Yep. Panels, faces, bodies, bubbles, tails, sound effects, etc. I have trained models for all of them pretty much.

How do you handle "sound effects" that are normally not bubbled?

They are a special type of bubble, they are recognized by the same model as the bubble model.

Do you have a way to extract them (especially when they have a texture applied)?

Sure. I use segment-anything to segment the page, and then a custom trained model to classify each segment.

[-]

CheatCodesOfLife@reddit

The entire project is a manga-to-anime pipeline.

I wonder how many of us are trying to build exactly this :D

I've got mine to the point where it's like those ai youtube videos where they have an ai voice 'recapping' manga, but on the low-end of that (forgetting which character is which, lots of gpt-isms, etc)

So, good reasons to give up. But I'm having fun, so I won't.

Same here, but I'm giving it less attention now.

[-]

arthurwolf@reddit

I wonder how many of us are trying to build exactly this :D

wolf.arthur@gmail.com . We really should talk, exchange tips/tricks. Are you on telegram, wire, something like that?

I've got mine to the point where it's like those ai youtube videos where they have an ai voice 'recapping' manga,

I've actually contacted people running those channels, and have been chatting with one of them, learned a lot from it.

[-]

smulfragPL@reddit

I think a much better use of the technology you developed is contextual translation of manga. Try pivoting to that

[-]

CheatCodesOfLife@reddit

I've got this pipeline setup to do this with my hobby project. Automatically extracts the text, whites it out from the image, stores the coordinates of each text bubble. Don't know where to source the raw manga though, and the translation isn't always accurate.

[-]

arthurwolf@reddit

Yeah that's what the context (understanding who said what, and what happened in previous panels) helps a lot, especially if a LLM is doing the translation.

I might try to get the system to do translation, and see how it goes...

[-]

Tramagust@reddit

Sounds like you should put it up on github so the community can accelerate it. You can still make money off it by providing compute.

[-]

arthurwolf@reddit

I might at some point, once it starts being useful, yeah...

[-]

KarnotKarnage@reddit

That seems like an awesome, albeit completely gigantic, project!

Do you have a blog or repo you share stuff onto? Would. Love to take a look

[-]

arthurwolf@reddit

I might, at some point, publish videos about this on my Youtube channel: https://www.youtube.com/@ArthurWolf

[-]

NeverSkipSleepDay@reddit

You will have such fine control over everything, keep going mate

[-]

Powerful_Brief1724@reddit

Got any github or place I can follow your project? It's really cool!

[-]

arthurwolf@reddit

My github is https://github.com/arthurwolf/ but I'm not publishing any manga stuff there so far.

I might make videos about this at some point: https://www.youtube.com/@ArthurWolf

[-]

bfume@reddit

you accomplished this with just prompting? care to share an early version of your prompt? I’d love to learn techniques, but it’s hard to book learn. easier and prefer examples & “real”

[-]

arthurwolf@reddit

Not just prompting. I've trained models to recognize stuff like panels and bubbles (though modern visual llms look like they should be able to handle some of that), and there's a ton of logic and tools I had to develop around it.

But a lot of the hard work is done by gpt4v and general llm processing yes.

I put some of the prompt templates in here for the curious: https://gist.github.com/arthurwolf/d44bfc8d8aa2c4c98b230ab9ab4a4661

[-]

Xeon06@reddit

It seems like their tool is to understand computer screenshots? What am I missing that nullifies your work with comics?

[-]

arthurwolf@reddit

It doesn't nullify my work with comics. I'm just saying I expect my work to at some point be nullified as general purpose models improve.

[-]

Boozybrain@reddit

What was your general process for training? This is an interesting CV problem due to the more organic and irregular shapes across panels.

[-]

arthurwolf@reddit

So for panels, I do the following.

I use segment-anything (the previous version, not moved to the latest yet) to segment the page into segments.

Then I use a model I trained to figure out which segments are panels, and which are not.

The training data for the panel, is previous comics for which I did the work manually.

It figures the panels out with something like 98% accuracy, but I still have to manually fix a few things.

It then also figures out the order of the panels. That's an interesting bit too, I looked up published papers/algos to do this, and none were accurate enough, so I wrote my own, which is better than anything I found published online (there's still one edge case it can't do, but I know how to fix it, I just haven't yet because it's not worth the effort at this point).

[-]

StaplerGiraffe@reddit

Have you considered turning your project into a manga to audiobook pipeline? It sounds like you have the image analysis done, and turning that into a script for an audiobook sounds feasible. Such a project would allow blind people to "read" manga, making the world a tiny bit better for them, even if it is not working perfectly.

[-]

arthurwolf@reddit

Yeah several people here suggested that, and I'll probably look into it.

[-]

FpRhGf@reddit

I was wondering if a tool like this exists. It'll be so useful for doing research analysis on graphic novels.

[-]

arthurwolf@reddit

I can probably share part of it, don't hesitate to email wolf.arthur@gmail.com

[-]

msbeaute00000001@reddit

Can you elaborate what you need? If it has enough request, i can relaunch my pipeline. Dm also good for me.

[-]

erm_what_@reddit

Build a comic reader for blind/partially sighted people. It's a big market, and they'd really appreciate it. Comic books are a medium they have little to no access to as it's so based on visual language. Text to speech doesn't work, but maybe your model could be the answer.

[-]

arthurwolf@reddit

That makes a lot of sense actually, I always wanted to do some accessibility-related stuff, and I think I can adapt this to do that. Thanks for the tip.

[-]

CheatCodesOfLife@reddit

Build a comic reader for blind/partially sighted people.

This is literally how you can get the models to "narrate" the comic without refusing. You prefill it by saying it's for accessibility.

[-]

Doubleve75@reddit

Most of what we do in community gets invalid by these big guys... But hey, it's a part of the game..

[-]

MoffKalast@reddit

"It's even funnier the 585th time."

It's the nature of how things move in new fields that solo devs will be first to the punch to make something useful only for then to be steamrolled in support and functionality by a slow moving large corporate team a year later.

For what it's worth you didn't waste your time, corporate open source is always sketchy. All it takes is one internal management shift and the license changes or even the whole thing goes private again. Happens again and again.

[-]

Down_The_Rabbithole@reddit

I could really use this for my translation pipeline. I'd appreciate it if you open sourced it. It would reduce work by 80% for regular translation work.

[-]

Severin_Suveren@reddit

Obvious next logical step now that you've mapped who says what seems to me to set up a RAG-system where you automatically fine-tune diffusion models on whatever comic book is entered, so to use the existing comic book context as input for generating new context, and perhaps also augmented by the users choices in a sort of "Black Mirror: Bandersnatch"-type of setup

[-]

arthurwolf@reddit

Nope, not what I'm doing with it. I'm doing a manga-to-anime pipeline. But this sounds like a lot of fun too.

[-]

Severin_Suveren@reddit

Ahh, that makes a lot of sense too! Good luck with your project :)

[-]

ninomatsu92@reddit

Don‘t give up! Any plans to open source it? Cool project

[-]

arthurwolf@reddit

I'm not sure yet, I'll probably rewrite it from scratch at some point, once it works better, and yeah, at some point it'd be open-source.

The part I described here is just one bit of it. The entire project is a semi-automated manga-to-anime pipeline.

That can somewhat also be used as an anime authoring tool (if you remove the manga analysis half and replace that with your own content / some generation tools).

I got it as far as able to understand and fully analyze manga, do voice acting with the right character's voice, color and (for now naively) animate images, all mostly automatically.

For now it makes some mistakes, but that's the point: have to do some of it manually, and then that manual work turns into a dataset, that can be used to train a model, which in turn would be able to do much more of the work autonomously.

I think at the rythm I'm at now, in like 5 to 10 years I'll have something that can just take a manga and make a somewhat watchable "pseudo"-anime from it.

[-]

InterstellarReddit@reddit

Is this what I would need to add to a workflow to help me make UIs. I am a shitty python developer and now I want to start making UIs with React or anything really for mobile devices. The problem is that I just am awful and cant figure out a workflow to make my life easier when designing front ends.

I already built the UIs in Figma, so how can I code them using something like this or another workflow to make my life easier.

[-]

cddelgado@reddit

I'm reminded of some tinkering I did with AutoGPT. Basically, I took advantage of HTML's nature by stripping out everything but semantic tags and tags for interactive elements, then converted that abstraction to JSON for parsing by a model.

[-]

AnomalyNexus@reddit

Tried it - works really well. Note that there is a typo in the requirements (== not =) and gradio demo is set to public share.

How would one pass this into a vision mode? original image, annotated and the text all three in one go?

[-]

MagoViejo@reddit

After hunting all the files missing from the git i got the gradio running but is unable to interpret any of 3 screenshots of user interfaces I had on hand. I have a 3060 and cuda installed , tried running it in windows without cuda or envs , just got ahead a pip installed all requirements. What am I missing?

Last error and message seems odd to me

File "C:\Users\pyuser\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch_ops.py", line 755, in call return self._op(args, *(kwargs or {})) NotImplementedError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchvision::nms' is only available for these backends: [CPU, Meta, QuantizedCPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

[-]

AnomalyNexus@reddit

No idea - I try to avoid windows for dev stuff

[-]

MagoViejo@reddit

Found the issue, it needs python 3.12 , so I went and used conda as the github page said and now it seems to be working :)

[-]

l33t-Mt@reddit

Is it running slow for you? seems to take a long time for me.

[-]

AnomalyNexus@reddit

Around 5 seconds here for a website screenshot. 3090

[-]

MagoViejo@reddit

Well , in a 3060 12Gb on windows takes 1-2 minutes to annotate a capture of some web interfaces my team has been working on. Not up for production but it is kind of promissing. Has a lots of hit/miss problems identifiying charts , tables. I've been playing monkey with the two slides for Box Threshold & IOU Threshold and that influences the amount of time it takes for processing So not usefull YET , but worth keeping an eye on it.

[-]

msbeaute00000001@reddit

Try to run with cpu. Should be good. This error means they run nms with cpu.

[-]

Boozybrain@reddit

Where did you find the florence icon caption model?

[-]

David_Delaune@reddit

So apparently the YOLOv8 model was pulled off github a few hours ago. But seems you can just grab the model.safetensor file off Huggingface and run the conversion script.

[-]

logan__keenan@reddit

Why would they pull the model, but still allow the process you’re describing?

[-]

bfume@reddit

race condition

[-]

David_Delaune@reddit

I guess Huggingface would be a better place for the model, it would make sense to remove it from the Github.

[-]

gtek_engineer66@reddit

Hey can you elaborate

[-]

David_Delaune@reddit

Sure, you can just download the model off Huggingface and run the conversion script.

[-]

angry_queef_master@reddit

this is such a wonderful time in computing

[-]

SwagMaster9000_2017@reddit

https://microsoft.github.io/OmniParser/

Methods	Modality	General	Install	GoogleApps	Single	WebShopping	Overall
ChatGPT-CoT	Text	5.9	4.4	10.5	9.4	8.4	7.7
PaLM2-CoT	Text	-	-	-	-	-	39.6
GPT-4V image-only	Image	41.7	42.6	49.8	72.8	45.7	50.5
GPT-4V + history	Image	43.0	46.1	49.2	78.3	48.2	53.0
OmniParser (w. LS + ID)	Image	48.3	57.8	51.6	77.4	52.9	57.7

The benchmarks are mildly above just using gpt4

[-]

qqpp_ddbb@reddit

Can this be combined with claude computer use?

[-]

Boozybrain@reddit

I'm getting an error when trying to run the gradio demo. It references a nonexistent HF repo: https://huggingface.co/weights/icon_caption_florence/resolve/main/config.json

Even logged in I get a Repository not found error

[-]

ValfarAlberich@reddit

They created this fro GPT-4V maybe someone has tried it with any open source alternative?

[-]

Inevitable-Start-653@reddit

I'm gonna try to integrate it into my project that lets an LLM use the mouse and keyboard:

https://github.com/RandomInternetPreson/Lucid_Autonomy

looks like the ID part is as good or better than owlv2, and if I can get decent descriptions of each element I wouldn't need to run owlV2 and minicpm1.6 together like the current implementation.

[-]

ProposalOrganic1043@reddit

Really helpful for creating anthropic-like computer use features.

[-]

coconut7272@reddit

Love tools like this. Seems like so many companies are trying to push general intelligence as quickly as possible, when in reality the best use cases of llms where the technology currently stands is in more specific domains. Combining specialized models in new and exciting ways is where I think llms really shine, at least in the short term