Microsoft silently releases OmniParser, a tool to convert screenshots into structured and easy-to-understand elements for Vision Agents
Posted by umarmnaq@reddit | LocalLLaMA | View on Reddit | 84 comments
arthurwolf@reddit
Oh wow, I've spend 3 month of my life doing exactly this, but for comic book pages instead of phone screenshots.
Like, detect panels, bubbles, faces, bodies, eyes, sound effects, speech bubble tails, etc, all so they can be fed to GPT4-V and it can reflect about them and use them to better understand what's going on in a given comic book page.
My thing was a lot of work. I think this one is a bit more straightforward all in all, but still pretty impressive.
frammie-@reddit
Hey there arthur,
Maybe you aren't aware but there has been this niche effort exactly what you're looking for.
It's called magi (v2) and it's on huggingface right here: https://huggingface.co/ragavsachdeva/magiv2
Might be worth looking into
arthurwolf@reddit
Thanks a lot, I'll look into it.
WAHNFRIEDEN@reddit
In what way is yours more advanced? Can it be run locally or it’s using cloud based LLMs? Is it a model, or orchestration of other non specialized models
arthurwolf@reddit
It does a bunch more analysis like understanding the actual story, the actual characters and their properties, the tails of bubbles and who they point at, sound effects, the context for each panel, its designed as orchestration of a bunch of custom trained models. Havent worked on it for months, got pretty far on the analysis part but the rest of the project was just too much for one person
nodeocracy@reddit
Message Microsoft and get yourself a job there
arthurwolf@reddit
I'm from the Linux crowd, if I got a job at Microsoft, the other bearded weirdos would likely murder me at the next bearded weirdo meetup.
:)
soothaa@reddit
MS has had a heavy linux push recently, it's not what it used to be
pushkin0521@reddit
They have a whole army of PhDs and nobel candidate level hires stuffed in their labs and get applicants from ivy leagues x100 that, why bother with no name otaku
bucolucas@reddit
If I was able to get hired there anyone can honestly
Dazzling_Wear5248@reddit
What did you do?
bucolucas@reddit
Get fired
arthurwolf@reddit
Congrats. Doing LLM stuff?
bucolucas@reddit
hahahaaa no
Key_Extension_6003@reddit
Sounds cool. Any plans to open source this or have sass model?
arthurwolf@reddit
If I ever get to something usable, which isn't very likely considering how massive of a project it is.
RnRau@reddit
I would love to learn how you structure your prompts to do these things. Maybe instead of releasing what you have done, perhaps write a gentle guide for prompt engineering for detecting visual elements.
I would have no idea on how to start something like this, but I would love to learn, and I think alot of other would too.
arthurwolf@reddit
Here are some of the templates the system uses: https://gist.github.com/arthurwolf/d44bfc8d8aa2c4c98b230ab9ab4a4661
Note a lot of the stuff you see betweeen
{{brackets}}gets replaced by the system with info from the database and/or previous prompt runs and/or previous analysis.RnRau@reddit
Appreciate it mate! Cheers!
Key_Extension_6003@reddit
Yeah I've often pondered doing this for webtoons which is even harder. I've not really used visual llms though so it's been a whim rather than a plan.
Good luck with your project!
arthurwolf@reddit
You should try it out, you'll likely get further than you expect, llms can sort of be like magic for this stuff.
TheManicProgrammer@reddit
No reason to give up :)
arthurwolf@reddit
Well. The entire project is a manga-to-anime pipeline. And I'm pretty sure before I'm done with the project, we'll have SORA-like models that do everything my project does, but better, and in one big step... So, good reasons to give up. But I'm having fun, so I won't.
IJOY94@reddit
Do you decompose the comic into it's separate pieces? How do you handle "sound effects" that are normally not bubbled? Do you have a way to extract them (especially when they have a texture applied)?
arthurwolf@reddit
Yep. Panels, faces, bodies, bubbles, tails, sound effects, etc. I have trained models for all of them pretty much.
They are a special type of bubble, they are recognized by the same model as the bubble model.
Sure. I use segment-anything to segment the page, and then a custom trained model to classify each segment.
CheatCodesOfLife@reddit
I wonder how many of us are trying to build exactly this :D
I've got mine to the point where it's like those ai youtube videos where they have an ai voice 'recapping' manga, but on the low-end of that (forgetting which character is which, lots of gpt-isms, etc)
Same here, but I'm giving it less attention now.
arthurwolf@reddit
wolf.arthur@gmail.com . We really should talk, exchange tips/tricks. Are you on telegram, wire, something like that?
I've actually contacted people running those channels, and have been chatting with one of them, learned a lot from it.
smulfragPL@reddit
I think a much better use of the technology you developed is contextual translation of manga. Try pivoting to that
CheatCodesOfLife@reddit
I've got this pipeline setup to do this with my hobby project. Automatically extracts the text, whites it out from the image, stores the coordinates of each text bubble. Don't know where to source the raw manga though, and the translation isn't always accurate.
arthurwolf@reddit
Yeah that's what the context (understanding who said what, and what happened in previous panels) helps a lot, especially if a LLM is doing the translation.
I might try to get the system to do translation, and see how it goes...
Tramagust@reddit
Sounds like you should put it up on github so the community can accelerate it. You can still make money off it by providing compute.
arthurwolf@reddit
I might at some point, once it starts being useful, yeah...
KarnotKarnage@reddit
That seems like an awesome, albeit completely gigantic, project!
Do you have a blog or repo you share stuff onto? Would. Love to take a look
arthurwolf@reddit
I might, at some point, publish videos about this on my Youtube channel: https://www.youtube.com/@ArthurWolf
NeverSkipSleepDay@reddit
You will have such fine control over everything, keep going mate
Powerful_Brief1724@reddit
Got any github or place I can follow your project? It's really cool!
arthurwolf@reddit
My github is https://github.com/arthurwolf/ but I'm not publishing any manga stuff there so far.
I might make videos about this at some point: https://www.youtube.com/@ArthurWolf
bfume@reddit
you accomplished this with just prompting? care to share an early version of your prompt? I’d love to learn techniques, but it’s hard to book learn. easier and prefer examples & “real”
arthurwolf@reddit
Not just prompting. I've trained models to recognize stuff like panels and bubbles (though modern visual llms look like they should be able to handle some of that), and there's a ton of logic and tools I had to develop around it.
But a lot of the hard work is done by gpt4v and general llm processing yes.
I put some of the prompt templates in here for the curious: https://gist.github.com/arthurwolf/d44bfc8d8aa2c4c98b230ab9ab4a4661
Xeon06@reddit
It seems like their tool is to understand computer screenshots? What am I missing that nullifies your work with comics?
arthurwolf@reddit
It doesn't nullify my work with comics. I'm just saying I expect my work to at some point be nullified as general purpose models improve.
Boozybrain@reddit
What was your general process for training? This is an interesting CV problem due to the more organic and irregular shapes across panels.
arthurwolf@reddit
So for panels, I do the following.
I use segment-anything (the previous version, not moved to the latest yet) to segment the page into segments.
Then I use a model I trained to figure out which segments are panels, and which are not.
The training data for the panel, is previous comics for which I did the work manually.
It figures the panels out with something like 98% accuracy, but I still have to manually fix a few things.
It then also figures out the order of the panels. That's an interesting bit too, I looked up published papers/algos to do this, and none were accurate enough, so I wrote my own, which is better than anything I found published online (there's still one edge case it can't do, but I know how to fix it, I just haven't yet because it's not worth the effort at this point).
StaplerGiraffe@reddit
Have you considered turning your project into a manga to audiobook pipeline? It sounds like you have the image analysis done, and turning that into a script for an audiobook sounds feasible. Such a project would allow blind people to "read" manga, making the world a tiny bit better for them, even if it is not working perfectly.
arthurwolf@reddit
Yeah several people here suggested that, and I'll probably look into it.
FpRhGf@reddit
I was wondering if a tool like this exists. It'll be so useful for doing research analysis on graphic novels.
arthurwolf@reddit
I can probably share part of it, don't hesitate to email wolf.arthur@gmail.com
msbeaute00000001@reddit
Can you elaborate what you need? If it has enough request, i can relaunch my pipeline. Dm also good for me.
erm_what_@reddit
Build a comic reader for blind/partially sighted people. It's a big market, and they'd really appreciate it. Comic books are a medium they have little to no access to as it's so based on visual language. Text to speech doesn't work, but maybe your model could be the answer.
arthurwolf@reddit
That makes a lot of sense actually, I always wanted to do some accessibility-related stuff, and I think I can adapt this to do that. Thanks for the tip.
CheatCodesOfLife@reddit
This is literally how you can get the models to "narrate" the comic without refusing. You prefill it by saying it's for accessibility.
Doubleve75@reddit
Most of what we do in community gets invalid by these big guys... But hey, it's a part of the game..
MoffKalast@reddit
"It's even funnier the 585th time."
It's the nature of how things move in new fields that solo devs will be first to the punch to make something useful only for then to be steamrolled in support and functionality by a slow moving large corporate team a year later.
For what it's worth you didn't waste your time, corporate open source is always sketchy. All it takes is one internal management shift and the license changes or even the whole thing goes private again. Happens again and again.
Down_The_Rabbithole@reddit
I could really use this for my translation pipeline. I'd appreciate it if you open sourced it. It would reduce work by 80% for regular translation work.
Severin_Suveren@reddit
Obvious next logical step now that you've mapped who says what seems to me to set up a RAG-system where you automatically fine-tune diffusion models on whatever comic book is entered, so to use the existing comic book context as input for generating new context, and perhaps also augmented by the users choices in a sort of "Black Mirror: Bandersnatch"-type of setup
arthurwolf@reddit
Nope, not what I'm doing with it. I'm doing a manga-to-anime pipeline. But this sounds like a lot of fun too.
Severin_Suveren@reddit
Ahh, that makes a lot of sense too! Good luck with your project :)
ninomatsu92@reddit
Don‘t give up! Any plans to open source it? Cool project
arthurwolf@reddit
I'm not sure yet, I'll probably rewrite it from scratch at some point, once it works better, and yeah, at some point it'd be open-source.
The part I described here is just one bit of it. The entire project is a semi-automated manga-to-anime pipeline.
That can somewhat also be used as an anime authoring tool (if you remove the manga analysis half and replace that with your own content / some generation tools).
I got it as far as able to understand and fully analyze manga, do voice acting with the right character's voice, color and (for now naively) animate images, all mostly automatically.
For now it makes some mistakes, but that's the point: have to do some of it manually, and then that manual work turns into a dataset, that can be used to train a model, which in turn would be able to do much more of the work autonomously.
I think at the rythm I'm at now, in like 5 to 10 years I'll have something that can just take a manga and make a somewhat watchable "pseudo"-anime from it.
InterstellarReddit@reddit
Is this what I would need to add to a workflow to help me make UIs. I am a shitty python developer and now I want to start making UIs with React or anything really for mobile devices. The problem is that I just am awful and cant figure out a workflow to make my life easier when designing front ends.
I already built the UIs in Figma, so how can I code them using something like this or another workflow to make my life easier.
cddelgado@reddit
I'm reminded of some tinkering I did with AutoGPT. Basically, I took advantage of HTML's nature by stripping out everything but semantic tags and tags for interactive elements, then converted that abstraction to JSON for parsing by a model.
AnomalyNexus@reddit
Tried it - works really well. Note that there is a typo in the requirements (== not =) and gradio demo is set to public share.
How would one pass this into a vision mode? original image, annotated and the text all three in one go?
MagoViejo@reddit
After hunting all the files missing from the git i got the gradio running but is unable to interpret any of 3 screenshots of user interfaces I had on hand. I have a 3060 and cuda installed , tried running it in windows without cuda or envs , just got ahead a pip installed all requirements. What am I missing?
Last error and message seems odd to me
File "C:\Users\pyuser\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch_ops.py", line 755, in call return self._op(args, *(kwargs or {})) NotImplementedError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchvision::nms' is only available for these backends: [CPU, Meta, QuantizedCPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].
AnomalyNexus@reddit
No idea - I try to avoid windows for dev stuff
MagoViejo@reddit
Found the issue, it needs python 3.12 , so I went and used conda as the github page said and now it seems to be working :)
l33t-Mt@reddit
Is it running slow for you? seems to take a long time for me.
AnomalyNexus@reddit
Around 5 seconds here for a website screenshot. 3090
MagoViejo@reddit
Well , in a 3060 12Gb on windows takes 1-2 minutes to annotate a capture of some web interfaces my team has been working on. Not up for production but it is kind of promissing. Has a lots of hit/miss problems identifiying charts , tables. I've been playing monkey with the two slides for Box Threshold & IOU Threshold and that influences the amount of time it takes for processing So not usefull YET , but worth keeping an eye on it.
msbeaute00000001@reddit
Try to run with cpu. Should be good. This error means they run nms with cpu.
Boozybrain@reddit
Where did you find the florence icon caption model?
David_Delaune@reddit
So apparently the YOLOv8 model was pulled off github a few hours ago. But seems you can just grab the model.safetensor file off Huggingface and run the conversion script.
logan__keenan@reddit
Why would they pull the model, but still allow the process you’re describing?
bfume@reddit
race condition
David_Delaune@reddit
I guess Huggingface would be a better place for the model, it would make sense to remove it from the Github.
gtek_engineer66@reddit
Hey can you elaborate
David_Delaune@reddit
Sure, you can just download the model off Huggingface and run the conversion script.
angry_queef_master@reddit
this is such a wonderful time in computing
SwagMaster9000_2017@reddit
https://microsoft.github.io/OmniParser/
The benchmarks are mildly above just using gpt4
qqpp_ddbb@reddit
Can this be combined with claude computer use?
Boozybrain@reddit
I'm getting an error when trying to run the gradio demo. It references a nonexistent HF repo: https://huggingface.co/weights/icon_caption_florence/resolve/main/config.json
Even logged in I get a
Repository not founderrorValfarAlberich@reddit
They created this fro GPT-4V maybe someone has tried it with any open source alternative?
Inevitable-Start-653@reddit
I'm gonna try to integrate it into my project that lets an LLM use the mouse and keyboard:
https://github.com/RandomInternetPreson/Lucid_Autonomy
looks like the ID part is as good or better than owlv2, and if I can get decent descriptions of each element I wouldn't need to run owlV2 and minicpm1.6 together like the current implementation.
ProposalOrganic1043@reddit
Really helpful for creating anthropic-like computer use features.
coconut7272@reddit
Love tools like this. Seems like so many companies are trying to push general intelligence as quickly as possible, when in reality the best use cases of llms where the technology currently stands is in more specific domains. Combining specialized models in new and exciting ways is where I think llms really shine, at least in the short term