Agent using Canva. Things are getting wild now...

If you need help getting Browser use running on Windows, I git it done by having Claude help me with the install. Then I had Claude write a 'bat' file for Windows to automate running and existing the app. The I had it build a small menu system as a UI, Let me know if you need help.

[-]

Relevant-Ad9432@reddit

what do you mean automate 'running and existing the app' ?? can you explain a bit?

[-]

Intraluminal@reddit

The app, as I recall, was not particularly Windows-friendly. The install was mildly difficult; getting the dependencies was a pain, setting up the environment was unpleasant, etc.

In addition, getting it running again, after shutting it down was also not Windows-friendly, you had to reset the environment, use Python in a Command Window etc.

I had Claude automate all that, so it effectively acts like a Windows app. I double-click it, and it runs. Admittedly it opends a Command Windows for its menu, but that's fine, and if I wanted, Claude could make that Windows-friendly too.

[-]

ImpossiblePlay@reddit

what was the issue? afaik, browser-use is based in DOM tree, and Canva is an iframe, in theory it won't work(i might be wrong though)

[-]

Relevant-Ad9432@reddit

no i got stuck much before i got to canva...

[-]

ThiccStorms@reddit

anyone here who has played with desktop control agents like these?
which is the most performant one wrt its size or footprint?

[-]

freecodeio@reddit

They are all hand-picked flashy videos. It just chokes after 2-3 steps due to the prompt growing.

[-]

ThiccStorms@reddit

Sad. Anyone tried UI-TARS? I just remember that by memory

[-]

waescher@reddit

I got some demos running in UI-TARS and found it very impressive actually. Tried a lot of stuff like 10-15 interactions for opening a web browser, navigating to a website with google, finding a value and opening the windows calculator to calculate that value's square root. Such stuff.

I found it so impressive that I actually signed into my work account that night and turned the AI model off because who can really tell what this thing is going to do overnight 😅

[-]

ScienceBeneficial404@reddit

I assume ur locally hosting it? I can only get the 7B to run, u think it's deemed fit for UI-complex tasks?

[-]

waescher@reddit

I used 7b as well, it worked pretty good actually.

[-]

ImpossiblePlay@reddit

not a super hard problem to solve? :P just build a SOP execution engine and convert complicated workflows to SOP, the success rate will in theory change from (step 1) * (step 2)*(step 3)... to (step 1) + (step 2)+(step 3)...

here is the implementation: https://github.com/Aident-AI/open-cuak/commit/c345755420f7d72128ac7861cee8479f70cbe23c

[-]

TheDailySpank@reddit

No desktop, but browser-use is an open source ai web browser that has a number of API options.

[-]

disciples_of_Seitan@reddit

This looks pretty shit no? Forever to complete a trivial task with a custom prompt.

[-]

ImpossiblePlay@reddit

The first time a human baby walks is pretty shit too, but it will get faster & cheaper really soon.

[-]

disciples_of_Seitan@reddit

"It's shit now but it'll get better" well we can at least agree that it looks shit now.

[-]

yVGa09mQ19WWklGR5h2V@reddit

Yeah, the "Things are getting wild now" title is a bit cringey. This is nothing different than what gets posted every day that also don't make me want to use it.

[-]

Reno0vacio@reddit

I've never understood the point of an agent interrogating a website based on a "picture". I mean, to do something that takes him 5 minutes and me half a second.

[-]

formspen@reddit

I see that this is OpenAI based at its core. Can this work with other multimodal models that are run locally?

[-]

ljhskyso@reddit (OP)

yeah, it works with openai compatible apis - so basically it can work with other open-source/open-weight VLMs. performance is another story 🤔

[-]

BoJackHorseMan53@reddit

Gemini flash ftw

[-]

mauroferra@reddit

Any chance to use a locally deployed LLM?

[-]

ljhskyso@reddit (OP)

Yeah, it supports connecting to open-ai api compatible servers, e.g. you can host any open-source VLM locally and hook it up with the system

[-]

SayfullahShehzad@reddit

What AI IS this?

[-]

ljhskyso@reddit (OP)

https://github.com/Aident-AI/open-cuak, and it uses GPT-4o for the demo

[-]

SayfullahShehzad@reddit

How many parameters does the model have ?

[-]

SayfullahShehzad@reddit

Thanks mate :)

[-]

YouAndThem@reddit

"President Day"?

[-]

ImpossiblePlay@reddit

A community member just fixed it! https://github.com/Aident-AI/open-cuak/commit/be9dc3d04d14ef989daf3dc53dc5a90473c55a22

[-]

fraschm98@reddit

Imo there can be a speedup instead of having the ai always screenshot and process the image after every single action, it could use something like shortcat on mac which gives vim like keybindings to every hyperlink and button/label actions

[-]

ImpossiblePlay@reddit

There are certainly huge room for efficiency gain. Could you expand on how keybindings will help?
The thing is that web is such a dynamic environment, the page can change easily (e.g., mouse move can trigger hover over popping up), so we are taking one screenshot after every action.

[-]

Yes_but_I_think@reddit

Few million tokens for a 1 min job

[-]

ljhskyso@reddit (OP)

"test time" scaling :D

now seriously, it will eventually get really cheap and open-source models will catch up - more DeepSeek-like VLMs will come i strongly believe

[-]

ImpossiblePlay@reddit

It indeed consumes a lot of tokens, not as many as you just mentioned :P
but since it supports open source model, one can rent a gpu for \~$1.5 per hour and run it, then the economics works

[-]

shokuninstudio@reddit

As always, the number one rule of demo tech videos is don't believe it until you use it in person yourself.

[-]

ImpossiblePlay@reddit

it's open sourced: https://github.com/Aident-AI/open-cuak. the only thing is that you will have to host Omniparser V2 and put Omniparser url in .env.local , it's too expensive for us to host :(

[-]

madaradess007@reddit

you meant 'fake demos are getting wild now..." ?

[-]

svantana@reddit

Impressive, but the detailed instructions on how to use Canva (click twice, don't double click) makes it look like it required a bunch of trial and error to get right.

[-]

ljhskyso@reddit (OP)

that's true - i think GPT-4o doesn't have these knowledge built-in yet. people might either list all the control details in the prompt (for better accuracy) or put those info in a knowledge-base and RAG it in.

[-]

potpro@reddit

And I assume all it takes is a fresh redesign of anything to make this explode right?

Either way great stuff

[-]

Puzzleheaded-Law7741@reddit

I think I've seen this on X before. What's the project again?

[-]

ljhskyso@reddit (OP)

oh you did? it's open sourced @ https://github.com/Aident-AI/open-cuak

[-]

Puzzleheaded-Law7741@reddit

Neat!

[-]

PermissionNext9894@reddit

What project is it? Gonna be fun to try it out