Agent using Canva. Things are getting wild now...
Posted by ljhskyso@reddit | LocalLLaMA | View on Reddit | 62 comments
Posted by ljhskyso@reddit | LocalLLaMA | View on Reddit | 62 comments
DeltaSqueezer@reddit
Best part was that it passed the "click to prove you are human" captcha :D
OkBase5453@reddit
Are We Entering the Era of Bots?!?!
LoafyLemon@reddit
Boy, do I have a rabbit hole for you - Dead Internet Theory.
Dead_Internet_Theory@reddit
Never heard of it.
ab2377@reddit
do you always know when people say you name?
Dead_Internet_Theory@reddit
No that would be horrible lol it's such a common ~~internet meme~~ business model.
LilPsychoPanda@reddit
So you are the one ha? Cool, now we now who to blame.
Clear-Ad-9312@reddit
looks at username... 🤔
OkBase5453@reddit
Exactly!
as-tro-bas-tards@reddit
lol no, of course not. what would give you that impression?
N U D E S I N B I O
HiddenoO@reddit
Ironically, a lot of of those captchas are easier to solve for AI than they are for humans nowadays.
pjeff61@reddit
Hmm it has a centimeter of the wheel. This square def has a bike in it
FAILED
LargelyInnocuous@reddit
That pisses me off so much. Fuck Google for unleashing that shit on the world.
jumperabg@reddit
Are you sure? This looks like a browser-use integration and the user is adding instructions and has the ability to click on the UI.
ImpossiblePlay@reddit
can browser-use even use Canva? browser-use is DOM tree based, Canva is an iframe.
Dinosaurrxd@reddit
Browser use has vision and click x, y so it should still be able to use i frames just fine
IrisColt@reddit
How!? It just simply did it?
Relevant-Ad9432@reddit
lol i was thinking of creating something like this with browser-use .. got stuck somewhere and forgot about it
Intraluminal@reddit
If you need help getting Browser use running on Windows, I git it done by having Claude help me with the install. Then I had Claude write a 'bat' file for Windows to automate running and existing the app. The I had it build a small menu system as a UI, Let me know if you need help.
Relevant-Ad9432@reddit
what do you mean automate 'running and existing the app' ?? can you explain a bit?
Intraluminal@reddit
The app, as I recall, was not particularly Windows-friendly. The install was mildly difficult; getting the dependencies was a pain, setting up the environment was unpleasant, etc.
In addition, getting it running again, after shutting it down was also not Windows-friendly, you had to reset the environment, use Python in a Command Window etc.
I had Claude automate all that, so it effectively acts like a Windows app. I double-click it, and it runs. Admittedly it opends a Command Windows for its menu, but that's fine, and if I wanted, Claude could make that Windows-friendly too.
ImpossiblePlay@reddit
what was the issue? afaik, browser-use is based in DOM tree, and Canva is an iframe, in theory it won't work(i might be wrong though)
Relevant-Ad9432@reddit
no i got stuck much before i got to canva...
ThiccStorms@reddit
anyone here who has played with desktop control agents like these?
which is the most performant one wrt its size or footprint?
freecodeio@reddit
They are all hand-picked flashy videos. It just chokes after 2-3 steps due to the prompt growing.
ThiccStorms@reddit
Sad. Anyone tried UI-TARS? I just remember that by memoryÂ
waescher@reddit
I got some demos running in UI-TARS and found it very impressive actually. Tried a lot of stuff like 10-15 interactions for opening a web browser, navigating to a website with google, finding a value and opening the windows calculator to calculate that value's square root. Such stuff.
I found it so impressive that I actually signed into my work account that night and turned the AI model off because who can really tell what this thing is going to do overnight 😅
ScienceBeneficial404@reddit
I assume ur locally hosting it? I can only get the 7B to run, u think it's deemed fit for UI-complex tasks?
waescher@reddit
I used 7b as well, it worked pretty good actually.
ImpossiblePlay@reddit
not a super hard problem to solve? :P just build a SOP execution engine and convert complicated workflows to SOP, the success rate will in theory change from (step 1) * (step 2)*(step 3)... to (step 1) + (step 2)+(step 3)...
here is the implementation: https://github.com/Aident-AI/open-cuak/commit/c345755420f7d72128ac7861cee8479f70cbe23c
TheDailySpank@reddit
No desktop, but browser-use is an open source ai web browser that has a number of API options.
disciples_of_Seitan@reddit
This looks pretty shit no? Forever to complete a trivial task with a custom prompt.
ImpossiblePlay@reddit
The first time a human baby walks is pretty shit too, but it will get faster & cheaper really soon.
disciples_of_Seitan@reddit
"It's shit now but it'll get better" well we can at least agree that it looks shit now.
yVGa09mQ19WWklGR5h2V@reddit
Yeah, the "Things are getting wild now" title is a bit cringey. This is nothing different than what gets posted every day that also don't make me want to use it.
Reno0vacio@reddit
I've never understood the point of an agent interrogating a website based on a "picture". I mean, to do something that takes him 5 minutes and me half a second.
formspen@reddit
I see that this is OpenAI based at its core. Can this work with other multimodal models that are run locally?
ljhskyso@reddit (OP)
yeah, it works with openai compatible apis - so basically it can work with other open-source/open-weight VLMs. performance is another story 🤔
BoJackHorseMan53@reddit
Gemini flash ftw
mauroferra@reddit
Any chance to use a locally deployed LLM?
ljhskyso@reddit (OP)
Yeah, it supports connecting to open-ai api compatible servers, e.g. you can host any open-source VLM locally and hook it up with the system
SayfullahShehzad@reddit
What AI IS this?
ljhskyso@reddit (OP)
https://github.com/Aident-AI/open-cuak, and it uses GPT-4o for the demo
SayfullahShehzad@reddit
How many parameters does the model have ?
SayfullahShehzad@reddit
Thanks mate :)
YouAndThem@reddit
"President Day"?
ImpossiblePlay@reddit
A community member just fixed it! https://github.com/Aident-AI/open-cuak/commit/be9dc3d04d14ef989daf3dc53dc5a90473c55a22
fraschm98@reddit
Imo there can be a speedup instead of having the ai always screenshot and process the image after every single action, it could use something like shortcat on mac which gives vim like keybindings to every hyperlink and button/label actions
ImpossiblePlay@reddit
There are certainly huge room for efficiency gain. Could you expand on how keybindings will help?
The thing is that web is such a dynamic environment, the page can change easily (e.g., mouse move can trigger hover over popping up), so we are taking one screenshot after every action.
Yes_but_I_think@reddit
Few million tokens for a 1 min job
ljhskyso@reddit (OP)
"test time" scaling :D
now seriously, it will eventually get really cheap and open-source models will catch up - more DeepSeek-like VLMs will come i strongly believe
ImpossiblePlay@reddit
It indeed consumes a lot of tokens, not as many as you just mentioned :P
but since it supports open source model, one can rent a gpu for \~$1.5 per hour and run it, then the economics works
shokuninstudio@reddit
As always, the number one rule of demo tech videos is don't believe it until you use it in person yourself.
ImpossiblePlay@reddit
it's open sourced: https://github.com/Aident-AI/open-cuak. the only thing is that you will have to host Omniparser V2 and put Omniparser url in .env.local , it's too expensive for us to host :(
madaradess007@reddit
you meant 'fake demos are getting wild now..." ?
svantana@reddit
Impressive, but the detailed instructions on how to use Canva (click twice, don't double click) makes it look like it required a bunch of trial and error to get right.
ljhskyso@reddit (OP)
that's true - i think GPT-4o doesn't have these knowledge built-in yet. people might either list all the control details in the prompt (for better accuracy) or put those info in a knowledge-base and RAG it in.
potpro@reddit
And I assume all it takes is a fresh redesign of anything to make this explode right?
Either way great stuff
Puzzleheaded-Law7741@reddit
I think I've seen this on X before. What's the project again?
ljhskyso@reddit (OP)
oh you did? it's open sourced @ https://github.com/Aident-AI/open-cuak
Puzzleheaded-Law7741@reddit
Neat!
PermissionNext9894@reddit
What project is it? Gonna be fun to try it out