Attend - Proof of Concept

Posted by Pedalnomica@reddit | LocalLLaMA | View on Reddit | 19 comments

I've gotten fed up with hoping on the computer to do one thing, and doing other stuff instead.

I'm building Attend so that our devices can help us dedicate our time and attention on what matters to us, instead of what someone else thinks is best.

Right now, it is a voice assistant that uses a vision LLM to "watch" your screen and help you get back on track if what you're doing isn't aligned with what you said you wanted to do.

I've got some work to do on the workflows and prompts to reduce false positives, but it "works" and I'm very excited about it!

I'd like to get this down to a single 3090, but two seems pretty feasible. Part of the problem most open weight vision language models are garbage with 4K images/screenshots. Qwen2-VL seems to be an exception, but it (especially the 7B) is garbage when it comes to driving the workflows behind Attend. So, I've just been using Qwen2-VL-7B-Instruct and Llama-3.3 at 8-bit as I get it working. I'd love to hear suggestions for minimizing VRAM (Intern2_5-VL also seems to handle 4K alright, but I haven't tested it enough on the workflows).

Attend interfaces with all models using OpenAI compatable API calls. So, you should be able to use the cloud, if you're into that kinda thing... You could also take a hybrid approach. I think you could get the STT and vision LLM into 16GB VRAM and run that locally. Piper TTS runs well on CPU. You could then use a cloud model just for the text LLM and keep the most sensitive stuff (screenshots!) local.

Check it out the code https://github.com/hyperfocAIs/Attend/ and a proof of concept video https://youtu.be/PETrY540zMM

[-]

Low88M@reddit

All this energy & ressources to just tell you to work instead of surfing the web ?

Really ?

It may be a good coding challenge. Congrats for it ! But investing/developing « inner parents » will probably be more efficient ! ;)

[-]

Pedalnomica@reddit (OP)

Oh... I'm constantly working on the latter as well.

The vision language model is the bulk of it, (the rest is pretty intermittent), and that fits on a single 3090. Even running one of those full tilt for all your waking hours consumes 0.35*16=5.6 kwh of energy. That's less than $1 where I live. (I don't think it actually needs to run full tilt all the time, you could get this more efficient with batching/serving multiple users, etc...)

Even if this only helps me spend several minutes better in a day, that's just a massive ROI. I'm hoping it ends up helping me much more, and eventually I want to add more generic voice assistant type features. I'm starting with what I think will help me most.

[-]

No_Afternoon_4260@reddit

Why do you want to use 4k pictures? I mean you can scale the screenshots down or cut them in region of interest

[-]

Pedalnomica@reddit (OP)

How do you know what region is of interest ahead of time? I suppose you could have a workflow and try and figure that out for each image. It might be worth trying, but I suspect it would add a lot of lag between when you get off track and Attend notices.

Scaling them down doesn't seem to work well for reasons I described in another comment.

If there are models that do okay with 4K why not use them?

[-]

No_Afternoon_4260@reddit

I've played with a yolo fine tune that should take the biggest most central paragraphs. I have like 200 images in the dataset it is a bit short may be, dm me if you want we can try something next week

[-]

fatihmtlm@reddit

So its like my mama checking me to see if I'm doing my homework and not playing games..

[-]

Pedalnomica@reddit (OP)

P.S. I'm trying to design this for whatever you want to do, work, play or anything else.

In my mind if you e.g. want to play a game for 90 minutes, Attend should help you with that, jumping in if you start to get distracted e.g. reading about something on sale on Steam... and then help you actually wrap it up after 90 minutes.

[-]

fatihmtlm@reddit

Yeah, interesting idea. I don't know how it would work for me but I want to try. Also, I can see it more helpfull at work 😅

[-]

Pedalnomica@reddit (OP)

Yeah, it is definitely highly relevant for "work," but I often don't spend my downtime how I want either. I don't think I'm the only one.

I'm still early in getting this to work well (those false positives I mentioned). If you try it and it doesn't work well for you, give it a try again in a few weeks.

[-]

Pedalnomica@reddit (OP)

Yeah, but only after you've asked her to.

[-]

cyanheads@reddit

This looks great! Maybe check out screenpipe if you haven’t already. it seems like this could be done fairly quickly with a screenpipe wrapper.

Start a task/goal and every N seconds pull the summary of what’s happening on the screen, send an alert if the activity isn’t towards the goal X times over Y minutes. Just need to add the correct temporal context to it if screenpipe doesn’t supply that already.

[-]

Pedalnomica@reddit (OP)

Interesting, I'll check it out!

[-]

Accomplished_Mode170@reddit

Love the idea; might’ve missed it in the post but miniCPM’s latest version is my default vision model for local

[-]

Pedalnomica@reddit (OP)

I saw their recent Omni model and I'm excited to try it! I've been working on this since well before that was released. I having speech to speech and vision all in one model sounds great for this usecase, but

1) I'm skeptical an 8B can drive workflows well. 2) I'm not sure how it will do with 4K screenshots. I'll have to test it soon. 3) I'm not really sure how to handle workflows with speech to speech. Parsing guided text generation works pretty well with text to text. I guess you could transcribe the conversation and pass it to a text only LLM with good instruction following... Are there any good guides for this?

[-]

AgnosticAndroid@reddit

Is there a reason you need to feed it 4K screenshots instead of something downscaled?

[-]

Pedalnomica@reddit (OP)

My screen is 4K and life's better with more screen real estate. If possible I'd rather not downgrade my resolution.

Any model you can run with, e.g. vLLM will produce outputs with 4K image inputs. However, I'm pretty sure the images are getting downscaled to whatever the model is designed for. For most open weight vision models this is way below 4K. So, the text that was a comfortable size for me to read becomes totally illegible. This seems to lead the model to basically always hallucinate what you're doing.

From what I can tell, InternVL2_5 and Qwen2-VL natively accept resolutions slightly above 4K. There's still some resizing to match everything to "patches," but what the models "see" is pretty close to the original resolution.

[-]

_thedeveloper@reddit

Hey!

You sure are working on something that excites you but don't you think it's sounds a little out of date? Also try using a better TTS something like KOKORO-TTS it would give a better speech output.

Try to get the voice outputs to be more human like it is too robotic. You can avoid screenshots for tasks that can be done using simple scripting. Use screenshots where the user is asking for actual assistance (something like help me figure out what went wrong in this code block or how can i optimize it for lower complexity). This way you amplify your target audience. Not everyone has a 3090.

Observe task switching using scripts as it would be way quicker and more effective experience. You could ask any LLM to provide a starter script, then adjust it further with complexity to meet users need.

Great job!
If you could try using the suggestions i provided I think they would improve your product a lot.

[-]

Pedalnomica@reddit (OP)

The whole point is that it passively keeps an eye on your activity to intervene if you're doing something other than what you want. I don't see how waiting for you to ask for help would work at all.

The TTS model in the video is in fact kokoro-tts.

[-]

_thedeveloper@reddit

True, I understand what your are saying. I am suggesting you add it as a feature that allows user to ask screen specific questions.

About the TTS, the response provided sounds way too out-of-date. I understand you have not set the a specific personality or tone for the response generation if you have already done it try to adjust it so it sound closer to how you would asking someone to get back to work or observe how people do it in the most efficient manner.

The closer you can get to human like the more people would actually use it. The reason is, when someone asks you to get back to work - you do it, not because they asked you to because they hold you responsible to that reminder you ask for. Make the user feel it's someone close to them who is asking them to get back to work needs more natural dialogue.

I know it sounds complex. You should think it through.

If you can figure that out, then you will surely have some attention.