Attend - Proof of Concept

Posted by Pedalnomica@reddit | LocalLLaMA | View on Reddit | 19 comments

I've gotten fed up with hoping on the computer to do one thing, and doing other stuff instead.

I'm building Attend so that our devices can help us dedicate our time and attention on what matters to us, instead of what someone else thinks is best.

Right now, it is a voice assistant that uses a vision LLM to "watch" your screen and help you get back on track if what you're doing isn't aligned with what you said you wanted to do.

I've got some work to do on the workflows and prompts to reduce false positives, but it "works" and I'm very excited about it!

I'd like to get this down to a single 3090, but two seems pretty feasible. Part of the problem most open weight vision language models are garbage with 4K images/screenshots. Qwen2-VL seems to be an exception, but it (especially the 7B) is garbage when it comes to driving the workflows behind Attend. So, I've just been using Qwen2-VL-7B-Instruct and Llama-3.3 at 8-bit as I get it working. I'd love to hear suggestions for minimizing VRAM (Intern2_5-VL also seems to handle 4K alright, but I haven't tested it enough on the workflows).

Attend interfaces with all models using OpenAI compatable API calls. So, you should be able to use the cloud, if you're into that kinda thing... You could also take a hybrid approach. I think you could get the STT and vision LLM into 16GB VRAM and run that locally. Piper TTS runs well on CPU. You could then use a cloud model just for the text LLM and keep the most sensitive stuff (screenshots!) local.

Check it out the code https://github.com/hyperfocAIs/Attend/ and a proof of concept video https://youtu.be/PETrY540zMM