has anyone tried local VLMs for desktop GUI automation?

Posted by Enough-Astronaut9278@reddit | LocalLLaMA | View on Reddit | 12 comments

Trying to use a quantized VLM on Apple Silicon to do desktop GUI automation from screenshots. Works ok for basic stuff but small icons and dense UIs are rough. Also the visual token count per screenshot is way higher than I expected which kills prefill speed.

Anyone else working on this locally? Curious what models/approaches people have tried.