has anyone tried local VLMs for desktop GUI automation?

Posted by Enough-Astronaut9278@reddit | LocalLLaMA | View on Reddit | 12 comments

Trying to use a quantized VLM on Apple Silicon to do desktop GUI automation from screenshots. Works ok for basic stuff but small icons and dense UIs are rough. Also the visual token count per screenshot is way higher than I expected which kills prefill speed.

Anyone else working on this locally? Curious what models/approaches people have tried.

[-]

wbulot@reddit

I've actually coded my own browser use and computer use using Qwen locally, and it works really well. The 35B MoE or 27B dense model gives good results. Qwen models are really good at screenshot understanding and precise element location. You just need to develop your own logic on top of this, and you have your working automation.

[-]

MindPsychological140@reddit

OmniParser as a pre-pass — YOLO + OCR segments the screenshot first, then the 
VLM only reasons over labeled regions instead of parsing pixels. Cuts visual 
tokens 5-10× and small icons get caught by YOLO instead of being lost in VLM 
downscaling.

[-]

Enough-Astronaut9278@reddit (OP)

good call on OmniParser, the YOLO step should help a lot with small icons since those are exactly what gets lost when the VLM downscales the screenshot. I've been doing token pruning on the VLM side but running detection first and feeding labeled regions makes more sense for dense UIs. will try combining the two

[-]

MindPsychological140@reddit

Token pruning + region pre-pass should stack well the regions OmniParser hands you are already high-signal, so post-hoc pruning can be more aggressive without losing accuracy. Curious which pruning method you're running.

[-]

Enough-Astronaut9278@reddit (OP)

it's called GSPruning, keeps spatially important tokens using anchor points and drops redundant background regions. you're right that stacking with OmniParser should let us prune more aggressively since the input is already pre-filtered. the code is in github.com/Mininglamp-AI/Mano-P if you wanna look at the implementation

[-]

MindPsychological140@reddit

Nice, GSPruning + OmniParser becomes a hierarchical spatial filter (region-level + anchor-level). For GUI specifically I'd be curious if anchor selection benefits from an interactivity prior (clickable > static), since spatial importance in GUIs maps fairly cleanly to interactability. Did Mano-P bake any of that in, or is it task-agnostic?

[-]

Enough-Astronaut9278@reddit (OP)

haven't tried weighting anchors by interactivity yet, currently just spatial salience. makes sense tho, clickable stuff does cluster in predictable spots so that could reduce the anchor budget a lot

[-]

MindPsychological140@reddit

exactly, append-only structure is the cache-friendly property. Claude Code's design is "stable prefix (system + repo context + memory) built once, everything else appended as turns." That maps cleanly onto how Anthropic's cache_control breakpoints work the prefix never moves, so reads stay cheap.

The frameworks that fight this are the ones with history-mutation patterns:

LangChain agents in default AgentExecutor, scratchpad and tool outputs end up interleaved with system → human → AI → tool → AI per iteration. Cacheable if you structure carefully, brutal if you don't.
AutoGen / CrewAI multi-agent agent handoffs and role rotation effectively reorder the message list, which invalidates anything past the first speaker change.
Anything with summarization or compaction enabled mid-session the compaction step rewrites the middle of context, killing whatever cache state was held there.

Append-only + an explicit cache breakpoint at the boundary between "stuff that never changes" and "stuff that grows" is the whole game. Most non-Anthropic agent frameworks were built before prompt caching existed, so they don't structure context with that in mind.

FWIW I've been working on this from the other direction actually deduping the chunks that go into context before they hit the model, since even Claude Code accumulates duplicate file snippets and tool results across long sessions. Open-sourced it as Merlin: github.com/corbenicai/merlin-community

MIT, runs locally, MCP tool. Measured 22% chunk-level dedup on typical agent sessions on my end, up to 71% on RAG-heavy stuff. Different problem from cache invalidation but stacks well with it.

[-]

vko-@reddit

I'm working on that now. the qwen 3/3.5 models are not bad at this stuff, but will need some scaffolding. i'm building my own for my masters thesis, but you can check out https://github.com/simular-ai/agent-s and the such

[-]

Enough-Astronaut9278@reddit (OP)

qwen 3.5 is solid for vision tasks. what kind of scaffolding are you adding on top? I've found the verify-after-each-action step matters a lot for reliability. will check out agent-s, thanks

[-]

MixtureOfAmateurs@reddit

There's a really good framework for browser automation, pretty sure it's this https://github.com/browser-use/browser-use but I haven't used it for ages so idk. Full desktop navigation is hard. Giving a VLM terminal and browser access is easy.

[-]

Enough-Astronaut9278@reddit (OP)

yeah I've seen browser-use. my use case is more non-browser stuff tho, finder and system preferences and some legacy apps with no API. been trying Mano-P for that (github.com/Mininglamp-AI/Mano-P), it just screenshots and figures out where to click. works for simple stuff, full desktop is def harder like you said