Looking for fast vision-capable local models that handle tool calls well (open-source app, want to add local support)

Posted by yaboyskales@reddit | LocalLLaMA | View on Reddit | 18 comments

Hi r/LocalLLaMA,

I built an open-source MIT-licensed desktop app - cursor-aware AI overlay, hold a key, ask AI about whatever's around your cursor, vision LLM answers with a screenshot of the cursor region as context.

Currently it routes through cloud providers (OpenRouter, Anthropic, OpenAI, Gemini direct). Default model is Gemini 3 Flash because of its speed and vision quality. The UX needs sub-2-second time-to-first-token, otherwise the "hold a key and get an answer" flow falls apart.

I'd love to add local model support as a first-class option. The community here clearly knows this space better than me.

Requirements:

- Vision-capable (image input alongside text prompt)

- Fast on consumer hardware (M-series Macs, RTX 3090/4090, mid-range cards)

- Handles function calling / tool use reliably (6 tools in the app: fetch_url, open_url, copy, save, reveal_folder, read_clipboard)

- Good enough for short Q&A about screenshots (not asking for GPT-5-level reasoning, just accurate visual understanding)

What I've seen in this sub but want input on:

- Qwen2.5-VL — looks promising for vision + tools

- MiniCPM-V — speed reportedly good

- Llama 3.2 Vision — slower but maybe better tool calling

- Pixtral — vision strong, tools unclear

- Anything else I'm missing?

What I'm asking:

  1. Which of these (or other) models would you bet on for a fast cursor-aware UX?

  2. Best inference stack? llama.cpp, Ollama, LM Studio, vLLM, MLX for Mac?

  3. Any of you running vision models locally with tool calls in production? What's the actual time-to-first-token like?

If we figure out a solid combo, I'll add it as a built-in provider option in AIPointer alongside the cloud routes. Source: github.com/talentsache/aipointer

Thanks in advance. Happy to share back what works once I've tested.