Looking for fast vision-capable local models that handle tool calls well (open-source app, want to add local support)

Posted by yaboyskales@reddit | LocalLLaMA | View on Reddit | 18 comments

Hi r/LocalLLaMA,

I built an open-source MIT-licensed desktop app - cursor-aware AI overlay, hold a key, ask AI about whatever's around your cursor, vision LLM answers with a screenshot of the cursor region as context.

Currently it routes through cloud providers (OpenRouter, Anthropic, OpenAI, Gemini direct). Default model is Gemini 3 Flash because of its speed and vision quality. The UX needs sub-2-second time-to-first-token, otherwise the "hold a key and get an answer" flow falls apart.

I'd love to add local model support as a first-class option. The community here clearly knows this space better than me.

Requirements:

- Vision-capable (image input alongside text prompt)

- Fast on consumer hardware (M-series Macs, RTX 3090/4090, mid-range cards)

- Handles function calling / tool use reliably (6 tools in the app: fetch_url, open_url, copy, save, reveal_folder, read_clipboard)

- Good enough for short Q&A about screenshots (not asking for GPT-5-level reasoning, just accurate visual understanding)

What I've seen in this sub but want input on:

- Qwen2.5-VL — looks promising for vision + tools

- MiniCPM-V — speed reportedly good

- Llama 3.2 Vision — slower but maybe better tool calling

- Pixtral — vision strong, tools unclear

- Anything else I'm missing?

What I'm asking:

Which of these (or other) models would you bet on for a fast cursor-aware UX?
Best inference stack? llama.cpp, Ollama, LM Studio, vLLM, MLX for Mac?
Any of you running vision models locally with tool calls in production? What's the actual time-to-first-token like?

If we figure out a solid combo, I'll add it as a built-in provider option in AIPointer alongside the cloud routes. Source: github.com/talentsache/aipointer

Thanks in advance. Happy to share back what works once I've tested.

[-]

ilintar@reddit

If you want fast + good + visual, Qwen3.6 35B-A3B is probably your best bet.

[-]

ilintar@reddit

For the stack, just use llama.cpp, should work out of the box on consumer hardware, might need to limit max context if short on VRAM.

[-]

yaboyskales@reddit (OP)

Yeah that's exactly what I need, because the response time is especially for a tool like this important, thank you for your take will take a deeper look while building the upcoming update

[-]

ilintar@reddit

Disregard accounts saying stuff like "Qwen2.5 VL", that stuff is prehistoric.

[-]

yaboyskales@reddit (OP)

that's how it responds with Cloud Providers and something within this speed/time frame would be nice to have also with a local model

https://i.redd.it/p8zqsw3hs21h1.gif

[-]

Ha_Deal_5079@reddit

qwen2.5-vl is solid for tool calling - their benchmarks show 93% type match on function calling which beats gpt-4o in structured tasks. miniCPM is faster on throughput but theres no tool calling eval for it yet so qwen is the safer bet

[-]

dampflokfreund@reddit

Bot with old knowledge is old

[-]

yaboyskales@reddit (OP)

Agreed that miniCPM having no tool calling eval is a problem, even if its throughput is better. For a hold-key-to-answer UX I need a model that won't randomly fail to emit the tool_use schema mid-conversation. Qwen2.5-VL is the safer choice for v1 and I can always add miniCPM as an experimental option later if their tool calling story improves. Appreciate your take!

[-]

InteractionSmall6778@reddit

For M-series Mac: Qwen2.5-VL 7B through MLX is your best starting point. Hits under 2 seconds TFT on M2/M3 Pro for screenshot queries and the tool calling is actually reliable, not just documented as supported.

For CUDA (3090/4090): same model through Ollama or llama.cpp. The 7B at Q4 fits in 8GB VRAM and hits your speed target. Skip vLLM, the setup overhead doesn't pay off for single-user local inference.

One thing working in your favor: cropping to the cursor region means small images and fast prefill, so the 7B is more than enough. Llama 3.2 Vision and Pixtral both have inconsistent tool call support depending on backend, so I'd start with Qwen2.5-VL and work outward from there.

[-]

fasti-au@reddit

1 3089 can do 27b…

[-]

yaboyskales@reddit (OP)

Thank you for the take

[-]

yaboyskales@reddit (OP)

Small image + fast prefill is a great angle. Cursor-region crops are typically under 512px after compression, so prefill stays cheap and 7B should be plenty.

Prototyping Qwen2.5-VL 7B over the next week, MLX on Mac and llama.cpp on CUDA, will report back. If it lands cleanly it goes in as a first-class provider option next to the cloud routes. Thank you!

[-]

jacky2060@reddit

You should try using Qwen 3.5/3.6. Make sure to set --image-min-tokens to something reasonable like 1024.

[-]

yaboyskales@reddit (OP)

that's a great tip. Setting a token floor on the vision encoder probably stabilizes TTFT a lot, especially for tiny cursor-region crops where the model otherwise overthinks small inputs. Will test alongside Qwen2.5-VL.. Thank you!

[-]

fasti-au@reddit

Qwen 9b? Qwen vl?

[-]

yaboyskales@reddit (OP)

Qwen2.5-VL 7B is what I got suggested elsewhere for vision + tool calls combo. Qwen9B itself doesn't seem to be a vision model from what I can find, would you have a link to the variant you mean? Thank you

[-]

Otherwise_Economy576@reddit

for sub-2s TTFT with vision + tools, the realistic shortlist on consumer hardware is shorter than people make out:

Qwen2.5-VL 7B: good tool calling for its size, vision is decent for screenshot Q&A, runs at reasonable TTFT on a 3090 with vllm/llama.cpp. for tool calls specifically, the tokenizer chat template handles the tool format well
InternVL 2.5 (4B or 8B): faster than Qwen for vision-only Q&A, tool calling is weaker tho - you'll likely have to do JSON-mode rather than native tools
MiniCPM-V 2.6: very small (~8B), surprisingly capable on UI screenshots, M-series mac friendly via mlx. tool calls work via prompting only, not native

few honest caveats from running this stack:

vision models eat tokens fast. a single 1080p screenshot is 1500-2500 tokens depending on the model's vision encoder. if you're doing cursor-region captures, crop tight (256-512px) and your TTFT will halve
the 6 tools thing trips up smaller models. some won't reliably emit the tool_use schema when the tool list is long. consider routing - local model picks a tool category, cloud handles the harder cases
llama.cpp + Qwen2.5-VL with --cont-batching and the f16 vision tower gives the best TTFT i've measured on a 4090 (~600-900ms first token for a 384px image)

for a hold-key-to-answer UX i'd ship Qwen2.5-VL 7B as default with a fallback to cloud when the model returns malformed tool calls

[-]

yaboyskales@reddit (OP)

This is gold, thanks for taking the time 🙏

Right now my screenshots go up to 1024x768 but I could easily drop the default to 512px and let users opt-in to larger via a setting. That alone should halve TTFT on every backend.

Your suggestion of routing local for tool categories and cloud for harder cases makes a lot of sense. cleaner mental model than fallback-on-error too.

Going to start with Qwen2.5-VL 7B + llama.cpp on CUDA and Qwen2.5-VL 7B + MLX on Mac, exactly as you laid out. Will report back once I have real numbers from the cursor-region case.

Thank you for your take and time!