Looking for fast vision-capable local models that handle tool calls well (open-source app, want to add local support)
Posted by yaboyskales@reddit | LocalLLaMA | View on Reddit | 18 comments
Hi r/LocalLLaMA,
I built an open-source MIT-licensed desktop app - cursor-aware AI overlay, hold a key, ask AI about whatever's around your cursor, vision LLM answers with a screenshot of the cursor region as context.
Currently it routes through cloud providers (OpenRouter, Anthropic, OpenAI, Gemini direct). Default model is Gemini 3 Flash because of its speed and vision quality. The UX needs sub-2-second time-to-first-token, otherwise the "hold a key and get an answer" flow falls apart.
I'd love to add local model support as a first-class option. The community here clearly knows this space better than me.
Requirements:
- Vision-capable (image input alongside text prompt)
- Fast on consumer hardware (M-series Macs, RTX 3090/4090, mid-range cards)
- Handles function calling / tool use reliably (6 tools in the app: fetch_url, open_url, copy, save, reveal_folder, read_clipboard)
- Good enough for short Q&A about screenshots (not asking for GPT-5-level reasoning, just accurate visual understanding)
What I've seen in this sub but want input on:
- Qwen2.5-VL — looks promising for vision + tools
- MiniCPM-V — speed reportedly good
- Llama 3.2 Vision — slower but maybe better tool calling
- Pixtral — vision strong, tools unclear
- Anything else I'm missing?
What I'm asking:
-
Which of these (or other) models would you bet on for a fast cursor-aware UX?
-
Best inference stack? llama.cpp, Ollama, LM Studio, vLLM, MLX for Mac?
-
Any of you running vision models locally with tool calls in production? What's the actual time-to-first-token like?
If we figure out a solid combo, I'll add it as a built-in provider option in AIPointer alongside the cloud routes. Source: github.com/talentsache/aipointer
Thanks in advance. Happy to share back what works once I've tested.
ilintar@reddit
If you want fast + good + visual, Qwen3.6 35B-A3B is probably your best bet.
ilintar@reddit
For the stack, just use llama.cpp, should work out of the box on consumer hardware, might need to limit max context if short on VRAM.
yaboyskales@reddit (OP)
Yeah that's exactly what I need, because the response time is especially for a tool like this important, thank you for your take will take a deeper look while building the upcoming update
ilintar@reddit
Disregard accounts saying stuff like "Qwen2.5 VL", that stuff is prehistoric.
yaboyskales@reddit (OP)
that's how it responds with Cloud Providers and something within this speed/time frame would be nice to have also with a local model
https://i.redd.it/p8zqsw3hs21h1.gif
Ha_Deal_5079@reddit
qwen2.5-vl is solid for tool calling - their benchmarks show 93% type match on function calling which beats gpt-4o in structured tasks. miniCPM is faster on throughput but theres no tool calling eval for it yet so qwen is the safer bet
dampflokfreund@reddit
Bot with old knowledge is old
yaboyskales@reddit (OP)
Agreed that miniCPM having no tool calling eval is a problem, even if its throughput is better. For a hold-key-to-answer UX I need a model that won't randomly fail to emit the tool_use schema mid-conversation. Qwen2.5-VL is the safer choice for v1 and I can always add miniCPM as an experimental option later if their tool calling story improves. Appreciate your take!
InteractionSmall6778@reddit
For M-series Mac: Qwen2.5-VL 7B through MLX is your best starting point. Hits under 2 seconds TFT on M2/M3 Pro for screenshot queries and the tool calling is actually reliable, not just documented as supported.
For CUDA (3090/4090): same model through Ollama or llama.cpp. The 7B at Q4 fits in 8GB VRAM and hits your speed target. Skip vLLM, the setup overhead doesn't pay off for single-user local inference.
One thing working in your favor: cropping to the cursor region means small images and fast prefill, so the 7B is more than enough. Llama 3.2 Vision and Pixtral both have inconsistent tool call support depending on backend, so I'd start with Qwen2.5-VL and work outward from there.
fasti-au@reddit
1 3089 can do 27b…
yaboyskales@reddit (OP)
Thank you for the take
yaboyskales@reddit (OP)
Small image + fast prefill is a great angle. Cursor-region crops are typically under 512px after compression, so prefill stays cheap and 7B should be plenty.
Prototyping Qwen2.5-VL 7B over the next week, MLX on Mac and llama.cpp on CUDA, will report back. If it lands cleanly it goes in as a first-class provider option next to the cloud routes. Thank you!
jacky2060@reddit
You should try using Qwen 3.5/3.6. Make sure to set --image-min-tokens to something reasonable like 1024.
yaboyskales@reddit (OP)
that's a great tip. Setting a token floor on the vision encoder probably stabilizes TTFT a lot, especially for tiny cursor-region crops where the model otherwise overthinks small inputs. Will test alongside Qwen2.5-VL.. Thank you!
fasti-au@reddit
Qwen 9b? Qwen vl?
yaboyskales@reddit (OP)
Qwen2.5-VL 7B is what I got suggested elsewhere for vision + tool calls combo. Qwen9B itself doesn't seem to be a vision model from what I can find, would you have a link to the variant you mean? Thank you
Otherwise_Economy576@reddit
for sub-2s TTFT with vision + tools, the realistic shortlist on consumer hardware is shorter than people make out:
few honest caveats from running this stack:
for a hold-key-to-answer UX i'd ship Qwen2.5-VL 7B as default with a fallback to cloud when the model returns malformed tool calls
yaboyskales@reddit (OP)
This is gold, thanks for taking the time 🙏
Right now my screenshots go up to 1024x768 but I could easily drop the default to 512px and let users opt-in to larger via a setting. That alone should halve TTFT on every backend.
Your suggestion of routing local for tool categories and cloud for harder cases makes a lot of sense. cleaner mental model than fallback-on-error too.
Going to start with Qwen2.5-VL 7B + llama.cpp on CUDA and Qwen2.5-VL 7B + MLX on Mac, exactly as you laid out. Will report back once I have real numbers from the cursor-region case.
Thank you for your take and time!