Best Local VLMs - November 2025

Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 36 comments

Share what your favorite models are right now and why. Given the nature of the beast in evaluating VLMs (untrustworthiness of benchmarks, immature tooling, intrinsic stochasticity), please be as detailed as possible in describing your setup, nature of your usage (what applications, how much, personal/professional use), tools/frameworks/prompts etc.

Rules

Should be open weights models

[-]

SlowFail2433@reddit

janhq/Jan-v2-VL-high

This looked good and is super new

[-]

Privocado@reddit

Technically, it's a finetune of Qwen-3-VL-8B-Thinking specialized for long-horizon tasks. Thanks for the suggestion, definitely exploring it.

[-]

rm-rf-rm@reddit (OP)

Literally the first time Im hearing about it. Their claimed SOTA LLM for searching was a let down

[-]

SlowFail2433@reddit

It’s just a small lab they are just doing RL on 4Bs and 8Bs so expectations should be in line with that really.

[-]

rm-rf-rm@reddit (OP)

Sure you can RL and produce something genuinely useful for specific use cases OR you can just benchmaxx & spend most of your time on marketing your grades. Jan feels like the latter.

[-]

Betadoggo_@reddit

It worked pretty well for me in q8. It's not doing anything too crazy, but it was able to retrieve current information like release numbers and such. For a 4B I found it pretty impressive, especially since it did it via tool calls instead of a multiprompt pipeline which is usually easier for models that size to handle. None of their claims were that crazy, they only gained 5 points over their base model in their main benchmark SimpleQA.

[-]

Pitiful-Rub-3037@reddit

I have tried, paddleocr-vl, dots.ocr, qwen3vl and chandra. For me, Chandra was the clear winner, except for its restrictive licensing on weights.

[-]

vasileer@reddit

minimax-m2, agentic use and coding with roocode

[-]

rm-rf-rm@reddit (OP)

damn really? The benchmarks do put it in the same regime so good to hear that they hold up in actual use

[-]

kryptkpr@reddit

if you're going to play with minimax-m2 make sure you read their model card to get recommended sampler settings, in my evaluations this model behaves funny in the long tail with temps below 1.0 or low topk

[-]

rm-rf-rm@reddit (OP)

oof shouldnt it be the opposite? It should behave well below 1...

[-]

kryptkpr@reddit

it gets stuck in loops it can't escape if you cool it/limit it too much. these loops start around 6k ctx length and is fairly bad by 12k, so only shows up when I slam it.

[-]

kc858@reddit

that happens to me with glm 4.5 air REAP awq, loops until one of my cards gets to 90c

[-]

Badger-Purple@reddit

not a VLM!

[-]

work_urek03@reddit

Minimax m2 is a vlm?

[-]

airbus_a360_when@reddit

The Ovis series of models seem to be the best at general questions and not hallucinating visual details, especially for their size. Unfortunately nobody seems to be interested in actually making it runnable on llama.cpp and the like, so it's not useful for the average person at all.

[-]

aeroumbria@reddit

Qwen3 VL seems to work really well with Qwen Image / WAN for self-directed sequence of scenes generation. I like the 8B Thinking, 30B MOE and 32B Instruct so far. 32B Thinking is getting quite slow for interactive use on my machine but it is quite good for world building and story sketching. 8B Thinking is quite good at following instructions compared to the non-thinking smaller models, and is still fast enough as a workflow step. 30B MOE is probably the best balance if you need the main GPU for other stuff like ComfyUI but can spare a second GPU with limited capabilities.

[-]

Badger-Purple@reddit

Qwen3 VL has been winning. 8B, 30a3…not sure why they released the 32B, but also in the same league. 8B VL feels like their Qwen3 4B 2507 Thinking, punch miles above its weight.

[-]

gaztrab@reddit

Current testing Qwen3-Next-80b 6bit on my M3 Max 96GB. It's blazing fast, 12s to first token and 50t/s, and very good at following instructions too! Perfect for prototyping agentic frameworks.

[-]

Badger-Purple@reddit

not VLM!!

[-]

rm-rf-rm@reddit (OP)

have you used it for agentic coding (with claude code or open code etc.)?

[-]

Past-Grapefruit488@reddit

Qwen 3 VL for browser use

[-]

rekriux@reddit

Kimi-Linear-48B-A3B-Instruct-AWQ-4bit
qwq+flavors (FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview, AirliAI,...)
Hermes-3-Llama-3.2-3B
Llama-3_3-Nemotron-Super-49B-v1_5
DeepHermes-3-Mistral-24B-Preview
deepcogito_cogito-v2-preview-llama-109B
Qwen3-30B-A3B-Thinking-2507
granite-4.0-h-small
Olmo-3-32B-Think

[-]

rm-rf-rm@reddit (OP)

VLMs. Not LLMs

[-]

film_man_84@reddit

Well, just installed qwen3-vl-8b and I have tested it now with some different photos and I was blown away how well it detected texts, understood that texts were in finnish, tested screenshot from The Godfather movie and it detected the movie, it also detected some famous actresses as well when I tested so I am now very impressed!

Actually, after couple of so well detected detections I was starting to doubt if that really runs locally so I even unplugged my network cable and then tested images still and it was working :D

So yeah, so far this will be my LLM of choice for new testings for now because of this visual detections!

[-]

rm-rf-rm@reddit (OP)

ocrarena agrees with qwen3-vl-8b being the best open weights models. Its doing remarkably well sitting next to Sonnet 4.5 right now

[-]

Klutzy-Snow8016@reddit

I like Qwen 3 VL 32B Instruct for rewriting prompts in ComfyUI. The 30B-A3B and 8B variants also work, but the 32B is a little better.

[-]

LightBrightLeftRight@reddit

I’ve just started with ComfyUI, what do you mean you use it to rewrite the prompts?

[-]

Klutzy-Snow8016@reddit

Where the prompt normally goes, I basically just pipe in the output of an LLM node, and I give the LLM node my original prompt plus an instruction to rewrite it for creative image / video generation.

It seems to work well with newer models like Qwen Image, where the model creator recommends that you rewrite prompts with an LLM. In their github repo, they have the system prompt that they use for this task.

[-]

LightBrightLeftRight@reddit

Thanks so much! I'll try this out

[-]

Klutzy-Snow8016@reddit

Also, make sure you scroll to the bottom of the model card and apply the recommended sampler settings, not the ones in generation_config.json. The 32B Thinking model in particular goes absolutely insane in chat usage if you don't set the recommended presence_penalty.

[-]

ViratBodybuilder@reddit

As I'm in the Medical field, I use MedGemma 27b and 4b. 4b performance is incredible on my M4 Pro 64 GiB.

[-]

JLeonsarmiento@reddit

I’m still using Mistral-Small 3.2 and Magistral 3.1 for VLM instruct and thinning tasks with dense models.

Qwen3-VL-30b-a3b-thinking is the king of local, isn’t it?

[-]

KvAk_AKPlaysYT@reddit

Qwen 3 VL!!

[-]

Medium_Chemist_4032@reddit

qwen3-vl the thinking variant, both 32 and 8 excel at reading screenshots, which is most of what I used VLMs for. They also format the answer nicely.

[-]

Betadoggo_@reddit

I use qwen-vl-30B-thinking with ik_llamacpp as the backend and openwebui as the frontend. I use it because it's fast (\~15t/s) on my 2060 6GB with ddr4 2900 and is capable enough for most of my image input needs. I use it for general vision tasks (text extraction, translation, image description, etc.). It's also generally uncensored. It has no problem describing or explaining images that models like gemma refuse to.