AIDC-AI/Ovis2.6-80B-A3B · Hugging Face
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 22 comments
We introduce Ovis2.6-80B-A3B, the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal performance at a fraction of the serving cost. It also brings major improvements in long-context and high-resolution understanding, visual reasoning with active image analysis, and information-dense document comprehension.
Key Features
- MoE Architecture: Superior Performance with Low Serving Cost The LLM backbone has been upgraded to a Mixture-of-Experts (MoE) architecture. This allows Ovis2.6 to scale up to 80B total parameters*, capturing vast amounts of knowledge and nuance. Crucially, it achieves this with only \~3B active parameters during inference, ensuring low serving costs and high throughput.
- Enhanced Long-Sequence and High-Resolution Processing Ovis2.6 extends the context window to 64K tokens and supports image resolutions up to 2880×2880, significantly improving its ability to process high-resolution and information-dense visual inputs. These enhancements are particularly effective for long-document question answering, where the model must gather and synthesize clues scattered across multiple pages to derive the correct answer.
- Think with Image We introduce the "Think with Image" capability, which transforms vision from a passive input into an active cognitive workspace. During reasoning, the model can actively invoke visual tools (e.g., cropping and rotation) to re-examine and analyze image regions within its Chain-of-Thought, enabling multi-turn, self-reflective reasoning over visual inputs for higher accuracy on complex tasks.
- Reinforced OCR, Document, and Chart Capabilities Continuing our focus on information-dense visual tasks, we have further reinforced the model's capabilities in Optical Character Recognition (OCR), document understanding, and chart/diagram analysis. Ovis2.6 excels not only at accurately extracting structured information from visual data, but also at reasoning over the extracted content.
Previously they released Marco-Mini-Instruct, Marco-Nano-Instruct, Marco-DeepResearch-8B, Ovis2.6-30B-A3B, etc.,
pmttyji@reddit (OP)
IrisColt@reddit
It can still hold its ground against other models from its era, but... that era was a year ago.
lakySK@reddit
This table gives me a headache. Just stick with bold for best…
mfarmemo@reddit
Agreed. Was confused until I read the footer note. My personal rule is if a visual needs to be explained it is the wrong visual.
pmttyji@reddit (OP)
IrisColt@reddit
Is the performance of the 30B and the 80B similar, or what?
Craftkorb@reddit
Qwen3-VL is severely outdated, so can I assume that it would fare badly against Qwen3.6?
silenceimpaired@reddit
Everyone seems to think this is based off Qwen… do we know that?
Craftkorb@reddit
Didn't say that, just that their graphic compares it against an old model
MaxKruse96@reddit
https://huggingface.co/AIDC-AI/Ovis2.6-80B-A3B/blob/main/config.json
silenceimpaired@reddit
Fair enough :)
Mountain_Patience231@reddit
how come a 64K tokens model could effective for long-document question answering,
Important_Quote_1180@reddit
The context size is really tight to be competitive with a reasoning model.
Finanzamt_Endgegner@reddit
Well it's supposed to be a vision model not necessarily a reasoning and coding model
seamonn@reddit
Context is always good even if not coding
PhoneOk7721@reddit
Worse than qwen3.6 35b a3b in vision it looks like.
tamerlanOne@reddit
Essendo relativamente pesante in ram MTP sarà implementato?
coolnq@reddit
There's still no implementation in llama.cpp. There's no point in using it if resources are limited.
Finanzamt_Endgegner@reddit
Shouldn't be too hard to add support since it's based on qwen next
seamonn@reddit
Regardless of how capable it is compared to frontier open sourced LLMs, this is still pretty cool especially for Vision
Own_Suspect5343@reddit
Only 64k context?
MaxKruse96@reddit
Qwen3-next-reasoning with vision it seems