AIDC-AI/Ovis2.6-80B-A3B · Hugging Face

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 22 comments

We introduce Ovis2.6-80B-A3B, the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal performance at a fraction of the serving cost. It also brings major improvements in long-context and high-resolution understanding, visual reasoning with active image analysis, and information-dense document comprehension.

Key Features

MoE Architecture: Superior Performance with Low Serving Cost The LLM backbone has been upgraded to a Mixture-of-Experts (MoE) architecture. This allows Ovis2.6 to scale up to 80B total parameters*, capturing vast amounts of knowledge and nuance. Crucially, it achieves this with only \~3B active parameters during inference, ensuring low serving costs and high throughput.
Enhanced Long-Sequence and High-Resolution Processing Ovis2.6 extends the context window to 64K tokens and supports image resolutions up to 2880×2880, significantly improving its ability to process high-resolution and information-dense visual inputs. These enhancements are particularly effective for long-document question answering, where the model must gather and synthesize clues scattered across multiple pages to derive the correct answer.
Think with Image We introduce the "Think with Image" capability, which transforms vision from a passive input into an active cognitive workspace. During reasoning, the model can actively invoke visual tools (e.g., cropping and rotation) to re-examine and analyze image regions within its Chain-of-Thought, enabling multi-turn, self-reflective reasoning over visual inputs for higher accuracy on complex tasks.
Reinforced OCR, Document, and Chart Capabilities Continuing our focus on information-dense visual tasks, we have further reinforced the model's capabilities in Optical Character Recognition (OCR), document understanding, and chart/diagram analysis. Ovis2.6 excels not only at accurately extracting structured information from visual data, but also at reasoning over the extracted content.