LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

Posted by isr_431@reddit | LocalLLaMA | View on Reddit | 8 comments

The team behind LLaVA has released a few new multimodal models: [LLaMA3 8B](https://huggingface.co/lmms-lab/llama3-llava-next-8b) and Qwen-1.5 [72B](https://huggingface.co/lmms-lab/llava-next-72b) and [110B](https://huggingface.co/lmms-lab/llava-next-110b). From the [blog post](https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/): >**Today, we expanded the LLaVA-NeXT with recent stronger open LLMs**, reporting our findings on more capable language models: >**Increasing multimodal capaiblies with stronger & larger language models, up to 3x model size.** This allows LMMs to present better visual world knowledge and logical reasoning inherited from LLM. It supports LLaMA3 (8B) and Qwen-1.5 (72B and 110B). >**Better visual chat for more real-life scenarios, covering different applications.** To evaluate the improved multimodal capabilities in the wild, we collect and develop new evaluation datasets, [LLaVA-Bench (Wilder)](https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/#3-llava-bench-wilder), which inherit the spirit of [LLaVA-Bench (in-the-wild)](https://github.com/haotian-liu/LLaVA/blob/main/docs/LLaVA_Bench.md) to study daily-life visual chat and enlarge the data size for comprehensive evaluation.

8 Comments

[-]

LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

Reply to Post

8 Comments

chibop1@reddit

Next_Program90@reddit

AmazinglyObliviouse@reddit

pmp22@reddit

RekTek4@reddit

pseudonerv@reddit

LPN64@reddit

pseudonerv@reddit