Hunyuan releases X-Omni, a unified discrete autoregressive model for both image and language modalities

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 7 comments

🚀 We're excited to share our latest research on X-Omni: reinforcement learning makes discrete autoregressive image generative models great again, empowering a practical unified model for both image and language modality generation. Highlights: ✅ Unified Modeling Approach: A discrete autoregressive model handling image and language modalities. ✅ Superior Instruction Following: Exceptional capability to follow complex instructions. ✅ Superior Text Rendering: Accurately render text in multiple languages, including both English and Chinese. ✅ Arbitrary resolutions: Produces aesthetically pleasing images at arbitrary resolutions. Insight: 🔍 During the reinforcement learning process, the aesthetic quality of generated images is gradually enhanced, and the ability to adhere to instructions and the capacity to render long texts improve steadily. Paper: https://arxiv.org/pdf/2507.22058 Github: https://github.com/X-Omni-Team/X-Omni Project Page: https://x-omni-team.github.io/