Hunyuan releases X-Omni, a unified discrete autoregressive model for both image and language modalities

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 7 comments

🚀 We're excited to share our latest research on X-Omni: reinforcement learning makes discrete autoregressive image generative models great again, empowering a practical unified model for both image and language modality generation. Highlights: ✅ Unified Modeling Approach: A discrete autoregressive model handling image and language modalities. ✅ Superior Instruction Following: Exceptional capability to follow complex instructions. ✅ Superior Text Rendering: Accurately render text in multiple languages, including both English and Chinese. ✅ Arbitrary resolutions: Produces aesthetically pleasing images at arbitrary resolutions. Insight: 🔍 During the reinforcement learning process, the aesthetic quality of generated images is gradually enhanced, and the ability to adhere to instructions and the capacity to render long texts improve steadily. Paper: https://arxiv.org/pdf/2507.22058 Github: https://github.com/X-Omni-Team/X-Omni Project Page: https://x-omni-team.github.io/

7 Comments

[-]

FrostAutomaton@reddit

The best autoregressive model I've seen so far. Though as one would expect, the generated image quality isn't yet at the level of the best open weights diffusion models. It does seem to handle text remarkably well though: https://preview.redd.it/37kmiq3jwdgf1.png?width=1152&format=png&auto=webp&s=7f465afa07b299599b0df0c458e64845c0f7c44c

https://preview.redd.it/jxmyi0znwdgf1.png?width=1024&format=png&auto=webp&s=d4278af93de3a20d6d1e85fab82c87a61d73b594 As compared to Flux1-dev (the base diffusion model the solution uses) with the exact same prompt

ninjasaid13@reddit

what was the prompt?

Granted, this is the default prompt in their example use-case so I'd imagine it's the task they found they performed best at: A formal letter document with a professional tone. Create a document that includes a section starting with "To, Mr. Edward Robertson," aligned to the left. Underneath, place the date "Date: 27th July 2025" also aligned to the left. Begin the body of the letter with "Dear Sir," indented slightly from the left margin. The first paragraph should state, "I am writing to you with intent of purchasing your property located at #765, Lincoln Street, New York." The second paragraph should read, "I want to propose a purchase price of $100,000 for your property. I am willing to pay you $20,000 as advance." The closing remarks should be, "Kindly let me know what do you think of the offer and we can make a few changes as per your requirements." followed by "Regards," and then "William Specter". Finally, add a logo with a feather graphic in the bottom right corner.

I think X-Omni should be trained on more text-dense images like comic books.

Neither-Phone-7264@reddit

i hope it doesnt die like the other multimodal models because it takes too long to implement in llama.cpp or people just don't care to implement it

kkb294@reddit

Please include hugging face model link in your post/comments. I thought the model was not released but only the inference code and paper. Once I went into the GitHub I saw the hugging face link and from there I can access the model.[Hugging face](https://huggingface.co/collections/X-Omni/x-omni-models-6888aadcc54baad7997d7982)

Reply to Post

7 Comments