Qwen-Image — a 20B MMDiT model

Posted by Xhehab_@reddit | LocalLLaMA | View on Reddit | 24 comments

🚀 Meet Qwen-Image — a 20B MMDiT model for next-gen text-to-image generation. Especially strong at creating stunning graphic posters with native text. Now open-source. 🔍 Key Highlights: 🔹 SOTA text rendering — rivals GPT-4o in English, best-in-class for Chinese 🔹 In-pixel text generation — no overlays, fully integrated 🔹 Bilingual support, diverse fonts, complex layouts 🎨 Also excels at general image generation — from photorealistic to anime, impressionist to minimalist. A true creative powerhouse.

Reply to Post

24 Comments

[-]

ilintar@reddit

Ggufs are up! https://huggingface.co/city96/Qwen-Image-gguf Run with GGUF-ComfyUI

[-]

Shivacious@reddit

https://preview.redd.it/rt728lg9f1hf1.png?width=1630&format=png&auto=webp&s=baab3b4de52787d05ac230d6dc2817ce527261ff tried running it

[-]

Rich_Artist_8327@reddit

how do you run it?

[-]

Shivacious@reddit

Used their diffusers library , kept it on gpu memory while using fastapi + httpx

[-]

Capable-Ad-7494@reddit

What dash is that if you don’t mind me asking?

[-]

NickCanCode@reddit

Wow. 56GB VRAM used! That's too much. I will wait for optimized version.

[-]

Shivacious@reddit

1.5t a second too.

[-]

Temporary_Exam_3620@reddit

All cool and good but is there any way companies can scale their image generation models in a way thats VRAM affordable and not entirely reliant on nvidia? Like for instance providing support for llama.cpp instead of going straight to hugginface/pytorch so we get image generation in vulkan? As of today, companies are happy to innovate by making the image gen models bigger, which brings results. But theres an absurd amount of people still relying on SDXL which by todays standards, is already a relic. China do your thing, and make a cheap flux-schnell level model that fits in 6 gb vram and has image editing!

[-]

taimusrs@reddit

FWIW PyTorch supports Intel Arc lmao. A couple of Arc B580 is not that expensive relatively speaking. Or if it's even possible, allocate 32GB of RAM to your Intel iGPU

[-]

Psychological-Sale-3@reddit

Intel Arc B60 48gb $1.300USD

[-]

Weltleere@reddit

Right. They mostly prioritize achieving the best possible quality regardless of model size, unfortunately. It would be much better if they made continuous improvements within each parameter class - similar to how language models evolve with better training techniques, data, and architectures at consistent sizes - rather than just scaling up endlessly.

[-]

Rich_Artist_8327@reddit

How can I run this? Is 5090 enough? vLLM? Does this work with rocm and vllm using 2 7900 XTX?

[-]

ihaag@reddit

Bring on image to image gen

[-]

Agreeable_Cat602@reddit

Too bad you need $100k equipment to run it - I mean - who is this really for?

[-]

Any_Pressure4251@reddit

Now you do, in about a couple of days you will not.

[-]

Agreeable_Cat602@reddit

I f@cking love it when people predict my lottery winnings

[-]

momentcurve@reddit

In a couple of days there will be quantized versions available that will fit on consumer GPUs.

[-]

Equivalent-Word-7691@reddit

Is it only available through API?😐

[-]

stddealer@reddit

Apache 2.0 open weights

[-]

jferments@reddit

No it's a free, open weight model.

[-]

shokuninstudio@reddit

I don't think it's available to test or use on [chat.qwen.ai](http://chat.qwen.ai) yet. The images generator they have is labelled '**235B-A22B**'. It doesn't produce text in Japanese that makes sense. It's random real and made up characters... https://preview.redd.it/tks25n8951hf1.png?width=1280&format=png&auto=webp&s=09742b4052cdfa25eb52bc730e3ba342c2dab036

[-]

ilintar@reddit

GGUF when? (for ComfyUI-GGUF obviously)

[-]

Xhehab_@reddit (OP)

Benchmarks 🔥 https://preview.redd.it/xgqzksza11hf1.png?width=3036&format=png&auto=webp&s=b3480217cc5a15c83ae9d4b4461ce71741a50e9e

[-]

MrWeirdoFace@reddit

Cool.