TheaterFire

Lumina-mGPT 2.0: Stand-alone Autoregressive Image Modeling | Completely open source under Apache 2.0

Posted by umarmnaq@reddit | LocalLLaMA | View on Reddit | 93 comments

[https://github.com/Alpha-VLLM/Lumina-mGPT-2.0](https://github.com/Alpha-VLLM/Lumina-mGPT-2.0) [https://huggingface.co/Alpha-VLLM/Lumina-mGPT-2.0](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-2.0) [https://huggingface.co/spaces/Alpha-VLLM/Lumina-Image-2.0](https://huggingface.co/spaces/Alpha-VLLM/Lumina-Image-2.0)

Reply to Post

93 Comments

4hometnumberonefan@reddit

Why autoregressive image models coming up after diffusion? GPT 4o image gen seems to be autoregressive, now this. Fascinating.
View on Reddit #52911211

stduhpf@reddit

Dall-E 1 was autoregressive, and it sucked. Diffusion models run faster and have typically better image quality, though it looks like the modern autoregressive generators are catching up fast in terms of image quality.
View on Reddit #53384117

Lissanro@reddit

Looks interesting, but cannot try yet due to lack of Multi-GPU support: [https://github.com/Alpha-VLLM/Lumina-mGPT-2.0/issues/1](https://github.com/Alpha-VLLM/Lumina-mGPT-2.0/issues/1) \- but it sounds like it is coming. With quantization, according to their github, it fits into just 33.8 GB, so a pair of 3090 cards could potentially run it.
View on Reddit #52924247

plankalkul-z1@reddit

> lack of Multi-GPU support: https://github.com/Alpha-VLLM/Lumina-mGPT-2.0/issues/1  That issue is now marked as "completed", but no word from the devs as to whether they actually did something, or just pressed the wrong button while closing the issue...
View on Reddit #53129152

FrostAutomaton@reddit

Very cool! Getting the repo up and running was fairly straight-forward. Though the requirements in terms of both vram and time are rough, to put it mildly. I'm not entirely convinced this model has a niche when compared to the best open diffusion models yet, based on the image quality I get. It doesn't seem to handle text or prompt fidelity better than the open source SotA, but it's a step in the right direction.
View on Reddit #52917596

plankalkul-z1@reddit

Did you manage to run it (that is, actually generate images)? If so, on what HW? Memory requirements are a bit confusing, to say the least... Not only is there that Github issue about lack of support for multi-GPU inference, but I cannot fathom what a 7B model (plus another 200+MB one) is even doing with 80GB of VRAM. Dev's reply under that issue isn't very helpful either: > We have contacted huggingface and will launch Lumina-mGPT 2.0 soon. That was in response to a suggestion to ask Huggingface for help with multi-GPU inference (?). Besides, they've launched "Lumina-mGPT 2.0" already... So what does that quote even mean?! I always liked what Lumina was doing (for me, personally, following prompt is more important than pixel-perfect quality), but I'd say this release is a bit... messy.
View on Reddit #52932704

FrostAutomaton@reddit

Yes, I've generated images with the model. I have access to an H100 so I could deploy it on a single GPU
View on Reddit #53126720

AD7GD@reddit

Main requirement for following their setup instructions is to use python 3.10, because it calls for specific wheels built for 3.10. It's not clear how memory usage works. Their sample generation worked in 48G. It doesn't allocate it all immediately (still >24G, though) but it eventually uses all VRAM. Although it's not clear what the rules are, I was pleasantly surprised that it didn't just randomly run out of memory partway through.
View on Reddit #52943546

maz_net_au@reddit

It looks like there's a hard requirement for flash attention 2, which means it doesn't run on Turing or earlier gen cards (i.e. the two RTX 8000's I have can't be used despite having 48gb of ram each)?
View on Reddit #52981726

TemperFugit@reddit

Is it really a 7B model that uses 80GB VRAM? Or am I missing something?
View on Reddit #52924115

FrostAutomaton@reddit

It does look like it. The model download is roughly the size of a non-quanted 7b model. I don't entirely understand why it is as memory intensive as it is.
View on Reddit #52925363

Lifeisshort555@reddit

Has anyone made an Auto regressive model that guides a diffusion model rather than trying to have the auto regressive model draw the entire thing?
View on Reddit #53002628

FallUpJV@reddit

Can someone explain/throw in a paper that explains why a 7B model needs 80GB RAM, since the autoregressive thing means it just generates tokens? Maybe I got that wrong
View on Reddit #52990227

Dr_Karminski@reddit

I tried it out, and the performance was good, but the text generation doesn't seem very good. The prompt was: 'Generate a catgirl with pink hair, wearing black glasses, with a smile on her face, and wearing a black JK uniform. Her left hand is making an adjusting-glasses gesture, and her right hand is holding a book with the cover reading "Advanced Programming in the Unix Environment."' https://preview.redd.it/62qhnmlonwse1.jpeg?width=1024&format=pjpg&auto=webp&s=84f0d383a23516d061bb2c16259521c2045904b3
View on Reddit #52956614

KefkaFollower@reddit

Her left hand looks weird. Not understandig how hands work is a common problem with image generation. At least for models that fit in consumer grade hardware.
View on Reddit #52958630

internal-pagal@reddit

Oh, the irony is just dripping, isn't it? (LLMs) are now flirting with diffusion techniques, while image generators are cozying up to autoregressive methods. It's like everyone's having an identity crisis
View on Reddit #52898362

Healthy-Nebula-3603@reddit

and seems even autoregressive works better for pictures ...
View on Reddit #52899272

deadlydogfart@reddit

I suspect the better performance probably has more to do with the size of the model and multi-modality. We've seen in papers that cross-modal learning has a remarkable impact.
View on Reddit #52909539

Iory1998@reddit

But the size is 7B. For comparison, Flux.1 is 12B!
View on Reddit #52929027

deadlydogfart@reddit

I didn't realize, but I'm not surprised. My bet is it's the multi-modality. They can build better world models by learning not just from images, but text that describes how it works.
View on Reddit #52958300

ron_krugman@reddit

Arguably the best image generation model (4o) uses the autoregressive method. On the other hand I haven't seen any evidence that diffusion-based LLMs are able produce higher quality outputs than transformer-based LLMs. They're usually advertised mostly for their generation speed. My hunch is that diffusion in general may be more resource efficient for consumer grade hardware (in terms of generation time and VRAM requirements) but doesn't scale well beyond a certain point while transformers are more resource intensive but scale better given sufficiently powerful hardware.
View on Reddit #52929102

Healthy-Nebula-3603@reddit

That's quite a good assumption. As I understand what I read : Autoregressive picture models need more compute power not more Vram and that's why diffusion models we were used so far. Even newest Imagen form Google of MJ 7 are not even close what is doing Gpt-4o autoregressive. In theory we could use autoregressive model of size 32b q4km with Rtx 3090 :).
View on Reddit #52931942

ron_krugman@reddit

GPT-4o is just a single transformer model with presumably hundreds of billions of parameters that does text, audio, and images natively, right? What I'm not sure about is if you actually need that many parameters to generate images at that level of quality or if a smaller model (e.g. 70B) with less world knowledge that's more focused on image generation could perform at a similar or better level. I for one will be strongly considering the RTX PRO 6000 Blackwell once it's released... đź‘€
View on Reddit #52936654

Smile_Clown@reddit

Maybe AGI is just those two together plus whatever comes next...
View on Reddit #52934655

hapliniste@reddit

This kennt has the quirky LLM vibe all over it
View on Reddit #52899590

Commercial-Chest-992@reddit

It’s especially weird when it’s sort of one's own default writing style that LLMs have claimed for their own.
View on Reddit #52926682

MerePotato@reddit

Seems you've recognised that LLMs are artificial redditors
View on Reddit #52904383

Randommaggy@reddit

It's among the better data sources for relatively civilized written communication that was sorted by subject and relatively easy to get a hold of up to a certain point in time. I'm not surprised if it's heavily over-represented in the commonly used training sets.
View on Reddit #52906124

IrisColt@reddit

Yeah, busted!
View on Reddit #52902464

Everlier@reddit

Feels like a Sonnet-style joke
View on Reddit #52900237

ahmcode@reddit

🤭
View on Reddit #52899266

plankalkul-z1@reddit

Great stuff; I especially appreciate the license. Is it for 768x768 images only though?..
View on Reddit #52899566

AD7GD@reddit

I ran the test generate script according to the readme, and it did 768x768. I tried 1024x576 (same px count) and it also worked.
View on Reddit #52943403

plankalkul-z1@reddit

Thanks for letting know.
View on Reddit #52944506

IrisColt@reddit

The demo generates 1024x1024 images.
View on Reddit #52902675

plankalkul-z1@reddit

Good to know, thanks. My question stems from the fact that the link from Github page to Huggingface model page is named "7B_768px". The command line example there is also for 768x768. Would be nice to get some official info on size limitations.
View on Reddit #52903812

IrisColt@reddit

Thanks! I just noticed it too. I assuming that they --width 1024 --height 1024 but now I am not so sure.
View on Reddit #52905154

Right-Law1817@reddit

Is there any advantage using this over diffusion models?
View on Reddit #52899179

AD7GD@reddit

You can ask for more specific elements arranged in particular ways instead of just saying "all of these elements are in the picture"
View on Reddit #52943320

lothariusdark@reddit

Well, models like these have far more "world-knowledge", which means they know more stuff and how it works, as such they can infer a lot of information from even short prompts. This makes them more versatile and easier to steer without huge and detailed prompts while still having good coherence. They however lack in final quality, while they are accurate and will produce good images, the best sample quality can currently only be achieved with diffusion models. They are also large as fuck and slow to generate, scaling worse than diffusion models with resolution, as such get even slower at larger images. They arent really feasible for consumer hardware as even Flux looks tiny by comparison.
View on Reddit #52901689

Right-Law1817@reddit

So its more about versatility and understanding prompts better. Whils diffusion models still win in terms of raw image quality and efficiency and for that it seems like a trade off between coherence and final output quality. Thanks for the input :)
View on Reddit #52912707

RMCPhoto@reddit

Sounds like they would make sense as the first step in an image pipeline. 
View on Reddit #52902925

ClassyBukake@reddit

I mean surely the value that it provides in spatial and content awareness could allow you to generate low resolution base images, then upscale with diffusion. ATM diffusion workflow is a combination of "generate at low resolution until you find something that is 80% there, inpaint until it's very good, upscale using naive algorithm, then do a second pass of the upscale to add detail / blend the upscaled." In this case it eliminates the first 2 stages, which are easily the most time / energy consuming. Waiting 10 minutes for this to generate vs 40 minutes to generate. That said, there is more space to "discover" with diffusion as it's inherent randomness and it's lack of awareness will guide it to make something that might not be coherent, but might be more interesting that the intent of the original prompt.
View on Reddit #52902838

RMCPhoto@reddit

Many.   They are compatible with llm infrastructure, so they can benefit from flash attention.  They can in theory be faster.  They can be "smarter".  They are more likely than not "multimodal" by nature.  And you get to watch your images load like early 90's porn. 
View on Reddit #52903228

StartupTim@reddit

So as somebody who just uses ollama and Openwebui on top of that, how could I go abouts using this? Very cool by the way!  
View on Reddit #52901764

olliec42069@reddit

What about automatic1111?
View on Reddit #52933136

Everlier@reddit

Unfortunately, no way with just these two for now What you need right now: * 80 GB VRAM, run in transformers natively * UI integration - build your own What's needed for Open WebUI/Ollama * Architecture support in Ollama/llama.cpp - biggest problem, image gen is outside of scope for both, highly unlikely * ComfyUI workflow that runs this model - possible in the near future, but requirements are likely to still be quite high for a long while I might be very wrong about these, maybe this will be exciting enough for image gen community to quickly solve these problems
View on Reddit #52911549

IrisColt@reddit

I was about to ask the same question, but for Forge and ComfyUI
View on Reddit #52902542

FullOf_Bad_Ideas@reddit

Model is 7B, arch `ChameleonXLLMXForConditionalGeneration`, type `chameleon`, with no GQA, default positional embedding size of 10240, with Qwen2Tokenizer, ChatML prompt format (mention of Qwen and Alibaba Cloud in default system message), 152k vocab, 172k embedding size and max model len of 131K. No vision layers, just LLM. Interesting, right?
View on Reddit #52907799

TrashPandaSavior@reddit

172k embedding size?
View on Reddit #52930108

uhuge@reddit

it's not like they've started from Qwen7B base, right? I'm in no ability to quickly check whether Qwen2.5 has GQA, but I'd suppose so.
View on Reddit #52918272

FullOf_Bad_Ideas@reddit

Qwen 2 and up have GQA. 1.5 and 1.0 don't. They made some frankenstein stuff, I'm eagerly waiting for the technical report here.
View on Reddit #52927049

Iory1998@reddit

Wait! Isn't this exactly how GPT-4o generates images?
View on Reddit #52928330

Wild-Masterpiece3762@reddit

That was quick
View on Reddit #52926714

gamblingapocalypse@reddit

WOWOWOW!!!
View on Reddit #52926525

Willing_Landscape_61@reddit

Nice! Too bad the recommended VRAM is 80GB and minimum just ABOVE 32 GB.
View on Reddit #52899975

Fun_Librarian_7699@reddit

Is it possible to load it into RAM like LLMs? Ofc with long computing time
View on Reddit #52906817

IrisColt@reddit

About to try it.
View on Reddit #52908362

aphasiative@reddit

been a few hours, how'd this go? (am I goofing off at work today with this, or...?) :)
View on Reddit #52923732

human358@reddit

Few hours should be enough he should have gotten a couple tokens already
View on Reddit #52925009

Hubbardia@reddit

Good luck, let us know how it goes
View on Reddit #52909514

Fun_Librarian_7699@reddit

Great, let me know the results
View on Reddit #52908941

Conscious-Lobster60@reddit

VRAM means *virtual* memory unchain the page file/zRam and use a Petabyte of old tape drives. Tokens are slow but never worry about OOM!
View on Reddit #52924270

05032-MendicantBias@reddit

If this is a transformer architecture, it should be way easier to split it between VRAM and RAM. I wonder if a 24GB GPU+ 64GB of RAM can run it.
View on Reddit #52924225

AbdelMuhaymin@reddit

Just letting you know that SDXL, Flux Dev, Wan 2.1, Hunyuan, etc. all requested 80GB of vram upon launch. That go quantized in seconds.
View on Reddit #52913366

FotografoVirtual@reddit

SDXL only required 8GB of VRAM at launch. https://preview.redd.it/6jw5g7tq8use1.png?width=792&format=png&auto=webp&s=e23abd4f4e9895c73b89511865f95fb4bf2d4dea
View on Reddit #52922889

mpasila@reddit

Hunyuan I think still needs about 32gb of RAM it's just VRAM can be quite low so it's not all so good.
View on Reddit #52918356

Karyo_Ten@reddit

Are those memory-bound like LLMs or compute-bound like LDMs? If the former, Macs are interesting but if the later :/ another ploy to force me into a 80~96GB VRAM GPU.
View on Reddit #52906576

TurbulentStroll@reddit

5.3TB/s is absolutely insane, is there any reason why this shouldn't run at inference speeds ~5x that of a 3090?
View on Reddit #52916141

FullOf_Bad_Ideas@reddit

this one is memory bound
View on Reddit #52908221

jonydevidson@reddit

It's gonna be on Replicate soon.
View on Reddit #52915779

FullOf_Bad_Ideas@reddit

It looks fairly close to a normal LLM, though with big 131k context length and no GQA. If it's normal MHA, we could apply [SlimAttention](https://arxiv.org/abs/2503.05840) to cut the KV cache in half, plus kv cache quantization to q8 to cut it in half yet again. Then quantize model weights to q8 to shave off a few gigs and I think you should be able to run it on single 3090.
View on Reddit #52908107

a_beautiful_rhind@reddit

I'm sure it will get quantized. Video generation models started out similar.
View on Reddit #52905339

slightlyintoout@reddit

Yes, with just over 32gb vram you can generate an image in five minutes. Still cool though!
View on Reddit #52904508

uhuge@reddit

add two tall babes to the forest's foreground..
View on Reddit #52917882

yoop001@reddit

Have you tested it? Does it handle text as well as 4o?
View on Reddit #52917109

Ylsid@reddit

Wow, that was fast
View on Reddit #52916688

Maleficent_Age1577@reddit

The problem with these big models is that people cant use them locally. Big models we need not, we need really specific models which we can run locally instead of paying $$$$$$ for big corps.
View on Reddit #52903324

vibjelo@reddit

> Big models we need not *You* don't need big models, and that's OK, not everything is for everyone. But lets not try to stop anyone from publishing big models, even if you personally cannot run them today, the research and availability is still important to other entities today, and maybe even you in the future.
View on Reddit #52908093

Maleficent_Age1577@reddit

Im just a little bit scared the way AI seems to go from opensourced to more consumerism like. The bigger the models the less people have access to research and study them. And dont get me wrong, most people would like to use big models its just they cant afford the equipment now and probably never. And in consumerism the big models available for pay per use are not the models released but really restricted versions of those.
View on Reddit #52908813

vibjelo@reddit

> Im just a little bit scared the way AI seems to go from opensourced to more consumerism like I'm very scared of this too, and is something I'm personally working against, so open source models will actually be open source. I've already shared some posts at notes.victor.earth which help people get some better information, which sadly I cannot submit to r/localllama as my submissions get deleted after a few seconds :/ But with that said, I think it's very important we don't change the definition of "open source" just because Meta's marketing department feels like it's easier to advertise LLM models that way. It doesn't matter how easy/hard it is to run, for something to be open source or not. If the "source" is available to be used for whatever you want, then it's open source. If you cannot, then it isn't. So big models, regardless of how easy/hard it is to run them, are open source if the "source" is available and you can freely re-distribute it without additional terms and conditions. If you cannot, then it isn't open source but maybe open weights, or something else. > its just they cant afford the equipment now and probably never Maybe I'm optimistic, but if I compare to what I thought was possible when I got my first computer around ~2000 sometime, to what is actually possible today, I could never have expected what we have today. So with that mindset, trying to see 20 years into the future, I think we'll see a lot more changes than we think are possible.
View on Reddit #52910625

Maleficent_Age1577@reddit

What I would like to see happen is rise of small but really specific opensourced models. Iex. if I wants a cat does the model need to be able generate cars? If I need a cat driving a car well then obviously but could it go so that then you could load those two specific models and combine those to create wanted result? I think that would be much more faster and power efficient than an all-around model that needs lets say 192gb of vram. Consumerism of course wants it so that people pay subscriptions, they have the equipment and rule over what you can and cannot do with the larger than life supermodels.
View on Reddit #52915264

FullOf_Bad_Ideas@reddit

It's a 7B model.
View on Reddit #52908130

odragora@reddit

It needs 80 Gb VRAM.
View on Reddit #52910727

Bobby72006@reddit

You see the insane uber-rigs people are making just to be able to run a *~~kneecapped~~* quantized version of Deepseek r1?
View on Reddit #52906459

Maleficent_Age1577@reddit

My bad, I didnt mention everyday Joe cant have builds like that. You need to be rich for that. 8 x 4090 give 192gb of vram with a little bit of money like 40k$.
View on Reddit #52907318

ai_waifu_enjoyer@reddit

Nice, hope this get as good as 4o in the future, and uncensored too.
View on Reddit #52913108

Stepfunction@reddit

I'm assuming that depending on the architecture, this could probably be converted to a GGUF once support is added to llama-cpp.
View on Reddit #52910163

BABA_yaaGa@reddit

Gpt4o shall be forgotten soon
View on Reddit #52897958

RMCPhoto@reddit

OpenAI has their place in the hall of Fame. Dalle 1, Gpt-4o image generation  (they need to give it a name)
View on Reddit #52903639

ikmalsaid@reddit

Waiting for quants! Seems possible to run on 12GB cards
View on Reddit #52902667

ihaag@reddit

Demo won’t allow you to scroll down on mobile..
View on Reddit #52900449

Kooky-Somewhere-2883@reddit

very cool
View on Reddit #52897329