Flux.1 Quantization Quality: BNB nf4 vs GGUF-Q8 vs FP16
Posted by Iory1998@reddit | LocalLLaMA | View on Reddit | 92 comments
Hello guys,
I quickly ran a test comparing the various Flux.1 Quantized models against the full precision model, and to make story short, the GGUF-Q8 is 99% identical to the FP16 requiring half the VRAM. Just use it.
I used ForgeUI (Commit hash: 2f0555f7dc3f2d06b3a3cc238a4fa2b72e11e28d) to run this comparative test. The models in questions are:
- flux1-dev-bnb-nf4-v2.safetensors available at https://huggingface.co/lllyasviel/flux1-dev-bnb-nf4/tree/main.
- flux1Dev_v10.safetensors available at https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main flux1.
- dev-Q8_0.gguf available at https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main.
The comparison is mainly related to quality of the image generated. Both the Q8 GGUF and FP16 the same quality without any noticeable loss in quality, while the BNB nf4 suffers from noticeable quality loss. Attached is a set of images for your reference.
GGUF Q8 is the winner. It's faster and more accurate than the nf4, requires less VRAM, and is 1GB larger in size. Meanwhile, the fp16 requires about 22GB of VRAM, is almost 23.5 of wasted disk space and is identical to the GGUF.
The fist set of images clearly demonstrate what I mean by quality. You can see both GGUF and fp16 generated realistic gold dust, while the nf4 generate dust that looks fake. It doesn't follow the prompt as well as the other versions.
I feel like this example demonstrate visually how GGUF_Q8 is a great quantization method.
Please share with me your thoughts and experiences.
Open_Channel_8626@reddit
Sometimes the compressed models make changes that I find more aesthetically good. Not necessarily on average but some of the time
Iory1998@reddit (OP)
But remember, we are comparing quality not changes. the nf4 may not follow the prompt as well as the GGUF_Q8 or the fp16 simply because the clip and t5xx baked in it are also quantized, which leads in quality loss. As for the GGUF, it uses the fp16 clip models, which means it would respect the prompt as well as the fp16.
JamesIV4@reddit
Can you do another comparison with nf4 v2 but loading the full t5xx fp16 separately? This is how I do it on my 12 GB card and it's fast and muuuuch better than the t5xx fp8. A lot more detail.
Iory1998@reddit (OP)
Interesting. I'd love to do that. Are you using it on ForgeUI or ComfyUI?
JamesIV4@reddit
ComfyUI
Iory1998@reddit (OP)
Then that wouldn't be a fair comparison since I used ForgeUI. I'll give it a try.
JamesIV4@reddit
Very cool, looking forward to the results. I'm testing gguf on my end.
max_force_@reddit
did you guys ever tested these out? /u/Iory1998
anshulsingh8326@reddit
But I have Flux.1 S Fp8.
Iory1998@reddit (OP)
OFC it would crash! FP8 is still big for your 12GB of VRAM. Do you use Forge or Comfy? Try keeping the text encoders in RAM.
anshulsingh8326@reddit
SwarmUI.
How to put text encoders in ram?
Iory1998@reddit (OP)
If you are using swarm, then the underlying backend is Comfy. You must hook the text encoders to the Set/Force Device Node: CPU. Just use Forge man
Tight-Program6415@reddit
OMG LEBENSRETTER!!!! War schon am Verzweifeln, da ich beim Fp16 keine Loras mehr in den VRAM gekriegt habe
😅
Danke!!!!
Due-Writer-7230@reddit
Can someone in here who is familiar with python and ai models send ne a message? Im working on an app and ive only worked with text only models, i need a little help with using models like flux and stable diffusion. Ive searched for several days and cand find anything. Im still kind of new to some of this stuff. Any help would be greatly appreciated
Just-Contract7493@reddit
Is it me or for some reason, I can't seem to get Q8 to run at all in my notebook (the cloud one)? It just killed it as soon as I try to generate anything...
Drakojin-X@reddit
From what I've seen around, NF4 is being ditched, and it doesn't LoRA (afaik).
Iory1998@reddit (OP)
Actually, NF4 supports LoRA just fine. In my testing, sometimes, the NF4 yields better results than Q4.
no_witty_username@reddit
I find that the 8gguf model generates images 2x longer then 8fp model. also time goes up as you add loras. this is for comfy, is it just me or is this normal for comfy now?
Iory1998@reddit (OP)
You might be right. I am updating this test adding the GGUF_Q4 and Q6, and I didn't notice any drop in speed. My guess is the GGUF format is still not optimized.
fathomly@reddit
Did you have any time to experiment with LoRas?
I also experienced this - slower generation under 8gguf, and each LoRa I stacked would almost double the generation time required. fp8 was faster without LoRas, and had no slow-down at all with them.
Quality-wise, FP8 looks similar. It's definitely generating differences, so I guess the price for the slower speed is accuracy to FP16
racerx2oo3@reddit
Could you share the prompts you used?
Iory1998@reddit (OP)
Which image do you like?
racerx2oo3@reddit
1 & 2
ramzeez88@reddit
what about the q5 and a6 quants? how much vram they use and how about the quality ?
Iory1998@reddit (OP)
I haven't tested them because I don't intend to use them. In image generation, quality is important. So, I'd like to use the highest quality close to the full precision.
Iory1998@reddit (OP)
I will add Q4 and Q6 to the mix and update the post
ramzeez88@reddit
Ok,thanks for getting back.
Healthy-Nebula-3603@reddit
I wonder why Flux is not compressed instead of old Q4 to much more robust and newer Q4k_m ?
Iory1998@reddit (OP)
Good question. Well, this is just my opinion, but when people discover a new technique, they usually try to put it out in the wild as quickly as possible without testing it fully. Maybe we will get the rest of the quantization methods later similar to what happened in the LLMs space.
a_beautiful_rhind@reddit
FP8 vs Q8?
Already knew NF4 lacked.
Healthy-Nebula-3603@reddit
are many tests ... simply q8 is much closer to fp16 than fp8
a_beautiful_rhind@reddit
Its almost a tossup. Likely Q8 gguf properly converts from BF16. I'm not sure how FP8 does it with the built in functions.
Healthy-Nebula-3603@reddit
fp8 has low very precision ... comparing to q8
a_beautiful_rhind@reddit
the file size difference isn't that big, under a GB. GGUF is just a better quantization scheme.
Iory1998@reddit (OP)
I didn't try the FP8 but I saw a few comparison and the FP8 in the examples I saw generated slightly different images, while in my experience the Q8 almost generated identical images. I may be wrong.
a_beautiful_rhind@reddit
I have a wf set up with no lora and multi stack of lora. Same seed. The lora behave differently on GGUF. But it's a wash anyways because it's slower and larger.
Kquant would be cool but they aren't implemented in the quantize.py this is using. So the code for them would have to be copied to python from C++.
teddybear082@reddit
How do you run the gguf?
-p-e-w-@reddit
Forge supports it as of two days ago.
Healthy-Nebula-3603@reddit
also works with confyui
ILoveThisPlace@reddit
Does comfyui?
NOThanyK@reddit
Only from extension
https://github.com/city96/ComfyUI-GGUF
mystonedalt@reddit
The NF4 ran fully in less than 12GB. Q8 is taking just over 16GB.
Fusseldieb@reddit
Aw, so no 8GB in sight, I guess... Kinda understandable, but a bummer nonetheless...
mystonedalt@reddit
You can always let it spill over into system RAM.
Fusseldieb@reddit
System RAM will probably be horrendously slow. Since I use a notebook with just one single stick of 8GB 2666MHz, I honestly don't even dare to try.
maddogxsk@reddit
I was about to say that isn't that terrible for running it at fp16, until i read the notebook part 💀
Fusseldieb@reddit
Yeaa... My main computer is a notebook, so there's that. I can only dream for now...
ai_dubs@reddit
Like jupyter notebook? I didn't know that notebooks were slower than raw python, is that true?
Fusseldieb@reddit
I think you misunderstand. Notebook = My laptop!
Roubbes@reddit
I run the big fp16 model in a 3060 12GB in ConfyUI. 2 and a half minutes for a 20 steps image at 1500x1000 resolution. I'm confused
mystonedalt@reddit
The fp16 model is spilling out into system RAM. If it weren't, it would be faster to generate for you.
It would closer to twice as fast to generate an image for you with NF4.
mikaelhg@reddit
So with 16 GB of memory, I can run a batch of, say, 128 images, first through half of the layers, save the intermediaries, then load the second half of the layers to GPU, and then run the 128 images through the rest of the layers?
urgettingtallpip@reddit
can you finetune quantized models or only the fp16/bf16 one
Iory1998@reddit (OP)
To my knowledge, you need the fp16 one, but I might be wrong.
Xandred_the_thicc@reddit
Use the GGUF Q4! Nf4 is less accurate for no benefit, so this comparison makes little sense. Q4 lora support already exists on forge.
Iory1998@reddit (OP)
Why do you it it does little sense? I am comparing quality as close as the full precision because that's my concern. NF4 get's the VAE and text encoders backed in it. They are too quantized, so the whole package is about 12GB.
Xandred_the_thicc@reddit
I don't mean it in a bad way. Nf4 is functionally already deprecated because the Q4 GGUF is the same size and almost the same performance as nf4 with better quality. The t5, clip_L, and vae can all be downloaded separately in whatever accuracy you want, and used with the GGUF it's almost the same vram usage.
Iory1998@reddit (OP)
I still disagree with you. With nf4, the model will stay loaded in VRAM all the time while with GGUF, the model, unfortunately, will have to unload each time you change the prompt then loaded into VRAM because we are using the T5 text encoder. Look at your VRAM usage and you will see it. The image generation time might be the same, but the overall time from hitting the generate button to the image generation will take minutes.
This from ForgeUI
Xandred_the_thicc@reddit
Are you using the gradio 4 fork with the "vae/text encoder" drop-down that lets you select the vae, clip, and t5 together? I'm not using that "enable t5" option. I'll have to double check next time I have access to my PC, but I could've sworn time from request to gen was about the same for both.
Both t5 and the Q4 GGUF should fit in 12gb of vram together, so it shouldn't need to unload the model just to run t5.
Iory1998@reddit (OP)
I am using ForgeUI. Anyway, it seems ComfyUI have solved the issue with Models unloading and now it takes about 10second more time when you change your prompt.
Unwitting_Observer@reddit
Best advice I've read in the last 24 hours!
OutrageousImpact931@reddit
Q8 works with lora and cn?
Iory1998@reddit (OP)
LoRA yes, CN, I am not sure
TheInternalNet@reddit
Is there anyway this could run on CPU along with 128GB of ram?? Old Dell server. It doesn't have to be fast. Just good. Set it and forget it is fine with me.
Iory1998@reddit (OP)
Actually, I have the same question. GGUF is meant to run on CPU only.
Enturbulated@reddit
Iory1998@reddit (OP)
I have NO idea why I wrote "CPU Only" when I meant to say on CPU alone, without the need of a GPU.
Emma_OpenVINO@reddit
A Jupyter notebook to compress to int4/int8 with NNCF/OpenVINO: https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/flux.1-image-generation/flux.1-image-generation.ipynb
ambient_temp_xeno@reddit
Q8 is a no brainer for anyone with less than 24gb vram. Epic win.
ProcurandoNemo2@reddit
Unfortunately that Q8 is just the transformers. You still need the T5 text encoder, which is a few GBs big. It definitely ends up going just over 16gb, which means you need more than that to run the whole thing properly. I tried running Flux when there was only the FP8 model available on 16gb VRAM and it always froze up my entire PC. For now, I'm happy with NF4 Schnell. It always gives me something I can use.
Xandred_the_thicc@reddit
Use the q4! There's no reason not to if you're using nf4. It's objectively closer to fp16, even with fp8 t5.
ProcurandoNemo2@reddit
I will as soon as I figure out how to make it show up on Forge. I saw a post in the discussions tab on Github with a screenshot of a new UI. My Forge has been updated to the most recent version, but the UI doesn't look like the new one with separate sections for VAE and text encoders.
ambient_temp_xeno@reddit
I only have 12gb! It unloads and loads the models but q8 doesn't all fit, it's using some magic with the system ram (I don't think it's using system fallback in the driver). It's still pretty decent speed though. I tested it at 1 min 44 sec for a 1024x1024 20 steps. I didn't get any freeze ups but I do have 128gb quad channel ddr4.
Side note: it's technically possible to load the clip and vae onto a second card, but I can't get that to work right anymore. It didn't seem to make a big difference anyway as I think in comfyui it caches models in system ram (or maybe the 1gb/sec m2 drive makes it pretty fast if it's loading it from disk each time).
ProcurandoNemo2@reddit
Yeah I saw that it eats up like 44gb RAM if not everything in in VRAM. I would need to buy 64gb RAM to run Q8 properly.
ambient_temp_xeno@reddit
I did see it briefly spike to 40+gb ram, but like for 1 second. It might be worth trying it anyway if you have an ssd or nvme.
ProcurandoNemo2@reddit
Yeah it can work, but it's too slow. I've had to hard restart my PC a few too many times recently because of Flux to try another adventure like that again lol
Iory1998@reddit (OP)
Try increasing the virtual memory and see:
Iory1998@reddit (OP)
You are right! I missed to mention that the Q8 does not come with the rest of the text-encoders and the VAE baked in it.
Iory1998@reddit (OP)
But, in ForgeUI, the unet will be loaded into VRAM and the rest into RAM. For instance, I noticed that when I use the fp16, my VRAM usage is about 21GB/24GB, and the fp16 alone is 23.5GB, so most of the vram is occupied with the model.
poli-cya@reddit
Wow, thanks so much for running all of this. Very interesting and valuable to the community. How the hell do you make prompts with such striking visuals, can you share one?
Iory1998@reddit (OP)
With Flux.1, prompt it as you usually prompt chatGPT. The text encoders use natural language and it understands it well.
"Create a magnificent illustration of an astronaut floating in space getting closer to a giant black hole. In the dark space, there is a half destroyed planet whose debris are sucked by the black whole. Use a professional realistic style that combines an aspect of science fiction and art." => this for the floating astronaut.
"Create a breathtaking, award-winning illustration of a woman's face in a professional, highly detailed style. The image should be in black and white, with the woman's eyes closed. Her hair is styled in a bun, transforming into a cloud of blue and pink light against a black background. Smoke emerges from her mouth, blending into her hair, creating an eerie, unsettling atmosphere. The theme is horror, with a focus on a dark, spooky, and suspenseful mood. The style should be dystopian, bleak, and post-apocalyptic, conveying a somber and dramatic tone." => This for the last image.
My favorite image:D
A little trick: for more complex scenes, I write the prompt and ask GPT-4o or Claude to refine it for me. Grok-2 seems to get Flux.1 prompts, so it does give good prompting.
poli-cya@reddit
Wow, thanks so much for the detailed response. Looking to get back into image-gen after dabbling a bit a year ago and these prompts will be super helpful. It looks like you tried to link an image but it didn't work, was it one of the pictures in your original post?
One last question, the lora:flux_realism bit- do you need to have something extra for that to work, like an extra set of modifiers downloaded? Feel free to ignore any of the above if its stupid or asking too much of your time, really appreciate what you shared already.
Iory1998@reddit (OP)
My pleasure, As for the image, I can see it. Maybe an issue from your side.
The Realism LoRA adds realism style to the image, which gives it some aesthetics, like midjouney's instead of looking like a stock image.
ArsNeph@reddit
All of these images are basically identical except for the last where there's a more significant difference. I wonder why that is. Either way, it seems like transformers is significantly more resistant to quantization than UNet. Or maybe it's just that it's a bigger model and therefore more resistant? The question is are back ends like automatic 1111 going to support stable diffusion.cpp inference?
Iory1998@reddit (OP)
ForgeUI does! I mean I am using GGUF Q8 in ForgeUI which is a fork of A1111!
Iory1998@reddit (OP)
EDIT:
The Q8 does not come with the rest of the text-encoders and the VAE baked in it.
rerri@reddit
I don't think this is correct.
NF4 is faster than Q8 in your screenshot, it's faster on my system too (4090, everything in VRAM)
Also, Q8 is larger than NF4 and takes more VRAM to run, at least this is the case on my system, dunno why it would be different on yours. In your screenshot we can see a smaller amount of VRAM in use, but I'm guessing T5 is unloaded from VRAM in the Q8 shot and not with NF4 or something similar.
Iory1998@reddit (OP)
No is isn't, it takes more time to generate an image with Nf4. But, that maybe due to my machine or ForgeUI still has issues with optimization. If that's the case, that would change later.
whoisraiden@reddit
Nf4 is also best for low VRAM systems, where people with sufficient VRAM report negligible differences between fp8 and nf4.
ahmetfirat@reddit
can ypu generate with nf4 on 12gb vram without using swap space
milksteak11@reddit
Thanks for doing the things I am too lazy to do
-p-e-w-@reddit
Thanks for posting this! I've been experimenting with the NF4 quant and I definitely noticed a quality loss, especially for text. The FP16 version can almost always render the requested text correctly, while the NF4 version randomly substitutes or omits words. Looks like GGUF is the way to go.