CogVideoX 5B - Open weights Text to Video AI model (less than 10GB VRAM to run) | Tsinghua KEG (THUDM)
Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 83 comments
CogVideo collection (weights): https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce
Space: https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space
Paper: https://huggingface.co/papers/2408.06072
The 2B model runs on a 1080TI and the 5B on a 3060.
2B model in Apache 2.0.
Source:
Vaibhav (VB) Srivastav on X: https://x.com/reach_vb/status/1828403580866384205
Adina Yakup on X: https://x.com/AdeenaY8/status/1828402783999218077
Tiezhen WANG: https://x.com/Xianbao_QIAN/status/1828402971622940781
IndicationMaleficent@reddit
I have a 4090 but it seems my RAM is the limiter at 32gb. Any idea what's a good amount of RAM to have?
Kiyushia@reddit
So 8GB vram possible?
Apart_Boat9666@reddit
I don't think so, 5b model maybe small in size but inference takes a lot of vram. For 2b model it was 16-20 GB for inference so 5 might be 40g above
ninjasaid13@reddit
what do you mean?
Xthman@reddit
yeah nah, OOMs at 8Gb for me
Apart_Boat9666@reddit
I remember when they released 2b model, their vram usage were 16-20 GB (written in readme). They also wrote they are working on reducing inference requirements. There was also a post talking about this model before 5b was released stating similar requirements. Maybe they have improved.
VirusCharacter@reddit
ninjasaid13@reddit
uhuge@reddit
maybe I2Q next year;)
Similar_Piano_963@reddit
possible for someone to turn this into an image to video model?
maybe train an IP-Adapter model to condition the beginning of the video??
this model looks pretty decent. in my experience, ALL current video gen models are quite slot machine-y right now, so it would be great to be able to have it run i2v locally.
Sand-Discombobulated@reddit
hey, have you found anything like this yet? I am looking for a way to do image -> video or "make images come to life" locally .
I am wondering what the bigger guys are using for this.
MMAgeezer@reddit
The creators of this model released an I2V version, and there are Alibaba versions which work for a range of resolutions too:
https://huggingface.co/THUDM/CogVideoX-5b-I2V
https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP/blob/main/README_en.md
Far_Lifeguard_5027@reddit
Dumb question, but where is this so-called "1-click launcher" they are referring to on X?
softwareweaver@reddit
https://i.redd.it/gde2boea79ld1.gif
Prompt: a boy wearing a red shirt and blue shorts playing fetch with his dog. His dog is a golden retriever.
Some limitations: Can only generate 49 frames, 8fps, 720x480
If you reply with SFW prompts, I can try to generate videos from them.
infiniteContrast@reddit
woah i laughed hard when the dog jumped and turned into two dogs, lol
SeymourBits@reddit
I've seen this kind of "entity cloning" before. It's a known issue that occurs under certain combinations of heavy motion and occlusion. I consider this to be a SotA architecture clue and another victory for open source models! CogVideo is hot on the coattails of Sora!
infiniteContrast@reddit
yeah, in my opinion it's not an issue but i think it's easy to fix by generating the prompt many times and choosing the best one
martinerous@reddit
I guess, that's what you get when creating too complex prompts. It seems, it gets the keywords but does not care about the sentence structures and filler words. So, if the prompt has two dogs in it, then that's what you might get :D
GasBond@reddit
is this the first text to video that is open source? any other?
Current-Rabbit-620@reddit
I think there is SVD
GasBond@reddit
i remember now
Maykey@reddit
opensora was before that. Maybe something else, but installing it much easier than opensora.
GasBond@reddit
oh yeah
3-4pm@reddit
After some experimentation I can confirm this model is fairly uncensored. It will render nudity, violence, and famous people but sometimes with a Ken doll accuracy.
-p-e-w-@reddit
The example videos blow my mind. Prompt adherence is amazing. The fact that this can be run on consumer cards is unbelievable.
It feels like humanity skipped forward by a whole century in the past 3 years or so. If someone had asked me in 2010 for my prediction when something like that would become possible, I would have guessed around 2070 or so. And I would have assumed it would require a quantum supercomputer, not a $800 gaming rig from the early 2020s.
FaceDeer@reddit
Sometimes when I've got a local LLM running and I'm using it as a brainstorming buddy for an upcoming RPG adventure I'm planning I have to stop, look down at my computer, and go "my graphics card just came up with a way better idea for this scenario than I did."
I'm very impressed with the technology, of course, but also kind of humbled that it turns out that significant aspects of the human mind can be emulated so easily. Turns out we're probably not as fancy as we thought we were.
Lemgon-Ultimate@reddit
Yeah totally, I remember all the sci-fi movies and predictions about AI and the conclusion was always "It may be intelligent enough to do things on it's own but it will never be creative, only humans can create art." I was pretty surprised as Stable Diffusion appeard, the first generative AI I learned about and it creates art, lol.
Open_Channel_8626@reddit
Yeah I have seen this sentiment a lot about the deep learning boom and it surprised me the way the order went (art before spreadsheets)
Healthy-Nebula-3603@reddit
Because we have a big megalomania and we thinking that creativity is so "unique" but appeared nope ;)
Interesting I do know any SF book where AI is creative and action is in this century.
AmericanNewt8@reddit
It turns out that the stuff we thought was easy to automate was hard, while the stuff we thought was hard to automate was actually simple.
Due-Memory-6957@reddit
Everything is easy after it's done.
FaceDeer@reddit
Indeed. Just the other day I was having an AI help me create lyrics for a song about whether red grilled cheese sandwiches or blue grilled cheese sandwiches were better, basically a pointless argument for a science fiction setting where there's red-coloured cheese and blue-coloured cheese. The LLM I was working with was doing okay, coming up with verses spinning subjective superlatives about each of the two types.
And then it wrote an outro in which the singer ends up suggesting that maybe purple cheese would be better than either red or blue on its own.
I didn't ask the AI to solve a generations-old war, but there you go, it did.
Wonderful-Top-5360@reddit
I second this feeling. My guess is we'll be able to generate almost all content entirely on our devices.
As people have become famous for playing their mp3 playlist on stage. People will become famous for generating movies, tv shows, music.
throwaway2676@reddit
It will be so amazing when we can translate almost any book to a movie or tv series with just a few days of prompting and inference. We'll even be able to modify storylines, play out "what if" scenarios, and introduce new characters at will. In just a few years, $100 million Hollywood productions will be available to the average person with something like a $5k GPU.
Wonderful-Top-5360@reddit
yes porn is going to be amazing
MostlyRocketScience@reddit
A golden rabbit jumping through a snowy grassy field with Faberge eggs lying around. The Northern Lights are visible in the background
A golden robot surfing on a lava waterfall with the nightsky in the background.
(What the prompts were before they were enhanced with GLM-4)
Open_Channel_8626@reddit
Its not the highest resolution or fidelity but it works ok
BeYourself2021@reddit
5b doesn't work on my 3060 12g. out of memory error
Uncle___Marty@reddit
Well crap, i have the 8 gig version so im screwed and I hear the 2B model is a MAJOR drop.
Open_Channel_8626@reddit
It is yeah, only the 5B is "viable" in my view. There will be options down the road like distillation though for lower Vram
DragonfruitIll660@reddit
If your using the comfyUI wrapper make sure to hit Fp8 under the precision button. That gets it to I think 11.6gb.
Maykey@reddit
Couldn't get kitten to chase its own tail yet.
Can't wait to get free time. I suspect using HQQ for quantization may lead to some good improvement for vram, as it's really easy to setup
dont--panic@reddit
Eating is still a challenge https://imgur.com/1magDBl
AmericanKamikaze@reddit
So…when can we run custom Lora’s with video like this?
Tobiaseins@reddit
5B version is really really good. The best open weights txt2vid by a long shot, not even close. And in prompt adherence in my first Tests better than Runway gen 3 also not as aesthetic
ResidentPositive4122@reddit
I'm still in the queue, but I like their idea of "sparklifying" the promtps. I entered
and it came up with
Vivid_Dot_6405@reddit
I see we have a Stargate fan here. I'm literally watching an SG-1 episode as I read this.
pmp22@reddit
Indeed.
Uncle___Marty@reddit
Oddly, I heard O'neill say that. I must have watched the body swap episode recently....
Vivid_Dot_6405@reddit
That's what Thor says all the time, or maybe it's Heimdall.
Xanjis@reddit
There is a PR on https://github.com/kijai/ComfyUI-CogVideoXWrapper that supports the 5b
Quantum1248@reddit
How can i sue it? I have to put it in some folder in comfyui?
Davidyz_hz@reddit
I hope you meant "use" because "sue" looks unnecessarily scary
Homeschooled316@reddit
EU moment
martinerous@reddit
After a few updates from the awesome author of that repository, I can confirm that with both fp8_transformer and enable_vae_tiling turned on, it completed generating a video on one of the most hated GPUs - 4060 Ti with 16GB VRAM :)
To run it, you just download the repo as zip and extract it to ComfyUI\custom_nodes, then restart ComfyUI and watch the console. If it complains it could not load the node because of diffusers, you'll need to upgrade the diffusers installation. On Windows embedded ComfyUI I did it with
python_embeded\python.exe -m pip install -U diffusers
Then I restarted ComfyUI and loaded the example workflow from examples/cogvideox_5b_example_01.json
A few video-related nodes were missing and I had to use ComfyUI manager ( https://github.com/ltdrdata/ComfyUI-Manager ) "Install missing custom nodes" command to install them.
Then you'll need the text encoder. I had t5xxl_fp16.safetensors from my earlier experiments with Flux, but Cogvideox recommended t5xxl_fp8_e4m3fn.safetensors which I downloaded here https://huggingface.co/comfyanonymous/flux_text_encoders/tree/main and put in ComfyUI models/clip folder.
If you have a GPU with 24GB VRAM, this might be all you need. Hit "Queue Prompt" and wait. The first time "(Down)load CogVideo Model" node will get stuck while downloading the model (the console shows the progress). Then it should work.
However, if it fails with "Allocation on device" during the Sampler step, you'll need to toggle fp8_transformer ON for the "(Down)load CogVideo Model" node. This will reduce the needed VRAM on your GPU. However, it's not supported on all GPUs.
If it fails during the Decoder step (which is heartbreaking to see after having waited on Sampler for 20-ish minutes), you'll need to toggle enable_vae_tiling ON for the "CogVideo Decode" node.
martinerous@reddit
I'm currently trying it in embedded ComfyUI. There are some caveats and issues.
ComfyUI needed a manual package install:
python_embeded\python.exe -m pip install -U diffusers
After that, it seemed to work, but then Decoder node failed with out-of-memory on my 16GB VRAM. Then I received advice on Github to update the nodes once more and try the new tiling toggle. However, after an update it started failing even earlier - now it was Sampler that failed with out-of-memory. When I reduced sampler frame size, it started failing with
Error occurred when executing CogVideoSampler:
'CogVideoXPipeline' object has no attribute 'guidance_scale'
So yeah, it's a very work-in-progress. But looks hopeful, if it will run on 16GB VRAM at all.
Nvidia, where is your CUDA shared memory feature when I need it? Why doesn't it work now, when it clearly worked with a few LLMs, when I saw "Shared GPU memory" increasing?
phenotype001@reddit
How do I use int8 with diffusers? Please help, should I set a specific dtype here, or what do I do?
UnkarsThug@reddit
I have 8 GB Vram, 64 GB Ram, and a dream, but I'll have to see if I can get the small one to run later.
ThisGonBHard@reddit
Wait for Q8 quant.
Deluded-1b-gguf@reddit
We kinda need img2vid
complains_constantly@reddit
You don't need a different model for that, just software that supports it. Basically a controlnet to force the first frame. Similar to inpainting.
Open_Channel_8626@reddit
Training control nets is expensive sadly
AbstractedEmployee46@reddit
Way cheaper than training an entirely new model with a completely different architecture. Are you braindead?
Wonderful-Top-5360@reddit
interesting...go on
complains_constantly@reddit
Sora is the same type of model, their blog post details this with examples better than I could. Also, I'm pretty sure people were already doing it with Stability's video model that they open sourced.
Ylsid@reddit
Oh man I can't wait for "fine tunes"
LoafyLemon@reddit
I can already imagine BeaverAI proudly announcing CoxVideoxXx.
infiniteContrast@reddit
It sounds great!
ithkuil@reddit
Looks amazing in examples. License required for > 1 million visits or uses per month or something like that.
When I tried out the Space, it said I was in a queue with about 14,000 seconds remaining. That's fourteen thousand.
Gubru@reddit
I'm waiting in the queue, the estimated time is way off, it dropped from 100,000 to 30,000 in 350 seconds.
Open_Channel_8626@reddit
I messed up lol I left the queue when it was around 1800 seconds I think I saw it before the crowds came
ResidentPositive4122@reddit
queue: 24/29 | 971.3/4699.7s
Ne_Nel@reddit
My first try was unexpectingly decent. A real "game changer" arises?
Few_Painter_5588@reddit
Is this not the first open weight Text to Video model?
Tight_Range_5690@reddit
There's a couple more local ones i tried - can't remember names, sorry, but they're all unusably bad
Few_Painter_5588@reddit
Yeah, I think this is the first one that is serviceable. Though I haven't tried out the 2b model lol
FullOf_Bad_Ideas@reddit
2B wasn't producing many convincing videos for me and I generated about a 100 of them locally, but it was fun to play with. They trained the 2B on a lot of POND5 data as watermark was super clearly visible in a lot of them
neph1010@reddit
Animatediff, Stable Diffusion are also text to video.
Radiant_Dog1937@reddit
I don't know how cherry picked they are, but the demos for this are pretty good.
-p-e-w-@reddit
I just used the HF Space to generate a video of green rubber kangaroos jumping around on an alien planet, and the quality was comparable to the examples.
Yes_but_I_think@reddit
https://i.redd.it/mvnbryghb8ld1.gif
For the prompt (created with help of glm-4) "The video opens with a majestic landscape, the ground teeming with life as various birds forage peacefully. Suddenly, dark clouds gather, and a torrential downpour begins, sending smaller birds into a flurry, darting away to seek refuge. Amidst the chaos, an eagle, with its powerful wings, starts to ascend rapidly. It climbs higher, its determined gaze fixed on the sky, until it punctures the dark canopy of clouds. The eagle continues its ascent, breaking through the storm into the serenity above, where the sun still shines. The bird is then shown gliding effortlessly, a look of triumph on its face as it shakes off droplets of water. The scene fades to a close-up of the eagle, its expression one of contentment and pride. "
A good start. I probably overestimated what can be generated in just 6 seconds. It took 700 seconds.
Homberger@reddit
GitHub repo: https://github.com/THUDM/CogVideo
formalsystem@reddit
HuggingFace got this running with 8GB of VRAM using torchao https://x.com/aryanvs_/status/1828405977667793005