CogVideoX 5B - Open weights Text to Video AI model (less than 10GB VRAM to run) | Tsinghua KEG (THUDM)

[-]

IndicationMaleficent@reddit

I have a 4090 but it seems my RAM is the limiter at 32gb. Any idea what's a good amount of RAM to have?

[-]

Apart_Boat9666@reddit

I don't think so, 5b model maybe small in size but inference takes a lot of vram. For 2b model it was 16-20 GB for inference so 5 might be 40g above

[-]

I remember when they released 2b model, their vram usage were 16-20 GB (written in readme). They also wrote they are working on reducing inference requirements. There was also a post talking about this model before 5b was released stating similar requirements. Maybe they have improved.

[-]

VirusCharacter@reddit

[-]

ninjasaid13@reddit

[-]

uhuge@reddit

maybe I2Q next year;)

[-]

Similar_Piano_963@reddit

possible for someone to turn this into an image to video model?

maybe train an IP-Adapter model to condition the beginning of the video??

this model looks pretty decent. in my experience, ALL current video gen models are quite slot machine-y right now, so it would be great to be able to have it run i2v locally.

[-]

Sand-Discombobulated@reddit

hey, have you found anything like this yet? I am looking for a way to do image -> video or "make images come to life" locally .
I am wondering what the bigger guys are using for this.

[-]

MMAgeezer@reddit

The creators of this model released an I2V version, and there are Alibaba versions which work for a range of resolutions too:

https://huggingface.co/THUDM/CogVideoX-5b-I2V

https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP/blob/main/README_en.md

[-]

Far_Lifeguard_5027@reddit

Dumb question, but where is this so-called "1-click launcher" they are referring to on X?

[-]

softwareweaver@reddit

https://i.redd.it/gde2boea79ld1.gif

Prompt: a boy wearing a red shirt and blue shorts playing fetch with his dog. His dog is a golden retriever.

Some limitations: Can only generate 49 frames, 8fps, 720x480

If you reply with SFW prompts, I can try to generate videos from them.

[-]

infiniteContrast@reddit

woah i laughed hard when the dog jumped and turned into two dogs, lol

[-]

SeymourBits@reddit

I've seen this kind of "entity cloning" before. It's a known issue that occurs under certain combinations of heavy motion and occlusion. I consider this to be a SotA architecture clue and another victory for open source models! CogVideo is hot on the coattails of Sora!

[-]

infiniteContrast@reddit

yeah, in my opinion it's not an issue but i think it's easy to fix by generating the prompt many times and choosing the best one

[-]

martinerous@reddit

I guess, that's what you get when creating too complex prompts. It seems, it gets the keywords but does not care about the sentence structures and filler words. So, if the prompt has two dogs in it, then that's what you might get :D

[-]

GasBond@reddit

is this the first text to video that is open source? any other?

[-]

Current-Rabbit-620@reddit

I think there is SVD

[-]

GasBond@reddit

i remember now

[-]

Maykey@reddit

opensora was before that. Maybe something else, but installing it much easier than opensora.

[-]

GasBond@reddit

oh yeah

[-]

3-4pm@reddit

After some experimentation I can confirm this model is fairly uncensored. It will render nudity, violence, and famous people but sometimes with a Ken doll accuracy.

[-]

-p-e-w-@reddit

The example videos blow my mind. Prompt adherence is amazing. The fact that this can be run on consumer cards is unbelievable.

It feels like humanity skipped forward by a whole century in the past 3 years or so. If someone had asked me in 2010 for my prediction when something like that would become possible, I would have guessed around 2070 or so. And I would have assumed it would require a quantum supercomputer, not a $800 gaming rig from the early 2020s.

[-]

FaceDeer@reddit

Sometimes when I've got a local LLM running and I'm using it as a brainstorming buddy for an upcoming RPG adventure I'm planning I have to stop, look down at my computer, and go "my graphics card just came up with a way better idea for this scenario than I did."

I'm very impressed with the technology, of course, but also kind of humbled that it turns out that significant aspects of the human mind can be emulated so easily. Turns out we're probably not as fancy as we thought we were.

[-]

Lemgon-Ultimate@reddit

Yeah totally, I remember all the sci-fi movies and predictions about AI and the conclusion was always "It may be intelligent enough to do things on it's own but it will never be creative, only humans can create art." I was pretty surprised as Stable Diffusion appeard, the first generative AI I learned about and it creates art, lol.

[-]

Open_Channel_8626@reddit

Yeah I have seen this sentiment a lot about the deep learning boom and it surprised me the way the order went (art before spreadsheets)

[-]

Healthy-Nebula-3603@reddit

Because we have a big megalomania and we thinking that creativity is so "unique" but appeared nope ;)

Interesting I do know any SF book where AI is creative and action is in this century.

[-]

AmericanNewt8@reddit

It turns out that the stuff we thought was easy to automate was hard, while the stuff we thought was hard to automate was actually simple.

[-]

Due-Memory-6957@reddit

Everything is easy after it's done.

[-]

FaceDeer@reddit

Indeed. Just the other day I was having an AI help me create lyrics for a song about whether red grilled cheese sandwiches or blue grilled cheese sandwiches were better, basically a pointless argument for a science fiction setting where there's red-coloured cheese and blue-coloured cheese. The LLM I was working with was doing okay, coming up with verses spinning subjective superlatives about each of the two types.

And then it wrote an outro in which the singer ends up suggesting that maybe purple cheese would be better than either red or blue on its own.

I didn't ask the AI to solve a generations-old war, but there you go, it did.

[-]

Wonderful-Top-5360@reddit

I second this feeling. My guess is we'll be able to generate almost all content entirely on our devices.

As people have become famous for playing their mp3 playlist on stage. People will become famous for generating movies, tv shows, music.

[-]

throwaway2676@reddit

It will be so amazing when we can translate almost any book to a movie or tv series with just a few days of prompting and inference. We'll even be able to modify storylines, play out "what if" scenarios, and introduce new characters at will. In just a few years, $100 million Hollywood productions will be available to the average person with something like a $5k GPU.

[-]

Wonderful-Top-5360@reddit

yes porn is going to be amazing

[-]

MostlyRocketScience@reddit

A golden rabbit jumping through a snowy grassy field with Faberge eggs lying around. The Northern Lights are visible in the background

A golden robot surfing on a lava waterfall with the nightsky in the background.

(What the prompts were before they were enhanced with GLM-4)

[-]

Open_Channel_8626@reddit

Its not the highest resolution or fidelity but it works ok

[-]

BeYourself2021@reddit

5b doesn't work on my 3060 12g. out of memory error

[-]

Uncle___Marty@reddit

Well crap, i have the 8 gig version so im screwed and I hear the 2B model is a MAJOR drop.

[-]

Open_Channel_8626@reddit

It is yeah, only the 5B is "viable" in my view. There will be options down the road like distillation though for lower Vram

[-]

DragonfruitIll660@reddit

If your using the comfyUI wrapper make sure to hit Fp8 under the precision button. That gets it to I think 11.6gb.

[-]

Maykey@reddit

Couldn't get kitten to chase its own tail yet.

Can't wait to get free time. I suspect using HQQ for quantization may lead to some good improvement for vram, as it's really easy to setup

[-]

dont--panic@reddit

Eating is still a challenge https://imgur.com/1magDBl

[-]

AmericanKamikaze@reddit

So…when can we run custom Lora’s with video like this?

[-]

Tobiaseins@reddit

5B version is really really good. The best open weights txt2vid by a long shot, not even close. And in prompt adherence in my first Tests better than Runway gen 3 also not as aesthetic

[-]

ResidentPositive4122@reddit

I'm still in the queue, but I like their idea of "sparklifying" the promtps. I entered

The members of SG-1 and General Hank Landry (Beau Bridges) are travelling on the Earth ship Odyssey to the Asgard home world, Orilla, when Thor beams aboard.

and it came up with

Aboard the sleek, advanced Earth ship Odyssey, the intrepid team of SG-1, alongside the commanding presence of General Hank Landry, portrayed by Beau Bridges, navigates the star-studded void en route to the Asgard home world, Orilla. The tension and anticipation are palpable as the ship hums with life and purpose. Suddenly, a shimmering beam of light materializes, and the majestic figure of Thor, an Asgardian of great wisdom, appears before them, his presence commanding and serene, as the crew of the Odyssey looks on with a mix of awe and readiness for the unfolding events.

[-]

Vivid_Dot_6405@reddit

I see we have a Stargate fan here. I'm literally watching an SG-1 episode as I read this.

[-]

pmp22@reddit

Indeed.

[-]

Uncle___Marty@reddit

Oddly, I heard O'neill say that. I must have watched the body swap episode recently....

[-]

Vivid_Dot_6405@reddit

That's what Thor says all the time, or maybe it's Heimdall.

[-]

Xanjis@reddit

There is a PR on https://github.com/kijai/ComfyUI-CogVideoXWrapper that supports the 5b

[-]

Quantum1248@reddit

How can i sue it? I have to put it in some folder in comfyui?

[-]

Davidyz_hz@reddit

I hope you meant "use" because "sue" looks unnecessarily scary

[-]

Homeschooled316@reddit

EU moment

[-]

martinerous@reddit

After a few updates from the awesome author of that repository, I can confirm that with both fp8_transformer and enable_vae_tiling turned on, it completed generating a video on one of the most hated GPUs - 4060 Ti with 16GB VRAM :)

To run it, you just download the repo as zip and extract it to ComfyUI\custom_nodes, then restart ComfyUI and watch the console. If it complains it could not load the node because of diffusers, you'll need to upgrade the diffusers installation. On Windows embedded ComfyUI I did it with

python_embeded\python.exe -m pip install -U diffusers

Then I restarted ComfyUI and loaded the example workflow from examples/cogvideox_5b_example_01.json

A few video-related nodes were missing and I had to use ComfyUI manager ( https://github.com/ltdrdata/ComfyUI-Manager ) "Install missing custom nodes" command to install them.

Then you'll need the text encoder. I had t5xxl_fp16.safetensors from my earlier experiments with Flux, but Cogvideox recommended t5xxl_fp8_e4m3fn.safetensors which I downloaded here https://huggingface.co/comfyanonymous/flux_text_encoders/tree/main and put in ComfyUI models/clip folder.

If you have a GPU with 24GB VRAM, this might be all you need. Hit "Queue Prompt" and wait. The first time "(Down)load CogVideo Model" node will get stuck while downloading the model (the console shows the progress). Then it should work.

However, if it fails with "Allocation on device" during the Sampler step, you'll need to toggle fp8_transformer ON for the "(Down)load CogVideo Model" node. This will reduce the needed VRAM on your GPU. However, it's not supported on all GPUs.

If it fails during the Decoder step (which is heartbreaking to see after having waited on Sampler for 20-ish minutes), you'll need to toggle enable_vae_tiling ON for the "CogVideo Decode" node.

[-]

martinerous@reddit

I'm currently trying it in embedded ComfyUI. There are some caveats and issues.

ComfyUI needed a manual package install:

python_embeded\python.exe -m pip install -U diffusers

After that, it seemed to work, but then Decoder node failed with out-of-memory on my 16GB VRAM. Then I received advice on Github to update the nodes once more and try the new tiling toggle. However, after an update it started failing even earlier - now it was Sampler that failed with out-of-memory. When I reduced sampler frame size, it started failing with

Error occurred when executing CogVideoSampler:
'CogVideoXPipeline' object has no attribute 'guidance_scale'

So yeah, it's a very work-in-progress. But looks hopeful, if it will run on 16GB VRAM at all.

Nvidia, where is your CUDA shared memory feature when I need it? Why doesn't it work now, when it clearly worked with a few LLMs, when I saw "Shared GPU memory" increasing?

[-]

phenotype001@reddit

How do I use int8 with diffusers? Please help, should I set a specific dtype here, or what do I do?

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=dtype,
)

[-]

UnkarsThug@reddit

I have 8 GB Vram, 64 GB Ram, and a dream, but I'll have to see if I can get the small one to run later.

[-]

ThisGonBHard@reddit

Wait for Q8 quant.

[-]

Deluded-1b-gguf@reddit

We kinda need img2vid

[-]

complains_constantly@reddit

You don't need a different model for that, just software that supports it. Basically a controlnet to force the first frame. Similar to inpainting.

[-]

Open_Channel_8626@reddit

Training control nets is expensive sadly

[-]

AbstractedEmployee46@reddit

Way cheaper than training an entirely new model with a completely different architecture. Are you braindead?

[-]

Wonderful-Top-5360@reddit

interesting...go on

[-]

complains_constantly@reddit

Sora is the same type of model, their blog post details this with examples better than I could. Also, I'm pretty sure people were already doing it with Stability's video model that they open sourced.

[-]

Ylsid@reddit

Oh man I can't wait for "fine tunes"

[-]

LoafyLemon@reddit

I can already imagine BeaverAI proudly announcing CoxVideoxXx.

[-]

infiniteContrast@reddit

moistral

It sounds great!

[-]

ithkuil@reddit

Looks amazing in examples. License required for > 1 million visits or uses per month or something like that.

When I tried out the Space, it said I was in a queue with about 14,000 seconds remaining. That's fourteen thousand.

[-]

Gubru@reddit

I'm waiting in the queue, the estimated time is way off, it dropped from 100,000 to 30,000 in 350 seconds.

[-]

Open_Channel_8626@reddit

I messed up lol I left the queue when it was around 1800 seconds I think I saw it before the crowds came

[-]

ResidentPositive4122@reddit

queue: 24/29 | 971.3/4699.7s

[-]

Ne_Nel@reddit

My first try was unexpectingly decent. A real "game changer" arises?

[-]

Few_Painter_5588@reddit

Is this not the first open weight Text to Video model?

[-]

Tight_Range_5690@reddit

There's a couple more local ones i tried - can't remember names, sorry, but they're all unusably bad

[-]

Few_Painter_5588@reddit

Yeah, I think this is the first one that is serviceable. Though I haven't tried out the 2b model lol

[-]

FullOf_Bad_Ideas@reddit

2B wasn't producing many convincing videos for me and I generated about a 100 of them locally, but it was fun to play with. They trained the 2B on a lot of POND5 data as watermark was super clearly visible in a lot of them

[-]

neph1010@reddit


Fine-tuning VRAM Consumption (per GPU)

Animatediff, Stable Diffusion are also text to video.

[-]

Radiant_Dog1937@reddit

I don't know how cherry picked they are, but the demos for this are pretty good.

[-]

-p-e-w-@reddit

I just used the HF Space to generate a video of green rubber kangaroos jumping around on an alien planet, and the quality was comparable to the examples.

[-]

Yes_but_I_think@reddit

https://i.redd.it/mvnbryghb8ld1.gif

For the prompt (created with help of glm-4) "The video opens with a majestic landscape, the ground teeming with life as various birds forage peacefully. Suddenly, dark clouds gather, and a torrential downpour begins, sending smaller birds into a flurry, darting away to seek refuge. Amidst the chaos, an eagle, with its powerful wings, starts to ascend rapidly. It climbs higher, its determined gaze fixed on the sky, until it punctures the dark canopy of clouds. The eagle continues its ascent, breaking through the storm into the serenity above, where the sun still shines. The bird is then shown gliding effortlessly, a look of triumph on its face as it shakes off droplets of water. The scene fades to a close-up of the eagle, its expression one of contentment and pride. "

A good start. I probably overestimated what can be generated in just 6 seconds. It took 700 seconds.

[-]

Homberger@reddit

GitHub repo: https://github.com/THUDM/CogVideo

[-]

formalsystem@reddit

HuggingFace got this running with 8GB of VRAM using torchao https://x.com/aryanvs_/status/1828405977667793005

CogVideo collection (weights): https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce