Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Posted by Inevitable-Log5414@reddit | LocalLLaMA | View on Reddit | 18 comments

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality). ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT.

Pipeline (8 stages, all sequential on the same GPU):

Director Agent - Qwen3.5-35B-A3B (vLLM + AITER MoE) plans 6 shots from one sentence, returns structured JSON with character bibles, shot prompts, music brief, per-shot voice-over script, narration language
Character masters - FLUX.2 [klein] paints one canonical portrait per character. No LoRA training step - reference editing pins identity across shots by construction
Per-shot keyframes - FLUX.2 again with reference image. Sub-second per keyframe after warmup
Animation - Wan2.2-I2V-A14B, 81 frames @ 16 fps native. FLF2V for cut:false continuation arcs (last frame of shot N anchors first frame of shot N+1)
Vision critic - same Qwen3.5-35B reloaded with 10 structured failure labels (character drift, extras invade frame, camera ignored, walking backwards, object morphing, hand/finger artifact, wardrobe drift, neon glow leak, stylized AI look, random intimacy). Bad clips re-render with targeted retry strategies (different seed, FLF2V anchor, prompt simplification)
Music - ACE-Step v1 generates a 30s instrumental from Director's brief
Narration - Kokoro-82M, 9 languages. Director picks language to match setting (Tokyo→Japanese, Paris→French, Mumbai→Hindi)
Mix - ffmpeg with per-shot vo aligned via adelay

Wan 2.2 specifics (the bit this sub will care about): - 1280×720, not 640×640 default. Costs more but matches what producers want - 121 frames at 24 fps was my first attempt - gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up - flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults) - Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker - Camera language: ONE camera verb per shot, sentence-case, placed first ("Tracking shot following from behind"). Multiple verbs in one prompt cancel each other out - Avoid the word "cinematic" - triggers Wan's stylization branch, gives the AI look. Use lens/film tags instead ("Arri Alexa, anamorphic, 35mm film grain")

Performance work: - ParaAttention FBCache (lossless 2× on Wan2.2) - torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) - another 1.2× - AITER MoE acceleration on Qwen director (vLLM) - End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X

Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together.

Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300

Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

Live demo on HF Space is temporarily offline while infra restores - should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots.

Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic's failure taxonomy in comments.

[-]

Fun_Employment6042@reddit

Super sick project. One sentence in → full 720p reel out on a single MI300X is wild. Love the vision-critic + auto‑retry loop and the 81f u/16fps Wan2.2 choice. Starred the repo and dropped a like on the HF space 🙌

DeerWoodStudios@reddit

I have a suggestion for your workflow, add a step with Sony woosh to add sound effects to the video that thing is a game changer too bad that wasn't covered a lot in this subreddit. Wan 2.2 + sony woosh is the best combo i found so far. Cheers.

ArugulaAnnual1765@reddit

Imo purely ai generated videos are going nowhere and would be too inefficient to get right (making sure physics are correct, objects not disappearing)

The ai should instead create an actual 3d pipeline where objects are physically simulated, then ai generation on top similar to a dlss5 technique. I think whoever cracks this pipeline will have the key to generating realistic videos

Inevitable-Log5414@reddit (OP)

It's a great idea, but it'll require more memory for creating each object and verifying it, so we need more VRAM and more CUDA kernels. Saved for future :)

Indeed a massive project, but when its figured out will change the whole world forever

Yeah, if I'll get AMD GPU by winning in this hackathon I'll try to simulate it, like my HF space please if you can :)

https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

I appreciate the help with HF Space of you guys. We are gaining more likes fast, soon I will be able to get HF tokens for developing new open-source projects ❤️

Lachimos@reddit

The girl looks different in every other scene and her hair has different length.

It's not lora model so it can't fully save character look right now, demo was taken at the dev version, new prod version is more powerful

Guys, if you liked my project, please leave a like on Hugging Face. This will allow me to take a place in the HF nomination and get API credits for future open-source projects.