meituan-longcat/LongCat-Video-Avatar-1.5 · Hugging Face
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 15 comments
# 🚀 Model Introduction
We are excited to announce the release of LongCat-Video-Avatar 1.5, an upgraded open-source framework that prioritizes extreme empirical optimization and production-readiness for audio-driven human video generation. Built upon the LongCat-Video foundation model, v1.5 delivers highly stable, commercial-grade avatar video synthesis supporting native tasks including Audio-Text-to-Video (AT2V), Audio-Text-Image-to-Video (ATI2V), and Video Continuation, with seamless compatibility for both single-stream and multi-stream audio inputs.
# [](https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5#key-features)Key Features
* 🌟 **Upgraded Audio Encoder (Whisper-Large):**: Replaces Wav2Vec2 with Whisper-Large, yielding significantly smoother and more natural lip dynamics.
* 🌟 **Production-Ready Stability**: Achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency.
* 🌟 **Stylized Domain Generalization**: Robustly generalizes to anime, animals, and complex real-world conditions such as multi-person interactions and object handling.
* 🌟 **Efficient 8-Step Inference**: Advanced DMD2-based step distillation accelerates inference to 8 NFE, balancing cost-effective serving with exceptional visual fidelity.
# 📊 Human Evaluation
We introduce a comprehensive human evaluation benchmark specifically tailored for audio-driven digital human generation. The benchmark encompasses 6 application scenarios (News Broadcasting, Knowledge Education, Daily Life, Entertainment, Singing, Commercial Promotion), 2 languages (Chinese/English), and 2 visual styles (Realistic/Animated), yielding a total of 508 image-audio source pairs. Evaluation Methodology:(1)Subjective Track: 770 crowdsourced evaluators rated each generated video on a 1–5 human-likeness scale, yielding 13,240 judgments. (2) Objective Track: 10 domain experts conducted structured quality analysis across four dimensions: Physical Rationality, Harmony (Audio-Visual Coordination), Temporal Stability, and Identity Consistency.
⚖️ License Agreement
The **model weights** are released under the **MIT License**.
15 Comments
crantob@reddit
crantob@reddit
tamasula@reddit
pmttyji@reddit (OP)
generaluser123@reddit
TheRealMasonMac@reddit
Alive_Ad_3223@reddit
TheRealMasonMac@reddit
philmarcracken@reddit
Old-Sherbert-4495@reddit
theOliviaRossi@reddit
jwpbe@reddit
mivog49274@reddit
polawiaczperel@reddit
Different_Fix_2217@reddit