TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VACE: All-in-One Video Creation and Editing

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, Yu Liu

2025-03-10Video EditingSingle-Domain Subject-to-VideoOpen-Domain Subject-to-VideoHuman-Domain Subject-to-VideoAllVideo Generation
PaperPDFCodeCode

Abstract

Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: https://ali-vilab.github.io/VACE-Page/.

Results

TaskDatasetMetricValueModel
VideoOpenS2V-EvalAesthetics0.4721Wan2.1-VACE-14B
VideoOpenS2V-EvalFaceSim0.5509Wan2.1-VACE-14B
VideoOpenS2V-EvalGmeScore0.6727Wan2.1-VACE-14B
VideoOpenS2V-EvalMotion0.1502Wan2.1-VACE-14B
VideoOpenS2V-EvalNaturalScore0.7278Wan2.1-VACE-14B
VideoOpenS2V-EvalNexusScore0.442Wan2.1-VACE-14B
VideoOpenS2V-EvalTotal Score0.5287Wan2.1-VACE-14B
VideoOpenS2V-EvalAesthetics0.4824Wan2.1-VACE-1.3B
VideoOpenS2V-EvalFaceSim0.2058Wan2.1-VACE-1.3B
VideoOpenS2V-EvalGmeScore0.7126Wan2.1-VACE-1.3B
VideoOpenS2V-EvalMotion0.1883Wan2.1-VACE-1.3B
VideoOpenS2V-EvalNaturalScore0.7178Wan2.1-VACE-1.3B
VideoOpenS2V-EvalNexusScore0.3795Wan2.1-VACE-1.3B
VideoOpenS2V-EvalTotal Score0.4553Wan2.1-VACE-1.3B
VideoOpenS2V-EvalAesthetics0.4727Wan2.1-VACE-1.3B-Preview
VideoOpenS2V-EvalFaceSim0.1658Wan2.1-VACE-1.3B-Preview
VideoOpenS2V-EvalGmeScore0.7138Wan2.1-VACE-1.3B-Preview
VideoOpenS2V-EvalMotion0.1203Wan2.1-VACE-1.3B-Preview
VideoOpenS2V-EvalNaturalScore0.7056Wan2.1-VACE-1.3B-Preview
VideoOpenS2V-EvalNexusScore0.4004Wan2.1-VACE-1.3B-Preview
VideoOpenS2V-EvalTotal Score0.4395Wan2.1-VACE-1.3B-Preview
Video GenerationOpenS2V-EvalAesthetics0.4721Wan2.1-VACE-14B
Video GenerationOpenS2V-EvalFaceSim0.5509Wan2.1-VACE-14B
Video GenerationOpenS2V-EvalGmeScore0.6727Wan2.1-VACE-14B
Video GenerationOpenS2V-EvalMotion0.1502Wan2.1-VACE-14B
Video GenerationOpenS2V-EvalNaturalScore0.7278Wan2.1-VACE-14B
Video GenerationOpenS2V-EvalNexusScore0.442Wan2.1-VACE-14B
Video GenerationOpenS2V-EvalTotal Score0.5287Wan2.1-VACE-14B
Video GenerationOpenS2V-EvalAesthetics0.4824Wan2.1-VACE-1.3B
Video GenerationOpenS2V-EvalFaceSim0.2058Wan2.1-VACE-1.3B
Video GenerationOpenS2V-EvalGmeScore0.7126Wan2.1-VACE-1.3B
Video GenerationOpenS2V-EvalMotion0.1883Wan2.1-VACE-1.3B
Video GenerationOpenS2V-EvalNaturalScore0.7178Wan2.1-VACE-1.3B
Video GenerationOpenS2V-EvalNexusScore0.3795Wan2.1-VACE-1.3B
Video GenerationOpenS2V-EvalTotal Score0.4553Wan2.1-VACE-1.3B
Video GenerationOpenS2V-EvalAesthetics0.4727Wan2.1-VACE-1.3B-Preview
Video GenerationOpenS2V-EvalFaceSim0.1658Wan2.1-VACE-1.3B-Preview
Video GenerationOpenS2V-EvalGmeScore0.7138Wan2.1-VACE-1.3B-Preview
Video GenerationOpenS2V-EvalMotion0.1203Wan2.1-VACE-1.3B-Preview
Video GenerationOpenS2V-EvalNaturalScore0.7056Wan2.1-VACE-1.3B-Preview
Video GenerationOpenS2V-EvalNexusScore0.4004Wan2.1-VACE-1.3B-Preview
Video GenerationOpenS2V-EvalTotal Score0.4395Wan2.1-VACE-1.3B-Preview
1 Image, 2*2 StitchiOpenS2V-EvalAesthetics0.4721Wan2.1-VACE-14B
1 Image, 2*2 StitchiOpenS2V-EvalFaceSim0.5509Wan2.1-VACE-14B
1 Image, 2*2 StitchiOpenS2V-EvalGmeScore0.6727Wan2.1-VACE-14B
1 Image, 2*2 StitchiOpenS2V-EvalMotion0.1502Wan2.1-VACE-14B
1 Image, 2*2 StitchiOpenS2V-EvalNaturalScore0.7278Wan2.1-VACE-14B
1 Image, 2*2 StitchiOpenS2V-EvalNexusScore0.442Wan2.1-VACE-14B
1 Image, 2*2 StitchiOpenS2V-EvalTotal Score0.5287Wan2.1-VACE-14B
1 Image, 2*2 StitchiOpenS2V-EvalAesthetics0.4824Wan2.1-VACE-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalFaceSim0.2058Wan2.1-VACE-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalGmeScore0.7126Wan2.1-VACE-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalMotion0.1883Wan2.1-VACE-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalNaturalScore0.7178Wan2.1-VACE-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalNexusScore0.3795Wan2.1-VACE-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalTotal Score0.4553Wan2.1-VACE-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalAesthetics0.4727Wan2.1-VACE-1.3B-Preview
1 Image, 2*2 StitchiOpenS2V-EvalFaceSim0.1658Wan2.1-VACE-1.3B-Preview
1 Image, 2*2 StitchiOpenS2V-EvalGmeScore0.7138Wan2.1-VACE-1.3B-Preview
1 Image, 2*2 StitchiOpenS2V-EvalMotion0.1203Wan2.1-VACE-1.3B-Preview
1 Image, 2*2 StitchiOpenS2V-EvalNaturalScore0.7056Wan2.1-VACE-1.3B-Preview
1 Image, 2*2 StitchiOpenS2V-EvalNexusScore0.4004Wan2.1-VACE-1.3B-Preview
1 Image, 2*2 StitchiOpenS2V-EvalTotal Score0.4395Wan2.1-VACE-1.3B-Preview
Image to Video GenerationOpenS2V-EvalAesthetics0.4721Wan2.1-VACE-14B
Image to Video GenerationOpenS2V-EvalFaceSim0.5509Wan2.1-VACE-14B
Image to Video GenerationOpenS2V-EvalGmeScore0.6727Wan2.1-VACE-14B
Image to Video GenerationOpenS2V-EvalMotion0.1502Wan2.1-VACE-14B
Image to Video GenerationOpenS2V-EvalNaturalScore0.7278Wan2.1-VACE-14B
Image to Video GenerationOpenS2V-EvalNexusScore0.442Wan2.1-VACE-14B
Image to Video GenerationOpenS2V-EvalTotal Score0.5287Wan2.1-VACE-14B
Image to Video GenerationOpenS2V-EvalAesthetics0.4824Wan2.1-VACE-1.3B
Image to Video GenerationOpenS2V-EvalFaceSim0.2058Wan2.1-VACE-1.3B
Image to Video GenerationOpenS2V-EvalGmeScore0.7126Wan2.1-VACE-1.3B
Image to Video GenerationOpenS2V-EvalMotion0.1883Wan2.1-VACE-1.3B
Image to Video GenerationOpenS2V-EvalNaturalScore0.7178Wan2.1-VACE-1.3B
Image to Video GenerationOpenS2V-EvalNexusScore0.3795Wan2.1-VACE-1.3B
Image to Video GenerationOpenS2V-EvalTotal Score0.4553Wan2.1-VACE-1.3B
Image to Video GenerationOpenS2V-EvalAesthetics0.4727Wan2.1-VACE-1.3B-Preview
Image to Video GenerationOpenS2V-EvalFaceSim0.1658Wan2.1-VACE-1.3B-Preview
Image to Video GenerationOpenS2V-EvalGmeScore0.7138Wan2.1-VACE-1.3B-Preview
Image to Video GenerationOpenS2V-EvalMotion0.1203Wan2.1-VACE-1.3B-Preview
Image to Video GenerationOpenS2V-EvalNaturalScore0.7056Wan2.1-VACE-1.3B-Preview
Image to Video GenerationOpenS2V-EvalNexusScore0.4004Wan2.1-VACE-1.3B-Preview
Image to Video GenerationOpenS2V-EvalTotal Score0.4395Wan2.1-VACE-1.3B-Preview

Related Papers

World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17Modeling Code: Is Text All You Need?2025-07-15All Eyes, no IMU: Learning Flight Attitude from Vision Alone2025-07-15$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11