| --- |
| language: [en] |
| license: other |
| library_name: motioncrafter |
| tags: |
| - motion |
| - video |
| - 4d |
| - diffusion |
| - scene-flow |
| pipeline_tag: image-to-3d |
| base_model: stabilityai/stable-video-diffusion-img2vid-xt |
| --- |
| |
| <h1 align="center" style="font-size: 1.6em;">MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE</h1> |
|
|
| <p align="center"><strong>🎉 Accepted by CVPR 2026 (Highlight🔥)</strong></p> |
|
|
| <div align="center"> |
|
|
| [Ruijie Zhu](https://ruijiezhu94.github.io/ruijiezhu/)<sup>1,2</sup>, |
| [Jiahao Lu](https://scholar.google.com/citations?user=cRpteW4AAAAJ&hl=en)<sup>3</sup>, |
| [Wenbo Hu](https://wbhu.github.io/)<sup>2</sup>, |
| [Xiaoguang Han](https://scholar.google.com/citations?user=z-rqsR4AAAAJ&hl=en)<sup>4</sup><br> |
| [Jianfei Cai](https://jianfei-cai.github.io/)<sup>5</sup>, |
| [Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en)<sup>2</sup>, |
| [Chuanxia Zheng](https://physicalvision.github.io/people/~chuanxia)<sup>1</sup> |
|
|
| <sup>1</sup> NTU <sup>2</sup> ARC Lab, Tencent PCG <sup>3</sup> HKUST <sup>4</sup> CUHK(SZ) <sup>5</sup> Monash University |
|
|
| [📄 Paper](https://arxiv.org/abs/2602.08961) | [🌐 Project Page](https://ruijiezhu94.github.io/MotionCrafter_Page/) | [🎬 YouTube Video](https://youtu.be/oc0fRoZTyk8) | [💻 Code](https://github.com/TencentARC/MotionCrafter) | [📜 License](LICENSE.txt) |
|
|
| </div> |
|
|
| ## Model Description |
|
|
| MotionCrafter is a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos. It predicts dense point maps and scene flow for each frame within a shared world coordinate system, without requiring post-optimization. |
|
|
| ## Intended Use |
|
|
| - Research on 4D reconstruction and motion estimation from monocular videos |
| - Academic evaluation and benchmarking of dense point map and scene flow prediction |
|
|
| Not intended for safety-critical or real-time production use. |
|
|
| ## Limitations |
|
|
| - Performance can degrade with extreme motion blur or severe occlusion. |
| - Output quality is sensitive to input resolution and video quality. |
| - Generalization may be limited for out-of-domain scenes. |
|
|
| ## Training Data |
|
|
| Training data details and preprocessing are described in the paper and main repository. If you need dataset specifics, please refer to the project page and the paper. |
|
|
| ## Evaluation |
|
|
| Please refer to the paper for evaluation datasets, metrics, and results. |
|
|
| ## How to Use |
|
|
| ```python |
| import torch |
| from motioncrafter import ( |
| MotionCrafterDiffPipeline, |
| MotionCrafterDetermPipeline, |
| UnifyAutoencoderKL, |
| UNetSpatioTemporalConditionModelVid2vid |
| ) |
| |
| unet_path = "TencentARC/MotionCrafter" |
| vae_path = "TencentARC/MotionCrafter" |
| model_type = "determ" # or "diff" for diffusion version |
| cache_dir = "./pretrained_models" |
| |
| unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained( |
| unet_path, |
| subfolder='unet_diff' if model_type == 'diff' else 'unet_determ', |
| low_cpu_mem_usage=True, |
| torch_dtype=torch.float16, |
| cache_dir=cache_dir |
| ).requires_grad_(False).to("cuda", dtype=torch.float16) |
| |
| geometry_motion_vae = UnifyAutoencoderKL.from_pretrained( |
| vae_path, |
| subfolder='geometry_motion_vae', |
| low_cpu_mem_usage=True, |
| torch_dtype=torch.float32, |
| cache_dir=cache_dir |
| ).requires_grad_(False).to("cuda", dtype=torch.float32) |
| |
| if model_type == 'diff': |
| pipe = MotionCrafterDiffPipeline.from_pretrained( |
| "stabilityai/stable-video-diffusion-img2vid-xt", |
| unet=unet, |
| torch_dtype=torch.float16, |
| variant="fp16", |
| cache_dir=cache_dir |
| ).to("cuda") |
| else: |
| pipe = MotionCrafterDetermPipeline.from_pretrained( |
| "stabilityai/stable-video-diffusion-img2vid-xt", |
| unet=unet, |
| torch_dtype=torch.float16, |
| variant="fp16", |
| cache_dir=cache_dir |
| ).to("cuda") |
| ``` |
|
|
| ## Model Weights |
|
|
| - geometry_motion_vae/: 4D VAE for joint geometry and motion representation |
| - unet_determ/: deterministic UNet for motion prediction |
| |
| ## Model Variants |
| |
| - Deterministic (unet_determ): fast inference with fixed predictions per input |
| - Diffusion (unet_diff): probabilistic predictions with diverse outputs |
| |
| ## Citation |
| |
| ```bibtex |
| @article{zhu2025motioncrafter, |
| title={MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE}, |
| author={Zhu, Ruijie and Lu, Jiahao and Hu, Wenbo and Han, Xiaoguang and Cai, Jianfei and Shan, Ying and Zheng, Chuanxia}, |
| journal={arXiv preprint arXiv:2602.08961}, |
| year={2026} |
| } |
| ``` |
| |
| ## License |
| |
| This model is provided under the Tencent License. See [LICENSE.txt](LICENSE.txt) for details. |
| |
| ## Acknowledgments |
| |
| This work builds upon [GeometryCrafter](https://github.com/TencentARC/GeometryCrafter). We thank the authors for their excellent contributions. |
| |
| |