HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
Xin Zhou1, Dingkang Liang1, Xiwu Chen2, Feiyang Tan2, Dingyuan Zhang1, Hengshuang Zhao3, Xiang Bai1
1Huazhong University of Science and Technology, 2Mach Drive, 3The University of Hong Kong
Abstract
Driving world models are important for autonomous driving because they simulate environmental dynamics and predict how a scene will evolve. Existing methods usually focus on future scene generation, while comprehensive 3D scene understanding is often handled by separate vision-language models. This separation leaves a gap between semantic interpretation and physical simulation. HERMES++ is a unified driving world model that integrates 3D scene understanding and future geometry prediction in a single framework. It uses a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information into a structure compatible with Large Language Models (LLMs). LLM-enhanced world queries transfer semantic knowledge from the understanding branch to the generation branch, while a Current-to-Future Link conditions future geometric evolution on both scene context and language reasoning. HERMES++ further introduces Joint Geometric Optimization, combining explicit point-cloud constraints and implicit latent regularization to preserve structural consistency. Extensive experiments show that HERMES++ achieves strong performance on both future point cloud prediction and 3D scene understanding.
TL; DR
- Unified driving world model: jointly supports 3D scene understanding and future geometry prediction.
- BEV representation for LLMs: compresses multi-view visual inputs into spatially consistent BEV tokens.
- LLM-enhanced world queries: transfer semantic and world knowledge from language reasoning to future generation.
- Current-to-Future Link: bridges current scene understanding and future geometric evolution.
- Textual Injection: uses text embeddings as conditioning signals for future scene generation.
- Joint Geometric Optimization: aligns latent features with geometry-aware priors through explicit and implicit constraints.
Method Overview
HERMES++ unifies understanding and generation around a shared BEV representation:
- Multi-view images are encoded and projected into BEV space.
- BEV features are compressed into LLM-compatible visual tokens.
- The LLM performs scene understanding and enriches world queries with semantic knowledge.
- The Current-to-Future Link generates future latent representations conditioned on current BEV features, textual semantics, and future ego-motion.
- A future geometry decoder predicts future point clouds, optimized with Joint Geometric Optimization.
Demo
Demo 1
Demo 2
Demo 3
Checkpoints
The released checkpoints are stored under the ckpt/ directory:
ckpt/hermes++.stage1.pthckpt/hermes++.stage2.1.pthckpt/hermes++.stage2.2.pthckpt/hermes++.stage3.pth
Please refer to the GitHub repository for code, environment setup, data preparation, and evaluation details.
Links
- HERMES++ code: https://github.com/H-EmbodVis/HERMESV2
- HERMES++ project page: https://h-embodvis.github.io/HERMESV2/
- HERMES conference version: https://github.com/LMD0311/HERMES
- HERMES conference paper: https://arxiv.org/abs/2501.14729
License
The code and released model files are provided under the Apache 2.0 license.
Citation
@article{zhou2026hermespp,
title={HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation},
author={Zhou, Xin and Liang, Dingkang and Chen, Xiwu and Tan, Feiyang and Zhang, Dingyuan and Zhao, Hengshuang and Bai, Xiang},
journal={arXiv preprint},
year={2026}
}
@inproceedings{zhou2025hermes,
title={HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation},
author={Zhou, Xin and Liang, Dingkang and Tu, Sifan and Chen, Xiwu and Ding, Yikang and Zhang, Dingyuan and Tan, Feiyang and Zhao, Hengshuang and Bai, Xiang},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}