HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

Xin Zhou¹, Dingkang Liang¹, Xiwu Chen², Feiyang Tan², Dingyuan Zhang¹, Hengshuang Zhao³, Xiang Bai¹

¹Huazhong University of Science and Technology, ²Mach Drive, ³The University of Hong Kong

Abstract

Driving world models are important for autonomous driving because they simulate environmental dynamics and predict how a scene will evolve. Existing methods usually focus on future scene generation, while comprehensive 3D scene understanding is often handled by separate vision-language models. This separation leaves a gap between semantic interpretation and physical simulation. HERMES++ is a unified driving world model that integrates 3D scene understanding and future geometry prediction in a single framework. It uses a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information into a structure compatible with Large Language Models (LLMs). LLM-enhanced world queries transfer semantic knowledge from the understanding branch to the generation branch, while a Current-to-Future Link conditions future geometric evolution on both scene context and language reasoning. HERMES++ further introduces Joint Geometric Optimization, combining explicit point-cloud constraints and implicit latent regularization to preserve structural consistency. Extensive experiments show that HERMES++ achieves strong performance on both future point cloud prediction and 3D scene understanding.

TL; DR

Unified driving world model: jointly supports 3D scene understanding and future geometry prediction.
BEV representation for LLMs: compresses multi-view visual inputs into spatially consistent BEV tokens.
LLM-enhanced world queries: transfer semantic and world knowledge from language reasoning to future generation.
Current-to-Future Link: bridges current scene understanding and future geometric evolution.
Textual Injection: uses text embeddings as conditioning signals for future scene generation.
Joint Geometric Optimization: aligns latent features with geometry-aware priors through explicit and implicit constraints.

Method Overview

HERMES++ unifies understanding and generation around a shared BEV representation:

Multi-view images are encoded and projected into BEV space.
BEV features are compressed into LLM-compatible visual tokens.
The LLM performs scene understanding and enriches world queries with semantic knowledge.
The Current-to-Future Link generates future latent representations conditioned on current BEV features, textual semantics, and future ego-motion.
A future geometry decoder predicts future point clouds, optimized with Joint Geometric Optimization.

Demo

Demo 1

Demo 2

Demo 3

Checkpoints

The released checkpoints are stored under the ckpt/ directory:

ckpt/hermes++.stage1.pth
ckpt/hermes++.stage2.1.pth
ckpt/hermes++.stage2.2.pth
ckpt/hermes++.stage3.pth

Please refer to the GitHub repository for code, environment setup, data preparation, and evaluation details.

License

The code and released model files are provided under the Apache 2.0 license.

Citation

@article{zhou2026hermespp,
  title={HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation},
  author={Zhou, Xin and Liang, Dingkang and Chen, Xiwu and Tan, Feiyang and Zhang, Dingyuan and Zhao, Hengshuang and Bai, Xiang},
  journal={arXiv preprint},
  year={2026}
}

@inproceedings{zhou2025hermes,
  title={HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation},
  author={Zhou, Xin and Liang, Dingkang and Tu, Sifan and Chen, Xiwu and Ding, Yikang and Zhang, Dingyuan and Tan, Feiyang and Zhao, Hengshuang and Bai, Xiang},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-to-3D

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for H-EmbodVis/HERMESV2

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

Paper • 2501.14729 • Published Jan 24, 2025 • 3

H-EmbodVis
/

HERMESV2