MotionVLA

MotionVLA is an end-to-end vision-language-action model for humanoid motion generation. It combines a Qwen3.5 autoregressive backbone (conditioned on a scene image and a text instruction) with DSFT (Dual-Stream Frequency-domain Tokenizer), which decouples low-frequency pose semantics from high-frequency physical dynamics.

Repository Contents

This HuggingFace repository contains:

Path Description
tokenizer/ DSFT tokenizer checkpoints
tokenizer/base/ Base stream BPE tokenizer (4096 vocab, 201-dim DCT)
tokenizer/phys/ Phys stream BPE tokenizer (4096 vocab, 75-dim DCT)
dataset/ Dataset index files (motion_path β†’ relative paths)

Motion data files (.pt) and images are stored in the companion dataset repo: [your-hf-username]/MotionVLA-Dataset

Tokenizer Design

The DSFT tokenizer decomposes 276-dim ViMoGen motion into two streams:

276-dim motion (T frames)
    ↓ split by dimension
Base (201-dim): body_pose_6d + joints + root_orient + root_trans   ← low-freq semantic
Phys  (75-dim): joints_vel + root_vel + root_trans_vel             ← high-freq dynamics
    ↓ DCT along time axis, keep top K coefficients
    ↓ BPE encoding
Base tokens: ~477/sequence  (K=5,  vocab=4096)
Phys tokens: ~40/sequence   (K=15, vocab=4096)

Each motion sample is laid out as a unified autoregressive sequence:

[ M_BOS, b_1, ..., b_N, M_SEP, p_1, ..., p_M, M_EOS ]

where b_i are Base tokens and p_j are Phys tokens. A phase-aware logit mask enforces the order BASE β†’ SEP β†’ PHYS β†’ EOS at inference, so semantic pose structure is generated before high-frequency physical dynamics.

Token Vocabulary

The Qwen3.5 backbone vocabulary is extended with motion tokens (used in the ms-swift training pipeline):

Token type ID range Count
Base motion tokens 248320 – 252415 4096
Phys motion tokens 252416 – 256511 4096
MOTION_BOS 256512 1
MOTION_SEP 256513 1
MOTION_EOS 256514 1

Usage

from tokenizer.ds_fast_tokenizer import DSFTTokenizer
import numpy as np

# Load tokenizer
tok = DSFTTokenizer.load("tokenizer/checkpoints")

# Encode 276-dim motion
motion = np.load("motion.npy")  # shape: (T, 276)
result = tok.encode(motion)
# result["base_tokens"]: list of int (BPE IDs for base stream)
# result["phys_tokens"]: list of int (BPE IDs for phys stream)
# result["T"]: number of frames

# Decode back
base_recon, phys_recon = tok.decode(
    result["base_tokens"], result["phys_tokens"], result["T"])
# base_recon: (T, 201), phys_recon: (T, 75)

Code

Training code and model architecture: GitHub

Citation

@article{motionvla2026,
  title={MotionVLA: Vision-Language-Action Model for Humanoid Motion},
  author={Zhang, Nonghai and Zhai, Siyu and Zhang, Zeyu and Tang, Hao},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading