MotionVLA

MotionVLA is an end-to-end vision-language-action model for humanoid motion generation. It combines a Qwen3.5 autoregressive backbone (conditioned on a scene image and a text instruction) with DSFT (Dual-Stream Frequency-domain Tokenizer), which decouples low-frequency pose semantics from high-frequency physical dynamics.

Repository Contents

This HuggingFace repository contains:

Path	Description
`tokenizer/`	DSFT tokenizer checkpoints
`tokenizer/base/`	Base stream BPE tokenizer (4096 vocab, 201-dim DCT)
`tokenizer/phys/`	Phys stream BPE tokenizer (4096 vocab, 75-dim DCT)
`dataset/`	Dataset index files (motion_path → relative paths)

Motion data files (.pt) and images are stored in the companion dataset repo: [your-hf-username]/MotionVLA-Dataset

Tokenizer Design

The DSFT tokenizer decomposes 276-dim ViMoGen motion into two streams:

276-dim motion (T frames)
    ↓ split by dimension
Base (201-dim): body_pose_6d + joints + root_orient + root_trans   ← low-freq semantic
Phys  (75-dim): joints_vel + root_vel + root_trans_vel             ← high-freq dynamics
    ↓ DCT along time axis, keep top K coefficients
    ↓ BPE encoding
Base tokens: ~477/sequence  (K=5,  vocab=4096)
Phys tokens: ~40/sequence   (K=15, vocab=4096)

Each motion sample is laid out as a unified autoregressive sequence:

[ M_BOS, b_1, ..., b_N, M_SEP, p_1, ..., p_M, M_EOS ]

where b_i are Base tokens and p_j are Phys tokens. A phase-aware logit mask enforces the order BASE → SEP → PHYS → EOS at inference, so semantic pose structure is generated before high-frequency physical dynamics.

Token Vocabulary

The Qwen3.5 backbone vocabulary is extended with motion tokens (used in the ms-swift training pipeline):

Token type	ID range	Count
Base motion tokens	248320 – 252415	4096
Phys motion tokens	252416 – 256511	4096
MOTION_BOS	256512	1
MOTION_SEP	256513	1
MOTION_EOS	256514	1

Usage

from tokenizer.ds_fast_tokenizer import DSFTTokenizer
import numpy as np

# Load tokenizer
tok = DSFTTokenizer.load("tokenizer/checkpoints")

# Encode 276-dim motion
motion = np.load("motion.npy")  # shape: (T, 276)
result = tok.encode(motion)
# result["base_tokens"]: list of int (BPE IDs for base stream)
# result["phys_tokens"]: list of int (BPE IDs for phys stream)
# result["T"]: number of frames

# Decode back
base_recon, phys_recon = tok.decode(
    result["base_tokens"], result["phys_tokens"], result["T"])
# base_recon: (T, 201), phys_recon: (T, 75)

Code

Training code and model architecture: GitHub

Citation

@article{motionvla2026,
  title={MotionVLA: Vision-Language-Action Model for Humanoid Motion},
  author={Zhang, Nonghai and Zhai, Siyu and Zhang, Zeyu and Tang, Hao},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics