MotionVLA
MotionVLA is an end-to-end vision-language-action model for humanoid motion generation. It combines a Qwen3.5 autoregressive backbone (conditioned on a scene image and a text instruction) with DSFT (Dual-Stream Frequency-domain Tokenizer), which decouples low-frequency pose semantics from high-frequency physical dynamics.
Repository Contents
This HuggingFace repository contains:
| Path | Description |
|---|---|
tokenizer/ |
DSFT tokenizer checkpoints |
tokenizer/base/ |
Base stream BPE tokenizer (4096 vocab, 201-dim DCT) |
tokenizer/phys/ |
Phys stream BPE tokenizer (4096 vocab, 75-dim DCT) |
dataset/ |
Dataset index files (motion_path β relative paths) |
Motion data files (.pt) and images are stored in the companion dataset repo: [your-hf-username]/MotionVLA-Dataset
Tokenizer Design
The DSFT tokenizer decomposes 276-dim ViMoGen motion into two streams:
276-dim motion (T frames)
β split by dimension
Base (201-dim): body_pose_6d + joints + root_orient + root_trans β low-freq semantic
Phys (75-dim): joints_vel + root_vel + root_trans_vel β high-freq dynamics
β DCT along time axis, keep top K coefficients
β BPE encoding
Base tokens: ~477/sequence (K=5, vocab=4096)
Phys tokens: ~40/sequence (K=15, vocab=4096)
Each motion sample is laid out as a unified autoregressive sequence:
[ M_BOS, b_1, ..., b_N, M_SEP, p_1, ..., p_M, M_EOS ]
where b_i are Base tokens and p_j are Phys tokens. A phase-aware logit mask
enforces the order BASE β SEP β PHYS β EOS at inference, so semantic pose
structure is generated before high-frequency physical dynamics.
Token Vocabulary
The Qwen3.5 backbone vocabulary is extended with motion tokens (used in the ms-swift training pipeline):
| Token type | ID range | Count |
|---|---|---|
| Base motion tokens | 248320 β 252415 | 4096 |
| Phys motion tokens | 252416 β 256511 | 4096 |
| MOTION_BOS | 256512 | 1 |
| MOTION_SEP | 256513 | 1 |
| MOTION_EOS | 256514 | 1 |
Usage
from tokenizer.ds_fast_tokenizer import DSFTTokenizer
import numpy as np
# Load tokenizer
tok = DSFTTokenizer.load("tokenizer/checkpoints")
# Encode 276-dim motion
motion = np.load("motion.npy") # shape: (T, 276)
result = tok.encode(motion)
# result["base_tokens"]: list of int (BPE IDs for base stream)
# result["phys_tokens"]: list of int (BPE IDs for phys stream)
# result["T"]: number of frames
# Decode back
base_recon, phys_recon = tok.decode(
result["base_tokens"], result["phys_tokens"], result["T"])
# base_recon: (T, 201), phys_recon: (T, 75)
Code
Training code and model architecture: GitHub
Citation
@article{motionvla2026,
title={MotionVLA: Vision-Language-Action Model for Humanoid Motion},
author={Zhang, Nonghai and Zhai, Siyu and Zhang, Zeyu and Tang, Hao},
year={2026}
}