SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper โข 2506.01844 โข Published โข 161
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Train a Vision-Language-Action model for drone navigation using Qwen3.5-0.8B + LoRA + Action Head.
Architecture: FPV Image + Text Instruction โ Qwen3.5-0.8B (LoRA) โ MLP โ [Vx, Vy, Vz, Yaw_rate]
| Script | What it does |
|---|---|
collect_data.py |
๐ฎ Captures FPV screenshots + keyboard inputs from your simulator |
convert_to_hf_dataset.py |
๐ฆ Converts collected data โ HuggingFace Dataset format |
generate_synthetic_data.py |
๐ฒ Generates fake data for testing the pipeline |
train_drone_vla.py |
๐๏ธ Trains Qwen3.5-0.8B + LoRA + action head |
inference_drone_vla.py |
๐ฏ Runs inference (single image or live from simulator) |
pip install pynput mss pillow numpy datasets torch transformers>=4.57 peft>=0.14 accelerate trackio huggingface_hub qwen-vl-utils
# Open your simulator window, then run:
python collect_data.py --output_dir ./drone_data --fps 10 --embodiment drone
# Controls:
# R = Start/Stop recording episode
# W/S/A/D = Forward/Back/Left/Right
# Space/Shift = Up/Down
# Q/E = Yaw left/right
# ESC = Quit and save
python convert_to_hf_dataset.py \
--source collected \
--input_dir ./drone_data \
--output_repo YOUR_USER/drone-nav-data \
--push_to_hub
python train_drone_vla.py \
--dataset_repo YOUR_USER/drone-nav-data \
--hub_model_id YOUR_USER/drone-vla-qwen3.5-0.8b \
--num_epochs 10 \
--batch_size 4
python inference_drone_vla.py \
--model_repo YOUR_USER/drone-vla-qwen3.5-0.8b \
--image frame.jpg \
--instruction "fly forward through the gate"
Maps directly to a standard RC drone controller:
Left Stick Right Stick
โ Vz (up) โ Vx (forward)
โ Vz (down) โ Vx (backward)
โ Yaw (rotate left) โ Vy (strafe left)
โ Yaw (rotate right) โ Vy (strafe right)
All values normalized to [-1, 1].
Same action space works across robots:
[Vx, Vy, Vz, Yaw] โ all 4 active[Vx, 0, 0, Yaw] โ no strafe, no altitude[Vx, Vy, 0, Yaw] โ no altitude