Instructions to use wangzc9865/SeeNav-Agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use wangzc9865/SeeNav-Agent with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="wangzc9865/SeeNav-Agent") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("wangzc9865/SeeNav-Agent") model = AutoModelForImageTextToText.from_pretrained("wangzc9865/SeeNav-Agent") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use wangzc9865/SeeNav-Agent with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wangzc9865/SeeNav-Agent" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wangzc9865/SeeNav-Agent", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/wangzc9865/SeeNav-Agent
- SGLang
How to use wangzc9865/SeeNav-Agent with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "wangzc9865/SeeNav-Agent" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wangzc9865/SeeNav-Agent", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "wangzc9865/SeeNav-Agent" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wangzc9865/SeeNav-Agent", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use wangzc9865/SeeNav-Agent with Docker Model Runner:
docker model run hf.co/wangzc9865/SeeNav-Agent
SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization
This repository contains the official implementation for the paper SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization.
Overview
We propose SeeNav-Agent, a novel LVLM-based embodied navigation framework that includes a zero-shot dual-view visual prompt technique for the input side and an efficient RFT algorithm named SRGPO for post-training. Existing Vision-Language Navigation (VLN) agents often suffer from perception, reasoning, and planning errors, which SeeNav-Agent aims to mitigate through its proposed techniques.
π Highlights
- π« Zero-Shot Visual Prompt: No extra training for performance improvement with visual prompt.
- π² Efficient Step-Level Advantage Calculation: Step-Level groups are randomly sampled from the entire batch.
- π Significant Gains: +20.0pp (GPT4.1+VP) and +5.6pp (Qwen2.5-VL-3B+VP+SRGPO) improvements on EmbodiedBench-Navigation.
π Summary
- π¨ Dual-View Visual Prompt: We apply visual prompt techniques directly on the input dual-view image to reduce the visual hallucination.
- π Step Reward Group Policy Optimization (SRGPO): By defining a state-independent verifiable process reward function, we achieve efficient step-level random grouping and advantage estimation.
π Results on EmbodiedBench-Navigation
π Main Results
ποΈ Training Curves for RFT
ποΈ Testing Curves for OOD-Scenes
π¦ Checkpoint
| base model | env | π€ link |
|---|---|---|
| Qwen2.5-VL-3B-Instruct-SRGPO | EmbodiedBench-Nav | Qwen2.5-VL-3B-Instruct-SRGPO |
π οΈ Usage
Setup
Setup a seperate environment for evaluation according to: EmbodiedBench-Nav and Qwen3-VL to support Qwen2.5-VL-3B-Instruct.
Setup a seperate training environment according to: verl-agent and Qwen3-VL to support Qwen2.5-VL-3B-Instruct.
Evaluation
Use the following command to evaluate the model on EmbodiedBench:
conda activate <your_env_for_eval>
cd SeeNav
python testEBNav.py
Hint: you need to first set your endpoint, API-key and api_version in SeeNav/planner/models/remote_model.py
Training
verl-agent/examples/srgpo_trainer contains example scripts for SRGPO-based training on EmbodiedBench-Navigation.
Modify
run_ebnav.shaccording to your setup.Run the following command:
conda activate <your_env_for_train>
cd verl-agent
bash examples/srgpo_trainer/run_ebnav.sh
π Citation
If you find this work helpful in your research, please consider citing:
@article{wang2025seenav,
title={SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization},
author={Zhengcheng Wang and Zichuan Lin and Yijun Yang and Haobo Fu and Deheng Ye},
journal={arXiv preprint arXiv:2512.02631},
year={2025}
}
- Downloads last month
- 6
Model tree for wangzc9865/SeeNav-Agent
Base model
Qwen/Qwen2.5-VL-3B-Instruct