Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
Abstract
Agents using vision-language models with extended reasoning face challenges in tool utilization, which are addressed through AXPO, a method that improves performance by optimizing thinking prefixes and tool call resampling.
Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.
Community
AXPO (Agent Explorative Policy Optimization) addresses the thinking-acting gap in multimodal agentic reasoning by resampling tool calls in failed rollouts to improve training signal and model performance.
Thank you for uploading our work, @taesiri !
We've also shared the key contributions in companion posts:
X: https://x.com/mkkang_1133/status/2059872464461848581?s=20
LinkedIn: https://www.linkedin.com/posts/minki-kang-5aa7281bb_releasing-axpo-our-new-work-on-agentic-activity-7465640179035279360-UZuh
Happy to discuss or take feedback :)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models (2026)
- Learning Agentic Policy from Action Guidance (2026)
- Structured Role-Aware Policy Optimization for Multimodal Reasoning (2026)
- How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors (2026)
- Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning (2026)
- Efficient Agentic Reasoning Through Self-Regulated Simulative Planning (2026)
- ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
