Papers
arxiv:2605.28774

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Published on May 27
· Submitted by
taesiri
on May 28
#3 Paper of the day
Authors:
,
,
,
,
,

Abstract

Agents using vision-language models with extended reasoning face challenges in tool utilization, which are addressed through AXPO, a method that improves performance by optimizing thinking prefixes and tool call resampling.

AI-generated summary

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.

Community

Paper submitter

AXPO (Agent Explorative Policy Optimization) addresses the thinking-acting gap in multimodal agentic reasoning by resampling tool calls in failed rollouts to improve training signal and model performance.

Paper author

Thank you for uploading our work, @taesiri !

We've also shared the key contributions in companion posts:
X: https://x.com/mkkang_1133/status/2059872464461848581?s=20
LinkedIn: https://www.linkedin.com/posts/minki-kang-5aa7281bb_releasing-axpo-our-new-work-on-agentic-activity-7465640179035279360-UZuh

Happy to discuss or take feedback :)

Paper author

og-thumbnail

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28774 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28774 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28774 in a Space README.md to link it from this page.

Collections including this paper 5