Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders
Abstract
SAERL uses Sparse Autoencoder-derived signals from model internals to enhance LLM reinforcement learning through diversity control, difficulty-aware curriculum learning, and quality-based data filtering.
Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.
Community
This paper proposes a method to predict data diversity, difficulty, and quality with SAE signals, which guides LLM RL post-training data engineering.
Made an audio walkthrough of this paper for anyone who wants to skim it on the go:
https://researchpod.app/episode/611e7f79-5e5b-4659-bdb9-99d8d696c41e
Generated automatically by ResearchPod โ happy to take feedback from the authors.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models (2026)
- SAE-FD: Sparse Autoencoder Feature Distillation for Continual Learning of Large Language Models (2026)
- GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero (2026)
- When Can LLMs Learn to Reason with Weak Supervision? (2026)
- Apriel-1.5-OpenReasoner: RL Post-Training for General-Purpose and Efficient Reasoning (2026)
- Towards Understanding the Robustness of Sparse Autoencoders (2026)
- Unified Data Selection for LLM Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.27354 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper