IMU-1: Sample-Efficient Pre-training of Small Language Models
Abstract
IMU-1 is a 430M-parameter language model achieving benchmark performance comparable to much larger models through optimized training techniques and architectural innovations.
We present IMU-1, a 430M-parameter language model trained on 72B tokens that approaches the benchmark performance of models trained on 56x more data. We describe a validated training recipe combining recent architectural interventions (QK-norm attention, per-head gating, value residuals, LayerNorm scaling) with optimization advances (NorMuon with cautious weight decay, muP parametrization) and a three-stage training schedule with post-hoc checkpoint EMA. We provide ablations for each component and release code, weights and data to enable reproduction: https://huggingface.co/thepowerfuldeez/imu1_base
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Progressive Residual Warmup for Language Model Pretraining (2026)
- Muon+: Towards Better Muon via One Additional Normalization Step (2026)
- Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion (2026)
- Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi (2026)
- H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code (2026)
- DySCO: Dynamic Attention-Scaling Decoding for Long-Context Language Models (2026)
- NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2602.02522 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper