Post
9
A small model that struggled against a random opponent now beats GPT-5-mini at tic-tac-toe
I took LiquidAI/LFM2-2.6B and trained it through play.
🧑🍳 Here's how:
1️⃣ Build a solid RL env with Verifiers (Prime Intellect)
2️⃣ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env
3️⃣ SFT warm-up to teach format
4️⃣ Group-based RL (CISPO) against opponents making 20-70% random moves
5️⃣ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies
Done! Beats GPT-5-mini 🏆
---
🎮 Play against the model: anakin87/LFM2-2.6B-mr-tictactoe
🤗 Model: anakin87/LFM2-2.6B-mr-tictactoe
📚 Walkthrough/course: https://github.com/anakin87/llm-rl-environments-lil-course
🤗 Dataset and checkpoints: https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
I took LiquidAI/LFM2-2.6B and trained it through play.
🧑🍳 Here's how:
1️⃣ Build a solid RL env with Verifiers (Prime Intellect)
2️⃣ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env
3️⃣ SFT warm-up to teach format
4️⃣ Group-based RL (CISPO) against opponents making 20-70% random moves
5️⃣ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies
Done! Beats GPT-5-mini 🏆
---
🎮 Play against the model: anakin87/LFM2-2.6B-mr-tictactoe
🤗 Model: anakin87/LFM2-2.6B-mr-tictactoe
📚 Walkthrough/course: https://github.com/anakin87/llm-rl-environments-lil-course
🤗 Dataset and checkpoints: https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe