Abstract
Steganographic chain-of-thought reasoning can be concealed within seemingly innocent text, posing risks for LLM safety and alignment that require continuous monitoring and evaluation.
Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT -- where models hide secret reasoning within innocuous text -- to inform risk assessment and deployment policies. We systematically evaluate the limits of steganographic capabilities across 28 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets, comparing steganographic acrostics against plain reasoning and filler-token baselines. We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks. However, in a simplified counting experiment, Claude Opus 4.5 achieved 92% accuracy on the hidden task, demonstrating nascent capability. Notably, in rare cases (<1%), GPT-5.2 might refuse steganographic instructions while simultaneously complying with them. Our findings underscore the need for continuous evaluation of steganographic risks. This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior.
Get this paper in your agent:
hf papers read 2602.14095 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper