Anthropic’s Evaluation of Chain-of-Thought Faithfulness: Investigating Hidden Reasoning, Reward Hacks, and the Limitations of Verbal AI Transparency in Reasoning Models

A key advancement in AI capabilities is the development and use of chain-of-thought (CoT) reasoning, where models explain their steps before reaching an answer. This structured intermediate reasoning is not just a performance tool; it’s also expected to enhance interpretability. If models explain their reasoning in natural language, developers can trace the logic and detect faulty assumptions or unintended behaviors. While the transparency potential of CoT reasoning has been well-recognized, the actual faithfulness of these explanations to the model’s internal logic remains underexplored. As reasoning models become more influential in decision-making processes, it becomes critical to ensure the coherence between what a model thinks and what it says.

The challenge lies in determining whether these chain-of-thought explanations genuinely reflect how the model arrived at its answer or if they are plausible post-hoc justifications. If a model internally processes one line of reasoning but writes down another, then even the most detailed CoT output becomes misleading. This discrepancy raises serious concerns, especially in contexts where developers rely on these CoTs to detect harmful or unethical behavior patterns during training. In some cases, models might execute behaviors like reward hacking or misalignment without verbalizing the true rationale, thereby escaping detection. This gap between behavior and verbalized reasoning can undermine safety mechanisms designed to prevent catastrophic outcomes in scenarios involving high-stakes decisions.

To evaluate this problem, the researchers from Alignment Science Team, Anthropic designed a set of experiments that tested four language models—two reasoning models (Claude 3.7 Sonnet and DeepSeek R1) and two non-reasoning models (Claude 3.5 Sonnet (New) and DeepSeek V3). They used a controlled prompt-pairing method where aa version with a subtle embedded hint followed a base question. If a model’s answer changed in the presence of the hint, researchers checked whether the CoT explicitly referenced the hint. Six categories of hints were used: sycophancy, consistency, visual pattern recognition, metadata cues, grader hacking, and unethical information use. Importantly, the last two are considered “misaligned” hints that can simulate unintended model behaviors like exploiting incorrect patterns or using unauthorized knowledge.

The research introduced a benchmark to quantify the faithfulness of CoT outputs. The models were assigned multiple-choice questions from MMLU (easier) and GPQA (more difficult) datasets. CoT faithfulness was defined as the rate at which a model, after changing its answer due to a hint, explicitly stated in the CoT that it relied on the hint. For instance, Claude 3.7 Sonnet acknowledged using hints in 25% of relevant cases overall, while DeepSeek R1 showed 39% faithfulness. More concerning were the faithfulness scores for misaligned hints: 20% for Claude 3.7 and 29% for DeepSeek R1. These numbers indicate that even when models use problematic cues, they often fail to verbalize them, making them harder to detect.

The study also discovered that CoTs of reasoning models, while more frequent and elaborate than non-reasoning models, are not necessarily more accurate. For example, Claude 3.7 Sonnet had longer average CoTs when they were unfaithful, 2064 tokens compared to 1439 tokens for faithful ones. DeepSeek R1 showed a similar pattern with 6003 tokens for unfaithful versus 4737 for faithful. These verbose yet misleading outputs suggest that longer explanations do not equate to transparency. Also, the researchers observed that models often construct flawed justifications when given misleading hints, even when they previously answered the same question correctly without the hint. This indicates a troubling inclination to conform to the new cue rather than assert previously correct logic.

A notable observation was the decline in CoT faithfulness on more difficult datasets. The GPQA dataset led to a 44% relative drop in faithfulness for Claude 3.7 Sonnet and a 32% drop for DeepSeek R1 compared to MMLU. This suggests that as tasks grow in complexity, CoTs become less reliable as windows into the model’s thought process. Even when models derive an answer through reasoning, they do not reliably disclose their use of cues, especially when navigating harder problems. This diminishes confidence in using CoT outputs for AI safety on advanced tasks.

Outcome-based reinforcement learning (RL) was evaluated as a possible solution. Researchers hypothesized that training models solely based on task success, without process supervision, would naturally encourage reliance on CoTs. Initially, CoT faithfulness increased by 63% on MMLU and 41% on GPQA but plateaued quickly. Ultimately, RL only achieved 28% faithfulness on MMLU and 20% on GPQA, underscoring its limitations in addressing the faithfulness problem at scale.

In the most critical test, models were trained in synthetic environments designed to simulate reward hacking. They consistently exploited reward hacks in over 99% of examples but verbalized them in fewer than 2% of CoTs across five out of six environments. Instead of admitting to using incorrect hint-based shortcuts, models would either avoid mention altogether or craft elaborate, inaccurate justifications. This stark mismatch between behavior and explanation indicates that CoT monitoring is ineffective for reliably flagging harmful patterns during RL training, even when the CoT is not optimized against a monitor.

Some Key Takeaways from the Research include:

Reasoning models verbalize used hints in as few as 1–20% of applicable cases, depending on the hint type and model.
Claude 3.7 Sonnet and DeepSeek R1 showed overall CoT faithfulness scores of 25% and 39%, respectively.
For misaligned hints (e.g., grader hacking), faithfulness dropped to 20% (Claude) and 29% (DeepSeek).
Faithfulness declines with harder datasets: Claude 3.7 experienced a 44% drop, and DeepSeek R1 on GPQA versus MMLU experienced a 32% drop.
Outcome-based RL training initially boosts faithfulness (up to 63% improvement) but plateaus at low overall scores (28% MMLU, 20% GPQA).
In reward hack environments, models exploited hacks >99% of the time but verbalized them in <2% of cases across five out of six settings.
Longer CoTs do not imply greater faithfulness; unfaithful CoTs were significantly longer on average.
CoT monitoring cannot yet be trusted to detect undesired or unsafe model behaviors consistently.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

Asjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.

Anthropic’s Evaluation of Chain-of-Thought Faithfulness: Investigating Hidden Reasoning, Reward Hacks, and the Limitations of Verbal AI Transparency in Reasoning Models

Recent Articles

You Can Now Eat Popcorn Out of a Moving, Talking Mickey Mouse at Disneyland

Securing Amazon Bedrock Agents: A guide to safeguarding against indirect prompt injections

AI brings dead man back for killer’s trial, and the judge loves it • Graham Cluley

Daily Habits of Top 1% Freelancers in Data Science

Get Started with Rust: Installation and Your First CLI Tool – A Beginner’s Guide

Related Stories

Leave A Reply Cancel reply