This page shows the source for this entry, with WebCore formatting language tags and attributes highlighted.
Title
"Chain of Thought" is just more generated text
Description
<img attachment="chana_messinger.webp" align="right" caption="Chana Messinger">This ~10-minute video discusses research about chain-of-thought LLMs that "show their work". Chana points out that, once you can see what the machine says it's doing, it's actually openly discussing "cheating" to achieve the correct result. She says that, once you add penalties for "cheating", the machine doesn't stop cheating---it simply stops writing about it. While this feels hilarious because it really seems to be acting like a teenager, it's exactly this kind of anthropomorphizing that is both so seductive and also potentially counterproductive.
<media href="https://www.youtube.com/watch?v=Xx4Tpsk_fnM" src="https://www.youtube.com/v/Xx4Tpsk_fnM" source="YouTube" width="560px" author="Computerphile / Chana Messinger" caption="'Forbidden' AI Technique - Computerphile">
Anthropic published a long paper recently called <a href="https://transformer-circuits.pub/2025/attribution-graphs/methods.html">Circuit Tracing: Revealing Computational Graphs in Language Models</a> in which they note that their research shows that the explanation offered by an LLM for how it arrived at an answer does not always---or even often---correspond to the actual path that the solution-generation took through the model's layers, when examined in detail with profiling.
Even though Chana says that the LLM is describing how it's going to "cheat" at getting to the answer that it knows has the greatest "weight"---i.e., it's the thing that the questioner very clearly wants to hear, or gets statistically closest to the "answer" that was given in the eval included in the query---it's actually describing this in a part of its processing that is only associated with generating the chain of thought and has little to nothing to do with producing the actual answer itself.
What we consider to be the "chain of thought" is, to the LLM, just more text to generate. It's just as likely to be completely made-up and has little to nothing to do with the construction of the answer itself. The LLM doesn't "know" that it's explaining one part of a text with another, just like it doesn't "know" that it's "lying" or "cheating".
The LLM is generating an answer that best satisfies the weights in its model (generated during training), combined with the "pressures" included in the system prompt and the query. It's the human interlocutor who imbues the situation with humanity or intent, not the machine. The context is that you're "talking to something" and the interpretive gloss is wholly one-sided. The other side is just cheerily crunching numbers.
I’m not convinced by Chana's explanation that the LLM is actually <iq>hiding private messages to itself</iq> with <a href="https://en.wikipedia.org/wiki/Steganography">steganography</a> because I think that the better explanation comes from the Anthropic paper linked above, not the OpenAI one she discusses. However, I think that it's definitely good advice to avoid these types of validation pressures, not because the models are <iq>trying to trick us, or hack us</iq> but because they don't lead to the desired result.
I think this research is fascinating because, even though there is no-one on the other side (or it's one of Searle's <a href="https://en.wikipedia.org/wiki/Chinese_room">Chinese Rooms</a>), we still might be able to figure out how to manipulate the machine to give us what we want reliably. While I understand that the anthropomorphizing explanation is more approachable, I'm more wary than many others of the limiting effect it has on how we think about solutions.