HNNewShowAskJobs
Built with Tanstack Start
Evaluating chain-of-thought monitorability(openai.com)
44 points by mfiguiere 3 days ago | 14 comments
  • ursAxZA4 hours ago

    I might be missing something here as a non-expert, but isn’t chain-of-thought essentially asking the model to narrate what it’s “thinking,” and then monitoring that narration?

    That feels closer to injecting a self-report step than observing internal reasoning.

    • crthpl3 hours ago |parent

      the chain of thought is what it is thinking

      • skissanean hour ago |parent

        When we think, our thoughts are composed of both nonverbal cognitive processes (we have access to their outputs, but generally lack introspective awareness of their inner workings), and verbalised thoughts (whether the “voice in your head” or actually spoken as “thinking out loud”).

        Of course, there are no doubt significant differences between whatever LLMs are doing and whatever humans are doing when they “think” - but maybe they aren’t quite as dissimilar as many argue? In both cases, there is a mutual/circular relationship between a verbalised process and a nonverbal one (in the LLM case, the inner representations of the model)

        • ursAxZAan hour ago |parent

          The analogy breaks at the learning boundary.

          Humans can refine internal models from their own verbalised thoughts; LLMs cannot.

          Self-generated text is not an input-strengthening signal for current architectures.

          Training on a model’s own outputs produces distributional drift and mode collapse, not refinement.

          Equating CoT with “inner speech” implicitly assumes a safe self-training loop that today’s systems simply don’t have.

          CoT is a prompted, supervised artifact — not an introspective substrate.

          • skissane14 minutes ago |parent

            Models have some limited means of refinement available to themselves already: augment a model with any form of external memory, and it can learn by writing to its memory and then reading relevant parts of that accumulated knowledge back in the future. Of course, this is a lot more rigid than what biological brains can do, but it isn’t nothing.

            Does “distributional drift and mode collapse” still happen if the outputs are filtered with respect to some external ground truth - e.g. human preferences, or even (in certain restricted domains such as coding) automated evaluations?

      • Bjartr2 hours ago |parent

        It is text that describes a plausible/likely thought process that conditions future generation by it's presence in the context.

        • CamperBob2an hour ago |parent

          Interestingly, it doesn't always condition the final output. When playing with DeepSeek, for example, it's common to see the CoT arrive at a correct answer that the final answer doesn't reflect, and even vice versa, where a chain of faulty reasoning somehow yields the right final answer.

          It almost seems that the purpose of the CoT tokens in a transformer network is to act as a computational substrate of sorts. The exact choice of tokens may not be as important as it looks, but it's important that they are present.

          • Workaccount25 minutes ago |parent

            IIRC Anthropic has research finding CoT can sometimes be uncorrelated with the final output.

      • ursAxZA3 hours ago |parent

        Chain-of-thought is a technical term in LLMs — not literally “what it’s thinking.”

        As far as I understand it, it’s a generated narration conditioned by the prompt, not direct access to internal reasoning.

      • jablongoan hour ago |parent

        It is what it is thinking consciously / its internal narrative. For example a supervillain's internal narrative with their plans would go into their COT notepad. If we want to really lean into the analogy between human psychology and LLMs. The "internal reasoning" that people keep referencing in this thread.. referring to the transformer weights and inscrutable inner working of a GPT.. isn't reasoning, but more like instinct, or the subconscious.

        • canjobear13 minutes ago |parent

          It’s more like if the supervillain had to write one word of his chain of thought, then go away and forget what he was thinking, then come back and write one more word based on what he had written so far, repeating the process until the whole chain of thought is written out. Each token is generated conditional only on the previous tokens.

      • arthurcolle2 hours ago |parent

        Wrong to the point of being misleading. This is a goal, not an assumption

        Source: all of mechinterp

  • ramoz5 hours ago

    > Our expectation is that combining multiple approaches—a defense-in-depth strategy—can help cover gaps that any single method leaves exposed.

    Implement hooks in codex then.

  • leetrout4 hours ago

    Related check out chain of draft if you haven't.

    Similar performance with 7% of tokens as chain of thought.

    https://arxiv.org/abs/2502.18600