Tanstack Start | RL is more information inefficient than you thought

RL is more information inefficient than you thought(dwarkesh.com)

68 points by cubefox 3 days ago | 19 comments

andyjohnson0 3 hours ago
Since it is not explicitly stated, "RL" in this article means Reinforcement Learning.
https://en.wikipedia.org/wiki/Reinforcement_learning
- quote 2 hours ago |parent
  I, too, started parsing this as RL=real life and that’s why I found the headline interesting
- Angostura an hour ago |parent
  Thank god. Was driving me mad.
  - on_the_train an hour ago |parent
    It's a deliberate click/ragebait, not a mistake. It makes People click and talk about it, just like it happens here.
    - jsnell 27 minutes ago |parent
      That is a bizarre take. Dwarkesh Patel is publishing in a very specific domain, where RL is a very common and unambigous acronym. I'd bet it was immediately clear to 99% of his normal audience, and to him it's such a high frequency term that people finding it ambiguous would not even have crossed his mind.
      (Like, would you expect people to expand LLM or AGI in a title?)
    - gpvos an hour ago |parent
      Never attribute to malice that which is adequately explained by stupidity.
      - bbarnett 36 minutes ago |parent
        There needs to be a new law, applicable to posts on the Internet of any kind.
        Because that law doesn't hold, when malice has a massive profit motive, and almost zero downside.
        Spammers, popups, spam, clickbait, all of it and more, not stupid, but planned.
bogtog an hour ago
The premise of this post and the one cited near the start (https://www.tobyord.com/writing/inefficiency-of-reinforcemen...) is that RL involves just 1 bit of learning for a rollout, rewarding success/failure.
However, the way I'm seeing this is that a RL rollout may involve, say, 100 small decisions out of a pool of 1,000 possible decisions. Each training step, will slightly upregulate/downregulate a given training step in the step's condition. There will be uncertainty about which decision was helpful/harmful -- we only have 1 bit of information after all -- but this setup where many steps are slowly learned across many examples seems like it would lend itself well to generalization (e.g., instead of 1 bit in one context, you get a hundred 0.01 bit insights across 100 contexts). There may be some benefits not captured by comparing the number of bits relative to pretraining.
As the blog says, "Fewer bits, sure, but very valuable bits", this also seems like a different factor that would also be true. Learning these small decisions may be vastly more valuable for producing accurate outputs than learning through pretraining.
- macleginn an hour ago |parent
  It is the same type of learning, fundamentally: increasing/decreasing token probabilities based on the left context. RL simply provides more training data from online sampling.
macleginn 3 hours ago
In the limit, the "happy" case (positive reward), policy gradients boil down to performing more or less the same update as the usual supervised strategy for each generated token (or some subset of those if we use sampling). In the unhappy case, they penalise the model for selecting particular tokens in particular circumstances -- this is not something you can normally do with supervised learning, but it is unclear to what extent this is helpful (if a bad and a good answer share a prefix, it will be upvoted in one case and penalised in another case, not in the same exact way but still). So during on-policy learning we desperately need the model to stumble on correct answers often enough, and this can only happen if the model knows how to solve the problem to begin with, otherwise the search space is too big. In other words, while in supervised learning we moved away from providing models with inductive biases and trusting them to figure out everything by themselves, in RL this does not really seem possible.
- sgsjchs 2 hours ago |parent
  The trick is to provide dense rewards, i.e. not only once full goal is reached, but a little bit for every random flailing of the agent in the approximately correct direction.
  - Jaxan an hour ago |parent
    How do you know the correct direction? Isn’t the point of learning that the right path is unknown to start with?
    - jsnell 23 minutes ago |parent
      The correct solutions and the viable paths probably are known to the trainers, just not to the trainee. Training only on problems where the solution is unknown but verifiable sounds like the ultimate hard mode, and pretty hard to justify unless you have a model that's already saturated the space of problems with known solutions.
      (Actually, "pretty hard to justify" might be understating it. How can we confidently extract any signal from a failure to solve a problem if we don't even know if the problem is solvable?)
  - thegeomaster 2 hours ago |parent
    Article talks about all of this and references DeepSeek R1 paper[0], section 4.2 (first bullet point on PRM) on why this is much trickier to do than it appears.
    [0]: https://arxiv.org/abs/2501.12948
scaredginger 2 hours ago
Bit of a nitpick, but I think his terminology is wrong. Like RL, pretraining is also a form of *un*supervised learning
- cubefox 2 hours ago |parent
  Usual terminology for the three main learning paradigms:
  - Supervised learning (e.g. matching labels to pictures)
  - unsupervised learning / self-supervised learning (pretraining)
  - reinforcement learning
  Now the confusing thing is that Dwarkesh Patel instead calls pretraining "supervised learning" and you call reinforcement learning a form of unsupervised learning.
  - pavvell an hour ago |parent
    SL and SSL are very similar "algorithmically": both use gradient descent on a loss function of predicting labels, human-provided (SL) or auto-generated (SSL). Since LLMs are pretrained on human texts, you might say that the labels (i.e., next token to predict) were in fact human provided. So, I see how pretraining LLMs blurs the line between SL and SSL.
    In modern RL, we also train deep nets on some (often non trivial) loss function. And RL is generating its training data. Hence, it blurs the line with SSL. I'd say, however, it's more complex and more computationally expensive. You need many / long rollouts to find a signal to learn from. All of this process is automated. So, from this perspective, it blurs the line with UL too :-) Though it dependence on the reward is what makes the difference.
    Overall, going from more structured to less structured, I'd order the learning approaches: SL, SSL (pretraining), RL, UL.
  - thegeomaster 2 hours ago |parent
    You could think of supervised learning as learning against a known ground truth, which pretraining certainly is.
    - Davidzheng an hour ago |parent
      a large number of breakthroughs in AI are based on turning unsupervised learning into supervised learning (alphazero style MCTS as policy improvers are also like this). So the confusion is kind of intrinsic.