Tanstack Start | How to scale RL to 10^26 FLOPs

How to scale RL to 10^26 FLOPs(blog.jxmo.io)

81 points by jxmorris12 7 days ago | 7 comments

moconnor 4 days ago
A very long way of saying "during pretraining let the models think before continuing next-token prediction and then apply those losses to the thinking token gradients too."
It seems like an interesting idea. You could apply some small regularisation penalty to the number of thinking tokens the model uses. You might have to break up the pretraining data into meaningfully-paritioned chunks. I'd be curious whether at large enough scale models learn to make use of this thinking budget to improve their next-token prediction, and what that looks like.
tekacs 4 days ago
Besides the great subject matter, I love how densely packed this article is with links to relevant papers and materials!
(Because I almost missed this) In the comments on the post someone linked to this paper: https://arxiv.org/html/2408.15240v1
mvkel 4 days ago
Grok 4 is effectively Grok 3 with massively scaled RL, and the improvements on the benchmarks (and experientially) are minimal.
Is this a flaw in theory, or application?
- k__4 days ago |parent
  Half-OT:
  Is there any info out there how the major models differ?
Iwan-Zotow 4 days ago
More data? Where it supposed to come from?
sync_silver93 4 days ago
[flagged]
childintime 3 days ago
The chemistry of RL dictates that 10^26 FLOPS is about 166 FLOP-mols. But how much weight is this? An electron/FLOP or 1eV/FLOP? That's 0.55mg or just 1ng. Regardless, I'd say it's close to 7 brainfucks, as it's common knowledge they exert a logarithmic force on the intelligence weighing apparatus, F = m * log2(a).