HNNewShowAskJobs
Built with Tanstack Start
How to scale RL to 10^26 FLOPs(blog.jxmo.io)
81 points by jxmorris12 7 days ago | 7 comments
  • moconnor4 days ago

    A very long way of saying "during pretraining let the models think before continuing next-token prediction and then apply those losses to the thinking token gradients too."

    It seems like an interesting idea. You could apply some small regularisation penalty to the number of thinking tokens the model uses. You might have to break up the pretraining data into meaningfully-paritioned chunks. I'd be curious whether at large enough scale models learn to make use of this thinking budget to improve their next-token prediction, and what that looks like.

  • tekacs4 days ago

    Besides the great subject matter, I love how densely packed this article is with links to relevant papers and materials!

    (Because I almost missed this) In the comments on the post someone linked to this paper: https://arxiv.org/html/2408.15240v1

  • mvkel4 days ago

    Grok 4 is effectively Grok 3 with massively scaled RL, and the improvements on the benchmarks (and experientially) are minimal.

    Is this a flaw in theory, or application?

    • k__4 days ago |parent

      Half-OT:

      Is there any info out there how the major models differ?

  • Iwan-Zotow4 days ago

    More data? Where it supposed to come from?

  • sync_silver934 days ago

    [flagged]

  • childintime3 days ago

    The chemistry of RL dictates that 10^26 FLOPS is about 166 FLOP-mols. But how much weight is this? An electron/FLOP or 1eV/FLOP? That's 0.55mg or just 1ng. Regardless, I'd say it's close to 7 brainfucks, as it's common knowledge they exert a logarithmic force on the intelligence weighing apparatus, F = m * log2(a).