HNNewShowAskJobs
Built with Tanstack Start
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL(github.com)
28 points by dzign an hour ago | 5 comments
  • j2kun6 minutes ago

    They claim the algorithm "discovered" the new techniques, but the methods described in section 5 do not seem all that novel to me. It smells like it could be "laundering" the literature [1] and reshuffling existing techniques. This is not inherently a bad thing, but I would hope that if it is borrowing existing techniques, the appropriate citation would eventually make it into this paper.

    [1]: https://www.argmin.net/p/lore-laundering-machines

    • AlexCoventry5 minutes ago |parent

      In the future, we will all be Jürgen Schmidhuber. :-)

  • stonogo23 minutes ago

    Am I reading this wrong, or does this only support FP16 inputs, and compares its performance against an FP32 solver?

  • bgwalter20 minutes ago

    > To valid kernel correctness, we need to compare its output to a reference correct kernel with the same inputs.

    No, you need a numerical proof, which you don't have.

    • krapht7 minutes ago |parent

      This is a standard which few kernels will ever meet. I'd say requiring a numerical proof is the same as requiring no proof at all - because it won't ever happen unless you're validating silicon or something equally expensive.