Tanstack Start | CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL

j2kun 6 minutes ago
They claim the algorithm "discovered" the new techniques, but the methods described in section 5 do not seem all that novel to me. It smells like it could be "laundering" the literature [1] and reshuffling existing techniques. This is not inherently a bad thing, but I would hope that if it is borrowing existing techniques, the appropriate citation would eventually make it into this paper.
[1]: https://www.argmin.net/p/lore-laundering-machines
- AlexCoventry 5 minutes ago |parent
  In the future, we will all be Jürgen Schmidhuber. :-)
stonogo 23 minutes ago
Am I reading this wrong, or does this only support FP16 inputs, and compares its performance against an FP32 solver?
bgwalter 20 minutes ago
> To valid kernel correctness, we need to compare its output to a reference correct kernel with the same inputs.
No, you need a numerical proof, which you don't have.
- krapht 7 minutes ago |parent
  This is a standard which few kernels will ever meet. I'd say requiring a numerical proof is the same as requiring no proof at all - because it won't ever happen unless you're validating silicon or something equally expensive.