The benchmark results on the first page are kind of meaningless without an explanation of what the task is and why the results came out the way they did. Yes, I know it is softmax, but if you have a hand written softmax kernel, why would the hand written kernel for Parrot be so much better than the one for Pytorch?
Usually information is concealed in situations where revealing more information works against you. Being quiet invites skepticism.
As far as I can tell, this is basically a very bare bones JAX, but for C++. That's pretty good, because I am honestly getting tired of everything being locked inside a monolithic Python ecosystem that is difficult to call from other languages.
It's basically functional sugar on top of Thrust. Every call returns a fresh new array, so I wonder why it's still that fast. Maybe because then it can be easier parallized.