HNNewShowAskJobs
Built with Tanstack Start
FlashAttention-T: Towards Tensorized Attention(dl.acm.org)
52 points by matt_d 3 hours ago | 8 comments
  • simianwords22 minutes ago

    OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.

    I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

    • eldenring14 minutes ago |parent

      This is a common way of thinking. In practice this type of thing is more like optimizing flop allocation. Surely with an infinite compute and parameter budget you could have a better model with more intensive operations.

      Another thing to consider is that transformers are very general computers. You can encode many many more complex architectures in simpler, multi layer transformers.

    • refulgentis3 minutes ago |parent

      n^2 isn't a setting someone chose, it's a mathematical consequence of what attention is.

      Here's what attention does: every token looks at every other token to decide what's relevant. If you have n tokens, and each one looks at n others, you get n * n = n^2 operations.

      Put another way: n^2 is when every token gets to look at every other token. What would n^3 be? n^10?

      (sibling comment has same interpretation as you, then handwaves re: transformers can emulate more complex systems)

  • sigbottle28 minutes ago

    Oh wow there's still work being done on ampere?

    I was wondering - I've been thinking about switching to AI systems programming (I know, easy task), but from what I understand, industry cloud GPUs are the main winners, right? Nobody's going to pay me (assuming I even had the skills) to optimize for consumer GPUs?

    From what I understand, it's not just number + capacity + performance, it's literal core primitives. I don't think any of the "Blackwell" chips like the grace one or rtx 5090 have for example SM pairs in their ISA? And likewise similar fundamental differences between consumer and cloud hopper (where the majority of the perf is the cloud one's ISA?)

    So I guess I'm wondering if I should buy a GPU myself or should I just rent on the cloud if I wanted to start getting some experience in this field. How do you even get experience in this normally anyways, do you get into really good schools and into their AI labs which have a lot of funding?

    • Maxious6 minutes ago |parent

      yep, https://github.com/poad42/cuda-fp8-ampere recently another attempt at squeezing whatever's left from ampere

  • semiinfinitelyan hour ago

    tri dao isn't on the paper is it even allowed to call it "FlashAttention"???

  • saagarjhaan hour ago

    Less annoying link directly to the paper: https://dl.acm.org/doi/pdf/10.1145/3774934.3786425?download=...

    • SpaceManNabsan hour ago |parent

      link if you don't want to automatically download files

      https://dl.acm.org/doi/pdf/10.1145/3774934.3786425