Tanstack Start | Konwinski Prize

Konwinski Prize(andykonwinski.com)

146 points by tosh 7 months ago | 48 comments

neonate 7 months ago
https://twitter.com/andykonwinski/status/1867015050403385674
https://kprize.ai/
thih9 7 months ago
The Kaggle competition page has more details: https://www.kaggle.com/competitions/konwinski-prize
The prizes scale with the model’s score; the total prize pool is between $100,000 and $1,225,000, depending on the top scores.
- optimalsolver 7 months ago |parent
  >$1M for the AI that can close 90% of new GitHub issues
  If your AI can do this, it's worth several orders of magnitude more. Just FYI.
  - andyk 7 months ago |parent
    (reposting from locallama and lower down here) yep that's true.
    one of my goals is to inspire and honor those that work on open source AI. Those people tend to be motivated by things like impact and the excitement of being part of something big. i know that's how i always feel when i'm around Berkeley and get to meet or work with OG BSD hackers or the people who helped invent core internet protocols.
    those people are doing this kind of OSS work and sharing it with the world anyway, without any cash prize. i think of this as a sort of thank you gift for them. and also a way to maybe convince a few people to explore that path who might not have otherwise.
  - senko 7 months ago |parent
    And Linux kernel, curl, SQLite and many other open source software are worth infinitely more than the purchase price.
    Also, you cut off the "from the benchmark" part; this doesn't expect it to solve any random Github issue, just the ones from the (presumably manually vetted and cleaned up) bench dataset.
    - minimaxir 7 months ago |parent
      Linux kernel, curl, and SQLite don't require significant compute cost to develop that put it out of reach of hobbyists, and only in the reach of organizations expecting a positive ROI.
      - senko 7 months ago |parent
        The cost of Linux kernel development alone has been estimated at a few $B (https://dwheeler.com/essays/linux-kernel-cost.html), current figure is probably over tens of billions.
        Also, the prize doesn't require you to train a new foundational model, just that whatever you use is open weights or open source.
        Theoretically, might be get away with a Llama3.3 (or any other model which you think makes sense) with a cleverly designed agentic system and a fresh codebase-understanding approach, with minimal compute cost.
        (ok, probably not that easy, but just saying there's much more to AI coding that the underlying model)
        bruce511 7 months ago |parent
        >> The cost of Linux kernel development alone has been estimated at a few $B (https://dwheeler.com/essays/linux-kernel-cost.html), current figure is probably over tens of billions.
        I followed your link, but it doesn't seem to bear out upur assertion. The two numbers mentioned in the article are 176 mil and 612 mil. Mind you those weren't an estimate of cost, but rather an estimate to replace. Article is dated 2004, with an update in 2011.
        Using the lines-of-code estimation it crossed a billion in 2010 - again to replace. That has no relation to what it did actually cost.
        Getting from there to "tens of billions" seems a stretch. Assuming a bottom value in your estimate of 20 billion, and assuming a developer costs a million a year, that's 20 000 man-years of effort. Which implies something like 2000 people (very well paid people) working continuously for the last decade.
        Which seems, well, unlikely.
        senko 7 months ago |parent
        > The two numbers mentioned in the article are 176 mil and 612 mil.
        Those two numbers are from the intro. The postscript and the updates at the end mention $1.4b and $3b respectively.
        The real cost is probably impossible to calculate, but that order of magnitude is a reasonable estimate IMHO, and absolutely comparable, or even larger, than compute costs for SOTA LLMs
        olddustytrail 7 months ago |parent
        There are around 5000 active kernel devs, they are generally highly skilled and therefore highly paid, and they've been working for a lot longer than 10 years.
        So doesn't seem that unlikely based on your estimates.
        stevage 7 months ago |parent
        Highly paid like a million a year? Is that a thing?
        senko 7 months ago |parent
        Linux kernel has been in development since the nineties, not just for the last ten years. Also 5000 contributors is a lot more than 2000 from gp's comment.
        Let's ignore the years before dotcom boom since the dev community was probably much smaller, and assume an average of 3500 contributors since.
        That's 25 years * 3500 contributors on average * 200k salary (total employee cost, not take home) = $17.5b
        Napkin math, but order of magnitude checks out.
  - frgtpsswrdlame 7 months ago |parent
    If you're the only one that can come close. Kaggle competition prizes are about focusing smart people on the same problem. But it's very rare for one team to blow all the others out of the water. So if you wanted to make a business out of the problem kaggle will (probably) show the best you could do and still have no moat.
  - toast0 7 months ago |parent
    Can't the stale bot already do this?
  - segmondy 7 months ago |parent
    Exactly, I'll personally buy it for $2Million for anyone that can get it and assign me the full code/weight and rights.
    - andyk 7 months ago |parent
      I hope the competition will inspire people to make breakthroughs in the open, so I won't take any rights to the IP, instead the winning solutions must use open source code and open weight models.
    - noch 7 months ago |parent
      > I'll personally buy it for $2Million for anyone that can get it and assign me the full code/weight and rights.
      If you are serious, you should put the funds in an escrow contract and announce a bounty.
      There are many brilliant people who would work on this for you.
      - 7 months ago |parent
        [deleted]
  - thrance 7 months ago |parent
    It's 90% of a selection of new GitHub issues, we don't know about the complexity of these. I don't think they'd ask the AI for a giant refactoring of the codebase, for example.
xianshou 7 months ago
SWE-bench with a private final eval, so you can't hack the test set!
In a perfect world this wouldn't be necessary, but in the current research environment where benchmarks are the primary currency and are usually taken at face value, more unbiased evals with known methodology but hidden tests are exactly what we need.
Also one reason why, for instance, I trust small but well-curated benchmarks such as Aider (https://aider.chat/docs/leaderboards/) or Wolfram (https://www.wolfram.com/llm-benchmarking-project/index.php.e...) over large, widely targeted, and increasingly saturated or gamed benchmarks such as LMSYS Arena or HumanEval.
Goodhart's law is thriving and it's our duty to fight it.
Upvoter33 7 months ago
Very cool to see "outcome oriented" prizes like this -- it's another way to fund research, perhaps. Will be curious to track who does this and whether success in the prize correlates with deep innovation ...
andyk 7 months ago
andy here - happy to answer questions.
Also, I answered a bunch of questions yesterday on LocalLLaMA that people here might find interesting https://www.reddit.com/r/LocalLLaMA/comments/1hdfng5/ill_giv...
minimaxir 7 months ago
The author posted about the original tweet announcement a couple days ago: https://news.ycombinator.com/item?id=42413392
In reponse to my comment of "Realistically, an AI that can perform that well is worth a lot, lot more than $1M.", he said:
> yeah i agree. one of my goals is to inspire and honor those that work on open source AI.
> people who work on open source tend to be motivated by things like impact and the excitement of being part of something bigger than themselves - at least that's how i always feel when i'm around Berkeley and get to meet or work with OG BSD hackers and people who helped invent core internet protocols or the guys who invented RISC or more recently RISC-V
> those people are going to do this kind of OSS work and share it with the world anyway, without any cash prize. i think of this as a sort of thank you gift for them. and also a way to maybe convince a few people to explore that path who might not have otherwise.
cs702 7 months ago
Fabulous. A big round of applause for Andy Kowinski and the SWE-bench folks!
stevage 7 months ago
Man, imagine having a million bucks to just give away to something you think is cool.
kenjackson 7 months ago
Surprised to see Amazon Q Developer already at 55% on the verified suite.
But what I appreciate even more is that we keep pushing the bar for what an AI can/should be able to do. Excited to track this benchmark over time.
7 months ago
[deleted]
Mistletoe 7 months ago
Idk how accurate net worth predictors on the web are but it says his net worth is $20 million. Is this from his personal funds?
- SamvitJ 7 months ago |parent
  He is a Databricks co-founder: https://www.forbes.com/sites/kenrickcai/2021/05/26/accidenta...
  Likely worth at least $500M, even if his stake was smaller than some of the other co-founders.
- andyk 7 months ago |parent
  yes the prize money is from me to the winners
vouaobrasil 7 months ago
This is why AI advances so quickly. There are easy economic mechanisms to encourage it, while AI safety laws have to go through an arduous process. Seems rather lopsided when the technology can be potentially dangerous. We should have mechanisms to take a step back and examine this stuff with more caution, mechanisms which have equal force to economic force but we don't. The Amish have a much better model.
- MichaelZuo 7 months ago |parent
  Who gets to define ‘potentially dangerous’?
  Isn’t that the core issue in the first place?
  - vouaobrasil 7 months ago |parent
    We should have a discussion and debate about it. The point is, IF people decide that it IS dangerous, then there is no mechanism to stop it.
    - MichaelZuo 7 months ago |parent
      Huh? There are well known mechanisms after that point. Such as a resolution of the UN Security Council.
      - vouaobrasil 7 months ago |parent
        Really? Let's say the bottom 70% of earners in the U.S. decided that AI was dangerous and its development should be stopped. Do you think the top 30% would allow that?
        MichaelZuo 7 months ago |parent
        How does this relate to the Security Council or any other known mechanisms that operate in the world?
        vouaobrasil 7 months ago |parent
        Because those mechanisms don't work to stop most of the dangerous economic activities.
        MichaelZuo 7 months ago |parent
        They would work as mechanisms after the point when the world’s nations agree to it… Like I said…?
        Clearly the world’s nations aren’t guaranteed to share your views…
        vouaobrasil 7 months ago |parent
        My entire point is that we don't have mechanisms to protect the people. I was not referring to the power structure of the nations, or those with the most money.
        MichaelZuo 7 months ago |parent
        This seems to be going in circles…
        Who gets to define “protect the people”…?
        vouaobrasil 7 months ago |parent
        Perhaps we should hold a referendum: destroy AI?
        MichaelZuo 7 months ago |parent
        Who has the authority to hold any referendum for the entire planet?
        Dodging the core issue over and over again isn’t going to lead anywhere…
theogravity 7 months ago
What would be an example of cheating since it says "no cheating".
- NitpickLawyer 7 months ago |parent
  The only reasonable way to cheat on this would be to find real bugs in many repos, train your models on the solutions, wait till the cut-off period, report those bugs, propose PRs and hope your bugs get selected. Pretty small chances, tbh and probably not worth the rewards (the 90% solve rate is pretty much impossible given the constraints - 4x L4s and ~4-6min / problem. There's no way any models that can be ran on those machines under those time limits are that much better than the SotA frontier models)
- andyk 7 months ago |parent
  That has a double meaning - half tongue in cheek.
  1) since we are creating a contamination-free version of SWE-bench (i.e. scraping a new test set after submissions are frozen) it is guaranteed that agents in this contest can't "cheat", i.e., models can't have trained on the benchmark / agents cant memorize answers.
  2) as a general rule in life, don't cheat on things (not that there aren't exceptions)
- zamadatix 7 months ago |parent
  The "when they can’t cheat" comment relates to the "Why make a contamination free version?" section.
- tm-infringement 7 months ago |parent
  Spearfishing the organizers to get access to the shiny fresh dataset, or using some variation on wrench-based social engineering [0].
  0. https://xkcd.com/538/
  - alwa 7 months ago |parent
    Does this work if you have to finalize your submission before said organizers harvest shiny fresh dataset into existence?