I’m beginning to see LLMs more as a solution to the code reuse problem: they can implement anything that’s common enough to appear a number of times in their training data.
If you’re hoping they will soon implement entirely novel complex systems based on a loose specification I think you’ll be disappointed.
This kind of begs the question though in the “there’s nothing new under the sun” sense. Maybe your particular task is novel, but are its component parts and algorithms novel? In some cases, perhaps some of them are, but I think that’s more the exception than the rule.
LLMs aren’t just tape recorders. They can mix and match and adapt. What percentage of the code you write do you think is truly original?
I just don't know if this is true.
The music programming language Csound is a good test.
There is not a huge amount of csound code online, but what is online is going to work.
The models are just terrible with csound. They can not mix and match at all. They can't abstract at all.
An even better test is to have the LLM try to make a piece of music in a certain style in csound. That is when you can see that the model is not abstracting the concepts at all.
Zero. none. nada.
Of course, it gives the probablistic next token prediction but because none of this is in the training data, the most probable output is basically noise/nonsense.
In general, I think we highly underestimate how hard it is to come up with things that are not in the training data.
It's frustrating whenever folks throw up proof complexity as why LLMs can't work. If most programs most people want to write can map into predictable & verifiable abstractions, or we recognize almost no software is verified to beginwith, we realize the world is already moving on irrespective of personal hobby horses here. Shocker: Much of the world we interact with every day already runs on PHP, JavaScript, and untyped Python that is not verified, not type checked, and has repos overflowing with bugs and CVEs.
Prof. Dawn Song IMO has been articulating a more productive view. LLMs are generating half the new code on github anyways, so lean in: use this as an opportunity to make it easy for new code to use formal methods where before it would have been to hard. Progress will happen either way, and at least this way we have a shot at bringing the verifiability in to more user code.
I don't think this is in direct contradiction with the article. This was written in response to ideas like:
(a) the machine is so powerful that you can just tell it what to do and it does it with no supervision
or
(b) we can ditch programming languages and just use English as the "source of truth", like you would read, modify and check in the source code rather than compiled object code.
(a) is very much yet to be seen, and (b) is just monumentally stupid. Productive uses of LLMs for code are either as a super-autocomplete or as a pair programmer, which doesn't make programming languages obsolete, much less computer science education as a whole. This is even called out in TFA:
'''Even if chatbots should become a useful tool for programmers, their existence cannot mean the end of programming. It may well be the case that this combination of a built-in repository of code examples and the ability to easily retrieve code examples based on it can be used for generating ideas and for forms of “pair programming” as recently studied by Bird et al.2
But this is not program synthesis, rather it gives software using LLMs the role of more modest and possibly useful tools. We are now seeing a flurry of interest in combining traditional approaches to program analysis with LLMs, and if anything, this points to the limitations inherent in both approaches.3,8'''
I'm mostly responding to the middle third of the article that IMO interprets the word "correct" in an academic way that, IMO, is unanchored to how, to a rounding error, software is written:
> "Even simple versions of the approach are intractable in the very precise sense of computational complexity theory if we are concerned about generating code that is correct." (Then 5 paragraphs about that being pspace-complete.)
The notion of correctness used in the article is a verified property. My point is probably less than 1% of the programmers on this site have ever verified their code in that sense, especially since they left a university setting. Instead, most commercial developers write tests based on their domain knowledge as developers and some limited understanding of their company's business domain... if at all. Which LLMs can also do.
>> Much of the world we interact with every day already runs on PHP, JavaScript, and untyped Python that is not verified, not type checked, and has repos overflowing with bugs and CVEs.
Yes, see, that's exactly the motivation for program synthesis (i.e. the automatic generation of programs that are correct with respect to some specification). Human programmers are perfectly capable of producing buggy code on our own. Nobody needs a system that can automatically generate bugs for the programmer to find and fix. We want something that can help us find bugs and fix them, or generate bug-free code automatically (if you can do one, you can do the other).
That is computationally hard and that hasn't changed, just as complexity theory hasn't suddenly been rendered obsolete, because of LLMs. Every attempt to get LLMs to produce correct code so-far use an LLM as a generator and then filters its output through a classical verifier, which is subject to all the issues with complexity that arise everywhere in program synthesis (for an early exemplar of that see DeepMind's AlphaCode).
The article's claim is that:
"Generating Correct Program Code Is Hard"
Emphasis mine. The claim is widely accepted and I don't understand why that has changed just because LLMs can generate program code. Note the missing word.
See my above on there is an academic use of terminology here that makes any discussion here irrelevant. A big chunk of the article reduces to pedantic trolling by basing arguments on this.
There is nothing academic about this. Program synthesis is hard for a very practical reason, the same very practical reason that you can't just brute-force your way through to an encrypted message: computational complexity is a bitch.
I appreciate that we have bigger computers now and more data and we can train large language models but for somet things it doesn't matter how big your computer and how deep your model, there are things you just can't do. E.g. when was the last time you asked an LLM to decrypt an encrypted message for you and it did it correctly? I don't mean ROT 13 btw.
I don't feel like you're reading what I'm writing.
I regularly use LLMs to work with security protocols and encrypted data, which are wrapped in APIs, and write new GPU code. Likewise, I regularly use LLMs to more quickly write large test suites.
I do not use LLMs to augment my verification work, because our team do not verify things to gain belief of sufficient correctness. Likewise, I do not try to bruteforce decrypt things, though I am surprisingly more lax on how we do things because of the headroom from GPUs.
The problem is repeat use of weasel words like 'correctness' and even here, your 'some things', that are doing a lot of work. That some tasks are hard, especially with poor environmental assumptions, does not mean they matter to most people who need to write most programs. There is a hard question here of whether most people needing to make programs need to care about these.
>> I don't feel like you're reading what I'm writing.
I can assure you that I am.
>> I do not use LLMs to augment my verification work, because our team do not verify things to gain belief of sufficient correctness.
I don't understand where the "verification" part came into the discussion. Who said anything about using LLMs "to augment my verification work"?
Are you perhaps confusing "program verification" with "program synthesis"? [edit: yes, reading your comment above https://news.ycombinator.com/item?id=42426184 it seems you are. Read on please, those are not the same thing]
Program verification means you already have a program, and some specification it must satisfy, and you chack whether the program satisfies the specification. "Program synthesis" means you don't have a program, only a specification, and you generate a program that satisfies the specification. These are two different tasks, although related, and most program synthesisers will verify the correctness of candidate programs. But they're two different tasks- and btw neither can be done by LLMs. The article above is discussing the computational hardness of generating correct program code with LLMs, i.e of program synthesis, not of verification.
So could you explain what you mean, please? You referred to verification, formal methods, etc before. What do you mean?
Also, could you avoid expressions like "weasel words" and so on? That's just adding noise. I can tell you're feeling hard done by, although I have no idea why and by whom, and I'm sure I have no relation to them so please tone it down and let's have an adult conversation.
The presumption of the article is that software correctness is the same the formal methods (which I lump mainstream classical program synthesis correctness into) is the same as some symbolic software satisfying some strong logical predicate, which, especially in general conditions, is pspace complete.
And my repeat response is no, software correctness, e.g., formal software acceptance criteria, is typically something different: "Just" testing.
RE:Hardness, it comes from the verification step of most classical synthesizers that use iterative search-based solvers. If verification was 'easy', the problem goes away. Verifiers say 'yes', 'no', and crucially, provide hints for 'try this next'.
RE:LLM's can't do. As in my other thread, if still working here, I'd be very much working on the combination of the two, such as where I pointed at below on work by Dawn Song and Terence Tao. (And some of our customer work on security policies and graph query synthesis may still lead us here.) There's a funny irony to statements like "LLMs can't prove..." in that the last decade's program synthesizers popularly rely on, in practice, the impressive engineering tricks that go on inside SMT solvers to hit relevant scale.
RE:Weasel words, that is pretty core to my criticism. By pointing out some programs are currently tricky, and especially under classical problem framings such as the above, I see a pattern of pedantic argumentation that the article is representative of. It hinges on a misunderstanding or misrepresentation of the task of writing acceptable software. I can't use the word "correct" here because of it.
Thanks for your reply but like I said above it's based on a misunderstanding, that program verification and Program Synthesis are one and the same. As far as I can tell the article is not trying to make that claim, either.
What the article does that perhaps causes confusion is that it uses the term "Program Synthesis" to refer to one kind of Program Synthesis, specifically, Inductive Program Synthesis. That is Program Synthesis from an incomplete specification, such as a set of examples of inputs and outputs of a target program. Deductive synthesis assumes a complete specification (a complete description of the target program, usually in some formal language).
Inductive synthesisers have to [1] search a large, combinatorial space of programs in order to find a program that satisfies a specification. This search itself has typically exponential time complexity and if it is coupled with a formal verifier then you have an at least NP [2] search on top of PSPACE verification. For this reason many inductive synthesisers don't use formal methods to verify candidate programs but instead check the output of a program's execution against I/O examples. That's very much like the software development process you describe, with informal tests such as unit tests. It's still very expensive because of the combinatorial hardness of the program search space.
That is what LLMs can't do: they can't do inductive program synthesis on the cheap. Generating programs with an LLM is subject to the same explosive combinatorial search as any other search-based approach: you're generating a large number of programs, until you find one that does what you want. It doesn't matter how you check whether you have the right program: the bottleneck is the search of a huge program space. On their own, of course, LLMs can generate correct programs only by chance, so you do need some kind of verification procedure to check their output, whether it's formal verification or "just" testing. It depends on what guarantees you want. Formal methods can give stronger guarantees while "just testing" can only tell you that a program "covers" its training examples.
I'll stop here because I don't want to write a huge comment but there's more things to say about how we ensure program correctness and what that even means in practice and in principle. I think I gave some hints above though.
_____________________
[1] Well, ish. There are alternatives, but not very well known.
[2] It depends on the approach.
- [deleted]
Do you have any links for Prof Song talking or writing about this? Sounds interesting
Her keynote at last year's https://www.louie.ai/gen-ai-graph-the-planet/ explicitly ended on this
Basically the way forward she's most excited about wrt AI x Security as a way avoid devolving into AIs fighting AIs at runtime by instead moving to enfusing authoring time with formal methods. I think her references were to earlier verified C compiler extension frameworks such as by George Necula's various projects.
Interestingly, hints of LLMs-for-automating-proofs are in her older papers, e.g., https://arxiv.org/pdf/1806.00608 . Even Terance Tao is leaning in here now: https://www.youtube.com/watch?v=AayZuuDDKP0
There's a very hard cliff of complexity that you and your business will fall off of with this attitude, but ya don't have to take my word for it!
Not sure I understand
What % of programmers in your company writes verifying coq proofs or equiv for what goes to production? Vs what % just does testing (unit, integration, etc)?
For us, as a small team, we do 7-8 figure annually for tested but unverified GPU code, and our growth plan does not involve formal verification
This article compares the hardness of program synthesis to the utility of chatbots to show chatbots fall short.
The difference is we don’t expect chatbots to be 100% right but we do expect program synthesis to be 100% right. For chatbots, 99% is amazing in terms of utility. That other 1% is really hard to get.
Given the limitations of English being able to robustly specify a program, thus requiring constraints for program synthesis to be formal descriptions of specifications, the author is committing a category error in comparing two incommensurable solutions to two distinct problems.
> For chatbots, 99% is amazing in terms of utility.
Is it really though? If it's 99% of generated tokens, then 1 out of 100 being wrong is not great for code generation since you're often going to generate more than that. But let's suppose it's 1 in 100 whole functions or programs that's wrong. Unless there's a way to automatically verify when a function is wrong the error rate basically makes full automation, e.g. automatically fixing GitHub issues, out of the question.
So it seems like at 99% we're left with a system that requires constant vigilance from the user to check it's work. This is a far cry from the promise of true program synthesis.
What would be amazingly useful is a 99% reliable system that can tell you when it can't answer the question correctly instead of providing a wrong answer.
If you expect to be able to prompt a chatbot and then not TEST that the code it wrote for you works, you're going to have a bad time.
If you're willing to put in that QA work yourself the productivity boost you can get them from is immense.
And honestly, working with the best human programmer in the world won't guarantee that the software they produce is the exact software that you wanted. Details are always lost in the communication between the person who wants the software and the person who builds it
What matters most is the dialog and speed of iteration that you can achieve to converge on the right product.
I think the reason that I haven't felt the productivity boost is because my work is mostly debugging and fixing existing code in a complicated codebase. Rarely is it writing new greenfield code. LLMs have so far struggled to be useful in this setting as it's more like the QA part of the workflow you're suggesting.
> If you're willing to put in that QA work yourself the productivity boost you can get them from is immense.
Isn't that work more than half the work of programming? I seem to find myself spending much more time on test code than on application code.
>> If you expect to be able to prompt a chatbot and then not TEST that the code it wrote for you works, you're going to have a bad time.
Then what do you need the chatbot for? I have to do that with my own code all the time anyway. What's the benefit?
Well... How do humans synthesizs programs then? We don't do exponential search over some symbols. We "somehow know" which branches get us closer to a program and which don't. It's foolish to believe it is impossible to teach an AI to "somehow know" this, too. Also: For some reason the bar is always higher for AI. Next to no human code is verifiably correct. And the actually verified-to-be-correct code is an even smaller subset.
- [deleted]
I have always believed difficulty is inherent to the problem. Regardless of whether that solver is a machine or a living organism. Maybe I am tainted in the same way since I studied complexity theory.
- [deleted]
Tl;dr (a) program synthesis is provably hard and (b) English is too ambiguous as a specification language. Therefore we’re nowhere close.
Ultimately not a convincing argument. We could use this same argument to argue that a human cannot write a program to spec. That may be strictly true, but not interesting.
- [deleted]
- [deleted]
The argument goes like this: a car will never be able to do all the things a horse can do.
Seems to me the car is the AI and the horse is the human.
Seems more like AI is like a Roomba and a person using a vacuum cleaner is like a human. Roomba can do the easy bits that are well-defined.