This has been a feature of the Scalene Python profiler (https://github.com/plasma-umass/scalene) for some time (at this point, about 1.5 years) - bring your own API key for OpenAI / Azure / Bedrock, also works with Ollama. Optimizing Python code to use NumPy or other similar native libraries can easily yield multiple order of magnitude improvements in real-world settings. We tried it on several of the success stories of Scalene (before the integration with LLMs); see https://github.com/plasma-umass/scalene/issues/58 - and found that it often automatically yielded the same or better optimizations - see https://github.com/plasma-umass/scalene/issues/554. (Full disclosure: I am one of the principal designers of Scalene.)One major limitation the LLM has is that it can't run a profiler on the code, but we can. (This would be a fun thing to do in the future - feed the output of perf to the LLM and say 'now optimize').
I think with new agent capabilities of AI IDEs like Cursor and Cline, you can simply instruct the agent to run a profiler in the shell and make optimizations based on the output of the profiler. You can even instruct the agent to try to make the program run faster than a certain time limit and will try to reach that goal iteratively.
This same approach can work for eliminating errors in LLM generated code. We've had good luck doing it with Lattice [1] going from 5% to 90% success rate of creating a fully functional application.
I'm drafting a blog post that talks about how but for now the documentation will have to do.
Correction: it’s been a feature for at least two years. To our knowledge, Scalene was the first profiler to incorporate LLMs.
There are many LLMs that have tools (function calling) available in their API. Some time ago, I was building AI agent that has profiler connected by function calling, and it works exactly as you described. First produce the code, then call profiler, and last step was to optimize code.
[dead]
> I think it's revealing how limited the LLMs were in algorithmically optimizing this problem vs approaches like "add some parallelism" or "use numpy", for which they were quite decent in Python and somewhat less decent in Rust. It's surprising to me that unless I prompted it overly-specifically, GPT 4o didn't suggest using a faster rand implementation, as that was a pretty effective, easy step.
As a human performing optimization one of the steps might be to generate a list of candidate optimizations. After that you might explore each idea in parallel. By using the LLM iteratively, you're side stepping that approach and constraining the LLM to instead picking a certain optimization first and then build on that. So one cure might be to adjust your prompt and iteration strategy to take into account that there are multiple candidate solutions rather than just one. This leads to more of an agentic / chain approach than a completions approach to the task though.
The other missing part of this is that each time you optimize the code (even if it's a failure), you learn something more about the optimization. Again this context seems missing when the LLM looks at the code (unless that history is provided in CoPilot completions automatically - it probably depends on what approach you're taking there).
I wonder if there are any AI tools out there doing context + planning + history + RAG + test + evaluation + tool use + feedback approaches. I've seen various tools do 2-3 of those. It's not a stretch to think that we might eventually see tools that start to emulate more of the developer inner loops.
> I wonder if there are any AI tools out there doing context + planning + history + RAG + test + evaluation + tool use + feedback approaches. I've seen various tools do 2-3 of those. It's not a stretch to think that we might eventually see tools that start to emulate more of the developer inner loops.
Neat.
For reference, here is (who I presume is) Mr Anderson's top-voted reply on the HN thread discussing the original article:
https://news.ycombinator.com/item?id=42585375
Aside: the original question begins "Given a list of 1 million random integers...". For me, this suggests that we ought to be benchmarking from the point after the random numbers have been generated. I don't know how easy this benchmarking is to do in Rust, but I know how I would do it in dotnet using something like BenchmarkDotNet.
Yup, that was my comment / I'm the author of the blog post. The discussion here inspired me to do the study more carefully and turn it into a longer post.
I think it's fair to leave the random generation in or out, really, as long as we're still comparing apples to apples. But I agree with you that it probably would have made the benchmarking more clear had I left them out.
i agree that excluding the test data generation from the test should have been required. additionally, the original task should have been formulated much more general, i.e:
then define the parameter values for testing and performance evaluation. that should have prevented at least some of the proposed compile time optimizations.Given a function that takes a list of n random integers between x and y, find the difference between the smallest and the largest numbers whose digits sum up to m.
edit: i'm aware though that the performance characteristics might be very different depending on the value of "n", especially when it comes to multithreading, so there should be a clause for that (e.g. n usually > 1m).
I feel that a lot of the "here's what LLMs can't do, and why humans are still king" posts are missing:
What kind of humans are we comparing LLMs against?
Junior devs? Intermediate devs? Senior devs? Experts?
I think few jr. devs would be able to apply the optimizations that were in the blog post. So maybe already LLMs are performing on a level that is beyond human jr. devs, and quickly catching up with the next tier.
For me, the show stopper is that every single token is suspect. Adding layers of more of it compounds this suspicion, leading to fatigue and automation bias, amplifying the speed of error or technical debt while possibly even acting against the understanding and control of stakeholders. They are actually in my mind far above human skill, actually, and far below basic microbes in basic existing and engagement of course. I think the true societal task now is to give names to these non-alive qualities and risks, the echo chamber aspects, what the total information of a model plus a generated result is, and things like X/Y problems and unknown unknowns.
> LLMs are performing on a level that is beyond human jr. devs, and quickly catching up with the next tier.
I am a little more pessimistic. I would say that LLMs have been hovering above junior for a year or two, with some progress towards intermediate in the context of a few methods/classes, but not much evidence of the wider architectural/systems-level thinking that I'd expect from an intermediate.
It's a good question. I want to think that most of the students we graduate could do this (once they learned Rust, as it's not part of our curriculum, but they could do it in C or Java or something). But our (CMU CS) students get a fair bit of algorithmic experience, and I'm not sure how you're defining "junior dev" (and the experience of a junior dev probably varies widely given the many different paths to get there).
CMU is a top school in CS, like tied for #1 with MIT. Probably puts out 99th percentile junior devs.
I think a lot of your posts are missing that someone's got to tell LLMs what to do.
What kind of humans are doing that?
How much more productive is it making them?
I think there is a much simpler explanation behind this: training data.
Llms have had truck load of training on normal code than a super optimized ones.
That's why you will see it struggle with nginx configs. And on newer technology where there is less amount of data available.
If they trained an LLM on demoscene entries, maybe the results could be better, but I doubt by much.
I just upgraded by AI optimised graphics driver and now every time I launch COD every map is a psychedelic rave trance disco. Help!
I can help, but we gotta get through the 20 minutes of scrolling greets first.
A lot of demoscene optimizations are sacrificing correctness for speed, and thus not really applicable in most domains.
Refining and adding to the text and code they train on is a major area of focus at the moment. Things will definitely get better, but it will take time.
This is the key question... "What is the total information in this model and interaction?"
Nice case study of how far LLMs are from AGI. I gave it to o1 instead of 4o, testing whether the thinking tokens help. Alas, they don't. It never came up with the "obvious" "skip-if-between" improvement over the naive algorithm.
The skipping was obvious to me at first glance, before I read the full blog post. So o1 failing it even after prodding for algorithmic improvements is a sign it's not that smart (in an AGI sense).
Note: the initial problem statement is ambiguous. Instead of saying "Given a list of random integers" - which the LLM somewhat reasonably interpreted as parsing from stain - you should phrase it as "Generate a list of random integers and then..."
Link to ChatGPT session: https://chatgpt.com/share/677d0665-3f9c-8005-aaf8-662dec923d...
- [deleted]