My experience with using AI tools for code review is that they do find critical bugs (from my retrospective analysis, maybe 80% of the time), but the signal to noise ratio is poor. It's really hard to get it not to tell you 20 highly speculative reasons why the code is problematic along with the one critical error. And in almost all cases, sufficient human attention would also have identified the critical bug - so human attention is the primary bottleneck here. Thus poor signal to noise ratio isn't a side issue, it's one of the core issues.
As a result, I'm mostly using this selectively so far, and I wouldn't want it turned on by default for every PR.
> but the signal to noise ratio is poor
Nail on the head. Every time I've seen it applied, its awful at this. However this is the one thing I loathe in human reviews as well, where people are leaving twenty comments about naming and then the actual FUNCTIONAL issue is just inside all of that mess. A good code reviewer knows how to just drop all the things that irk them and hyperfocus on what matters, if there's a functional issue with the code.
I wonder if AI is ever gonna be able to conquer that one as its quite nuanced. If they do though, then I feel the industry as it is today, is kinda toast for a lot of developers, because outside of agency, this is the one thing we were sorta holding out on being not very automatable.
at my last job code review was done directly in your editor (with tooling to show you diffs as well).
What this meant was that instead of leaving nitpicky comments, people would just change things that were nitpicky but clear improvements. They'd only leave comments (which blocked release) for stuff that was interesting enough to discuss.
This was typically a big shock for new hires who were used to the "comment for every nitpick" system; I think it can feel insulting when someone changes your feature. But I quickly came to love it and can't imagine doing code review any other way now. It's so much faster!
I'm not sure how to tie this to AI code review tbh. Right now I don't think I'd trust a model's taste for when to change things and when to leave a comment. But maybe that'll change. I agree that if you automated away my taste for code it'd put me in a weird spot!
What if you have two people with different ideas of how to name a certain variable and they just flip the name back and forth every release?
I like this review method too though, and like that some pr review tools have a 'suggest changes' and 'apply changes' button now too
> What if you have two people with different ideas of how to name a certain variable and they just flip the name back and forth every release?
Fire both. There is no amount of skill and productivity that can justify that amount of pettiness.
I think it's a good idea to have a style guide of sorts that you can point to when people sweat the small stuff.
> What this meant was that instead of leaving nitpicky comments, people would just change things that were nitpicky but clear improvements. They'd only leave comments (which blocked release) for stuff that was interesting enough to discuss.
This is my dream; have only had a team with little enough ego to actually achieve it once for an unfortunately short period of time. If it's something that there's a 99% chance the other person is going to say 'oh yeah, duh' or 'sure, whatever' then it's just wasting both of your time to not just do it.
That said, I've had people get upset over merging their changes for them after a LGTM approval when I also find letting it sit to be a meaningless waste of time.
If you are nitpicking style or conventions that do not have rules in your linting tools, then those should automatically be non-issues, IMO.
Naming comments can be very useful in code that gets read by a lot of people. It can make the process of understanding the code much quicker.
On the other hand, if it's less important code or the renaming is not clearly an improvement it can be quite useless. But I've met some developers who has the opinion of reviews as pointless and just say "this works, just approve it already" which can be very frustrating when it's a codebase with a lot of collaboration.
Naming comments are useful when someone catches something like:
1. you are violating a previously agreed upon standard for naming things
2. inconsistent naming, eg some places you use "catalog ID" and other places you use "item ID" (using separate words and spaces here because case is irrelevant).
3. the name you chose makes it easy to conflate two or more concepts in your system
4. the name you chose calls into question whether you correctly understood the problem domain you are addressing
I'm sure there are other good naming comments, but this is a reasonable representation of the kinds of things a good comment will address.
However, most naming comments are just bike shedding.
If the person reading the code doesn't quickly understand what's going on from the name or finds the name confusing, the name is poor and should be changed. It is way too easy for the author to be caught up in their mental model and to be unaware of their implicit assumptions and context and choose a name that doesn't make sense.
The bigger problem is people who feel ownership of shared codebases tied to their ego and who get angry when people suggest changes to names and other bits of interfaces instead of just making the suggested change.
If you get code review feedback, the default answer is "Done" unless you have a strong reason not to. If it's not obvious whether the name suggested by the author or the reader is better, the reader's choice should be taken every time.
> Naming comments can be very useful in code that gets read by a lot of people. It can make the process of understanding the code much quicker.
yes but it can be severely diminishing returns. Like lets step back a second and ask ourselves if:
var itemCount = items.Count;
vs
var numberOfItems = items.Count;
is ever worth spending the time discussing, versus how much of a soft improvement it makes to the code base. I've literally been in a meeting room with three other senior engineers killing 30 minutes discussing this and I just think that's a complete waste of time. They're not wrong, the latter is clearer, but if you have a PR that improves the repo and you're holding it back because of something like this, then I don't think you have your priorities straight.
Sorry for the dumb question, is the second version actually better than the first? Because I prefer the first. But perhaps you chose this as a particularly annoying/unuseful comment
I personally don't give a shit either way but I've worked in dev shops with a clear preference for the second one. I can see their point because the code as natural language parses better but I don't think its strong enough to care about.
Sort of place that is fussy about test naming so where I would do smth like:
TestSearchCriteriaWhere
they'd want
Test_That_Where_Clauses_In_Search_Criteria_Work
I think its a waste of typing but idk, I'm willing to let it slide because I think its a pointless hill to die on.
Let's take it up a notch!
depends on what `items` is, no? Is the `.Count` O(1)? Do you really need a variable or is it fine for the (JIT) compiler to take care of it? Is it O(n) and n is significant enough? Maybe you need a variable and spend time arguing about that name. Yes I chose this because almost everyone I know at least would argue you always have to create the variable (and then argue about the name) ;)var itemCount = items.Count;
I get fussiness about test naming. I believe that a good test "name" should tell you enough for you to be able to "double check" the test setup as well as the assertions against the test name with some sort of "reasonable" knowledge of the code/problem domain.fussy about test namingAs such both of those test names are really bad, because they can't tell anything at all about whether you're testing for the correct thing. How do I know that your assertions are actually asserting that it "works"?
Instead, I'd want a test named something like this (assuming that that's what this particular test is actually about - i.e. imagine this particular test in the context of a user defined search, where one of the options is that they can specify a project to search by and this particular test is about verifying that we check the permissions the user has for said project. There would be different tests for each of the relevant where clauses that specifying a project in the search params would entail and different tests again for each of the other user specifiable parameters that result in one or more where clauses to be generated):
Every single test case gives you the ability to specify both a good name and clear, concise test assertions. If I see anything but a bunch of assertions related to project permissions for the logged in user in this test, I will fight you tooth and nail on that test ;) I couldn't care less tho if you use camelCase or snake_case or whatever. I just had to choose something to post. I also couldn't care less if you had 17 different assertions in the test (we all know that "rule", right? I think the "test one thing" and "one assertion" is not about the actual number of "assert statements". People that think that, got the rule wrong. It's all about the "thing" the assertions test. If you have 17 assertions that are all relevant to testing the project permission in question then they're great and required to be there. If 1 is for asserting the project permissions and the other 16 are repeating all the other "generic assertions" we copy and pasted from previous tests, then they're not supposed to be there. I will reject such a PR every time.shouldCheckProjectPermissionsWhenProjectIdInSearchParams()
If I was going to nitpick it I would point out that `itemsCount` could easily be confused with `items.Count`, or vice versa, depending on syntax highlighting. That kind of bug can have a negative impact if one or the other is mutated while the function is running.
So clearly distinguishing the local `numberOfItems` from `items.Count` _could_ be helpful. But I wouldn't ping it in a review.
They’re both equally bad to me, I don’t see the improvement over just using item.count. I may be nitpicking a toy example though.
I think in this case itemCount had application in a couple of conditions later in the function, so there was value in extracting the count. In my recollection I might be missing some nuance, lets say for the sake of argument it was:
var relevantCount = items.Where(x => x.SomeValue > 5);
vs
var numberOfRelevantItems = items.Where(x => x.SomeValue > 5);
so it wasn't necessarily cheap enough to want to repeat.
A lot of these comments are not pointing out actual issues, just "That's not how I would have done it" type comments.
And the most amazing part is that we got a mini PR review in the comments to a single line of code someone posted just to show an example of useless debates :D
Yeah or worse like my boss. We don't have a style guide. But he always wants style changes in every PR, and those style changes are some times contradictory across different PRs.
Eventually I've told him "if your comment does not affect performance or business logic, I'm ignoring it". He finally got the message. The fact that he accepted this tells me that deep down he knew his comments were just bike shedding.
You should have a style guide, or adopt one. Having uniform code is incredibly valuable as it greatly reduces the cognitive load of reading it. Same reason that Go's verbose "err != nil" works so well.
Style guidelines should be enforced automatically. Leaving that for humans to verify is a recipe for conflict and frustration.
I've been in teams like this - people who are lower on the chain of power get run in circles as they change to appease one, then change to appease another then change to go back to appease the first again.
Then, going through their code, they make excuses about their code not meeting the same standards they demand.
As the other responder recommends, a style guide is ideal, you can even create an unofficial one and point to it when conflicting style requests are made
> Then, going through their code, they make excuses about their code not meeting the same standards they demand.
Yes!! Exactly. When it comes to my PRs, he once made this snarky comment about him having high expectations in terms of code quality. When it comes to his PRs, he does the things he tells me not to do. In fact, I once sent him a "dis u?" with a link to his own code, as a response to something he told me I shouldn't do. To his credit he didn't make excuses, he responded "I could've done better there, agreed".
In general he's not bad, but his nitpicking is bad. I don't really understand what's going on in his mind that drives this behavior, it's weird.
This is why you should set guidelines for reviews (like e.g. https://go.dev/wiki/CodeReviewComments), and ideally automate as much as possible. I'm guilty of this as well, leaving loads of nitpicky code style comments - but granted, this was before Prettier was a thing. In hindsight, I could've spent all that time building a code formatter myself lol.
Let me throw something out there: poor naming obscures and distracts from functional issues. You are right about a good reviewer, but a good author strives for clarity in addition to correctness.
As an aside, naming is highly subjective. Like in writing, you tailor naming to the problem domain and the audience.
Depends on what you're targeting
- If it's a rough PR, you're looking for feedback on direction rather than nitpicks.
- If it's in a polished state, it's good to nitpick assuming you have a style guide you're intending to adhere to.
Perhaps this can be provided in the system prompt?
Human comments tend to be short and sweet like "nit: rename creatorOfWidgets to widgetFactory". Whereas AI code review comments are long winded not as precise. So even if there are 20 humans comments, I can easily see which are important and which aren't.
We are using BitBucket at work and decided to turn on RovoDev as reviewer. It absolutely doesn’t do that. Few but relevant comments are the norm and when we don’t like something it says we tell it in its instructions file to stop doing that. It has been working great!
My coworker is so far on this spectrum it's a problem. He writes sentences with half the words missing making it actually difficult to understand what he is trying to suggest.
All of the non critical words in english aren't useless bloat, they remove ambiguity and act as a kind of error correction if something is wrong.
it "nit" short for nitpick? I think prefixing PR comments with prefixes like that is very helpful for dealing with this problem.
Yes, but I don't know how effective it is. 99% of the time someone leaves a 'nit' the other person fixes it. So we're still dealing with most of them like regular comments. Only once or twice I've been like "nah, I like my way better" but I can only do that if they also leave an LGTM. Sometimes they do. There's one or two people that will hold your code hostage until you reply to every little nit. At that point they don't feel like nits. I always LGTM if the code is functionally correct or if the build breaks in a trivial way (that would also block them from submitting). Then they can address my nits or submit anyway and I'm cool with that.
Yes it is. I've really oijed those convention at places I've worked. It probably wouldn't be too hard to instruct AI's to use this format too.
At augment code we specifically build our code review tool to find noise to signal ratio problem. In benchmark our comments are 2 to 3x more likely to get fixed compared to bugbot coderabbit etc
You should check it at Augmentcode.com
That's not even mentioning a not insignificant part of the point of code reviews is to propagate understanding of the evolution of the code base among other team members. The reviewer benefits from the act of reviewing as well.
How is that different from today's SA, like CodeQL and SonarQube? Most of the feedback is just sh*t and drives programmers towards making senseless perfections that just double the amount of work had to be done later to toggle or tune behaviour, because the configurable variables are gone due to bad static code analysis. Clearly present intent and convience like: Making a method virtual, adding a public method, not making a method static when it is likely to use instance fields in the future --- these good practices are shunned in all SA just because the rules are opportunistic, not real.
My experience at work: Claude regularly says to use one method over another, because it's "safer"... But the method doesn't actually exist in that language. Seems to get rather confused between C# and C++, despite also getting told the language, before and after getting handed the code.
I've only managed to use it as a linter-but-on-steroids because, where I'd normally page through the Ruby docs about enumerators to find the exact method that does what someone has implemented in a PR (because there's almost always something in there that can help out), I can instead prompt to look up a more idiomatic version of the implementation for the ruby version being used. It's easy to cross-check and it saves me some time.
It's not very good with the rest, because there's an intuition that needs to be developed over time that takes all the weirdness into account. The dead code, the tech debt, the stuff that looks fundamentally broken but is depended on because of unintended side effects, etc. The code itself is not enough to explain that, it's not a holistic documentation of the system.
The AI is no different to a person here: something doesn't 'feel' right, you go and fix it, it breaks, so you have to put it back again because it's actually harder than you think to change it.
My experience is similar. AI's context is limited to the codebase. It has limited or no understanding of the broader architecture or business constraints, which adds to the noise and makes it harder to surface the issues that actually matter.
It also acts as mainly an advanced linter. The other day it pointed out some overall changes in a piece of code, but didn't catch that the whole thing was useless and could've been replaced with an "on conflict to update" in postgres.
Now, that could happen with a human reviewer as well. But it didn't catch the context of the change.
> signal to noise ratio is poor
I think this is the problem with just about every tool that examines code.
I've had the same problem with runtime checkers, with static analysis tools, and now ai code reviews.
Might be the nature of the beast.
probably happens with human code reviews too. Lots of style false positives :)
It very much depends on the product. In my experience, Copilot has terrible signal noise. But Bugbot is incredible. Very little noise and it consistently finds things the very experienced humans on my team didn’t.
The signal-to-noise ratio problem is unexpectedly difficult.
We wrote about our approach to it some time ago here - https://www.greptile.com/blog/make-llms-shut-up
Much has changed on our approach since then, so we'll probably write a a new blog post.
The tl;dr of what makes it hard is - different people have different ideas of what a nitpick is - it's not a spectrum, the differences are qualitative - LLMs are reluctant to risk downplaying the severity of an issue and therefore are unable to usefully filter out nits. - theory: they are paid by the token and so they say more stuff
I've been using it a bit lately and at first I was enjoying it, but then it quickly devolved into finding more different minor issues with each minor iteration, including a lovely loop of check against null rather than undefined, check against undefined rather than null etc.
I agree but find it's fairly easy noise to ignore.
I wouldn't replace human review with LLM-review but it is a good complement that can be run less frequently than human review.
Maybe that's why I find it easy to ignore the noise, I have it to a huge review task after a lot of changes have happened. It'll find 10 or so things, and the top 3 or 4 are likely good ones to look deeper into.
For the signal to noise reason, I start with Claude Code reviewing a PR. Then I selectively choose what I want to bubble up to the actual review. Often times, there's additional context not available to the model or it's just nit picky.
You should try Codex. There's a pretty wide gap between the quality of code review tools out there.
Agreed.
I have to constantly push back against it proposing C++ library code, like std::variant, when C-style basics are working great.
I absolutely hate the verbosity of AI. I know that you can give it context; I have done it, and it helps a little. It will still give me 10 "ideas", many of which are closely related to each other.
One thing I've found to be successful is to
1) give it a number of things to list in order of severity
and
2) tell it to grade how serious of a problem it may be
The human reviewer can then look at the top ten list and what the LLM thinks about its own list for a very low overhead of thinking (i.e. if the LLM thinks its own ideas are dumb a human probably doesn't need to look into them too hard)
It also helps to explicitly call out types of issue (naming, security, performance, correctness, etc)
The human doesn't owe the LLM any amount of time considering, it's just an idea generating tool. Looking through a top ten list formatted as a table can be scanned in 10 seconds in a first pass.
None of these tools perform particularly well and all lack context to actually provide a meaningful review beyond what a linter would find, IMO. The SOTA isn't capable of using a code diff as a jumping off point.
Also the system prompts for some of them are kinda funny in a hopelessly naive aspirational way. We should all aspire to live and breathe the code review system prompt on a daily basis.
I agree that none perform _super_ well.
I would argue they go far beyond linters now, which was perhaps not true even nine months ago.
To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
I fully agree. Claude’s review comments have been 50% useful, which is great. For comparison I have almost never found a useful TeamScale comment (classic static analyzer). Even more important, half of Claude’s good finds are orthogonal to those found by other human reviewers on our team. I.e. it points out things human reviewers miss consistently and v.v.
TBH that sounds like TeamScale just has too verbose default settings. On the other hand, people generally find almost all of the lints in Clippy's [1] default set useful, but if you enable "pedantic" lints, the signal-to-noise ratio starts getting worse – those generally require a more fine-grained setup, disabling and enabling individual lints to suit your needs.
> To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
do you have a bot to do this too?
I like number of "great catches" as a measure of AI code review effectiveness
I mean how far Rusts own clippy lint went before any LLMs was actually insane.
Clippy + Rusts type system would basically ensure my software was working as close as possible to my spec before the first run. LLMs have greatly reduced the bar for bringing clippy quality linting to every language but at the cost of determinism.
People more often say that to save face by implying the issue you identified would be reasonable for the author to miss because it's subtle or tricky or whatever. It's often a proxy for embarrassment
When mature, funtional adults say it, the read is "wow, I would have missed that, good job, you did better than me".
Reading embarrassment into that is extremely childish and disrespectful.
What I'm saying is that a corporate or professional environment can make people communicate in weird ways due to various incentives. Reading into people's communication is an important skill in these kinds of environments, and looking superficially at their words can be misleading.
That sounds more like confirmation that greptile is being included in a lot agentic coding loops than anything
Not trying to sidetrack, but a figure like that is data, not evidence. At the very minimum you need context which allows for interpretation; 9,078 positive author comments would be less impressive if Greptile made 1,000,000 comments in that time period, for example.
over 7 days does contextualize it some, though.
9,078 comments / 7 (days) / 8 (hours) = 162.107 though, so if human that person is making 162 comments an hour, 8 hours a day, 7 days a week?
In some code that I was working on, I had
The 'something' was a little bit more complex, but it was the same something with slightly different formatting.// stuff obj.setSomeData(something); // fifteen lines of other code obj.setSomeData(something); // more stuffMy linter didn't catch the repeat call. When asking the AI chat for a review of the code changes it did correctly flag that there was a repeat call.
It also caught a repeat call in
Where one of the filter calls is unnecessary... and it caught that across a call boundary.List<Objs> objs = someList.stream().filter(o -> o.field.isPresent()).toList(); // ... var something = someFunc(objs); Thingy someFunc(List<Objs> param) { return param.stream().filter(o -> o.field.isPresent()). ...So, I'd say that AI code reviews are better than a linter. There's still things that it fusses about because it doesn't know the full context of the application and the tables that make certain guarantees about the data, or code conventions for the team (in particular the use of internal terms within naming conventions).
I had a similar review by AI except my equivalent of setSomeData was stateful and needed to be there in both places, the AI just didn't understand any of it.
When this happens to me it makes me question my design.
If the AI doesn’t understand it, chances are it’s counter-intuitive. Of course not all LLM’s are equal, etc, etc.
Then again, I have a rough idea on how I could implement this check with some (language-dependent) accuracy in a linter. With LLM's I... just hope and pray?
I'd agree with that but in the JS world, there's a lot of questionable library designs that are outside of my control.
My reaction in that case is that most other readers of the codebase would probably also assume this, and so it should be either made clearer that it's stateful, or it should be refactored to not be stateful
I'd say I see one anecdote, nothing to draw conclusions from.
Why isn’t `obj` immutable?
Because 'obj' is an object that was generated by a json schema and pulled in as a dependency. The pojo generator was not set up to create immutable objects.
Unit tests catch that kind of stuff
The code works perfectly - there is no issue that a unit test could catch... unless you are spying on internally created objects to a method and verifying that certain functions are called some number of times for given data.
Sure and you can do that
Trying to write the easiest code that I could test... I don't think I can without writing an excessively brittle test that would break at the slightest implementation change.
So you've got this Java:
And I can mock the class and return a spied'ed List. But now I've got to have that spied List return a spied stream that checks to see if .filter(i -> i % 2 == 0) was called. But then someone comes and writes it later as .filter(i -> i % 2 != 1) and the test breaks. Or someone adds another call to sort them first, and the test breaks.public List<Integer> someCall() { return IntStream.range(1,10).boxed().toList(); } public List<Integer> filterEvens(List<Integer> ints) { return ints.stream() .filter(i -> i % 2 == 0) .toList(); } int aMethod() { List<Integer> data = someCall(); return filterEvens(data.stream().filter(i -> i % 2 == 0).toList()).size(); }To that end, I'd be very curious to see the test code that verifies that when aMethod() is called that the List returned by SomeCall is not filtered twice.
What's more, it's not a useful test - "not filtered twice" isn't something that is observable. It's an implementation detail that could change with a refactoring.
Writing a test that verifies that filterEvens returns a list that only contains even numbers? That's a useful test.
Writing a test that verifies that aMethod returns back the size of the even numbers that someCall produced? That's a useful test.
Writing a test that tries to enforce a particular implementation between the {} of aMethod? That's not useful and incredibly brittle (assuming that it can be written).
You are correct and the objection is just completely invalid. There's no way anyone would or should write tests like this at the client level.
I think they are just arguing for the sake of arguing.
You mention the tools you can use to make it happen.
I think we're at the point where you need concrete examples to talk about whether it's worth it or not. If you have functions that can't be called twice, then you have no other option to test details in the implementation like that.
Yeah there's a tradeoff between torturing your code to make everything about it testable and enforce certain behavior or keeping it simpler.
I have worked in multiple code bases where every function call had asserts on how many times it was called and what the args were.
youre verifying std lib function call counts in unit tests? lmao.
You can do that with mocks if it's important that something is only called once, or likely there's some unintended side effect of calling it twice and tests woukd catch the bug
i know you could do it, im asking why on earth you would feel its vital to verify stream.filter() was called twice in a function
You're not verifying the observable behavior of your application? lmao
How would you suggest tests around:
The first filter is redundant in this example. Duplicate code checkers are checking for exactly matching lines.void func() { printEvens(someCall().stream().filter(n -> n % 2 == 0).toList()); } void printEvens(List<Integer> nums) { nums.stream().filter(n -> n % 2 == 0).forEach(n -> System.out.println(n)); }I am unaware of any linter or static analyzer that would flag this.
What's more, unit tests to test the code for printEvens (there exists one) pass because they're working properly... and the unit test that calls the calling function passes because it is working properly too.
Alternatively, write the failing test for this code.
Idk how exactly to do it in cpp becasue I'm not familiar with the tooling
You could write a test that makes sure the output of someCall is passed directly to printeven without being modified.
The example as you wrote is hard to test in general. It's probably not something you would write if your serious about testing.
In C++, the code would look like:
---#include <vector> #include <iostream> #include <algorithm> std::vector<int> someCall() { return {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}; } void printEvens(const std::vector<int>& nums) { std::ranges::for_each(nums, [](int n) { if (n % 2 == 0) { std::cout << n << '\n'; } }); } int main() { std::vector<int> data = someCall(); std::vector<int> tmp; std::ranges::copy_if(data, std::back_inserter(tmp), [](int n) { return n % 2 == 0; } ); printEvens(tmp); return 0; }Nothing in there is wrong. There is no test that would fail short of going through the hassle of creating a new type that does some sort of introspection of its call stack to verify which function its being called in.
Likewise, identify if a linter or other static analysis tool could catch this issue.
Yes, this is a contrived example and it likely isn't idiomatic C++ (C++ isn't my 'native' language). The actual code in Java was more complex and had a lot more going on in other parts of the files. However, it should serve to show that there isn't a test for printEvens or someCall that would fail because it was filtered twice. Additionally, it should show that a linter or other static analysis wouldn't catch the problem (I would be rather impressed with one that did).
From ChatGPT a code review of the code: https://chatgpt.com/share/69780ce6-03e0-8011-a488-e9f3f8173f...
A redundant filter() isn't observable (except in execution time).
You could pick it up if you were to explicitly track whether it's being called redundantly but it'd be very hard and by the time you'd thought of doing that you'd certainly have already manually checked the code for it.
what happened to not testing implementation details?
Opus 4.5 catches all sorts of things a linter would not, and with little manual prompting at that. Missing DB indexes, forgotten migration scenarios, inconsistencies with similar services, an overlooked edge case.
Now I'm getting a robot to review the branch at regular intervals and poking holes in my thinking. The trick is not to use an LLM as a confirmation machine.
It doesn't replace a human reviewer.
I don't see the point of paying for yet another CI integration doing LLM code review.
Exactly. This is like buying a smoothie blender when you already have an all-purpose mixer-blender. This whole space is at best an open-source project, not a (multiple!) whole company.
It's very unlikely that any of these tools are getting better results than simply prompting verbatim "review these code changes" in your branch with the SOTA model du jour.
I came to the same conclusion and ended up wiring a custom pipeline with LangGraph and Celery. The markup on the SaaS options is hard to justify given the raw API costs. The main benefit of rolling it yourself seems to be the control over context retrieval—I can force it to look at specific Postgres schemas or related service definitions that a generic CI integration usually misses.
Personally I'm hoping that once the bubble bursts and hardware improvement catches up, we start seeing reasonable prices for reasonable models on SaaS platforms that are not scary for SecOps.
Not guaranteed though of course.
All those llm wrapper companies make no sense.
Currently attempting to get GitLab Duo's review featured enabled as a 'second pair of eyes'. I agree 100% that it's not replacing a human review.
I would on the whole prefer a 'lint-style' tool to catch most stuff because they don't hallucinate.
But obviously they don't catch everything so an LLM-based review seems like an additional useful tool.
You’ve found the smoking gun!
AI code review to me is similar to AI code itself. It's good (and constantly getting better) at dealing with mundane things, like - is the list reversed correctly? Are you dealing with pointers correctly? Do you have off by 1 issues?
Where they suck is high level problems like - is the code actually solving the business problem? Is it using right dependencies? Does it fit into broader design?
Which is expected for me and great help. I'm more happy as a human to spend less time checking if you're managing lifecycle of the pointer correctly and focus on ensuring that code is there to do what it needs to do.
GH Copilot is definitely far better than just a linter. I don't have examples to hand but one thing that's stood out to me is its use of context outside the changes in the diff. It'll pull in context that typically isn't visible in the PR itself, the sort of things that only someone experienced in the code base with good recall would connect the dots on (e.g. this doesn't conform to typical patterns, or a version of this is already encapsulated in reusable code, or there's an existing constant that could be used here instead of the hardcoded value you have).
I don't know that I fully agree with that. I use Copilot for AI code review - just because it's built in to GitHub and it's easy - and I'd say results are variable, but overall decent.
Like anything else AI you need to understand what you're doing, so you need to understand your code and the structure of your application or service or whatever because there are times it will say something that's just completely wide of the mark, or even the polar opposite of what's actually the case. And so you just ignore the crap and close the conversation in those situations.
At the same time, it does catch a lot of bugs and problems that fall into classes where more traditional linters really miss the mark. It can help fill holes in automated testing, spot security issues, etc., and it'll raise PRs for fixes that are generally decent. Sometimes not but, again, in these cases you just close them and move on.
I'd certainly say that an AI code review is better than no code review at all, so it's good for a startup where you might be the only developer or where there are only one or two of you and you don't cross over that much.
But the point I actually wanted to get to is this: I use Copilot because it's available as part of my GitHub subscription. Is it the best? I don't know. Does it add value with zero integration cost to me? Yes. And that, I suspect, is going to make it the default AI code review option for many GitHub subscribers.
That does leave me wondering how much of a future there is for AI code review as a product or service outside of the hosting platforms like GitHub and Gitlab, and I have to imagine that an absolutely savage consolidation is coming.
I suspect this is primarily a unit economics problem. To get context beyond the diff you really need the full repository or a robust AST, but the token costs to load that state for every PR make the margins impossible right now.
I installed CodeRabbit for our reviews in GitLab and am pretty happy with the results, especially considering the low price ($15/user/mo I think).
It regularly finds problems, including subtle but important problems that human reviewers struggle to find. And it can make pretty good suggestions for fixes.
It also regularly complains about things that are possible in theory but impossible in practice, so we've gotten used to just resolving those comments without any action. Maybe if we used types more effectively it would do that less.
We pay a lot more attention to what CodeRabbit says than what DeepSource said when use used it.
> The SOTA isn't capable of using a code diff as a jumping off point.
Not a jumping off point, but I'm having pretty great results on a complicated fork on a big project with a `git diff main..fork > main.diff`, then load in the specs I keep, and tell it to review the diff in chunks while updating a ./review.md
It's solving a problem I created myself by not reviewing some commits well enough, but it's surprisingly effective at picking up interactions spread out over multiple commits that might have slipped through regardless.
They 100% catch bugs in code I work on. Is it replacing human review fully? No, not yet. But it is a useful tool. Just like most of us wouldn’t do a code review without having tests, linters etc run first.
>The SOTA isn't capable of using a code diff as a jumping off point.
The low quality of HN comments has been blowing my mind.
I have quite literally been doing what you describe every working day for the last 6+ months.
Anecdotally, Claude Bug Bot has actually been super impressive in understanding non trivial changes. Like, today, it noted a race condition in a ~1000 line go change that go test -race didnt pick up. There are definitely issues though. For one, it's non deterministic, so you end up with half a dozen commits, with each run noting different issues. For a second, it tends to be quite in favour of premature optimisation. But over all, well worth it in my experience
I haven't used the bug bot, but I like asking claude code to just review my PR in the command line. Yesterday it found a bug in a data structure I was implementing (it didn't support ZSTs properly). Of course, the fix it suggested was completely wrong, but what are ya gonna do. Still saved me from embarrassing myself before asking for a review
I've tried Greptile and it's pretty much pure noise. I ran it for 3 PRs and then gave up. Here are three examples of things it wasted my time on in those 3 PRs:
* Suggested to silence exception instead of crash and burn for "style" (the potential exception was handled earlier in code but it did not manage to catch that context). When I commented that silencing the exception could lead to uncaught bugs it replies "You're absolutely right, remove the try-catch" which I of course never added * Us using python 3.14 is a logic error as "python 3.14 does not exist yet" * "Review the async/await patterns Heavy use of async in model validation might indicate these should be application services instead." whatever this vague sentence means. Not sure if it is suggesting us changing the design pattern used in our entire code base.
Also the "confidence" score added to each PR being 4/5 or something due to these irrelevant comments was a really annoying feature IMO. In general AI tools giving a rating when they're wrong feels like a big productivity loss as then the human reviewer will see that number and think something is wrong with the PR.
--
Before this we were running Coderabbit which worked really well and caught a lot of bugs / implementation gotchas. It also had "learnings" which it referenced frequently so it seems like it actually did not repeat commenting on intentional things in our code base. With Coderabbit I found myself wanting to read the low confidence comments as well since they were often useful (so too quiet instead of too noisy). Unfortunately our entire Coderabbit integration just stopped working one day and since then we've been in a long back and forth with their support.
--
I'm not sure what the secret sauce is but it feels like Greptile was GPT 3.5-tier and Coderabbit was Sonnet 4.5-tier.
My experience is that basic generic agents are useless but an agent with extensive prompting about your usecase is extremely valuable.
In my case using these prompts:
https://github.com/masoncl/review-prompts
Took things from "pure noise" to a world where, if you say there's a bug in your patch, people's first question will be "has the AI looked at it?"
FWIW in my case the AI has never yet found _the_ bug I was hunting for but it has found several _other_ significant bugs. I also ran it against old commits that were already reviewed by excellent engineers and running in prod. It found a major bug that wasn't spotted in human review.
Most of the "noise" I get now just leads me to say "yeah I need to add more context to the commit message". E.g the model will say "you forgot to do X" when X is out of scope for the patch and I'm doing it in a later one. So ideally the commit messages should mention this anyway.
The main problem with current AI reviewers isn't catching bugs, it's shutting up when there is no bug. Humans have an intuitive filter like "this code is weird, but it works and won't break prod, so I'll let it slide". LLMs lack this, they generate 20 comments about variable naming and 1 comment about a critical race condition. As a result the developer gets fatigue and ignores everything. Until AI learns to understand the context of importance, not just code context, it will remain an expensive linter
Problem with Code Review is it is quite straightforward to just prompt it, and the frontier models, whether Opus or GPT5.2Codex do a great job at code-reviews. I don't need second subscription or API call when the first one i already have and focus on integration works well out of the box.
In our case, agentastic.dev, we just baked the code-review right into our IDE. It just packages the diff for the agent, with some prompt, and sends it out to different agent choice (whether claude, codex) in parallel. The reason our users like it so much is because they don't need to pay extra for code-review anymore. Hard to beat free add-on, and cherry on top is you don't need to read a freaking poems.
we use codex review. it's working really well for us. but i don't agree that it's straightforward. moving the number of bugs catched and signal to noise ratio a few percentage points is a compounding advantage.
it's a valuable problem to solve, amplified by the fact that ai coding produces much more code.
that being said, i think it's damn hard to compete with openai or anthropic directly on a core product offering in the long run. they know that it's an important problem and will invest accordingly.
Greptile is a great product and I hope you succeed.
However, I disagree that independence is a competitive advantage. If it’s true that having a “firewall” between the coding agent and review agent leads to better code, I don’t see why a company like Cursor can’t create full independence between their coding and review products but still bundle them together for distribution.
Furthermore, there might well be benefits to not being fully independent. Imagine if an external auditor was brought in to review every decision made inside your company. There would likely be many things they simply don’t understand. Many decisions in code might seem irrational to an external standalone entity but make sense in the broader context of the organization’s goals. In this sense, I’m concerned that fully independent code review might miss the forest for the trees relative to a bundled product.
Again, I’m rooting for you guys. But I think this is food for thought.
It's a ridiculous argument. Complete gibberish. They're all just API calls. There's no incentive to be biased toward approving bad code.
I still think any business that is based on someone else's model is worthless. I know I'm sounding like the 'dropbox is just FTP' guy, but it really feels like that any good idea will just be copied by OpenAI and Anthropic. If AI code review is proven a good idea is there any reason to expect Codex or Claude Code to not implement some commands to do code review?
The thing is, the "Dropbox is just FTP" guy should be right most of the time when you are selling to experts.
There is no reason to not just ask Claude for a review and then distill this into PR comments. Especially because "every LLM output has to be filtered through a human" is a good policy with the current state of these tools.
However, this industry loves distilling frivolities into web tools and it sells for some unfathomable reason. It is the same with the existing static analyzers etc that some orgs pay for. I do not understand why.
Very very strictly speaking relying on models in it's essence is not the problem I think. There is enough "meat" there you can build a nice small profitable company.
Those tools are better than vanilla agents by dedicating expensive human time on evaluating and fine tuning models. You can also build various integration, management and reporting features to add value. If you freeze model progress today, or 12 months ago when most of those companies started, it's a viable business I think.
But any gains you make on the first part will be lost to newer models, and the 2nd part is not as valuable when llms allow people to build fairly complicated features quickly.
I don't if worthless but all those companies have very limited time to gather customers and at least make themselves valuable for an acquisition
The shakiest business model is one where you have no competition - if nobody else had the idea already: you are probably wrong - they did but it was a bad idea so they failed.
The real question is how can you compete. There are lots of answers here, but something new and good is rare.
We do our review through Claude github actions. Works well.
I've also noticed this explosion of code review tools and felt that there's some misplaced focus going on for companies.
Two that stood out to me are Sentry and Vercel. Both have released code review tools recently and both feel misplaced. I can definitely see why they thought they could expand with that type of product offering but I just don't see a benefit over their competition. We have GH copilot natively available on all our PRs, it does a great job, integrates very well with the PR comment system, and is cheap (free with our current usage patterns). GH and other source control services are well placed to have first-class code review functionality baked into their PR tooling.
It's not really clear to me what Sentry/Vercel are offering beyond what copilot does and in my brief testing of them didn't see noticeable difference in quality or DX. Feels like they're fighting an uphill battle from day one with the product choice and are ultimately limited on DX by how deeply GH and other source control service allow them to integrate.
What I would love to see from Vercel, which they feel very well placed to offer, is AI powered QA. They already control the preview environments being deployed to for each PR, they have a feedback system in place with their Vercel toolbar comments, so they "just" need to tie those together with an agentic QA system. A much loftier goal of course but a differentiator and something I'm sure a lot of teams would pay top dollar for if it works well.
What should be added, I think, to code reviewing is that it can get really complex, for example if we add formal verification in the mix to catch very subtle bugs.
So in the end I think there will still be some disappointment, as one would expect it should be fully automated and only about reading the code, like this article suggests. In reality, I think it is harder than writing code.
Fuzzy automated reviews should always run in an interactive loop with a developer on their workstation and contain enough context to quickly assess if they are valid or not.
When developers create a PR, they already feel they are "done", and they have likely already shifted their focus on another task. False positive are horrible at this point, especially when they keep changing with each push of commits.
Is "AI code review" a correct term?
A code review requires reasoning and understanding, things that to my knowledge a generative model cannot do.
Surely the most an AI code review ever could be is something that looks like a code review.
Given we are more interested in the end than in the mean, it is a good usage.
I don't really understand how this differentiates against the competition.
> Independence
Any "agent" running against code review instead of code generation is "independent"?
> Autonomy
Most other code review tools can also be automated and integrated.
> Loops
You can also ping other code review tools for more reviews...
I feel like this article actually works against you by presenting the problem and inadequately solving them.
> Independence
It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways. Question: Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?
> Autonomy
Plenty of tools have invested heavily in AI-assisted review - creating great UIs to help human reviewers understand and check diffs. Our view is that code validation will be completely autonomous in the medium term, and so our system is designed to make all human intervention optional. This is possibly a unpopular opinion, and we respect the camp that might say people will always review AI-generated code. It's just not the future we want for this profession, nor the one we predict.
> Loops
You can invest in UX and tooling that makes this easier or harder. Our first step towards making this easier is a native Claude Code plugin in the `/plugins` command that let's Claude code do a plan, write, commit, get review comments, plan, write loop.
> Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?
You're assuming models/prompts insist on a previous iteration of their work being right. They don't. Models try to follow instructions, so if you ask them to find issues, they will. 'Trust' is a human problem, not a model/harness problem.
> Our view is that code validation will be completely autonomous in the medium term.
If reviews are going to be autonomous, they'd be part of the coding agent. Nobody would see it as an independent activity, you mentioned above.
> Our first step towards making this easier is a native Claude Code plugin.
Claude can review code based on a specific set of instructions/context in an MD file. An additional plugin is unnecessary.
My view is that to operate in this space, you gotta build a coding agent or get acquired by one. The writing was on the wall a year ago.
Independence is ridiculous - the underlying llm models are too similar on their training days and methodologies to be anything like independent. Trying different models may somewhat reduce the dependency, but all have read stack overflow, Reddit, and GitHub in their training.
It might be an interesting time to double down on automatically building and checking deterministic models of code which were previously too much of a pain to bother with. Eg, adding type checking to lazy python code. These types of checks really are model independent, and using agents to build and manage them might bring a lot of value.
> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.
Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”
“Vibes” (2/3 on scale) are ok, just honestly curious.
> Unfortunately, code review performance is ephemeral and subjective
> Today's agents are better than the median human code reviewer
Which is it? You cannot have it both ways.
> Today's agents are better than the median human code reviewer
"...at catching issues and enforcing standards, and they're only getting better".
I took this to mean what good code review is is subjective. But if you clearly define standards and patterns for your code, your linter/automated tools/ AI code reviewer will always catch more than humans.
>A human rubber-stamping code being validated by a super intelligent machine is the equivalent of a human sitting silently in the driver's seat of a self-driving car, "supervising".
So, absolutely necessary and essential?
In order to get the machine out of trouble when the unavoidable strange situation happens that didn't appear during training, and requires some judgement based on ethics or logical reasoning. For that case, you need a human in charge.
After testing several bots in our org, specifically Devin, Graphite, and Cursor, I’ve noticed Cursor is the better bug bot out there right now.
Same here, tested a bunch and cursor has been given little noise and usually decent suggestions. In this case its on a react app, so other projects might not find it as good.
> This might seem far-fetched but the counterfactual is Kafkaesque.
> As the proprietors of an, er, AI code review tool suddenly beset by an avalanche of competition, we're asking ourselves: what makes us different?
> Human engineers should be focused only on two things - coming up with brilliant ideas for what should exist, and expressing their vision and taste to agents that do the cruft of turning it all into clean, performant code.
> If there is ambiguity at any point, the agents Slack the human to clarify.
Was this LLM advertisement generated by an LLM? Feels so at least.
This article has a catchy headline, but there's really no content to it. This is content marketing without content. It seems like every week on Hacker News, there's a dozen of these. All seemingly code reviewers, too. Keep it to LinkedIn.
We used Greptile where I work and it was so bad we decided to switch to Claude. And even Claude isn’t nearly as good at reviewing as an experienced programmer with domain knowledge.
My experience is that Claude or others are good at pointing out things I will want to look at and then I can go review more thoroughly. So it's helped to some degree.
But like everything else with it, it tries to do too much.
What I want is a review "wizard" agent -- something that identifies the pieces I should look at, and takes me through them diff by diff asking me to read them, while offering its commentary ("this appears to be XX....") and letting me make my own.
> In addition, success is generally pretty well-defined. Everyone wants correct, performant, bug-free, secure code.
I feel like these are often not well defined? "Its not a bug it's a feature", "premature optimization is the root of all evil", etc
In different contexts, "performant enough" means different things. Similarly, many times I've seen different teams within a company have differing opinions on "correctness"
If you give LLM a hammer everything looks like a nail, you give it a saw everything looks like wood. You ask LLM to find issues, it will find "issues" At the end of the day, you will have to fix those issues, if you decide to have another LLM fix those issues, by the time you are done with that cycle, you are going to end up with code that will be thoroughly over engineered.
If by engineering you mean doing whatever vibes you feel, than yeah, over engineering.
If by engineering you mean using the engineering design process than it would not be engineered at all, let alone over engineered.
Good code reviews are part of team's culture and it's hard to just patch it with an agent. With millions of tools it will be arms race between which one is louder about as many things as possible because:
- it will have higher chance at convincing the author that the issue was important by throwing more darts - something that a human wouldn't do because it takes real mental effort to go through an authentic review,
- it will sometimes find real big issue which reinforces the bias that it's useful
- there will always be tendency towards more feedback (not higher quality) because if it's too silent, is it even doing anything?
So I believe it will just add more round of back and forth of prompting between more people, but not sure if net positive
Plus PRs are a good reality check if your code makes sense, when another person reviews it. A final safeguard before maintainability miss, or a disaster waiting to be deployed.
I liked that the post is self-aware that it's promoting its own product. But the writing seemed more focus on the philosophy behind code reviews and the impact of AI, and less on the mechanics of how greptile differs from competitors. I was hoping to see more on the latter.
Thanks! We go over that on many other pages. Here are some:
https://www.greptile.com/benchmarks https://www.greptile.com/greptile-vs-coderabbit https://www.greptile.com/greptile-vs-bugbot
My company just finished a several week review period of Greptile. Devs were split over the usefulness of the tool (compared to our current solution, Cursor). While Greptile did occasionally offer better insights than Cursor, it also exhibited strange behavior such as entirely overwriting PR descriptions with its own text and occasionally arguing with itself in the comments. In the end we decided to NOT purchase Greptile as there were enough "not quite there" issues that made it more trouble than worthwhile. I am certain, though, that the Greptile team will resolve all those problems and I wish them the best of luck!
Either become a platform or get swallowed up by one (e.g. Cursor acquiring Graphite to become more of a platform). Trying to prove out that your code review agent is marginally better than others when the capability is being included in every single solution is a losing strategy. They can just give the capability away for free. Also, the idea that code review will scale dramatically in importance as more code is written by agents is not new.
Contrary to some of the other anecdotes in this thread, I've found automated code review to discover some tricky stuff that humans missed. We use https://www.cubic.dev/
Before I push any code, I always ask 2 different frontier LLMs to review the changes for any potential issues. Saved my ass a few times before pushing to production.
Founder of cubic here, thanks for the shoutout!
This article surprised me. I would have expected it would be about how _human_ code review is unsustainable in the face of AI-enhanced velocity.
I would be interested to hear of some specific use-cases for LLMs in code review.
With static analysis, tests, and formatters I thought code review was mostly interpersonal at this point. Mentorship, ensuring a chain of liability in approvals, negotiating comfort levels among peers with the shared responsibility of maintaining the code, that kind of thing.
I’ve found only one good code review bot, and that’s Unblocked. It doesn’t always leave a comment, and when it does, it’s often found 1-2 real bugs in the code crossing multiple files (even like “hey you forgot to update this reference in this other file not edited in the PR”). Things you’d expect someone with a deeper knowledge of the code to know.
You do get a handful of false positives, especially if what it reports is technically correct, but we’re just handling the issue in a sort of weird/undocumented way. But it’s only one one comment that’s easy to dismiss, and it’s fairly rare. It’s not like huge amounts of AI vomit all over PRs. It’s a lot more focused.
I find a lot of times with co-pilot it calls out issues where if the AI had more context of the whole codebase it would realize that scenario can’t actually occur.
Or it won’t understand some invariant that you know but is not explicit anywhere
It's not terribly hard to write a Copilot GHA that does this yourself for your specific teams needs. Not sure why you'd been to bring a vendor on for this....
What do the vendors provide?
I looked at a couple which were pretty snazzy at first glance, but now that I know more about how copilot agents work and such, I'm pretty sure in a few hours, I could have the foundation for my team to build on that would take care of a lot of our PR review needs....
I had a bad experience with greptile due to what seemed to be excessive noise and nit comments. I have been using cursorbot for a year and really like it.
> Only once would you have X write a PR, then have X approve and merge it to realize the absurdity of what you just did.
I get the idea. I'll still throw out that having a single X go through the full workflow could still be useful in that there's an audit log, undo features (reverting a PR), notifications what have you. It's not equivalent to "human writes ticket, code deployed live" for that reason
Maybe I'm buying into the cool-aid, but I actually really liked the self-aware tone of this post.
> Based on our benchmarks, we are uniquely good at catching bugs. However, if all company blogs are to be trusted, this is something we have in common with every other AI code review product. One just has to try a few, and pick the one that feels the best.
I would suggest you check out your Greptile discord and/or answer your messages on X where people are trying to reach you with problems and questions about your service. Unless that no longer matters.
1. I absolutely agree there's a bubble. Everybody is shipping a code review agent.
2. What on earth is this defense of their product? I could see so many arguments for why their code reviewer is the best, and this contains none of them.
More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.
The point of a PR is to share knowledge and to catch structural gaps. Bug-finding is a bonus. Catching bugs, automated self-review, structuring your code to be sensible: that's _your_ job. Write the code to be as sensible as possible, either by yourself or with an AI. Get the review because you work on a team, not in a vacuum.
2. There is plenty of evidence for this elsewhere on the site, and we do encourage people to try it because like with a lot of AI tools, YMMV.
You're totally right that PR reviews go a lot farther than catching issues and enforcing standard. Knowledge sharing is a very important part of it. However, there are processes you can create to enable better knowledge sharing and let AI handle the issue-catching (maybe not fully yet, but in time). Blocking code from merging because knowledge isn't shared yet seems unnecessary.
> 2. What on earth is this defense of their product?
i think the distribution channel is the only defensive moat in low-to-mid-complexity fast-to-implement features like code-review agents. So in case of linear and cursor-bugbot it make a lot of sense. I wonder when Github/Gitlab/Atlassian or Xcode will release their own review agent.
> More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.
> The point of a PR is to share knowledge and to catch structural gaps.
Well, it was to share knowledge and to catch structural gaps.
Now you have an idea, for better or for worse, that software needs to be developed AI-first. That's great for the creation of new code but as we all know, it's almost guaranteed that you'll get some bad output from the AI that you used to generate the code, and since it can generate code very fast, you have a lot of it to go through, especially if you're working on a monorepo that wasn't architected particularly well when it was written years ago.
PRs seem like an almost natural place to do this. The alternative is the industry finding a more appropriate place to do this sort of thing in the SDLC, which is gonna take time, seeing as how agentic loop software development is so new.
So far I've been pretty happy with Greptile. Tried Copilot and Cubic.dev but landed on Greptile
Why not let AI write the code and then have it reviewed by humans? If you use AI to review my code, then you can't stop me from using another AI to refute it: this only foreshadows the beginning of internal friction.
where we draw the line on agent "identity" when the models being orchestrated are generally the same 3 frontier intelligences is an interesting question indeed
I would think this idea of creating a third-party to verify things likely centers more around liability/safety cover for a steroidal increase in velocity (i.e. --dangerously-skip-permissions) rather than anything particularly pragmatic or technical (but still poised to capture a ton of value)
4GVFDGDGFFFFEGRFEDS
LLMs writing code, and then LLMs reviewing the code. And when customers run into a problem with the buggy slop you just churned out, they can talk to a LLM chat bot. Isn't it just swell?
Just let the support chat bot submit, review, and deploy code changes and there are no longer any customer problems!
We have Code Rabbit at work, and it's made PRs unreadable. The Bun pollutes the comments and code diffs with noise.
Claude code's code review is _sufficient_ imo.
still need HITL, but the human is shifted right and can do other things rather than grinding through fiddly details.
As Claude Code (and Opus) improves, Greptile is finding fewer issues in my code reviews.
"While some other products have built out great UIs for humans to review code in an AI-assisted paradigm, we have chosen to build for what we consider to be an inevitable future - one where code validation requires vanishingly little human participation."
Ok good, now I know not to bother reading through any of their marketing literature, because while the product at first interested me, now I know it's exactly not what I want for my team.
The actual "bubble" we have right now is a situation where people can produce and publish code they don't understand, and where engineers working on a system no longer are forced to reckon with and learn the intricacies of their system, and even senior engineers don't gain literacy into the very thing they're working on, and so are somewhat powerless to assess quality and deal with crisis when it hits.
The agentic coding tools and review tools I want my team (and myself) to have access to are ones that ones that force an explicit knowledge interview & acquisition process during authoring and involve the engineer more intricately in the whole flow.
What we got instead with claude code & friends is a thing way too eager to take over the whole thing. And while it can produce some good results it doesn't produce understandable systems.
To be clear, it's been a long time since writing code has been the hard part of the job? in many many domains. The hard part is systems & architecture and while these tools can help with that, there's nothing more potentially terrifying pthan a team full of people who have agentically produced a codebase that they cannot holistically understand the nuances of.
So, yeah, I want review tools for that scenario. Since these people have marketed themselves off the table... what is out there?
Yep. We see this future and are working on exactly what you're talking about (Graphite)
You just completely contradicted yourself then.
Not sure how? Meant this:
> The agentic coding tools and review tools I want my team (and myself) to have access to are ones that ones that force an explicit knowledge interview & acquisition process during authoring and involve the engineer more intricately in the whole flow.
> we have chosen to build for what we consider to be an inevitable future - one where code validation requires vanishingly little human participation.
> Today's agents are better than the median human code reviewer at catching issues
Not my experience
> A human rubber-stamping code being validated by a super intelligent machine
What? I dunno how they define intelligence, but LLMS are absolutely not super intelligent.
> If agents are approving code, it would be quite absurd and perhaps non-compliant to have the agent that wrote the code also approve the code.
It's all the same frontier models under the hood. Who are you kidding.
My experience with code review tools has been dreadful. In most cases I can remember the reviews are inaccurate, "you are absolutely right" sycophantic garbage, or missing the big picture. The worst feature of all is the "PR summary" which is usually pure slop lacking the context around why a PR was made. Thankfully that can be turned off.
I have to be fair and say that yes, occasionally, some bug slips past the humans and is caught by the robot. But these bugs are usually also caught by automated unit/integration tests or by linters. All in all, you have to balance the occasional bug with all the time lost "reviewing the code review" to make sure the robot didn't just hallucinate something.
Haven’t used a single one that was any good. Basically a 50/50 crapshoot if what they are saying makes any sense at all, let alone it being considered “good” comments. Basically no different than random chance.
Hot take: Code review is an anti-pattern.
We spend a ton of time looking at the code and blocking merges, and the end result is still full of bugs. AI code review only provides a minor improvement. The only reason we do code review at all is humans don't trust that the code works. Know another way to tell if code works? Running it. If our code is so utterly inconceivable that we can't make tests that can accurately assess if the code works, then either our code design is too complicated, or our tests suck.
OTOH, if the reason you're doing code review is to ensure the code "is beautiful" or "is maintainable", again, this is a human concern; the AI doesn't care. In fact, it's becoming apparent that it's easier to replace entire sections of code with new AI generated code than to edit it.
Running the code checks if it works now, whereas code review checks if it will work in a year and if anyone else can understand it.
Tests don't catch architectural mistakes or time bombs. If you remove reviews and rely solely on tests, you end up with a "working" big ball of mud that is impossible to maintain. AI won't help if it's the one generating the mud.
Tests can't tell you if the design of the code is fit for purpose, or about requirements you completely missed or punted on, or that a core new piece that's going to be built upon next is barely-coherent, poorly-performing slop that "works" but is going to need to be actually designed while being rewritten by the next person instead, or that you skipped trying to understand how the feature should work or thinking about the performance characteristics of the solution before you started and just let the LLM drive, so you never designed anything, arriving at something which "works" on your machine and passes the tests which were generated for it, but will hammer production under production loads. Neither will running it on your own machine or in Dev.
No amount of telling the LLM to "Dig up! Make no mistakes!" will help with non-designed slop code actively poisoning the context, but you have to admire the attempt when you see comments added while removing code, referring to the code that's being removed.
It's weird to see tickets now effectively go from "ready for PR" to 0% progress, but at least you're helping that person meet whatever the secret AI* usage quota is for their performance review this year.
> Tests can't tell you if the design of the code is fit for purpose, or about requirements you completely missed or punted on
This is what acceptance tests are for. Does it do the thing you wanted it to do? Design a test that makes it do the thing, and check the result matches what you expect. If it's not in the test, don't expect it to work anywhere else. Obviously this isn't easy, but that's why we either need a different design or different tests. Before that would have been a tremendous amount of work, but now it's not.
(Making this work requires learning how to make it work right. This is a skill with brand-new techniques which 99.999% of people will need over year to learn)
> or that a core new piece that's going to be built upon next is barely-coherent, poorly-performing slop that "works" but is going to need to be actually designed while being rewritten by the next person instead
This is the "human" part I mentioned being irrelevant now. AI does not care if the code is slop or maintainable. AI can just rewrite the entire thing in an hour. And if the tests pass, it doesn't matter either. Take the human out of the loop.
(Concerned about it "rewriting tests" to pass them? You need independent agents, quality gates, determinism, feedback loops, etc. New skills and methods designed to keep the AI on the rails, like a psychotic idiot savant that can build a spaceship if you can keep it from setting fire to it)
> or that you skipped trying to understand how the feature should work or thinking about the performance characteristics of the solution before you started and just let the LLM drive, so you never designed anything
This is not how AI driven coding works. You have to give the AI very specific design instructions. If you do it right, it will make what you want. Sadly, this means most programmers today will be irrelevant because they can't design their way out of a wet paper bag.
(You know how agile eschews planning and documentation, telling developers and product people to just build "whatever works right now" and keep rewriting it indefinitely as they meet blockers they never planned for? AI now encourages the planning and documentation.)
Reminder that this comes from from the founder that got rightly lambasted for his comments about work life balance and then doubled down when called out.
or stick with known frameworks documented - so you don't have to pay for this nonsense
since they're likely telling you things you know if you test and write your own code.
oh - writing your own code is a thing of the past - a.i writes, a.i then finds bugs
one more ai code review please, I promise it will fix everything this time, please just one more
There is an AI bubble.
Can drop the extra words
No shit. What is the point of using an llm model to review code produced by an llm model?
Code review pressupose a different perspective, which no platform can offer at the moment because they are just as sophisticated as the model they wrap. Claude generated the code, and Claude was asked if the code was good enough, and now you want to be in the middle to ask Claude again but with more emphasis, I guess? If I want more emphasis I can ask Claude myself. Or Qwen. I can't even begin to understand this rationale.