One thing I find frustrating is that management where I work has heard of 10x productivity gains. Some of those claims even come from early adopters at my work.
But that sets expectation way too high. Partly it is due to Amdahl's law: I spend only a portion of my time coding, and far more time thinking and communicating with others that are customers of my code. Even if does make the coding 10x faster (and it doesn't most of the time) overall my productivity is 10-15% better. That is nothing to sneeze at, but it isn't 10x.
Maybe it's due to a more R&D-ish nature of my current work, but for me, LLMs are delivering just as much gains in the "thinking" part as in "coding" part (I handle the "communicating" thing myself just fine for now). Using LLMs for "thinking" tasks feels similar to how mastering web search 2+ decades ago felt. Search engines enabled access to information provided you know what you're looking for; now LLMs boost that by helping you figure out what you're looking for in the first place (and then conveniently searching it for you, too). This makes trivial some tasks I previously classified as hard due to effort and uncertainty involved.
At this point I'd say about 1/3 of my web searches are done through ChatGPT o3, and I can't imagine giving it up now.
(There's also a whole psychological angle in how having LLM help sort and rubber-duck your half-baked thought makes many task seem much less daunting, and that alone makes a big difference.)
This, and if you add in a voice mode (e.g. ChatGPT's Advanced Mode), it is perfect for brainstorming.
Once I decide I want to "think a problem through with an LLM", I often start with just the voice mode. This forces me to say things out loud — which is remarkably effective (hear hear rubber duck debugging) — and it also gives me a fundamentally different way of consuming the information the LLM provides me. Instead of being delivered a massive amount of text, where some information could be wrong, I instead get a sequential system where I can stop/pause the LLM/redirect it as soon as something gets me curious or as I find problems with it said.
You would think that having this way of interacting would be limiting, as having a fast LLM output large chunks of information would let you skim through it and commit it to memory faster. Yet, for me, the combination of hearing things and, most of all, not having to consume so much potentially wrong info (what good is it to skim pointless stuff), ensures that ChatGPT's Advanced Voice mode is a great way to initially approach a problem.
After the first round with the voice mode is done, I often move to written-form brainstorming.
This 100%. Though think there is a personality component to this. At least I think when I speak.
From time to time I use an LLM to pretend to research a topic that I had researched recently, to check how much time it would have saved me.
So far, most of the time, my impression was "I would have been so badly mislead and wouldn't even know it until too late". It would have saved me some negative time.
The only thing LLMs can consistently help me with so far is typing out mindless boilerplate, and yet it still sometimes requires manual fixing (but I do admit that it still does save effort). Anything else is hit or miss. The kind of stuff it does help researching with is usually the stuff that's easy to research without it anyway. It can sometimes shine with a gold nugget among all the mud it produces, but it's rare. The best thing is being able to describe something and ask what it's called, so you can then search for it in traditional ways.
That said, search engines have gotten significantly worse for research in the last decade or so, so the bar is lower for LLMs to be useful.
> So far, most of the time, my impression was "I would have been so badly mislead and wouldn't even know it until too late". It would have saved me some negative time.
That was my impression with Perplexity too, which is why I mostly stopped using it, except for when I need a large search space covered fast and am willing to double-check anything that isn't obviously correct. Most of the time, it's o3. I guess this is the obligatory "are you using good enough models" part, but it really does make a difference. Even in ChatGPT, I don't use "web search" with default model (gpt-4o) because I find it hallucinate or misinterpret results too much.
> The kind of stuff it does help researching with is usually the stuff that's easy to research without it anyway.
I disagree, but then maybe it's also a matter of attitude. I've seen co-workers do exact same research as I did, in parallel, using the same tools (Perplexity and later o3); they tend to do it 5-10x faster than I do, but then they get bad results, and I don't.
Thing is, I have an unusually high need to own the understanding of any thing I'm learning. So where some co-workers are happy to vibe-check the output of o3 and then copy-paste it to team Notion and call their research done, I'll actually read it, and chase down anything that I feel confused about, and keep digging until things start to add up and I feel I have a consistent mental model of the topic (and know where the simplifications and unknowns are). Yes, sometimes I get lost in following tangents, and the whole thing takes much longer than I feel it should, then I don't get "misled by the LLM".
I do the same with people, and sometimes they hate it, because my digging makes them feel like I don't trust them. Well, I don't - most people hallucinate way more than SOTA LLMs.
Still, the research I'm talking about, would not be easy to do without LLMs, at least not for me. The models let me dig through things that would otherwise be overwhelming or too confusing to me, or not feasible in the time I have for it.
Own your understanding. That's my rule.
> Thing is, I have an unusually high need to own the understanding of any thing I'm learning.
Same here. Don't get me wrong, LLMs can be helpful, but what I mean is that they can at best aid my research rather than perform it for me. In my experience, relying on them to do that would usually be disastrous - but they do sometimes help in cases where I feel stuck and would otherwise have to find some human to ask.
I guess it's the difference between "using LLMs while thinking" and "using LLM to do the thinking". The latter just does not work (unless all you're ever thinking about is trivial :P), the former can boost you up if you're smart about it. I don't think it's as big of a boost as many claim and it's still far from being reliable, but it's there and it's non-negligible. It's just that being smart about it is non-optional, as otherwise you end up with slop and don't even realize it.
< I have an unusually high need to own the understanding of any thing I'm learning
This is called deprivation sensitivity. It’s different from intellectual curiosity, where the former is a need to understand vs. the latter, which is a need to know.
Deprivation sensitivity comes with anxiety and stress. Where intellectual curiosity is associated with joyous exploration.
I score very high with deprivation sensitivity. I have unbridled drive to acquire and retain important information.
It’s a blessing and curse. An exhausting way 2 live. I love it but sometimes wish I was not neurodivergent.
You're not neurodivergent. You're a suffering conscious being just like everyone else. Anxiety and depression are caused by ignorance, not circumstance or personality traits, or anything else. With ignorance there is greed, and anger, and delusion. It is because there is no limit to the diversity of delusion that you cling to the view that you are neurodivergent, and otherwise hold the view that you exist in such and such relations to such and such entities and possess so and so qualities and essences. This is why it is said that ignorance alone is the cause of all mental suffering and dissatisfaction experienced by conscious beings.
Our brains are prediction machines. Anxiety is the anticipation of unpleasant experience, which comes from conditioning, not ignorance.
You can be completely aware of your experience and still feel anxiety. So your thinking is flawed.
Your response is telling. You are triggered by a benign comment and generalize harsh views towards all people.
You sound like a troubled young man who feels invisible.
I’m surprised it’s only 1/3rd. 90% of my searches for information start at Perplexity or Claude at this point.
Perplexity is too bulky for queries Kagi can handle[0], and I don't want to waste o3 quota[1] on trivial lookups.
--
[0] - Though I admit that almost all my Kagi searches end in "?" to trigger AI answer, and in ~50% of the cases, I don't click on any result.
[1] - Which AFAIK still exists on Plus plan, though I haven't hit it ~two months.
> One thing I find frustrating is that management where I work has heard of 10x productivity gains. Some of those claims even come from early adopters at my work.
Similar situation at my work, but all of the productivity claims from internal early adopters I've seen so far are based on very narrow ways of measuring productivity, and very sketchy math, to put it mildly.
> One thing I find frustrating is that management where I work has heard of 10x productivity gains.
That may also be in part because llms are not as big of an accelerant for junior devs as they are for seniors (juniors don't know what is good and bad as well).
So if you give 1 senior dev a souped up llm workflow I wouldn't be too surprised if they are as productive as 10 pre-llm juniors. Maybe even more, because a bad dev can actually produce negative productivity (stealing from the senior), in which case it's infinityx.
Even a decent junior is mostly limited to doing the low level grunt work, which llms can already do better.
Point is, I can see how jobs could be lost, legitimately.
The item lost is pipeline of talent in all of this though.
Precision machining is going through an absolute nightmare where the journeymen or master machinists are aging out of the work force. These were people who originally learned on manual machines, and upgraded to CNC over the years. The pipeline collapsed about 1997.
Now there are no apprentice machinists to replace the skills of the retiring workforce.
This will happen to software developers. Probably faster because they tend to be financially independent WAY sooner than machinists.
> The item lost is pipeline of talent in all of this though.
Totally agree.
However, I think this pipeline has been taking a hit for a while already because juniors as a whole have been devaluing themselves: if we expect them to leave after one year, what's the point of hiring and training them? Only helping their next employer at that point.
Its the employers who are responsible for the fact that almost everyone working in tech (across all skill levels) will have a far easier time advancing in both pay and title by jumping jobs often.
Very few companies put any real thought into meaningful retention but they are quick to complain about turnover.
Yes I agree it works both ways. Employment is a transaction and both sides are trying to optimize outcomes in their own best interest. No blame.
The health of the job market is a big factor as well.
That old canard? If you pay people in a way that incentivizes them to stay, they will. If you train people and treat them right and pay them right, they wont leave. If they are, try to fix one of those things, stop blaming the juniors for their massive collusion in a market where they literally are struggling to get jobs.
> However, I think this pipeline has been taking a hit for a while already because juniors as a whole have been devaluing themselves
I have seen the standards for junior devs in free fall for a few years as they hired tons of bootcamp fodder over the last few years. I have lost count of the number of whinging junior devs who think SQL or regex is 'too hard' for their poor little brains. No wonder they are being replaced by a probabilistic magician's hat.
- [deleted]
> overall my productivity is 10-15% better. That is nothing to sneeze at, but it isn't 10x.
It is something to sneeze at if you are 10-15% more expensive to employ due to the cost of the LLM tools. The total cost of production should always be considered, not just throughput.
> It is something to sneeze at if you are 10-15% more expensive to employ due to the cost of the LLM tools.
Claude Max is $200/month, or ~2% of the salary of an average software engineer.
Does anyone actually know what the real cost for the customers will be once the free AI money no longer floods those companies?
I'm no LLM evangelist, far from it, but I expect models of similar quality to the current bleeding-edge, will be freely runnable on consumer hardware within 3 years. Future bleeding-edge models may well be more expensive than current ones, who knows.
For the purpose of keeping the costs of LLM-dependent services down, you don't need to run bleeding-edge models on single consumer GPUs. Even if it takes a hundred GPUs, it still means people can start businesses around hosting those models, and compete with the large vendors.
How do the best models that can run on say a single 4090 today compare to GPT 3.5?
Qwen 2.5 32B which is an older model at this point clearly outperforms it:
https://llm-stats.com/models/compare/gpt-3.5-turbo-0125-vs-q...
Even when quantized down to 4 bits to fit on a 4090?
Not in my experience, running qwen3:32b is good, but it’s not as coherent or useful as 3.5 at a 4bit quant. But the gap is a lot narrower than llama 70b.
yeah there was an analysis that came out on hackernews the other day. between low demand side economics, virtually no impact to GDP, and corporate/vc subsidies going away soon we're close to finding out. Sam Altman did convince Softbank to do a 40B round though so it might be another year or two. Current estimates are that its cheaper than search to run so its probabilistic that there will be more search features swapped. OpenAi hasn't dropped their ad platform yet though, so interested to see how that goes
There's a potential for 100x+ lower cost of chips/energy for inference with compute-in-memory technology.
So they'll probably find a reasonable cost/value ratio.
Too cheap to meter? Inference is cheap and there's no long-term or even mid-term moat here.
As long as the courts don't shut down Meta over IP issues with LLama training data, that is.
I can't stress that enough: "open source" models are what can stop the "real costs" for the customers from growing. Despite popular belief, inference isn't that expensive. This isn't Uber - stopping isn't going to make LLMs infeasible; at worst, it's just going to make people pay API prices instead of subscription prices. As long as there are "open source" models that are legally available and track SOTA, anyone with access to some cloud GPUs can provide "SOTA of 6-12 months ago" for the price of inference, which puts a hard limit on how high OpenAI, et al. can hike the prices.
But that's only as long as there are open models. If Meta loses and LLama goes away, the chilling effect will just let OpenAI, Microsoft, Anthropic and Google to set whatever prices they want.
EDIT:
I mean LLama legally going away. Of course the cat is now out of the bag, the Pandora's box has been opened; the weights are out there and you can't untrain or uninvent them. But keeping the commercial LLM offerings' prices down requires a steady supply of improved open models, and the ability for smaller companies to make a legal business out of hosting them.
You can't just take cost of training out of the equation...
If these companies plan to stay afloat, they have to actually pay for the tens of billions they've spent at some point. That's what the parent comment meant by "free AI"
Yes, you can - because of LLama.
Training is expensive, but it's not that expensive either. It takes just one of those super-rich players to pay the training costs and then release the weights, to deny other players a moat.
If your economic analysis depends on "one of those super-rich players to pay" for it to work, it isn't as much analysis as wishful thinking.
All the 100s of billions of $ put into the models so far were not donations. They either make it back to the investors or the show stops at some point.
And with a major chunk of proponent's arguments being "it will keep getting better", if you lose that what you got? "This thing can spit out boilerplate code, re-arrange documents and sometimes corrupts data silently and in hard to detect ways but hey you can run it locally and cheaply"?
The economic analysis is not mine, and I though it was pretty well-known by now: Meta is not in the compute biz and doesn't want to be in it, so by releasing Llamas, it denies Google, Microsoft and Amazon the ability to build a moat around LLM inference. Commoditize your complement and all that. Meta wants to use LLMs, not sell access to them, so occasionally burning a billion dollars to train and give away an open-weight SOTA model is a good investment, because it directly and indirectly keeps inference cheap for everyone.
You understand that according to what you just said, economically the current SOTA is untenable?
Which, again, leads to a future where we're stuck with local models corrupting data about half the time.
No, it just means that the big players have to keep advancing SOTA to make money; Llama lagging ~6 months behind just means there's only so much they can charge for access to the bleeding edge.
Short-term, it's a normal dynamics for a growing/evolving market. Long-term, the Sun will burn out and consume the Earth.
The cost to improve training increases exponentially for every milestone. No vendor is even coming close to recouping the costs now. Not to mention quality data to feed the training.
The R&D is running on hopes that increasing the magnitude (yes, actual magnitudes) of their models will eventually hit a miracle that makes their company explode in value and power. They can't explain what that could even look like... but they NEED evermore exorbitant amounts of funding flowing in.
This truly isn't a normal ratio of research-to-return.
Luckily, what we do have already is kinda useful and condensing models does show promise. In 5 years I doubt we'll have the post-labor dys/utopia we're being hyped up for. But we may have some truly badass models that can run directly on our phones.
Like you said, Llama and local inference is cheap. So that's the most logical direction all of this is taking us.
Nah, the vendors have generally been open about the limits of scaling. The bet isn't on that one last order of magnitude increase will hit a miracle - the bet is on R&D figuring out a new way to get better model performance before the last one hits diminishing returns. Which, for now, is what's been consistently happening.
There's risk to that assumption, but it's also a reasonable one - let's not forget the whole field is both new and has seen stupid amounts of money being pumped into it over the last few years; this is an inflationary period, there's tons of people researching every possible angle, but that research takes time. It's a safe bet that there are still major breakthroughs ahead us, to be achieved within the next couple years.
The risky part for the vendors is whether they'll happen soon enough so they can capitalize on them and keep their lead (and profits) for another year or so until the next breakthrough hits, and so on.
If LLama goes away we would still get models from China that don't respect the laws that shut down LLama, at least until China is on top, they will continue to undercut using open source/model. Either way, open models will continue to exist.
Rapid progress in open source says otherwise.
In the US, maybe. Several times that by percentage in other places around the world.
the average software engineer makes $10000 a month after taxes?!
> if you are 10-15% more expensive to employ due to the cost of the LLM tools
How is one spending anywhere close to 10% of total compensation on LLMs?
That's a good insight be because with perfect competition it means you need to share your old salary with an LLM!
It's just another tech hype wave. Reality will be somewhere between total doom and boundless utopia. But probably neither of those.
The AI thing kind of reminds me of the big push to outsource software engineers in the early 2000's. There was a ton of hype among executives about it, and it all seemed plausible on paper. But most of those initiatives ended up being huge failures, and nearly all of those jobs came back to the US.
People tend to ignore a lot of the little things that glue it all together that software engineers do. AI lacks a lot of this. Foreigners don't necessarily lack it, but language barriers, time zone differences, cultural differences, and all sorts of other things led to similar issues. Code quality and maintainability took a nosedive and a lot of the stuff produced by those outsourced shops had to be thrown in the trash.
I can already see the AI slop accumulating in the codebases I work in. It's super hard to spot a lot of these things that manage to slip through code review, because they tend to look reasonable when you're looking at a diff. The problem is all the redundant code that you're not seeing, and the weird abstractions that make no sense at all when you look at it from a higher level.
This was what I was saying to a friend the other day. I think anyone vaguely competent that is using LLMs will make the technology look far better than it is.
Management thinks the LLM is doing most of the work. Work is off shored. Oh, the quality sucks when someone without a clue is driving. We need to hire again.
On my personal projects it's easily 10x faster if not more in some circumstances. At work where things are planned out months in advanced and I'm working with 5 different teams to figure out the right way to do things for requirements that change 8 times during development? Even just stuff with PR review and making sure other people understand it and can access it. idk sometimes it's probably break even or that 10-15%. It just doesn't work well in some environments and what really makes it flourish (having super high quality architectural planning/designs/standardized patterns etc.) is basically just not viable at anything but the smallest startups and solo projects.
Frankly even just getting engineers to agree upon those super specificized standardized patterns is asking a ton, especially since lots of the things that help AI out are not what they are used to. As soon as you have stuff that starts deviating it can confuse the AI and makes that 10x no longer accessible. Also no one would want to review the PRs I'd make for the changes I do on my "10x" local project... Especially maintaining those standards is already hard enough on my side projects AI will naturally deviate and create noise and the challenge is constructing systems to guide that to make sure nothing deviates (since noise would lead to more noise).
I think it's mostly a rebalancing thing, if you have 1 or a couple like minded engineers who intend to do it they can get that 10x. I do not see that EVER existing in any actual corporate environment or even once you get more then like 4 people tbh.
Ai for middle management and project planning on the other hand...
I don't disagree with your assessment of the world today, but just 12 months ago (before the current crop of base models and coding agents like Claude Code), even that 10X improvement of writing some-of-the-code wouldn't have been true.
> just 12 months ago (before the current crop of base models and coding agents like Claude Code), even that 10X improvement of writing some-of-the-code wouldn't have been true.
You had to paste more into your prompts back then to make the output work with the rest of your codebase, because there weren't good IDEs/"agents" for it, but you've been able to get really really good code for 90% of "most" day to day SWE since at least OpenAI releasing the ChatGPT-4 API, which was a couple years ago.
Today it's a lot easier to demo low-effort "make a whole new feature or prototype" things than doing the work to make the right API calls back then, but most day to day work isn't "one shot a new prototype web app" and probably won't ever be.
I'm personally more productive than 1 or 2 years ago now because the time required to build the prompts was slower than my personal rate of writing code for a lot of things in my domain, but hardly 10x. It usually one-shots stuff wrong, and then there's a good chance that it'll take longer to chase down the errors than it would've to just write the thing - or only use it as "better autocomplete" - in the first place.
> I don't disagree with your assessment of the world today, but just 12 months ago (before the current crop of base models and coding agents like Claude Code), even that 10X improvement of writing some-of-the-code wouldn't have been true.
So? It sounds like you're prodding us to make an extrapolation fallacy (I don't even grant the "10x in 12 months" point, but let's just accept the premise for the sake of argument).
Honestly, 12 months ago the base models weren't substantially worse than they are right now. Some people will argue with me endlessly on this point, and maybe they're a bit better on the margin, but I think it's pretty much true. When I look at the improvements of the last year with a cold, rational eye, they've been in two major areas:
So how do we improve from here? Cost & efficiency are the obvious lever with historical precedent: GPUs kinda suck for inference, and costs are (currently) rapidly dropping. But, maybe this won't continue -- algorithmic complexity is what it is, and barring some revolutionary change in the architecture, LLMs are exponential algorithms.* cost & efficiency * UI & integration
UI and integration is where most of the rest of the recent improvement has come from, and honestly, this is pretty close to saturation. All of the various AI products already look the same, and I'm certain that they'll continue to converge to a well-accepted local maxima. After that, huge gains in productivity from UX alone will not be possible. This will happen quickly -- probably in the next year or two.
Basically, unless we see a Moore's law of GPUs, I wouldn't bet on indefinite exponential improvement in AI. My bet is that, from here out, this looks like the adoption curve of any prior technology shift (e.g. mainframe -> PC, PC -> laptop, mobile, etc.) where there's a big boom, then a long, slow adoption for the masses.
12 months ago, we had no reasoning models and even very basic arithmetic was outside of the models' grasp. Coding assistants mostly worked on the level of tab-completing individual functions, but now I can one-shot demo-able prototypes (albeit nothing production-ready) of webapps. I assume you consider the latter "integration", but I think coding is so key to how the base models are being trained that this is due to base model improvements too. This is testable - it would be interesting to get something like Claude Code running on top of a year-old open source model and see how it does.
If you're going to call all of that not substantial improvement, we'll have to agree to disagree. Certainly it's the most rapid rate of improvement of any tech I've personally seen since I started programming in the early '00s.
I consider the reasoning models to be primarily a development of efficiency/cost, and I thought the first one was about a year ago, but sure, ok. I don’t think it changes the argument I’m making. The LLM ourobouros / robot centipede has been done, and is not itself a path towards exponential improvement.
To be quite honest, I’ve found very little marginal value in using reasoning models for coding. Tool usage, sure, but I almost never use “reasoning” beyond that.
Also, LLMs still cannot do basic math. They can solve math exams, sure, but you can’t trust them to do a calculation in the middle of a task.
> but you can’t trust them to do a calculation in the middle of a task.
You can't trust a person either. Calculating is its own mode of thinking; if you don't pause and context switch, you're going to get it wrong. Same is the case with LLMs.
Tool usage and reasoning and "agentic approach" are all in part ways for allowing LLM to do the context switch required, instead of taking the match challenge as it goes and blowing it.
The proper comparison is not a human, it’s a computer. Or even a human with a computer.
But my point wasn’t to judge LLMs on their (in)ability to do math - I was only responding to the parent comment’s assertion that they’ve gotten better in this area.
It’s worth noting that all of the major models still randomly decide to ignore schemas and tool calls, so even that is not a guarantee.
12 months ago, if I fed a list of ~800 poems with about ~250k tokens to an LLM and asked it to summarize this huge collection, they would be completely blind to some poems and were prone to hallucinating not simply verses but full-blown poems. I was testing this with every available model out there that could accept 250k tokens. It just wouldn't work. I also experimented with a subset that was at around ~100k tokens to try other models and results were also pretty terrible. Completely unreliable and nothing it said could be trusted.
Then Gemini 2.5 pro (the first one) came along and suddenly this was no longer the case. Nothing hallucinated, incredible pattern finding within the poems, identification of different "poetic stages", and many other rather unbelievable things — at least to me.
After that, I realized I could start sending in more of those "hard to track down" bugs to Gemini 2.5 pro than other models. It was actually starting to solve them reliably, whereas before it was mostly me doing the solving and models mostly helped if the bug didn't occur as a consequence of very complex interactions spread over multiple methods. It's not like I say "this is broken, fix it" very often! Usually I include my ideas for where the problem might be. But Gemini 2.5 pro just knows how to use these ideas better.
I have also experimented with LLMs consuming conversations, screenshots, and all kinds of ad-hoc documentation (e-mails, summaries, chat logs, etc) to produce accurate PRDs and even full-on development estimates. The first one that actually started to give good results (as in: it is now a part of my process) was, you guessed it, Gemini 2.5 pro. I'll admit I haven't tried o3 or o4-mini-high too much on this, but that's because they're SLOOOOOOOOW. And, when I did try, o4-mini-high was inferior and o3 felt somewhat closer to 2.5 pro, though, like I said, much much slower and...how do I put this....rude ("colder")?
All this to say: while I agree that perhaps the models don't feel like they're particularly better at some tasks which involve coding, I think 2.5 pro has represented a monumental step forward, not just in coding, but definitely overall (the poetry example, to this day, still completely blows my mind. It is still so good it's unbelievable).
> 12 months ago, if I fed a list of ~800 poems with about ~250k tokens to an LLM and asked it to summarize this huge collection, they would be completely blind to some poems and were prone to hallucinating not simply verses but full-blown poems.
for the past week claude code has been routinely ignoring CLAUDE.md and every single instruction in it. I have to manually prompt it every time.
As I was vibe coding the notes MCP mentioned in the article [1] I was also testing it with claude. At one point it just forgot that MCPs exist. It was literally this:
We have no objective way of measuring performance and behavior of LLMs> add note to mcp Calling mcp:add_note_to_project > add note to mcp Running find mcp.ex ... Interrupted by user ... > add note to mcp Running <convoluted code generation command with mcp in it>
Your comment warrants a longer, more insightful reply than I can provide, but I still feel compelled to say that I get the same feeling from o3. Colder, somewhat robotic and unhelpful. It's like the extreme opposite of 4o, and I like neither.
My weapon of choice these days is Claude 4 Opus but it's slow, expensive and still not massively better than good old 3.5 Sonnet
Exactly! Here's my take:
4o tens do be, as they say, sycophantic. It's an AI masking as a helpful human, a personal assistant, a therapist, a friend, a fan, or someone on the other end of a support call. They sometimes embellish things, and will sometimes take a longer way getting to the destination if it makes for a what may be a more enjoyable conversation — they make conversations feel somewhat human.
OpenAI's reasoning models, though, feel more like an AI masking a code slave. It is not meant to embellish, to beat around the bush or to even be nice. Its job is to give you the damn answer.
This is why the o* models are terrible for creative writing, for "therapy" or pretty much anything that isn't solving logical problems. They are built for problem solving, coding, breaking down tasks, getting to the "end" of it. You present them a problem you need solved and they give you the solution, sometimes even omitting the intermediate steps because that's not what you asked for. (Note that I don't get this same vibe from 2.5 at all)
Ultimately, it's this "no-bullshit" approach that feels incredibly cold. It often won't even offer alternative suggestions, and it certainly doesn't bother about feelings because feelings don't really matter when solving problems. You may often hear 4o say it's "sorry to hear" about something going wrong in your life, whereas o* models have a much higher threshold for deciding that maybe they ought to act like a feeling machine, rather than a solving machine.
I think this is likely pretty deliberate of OpenAI. They must for some reason believe that if the model is much concise in its final answers (though not necessarily in the reasoning process, which we can't really see), then it produces better results. Or perhaps they lose less money on it, I don't know.
Claude is usually my go-to model if I want to "feel" like I'm talking to more of a human, one capable of empathy. 2.5 pro has been closing the gap, though. Also, Claude used to be by far much better than all other models at European Portuguese (+ portuguese culture and references in general), but, again, 2.5 pro seems just as good nowadays).
On another note, this is also why I also completely understand the need for the two kinds of models for OpenAI. 4o is the model I'll use to review an e-mail, because it won't just try to remove all the humanity of it and make it the most succinct, bland, "objective" thing — which is what the o* models will.
In other words, I think: (i) o* models are supposed to be tools, and (ii) 4o-like models are supposed to be "human".
What exactly are you basing any of your assertions off of?
The same sort of rigorous analysis that the parent comment used (that’s a joke, btw).
But seriously: If you find yourself agreeing with one and not the other because of sourcing, check your biases.
It still isn't.
Its great when they use AI to write a small app “without coding at all” over the weekend and then come in on Monday to brag about it and act baffled that tasks take engineers any time at all.
How much of the communication and meetings are because traditionally code was very expensive and slow to create? How many of those meetings might be streamlined or entirely disappear in the future? In my experience there is are a lot of process around making sure that software on schedule track and that it's doing what it is supposed to do. I think that the software lifecycle is about to be reinvented.
The reports from analysis of open source projects are that its something in the range of 10%-15% productivity gains... so it sounds like you're spot on
That's about right for copilots. It's much higher for agentic coding.
[citation needed]
Agentic coding had really only taken off in the last few weeks due to better pricing.
Wait till they hear about the productivity gains from using vim/neovim.
Your developers still push a mouse around to get work done? Fire them.
Expectations are absolutely way too high. It's going to lead to a lot of toxicity and people being fired. It's really going to suck.
Canva has seen a 30% productivity uplift - https://fortune.com/2025/06/25/canva-cto-encourages-all-5000...
AI is the new uplift. Embrace and adapt, as a rift is forming (see my talk at https://ghuntley.com/six-month-recap/), in what employers seek in terms of skills from employees.
I'm happy to answer any questions folks may have. Currently AFK [2] vibecoding a brand new programming language [1].
[1] https://x.com/GeoffreyHuntley/status/1940964118565212606 [2] https://youtu.be/e7i4JEi_8sk?t=29722
There’s something hilariously Portlandia about making outlandish claims with complete confidence and then plugging your own talk.
There’s citations to the facts in the links.
And that's with 50% adoption and probably a broad distribution of tool use skill.
> The productivity for software engineers is at around 30%
That would be a 70% descent?
I’m a tech lead and I have maybe 5X output now compared to everybody else under me. Quantified by scoring tickets at a team level. I also have more responsibilities outside of IC work compared to the people under me. At this point I’m asking my manager to fire people that still think llms are just toys because I’m tired of working with people with this poor mindset. A pragmatic engineer continually reevaluates what they think they know. We are at a tipping point now. I’m done arguing with people that have a poor model of reality. The rest of us are trying to compete and get shit done. This isn’t an opinion or a game. It’s business with real life consequences if you fall behind. I’ve offered to share my workflows, prompts, setup. Guess how many of these engineers have taken me up on my offer. 1-2 and the juniors or ones that are very far behind have not.
It’s funny. We fired someone with this attitude Thursday. And by this attitude I mean yours.
Not necessarily because of their attitude but because it turns out the software they were shipping was ripe with security issues. Security managed to quickly detect and handle the resulting incident. I can’t say his team were sad to see him go.
Are you the one at Ableton responsible for it ignoring the renaming of parameter names during the setState part of a Live program? Some of us are already jumping through ridiculous hoops to cover for your… mindset. There's stuff coming up that used to work and doesn't now, like in Live 12. From your response I would guess this is a trend that will hold.
We should not be having to code special 'host is Ableton Live' cases in JUCE just to get your host to work like the others.
Can you please not fire any people who are still holding your operation together?
Why do you think this person works at Ableton? From their comments it doesnt seem that they would be a fit to small cool Berlin company making tools for techno.
You've been doing the big I am about LLMs on HN for most of your last comments.
Everyone else who raises any doubts about LLMs is an idiot and you're 10,000x better than everyone else and all your co-workers should be fired.
But what's absent from all your comments is what you make. Can you tell us what you actually do in your >500k job?
Are you, by any chance, a front-end developer?
Also, a team-lead that can't fire their subordinates isn't a team-lead, they're a number two.
I will thank God every day I don’t work with you or for you. How toxic.
im glad I don’t have to work with you too lol.
It’s not toxic for me to expect someone to get their work done in a reasonable amount of time with the tools available to them. If you’re an accountant and you take 5X the time to do something because you have beef with excel you’re the problem. It’s not toxicity to tell you that you are a bad accountant
You believe the cost of firing and rehiring to be cheaper than simple empirical persuasion?
You don't sound like a great lead to me, but I suppose you could be working with absolutely incompetent individuals, or perhaps your soft skills need work.
My apologies but I see only two possibilities for others not to take the time to follow your example given such strong evidence. They either actively dislike you or are totally incompetent. I find the former more often true than the latter.
You have about 50% of HN thinking LLMs are useless and you’re commenting on an article about how it’s still magical and wishful thinking, and that this is crypto all over again. But sure, the problem is me, not the people with a poor model of reality
> You have about 50% of HN thinking LLMs are useless and you’re commenting on an article about how it’s still magical and wishful thinking,
Perhaps you should try reading the article again (or maybe let some LLM summarize it for you)
> But sure, the problem is me, not the people with a poor model of reality
Is amazing how you almost literally use crypto-talk
You believe the cost of firing and rehiring to be cheaper than simple empirical persuasion?
My apologies but that does not sound like good leadership to me. It actually sounds like you may have deficiencies in your skills as it relates to leadership. Perhaps in a few years we will have an LLM who can provide better leadership.
> I’m done arguing with people that have a poor model of reality.
isn't this the entire LLM experience?
A new copypasta is born.
Go back to reddit
"I’ve offered to share my workflows, prompts" That should all be checked in.
It’s checked in, they have just written off llms
Dude, if you are a tech lead, and you measure productivity by scoring tickets, you are doing it pretty badly. I would fire you instead.
You seem completely insufferable and incredibly cringeworthy.
I have to say I’m in the exact camp the author is complaining about. I’ve shipped non trivial greenfield products which I started back when it was only ChatGPT and it was shitty. I started using Claude with copying and pasting back and forth between the web chat and XCode. Then I discovered Cursor. It left me with a lot of annoying build errors, but my productivity was still at least 3x. Now that agents are better and claude 4 is out, I barely ever write code, and I don’t mind. I’ve leaned into the Architect/Manager role and direct the agent with my specialized knowledge if I need to.
I started a job at a demanding startup and it’s been several months and I have still not written a single line of code by hand. I audit everything myself before making PRs and test rigorously, but Cursor + Sonnet is just insane with their codebase. I’m convinced I’m their most productive employee and that’s not by measuring lines of code, which don’t matter; people who are experts in the codebase ask me for help with niche bugs I can narrow in on in 5-30 minutes as someone whose fresh to their domain. I had to lay off taking work away from the front end dev (which I’ve avoided my whole career) because I was stepping on his toes, fixing little problems as I saw them thanks to Claude. It’s not vibe coding - there’s a process of research and planning and perusing in careful steps, and I set the agent up for success. Domain knowledge is necessary. But I’m just so floored how anyone could not be extracting the same utility from it. It feels like there’s two articles like this every week now.
But you just confirmed everything the blogpost claimed.
You didn't share any evidence with us even though you claim unbelievable things.
You even went as far as registering a throwavay account to hide your identity and to make verifying any of your claims impossible.
Your comment feels more like a joke to me
... this from an account with <100 karma.
Look, the person who wrote that comment doesn't need to prove anything to you just because you're hopped up after reading a blog post that has clearly given you a temporary dopamine bump.
People who understand their domains well and are excellent written communicators can craft prompts that will do what we used to spend a week spinning up. It's self-evident to anyone in that situation, and the only thing we see when people demand "evidence" is that you aren't using the tools properly.
We don't need to prove anything because if you are working on interesting problems, even the most skeptical person will prove it to themselves in a few hours.
Feeling triggered? Feeling afraid? And yes, every claim needs to be proven, otherwise those who make the claims will only convince 4 year olds.
>People who understand their domains well and are excellent written communicators can craft prompts that will do what we used to spend a week spinning up. It's self-evident to anyone in that situation, and the only thing we see when people demand "evidence" is that you aren't using the tools properly.
You have no proof of this, so I guess you chose your camp already?
Same experience here, probably in a slightly different way of work (PhD student). Was extremely skeptical of LLMs, Claude Code has completely transformed the way I work.
It doesn't take away the requirements of _curation_ - that remains firmly in my camp (partially what a PhD is supposed to teach you! to be precise and reflective about why you are doing X, what do you hope to show with Y, etc -- breakdown every single step, explain those steps to someone else -- this is a tremendous soft skill, and it's even more important now because these agents do not have persistent world models / immediately forget the goal of a sequence of interactions, even with clever compaction).
If I'm on my game with precise communication, I can use CC to organize computation in a way which has never been possible before.
It's not easier than programming (if you care about quality!), but it is different, and it comes with different idioms.
I find that the code quality LLMs output is pretty bad. I end up going through so many iterations that it ends up being faster to do it myself. What I find agents actually useful for is doing large scale mechanical refractors. Instead of trying to figure out the perfect vim macro or AST rewrite script, I'll throw an agent at it.
I disagree strongly at this point. The code is generally good if the prompt was reasonable at this point but also every test possible is now being written, every ui element has the all required traits, every function has the correct documentation attached, the million little refactors to improve the codebase are being done, etc.
Someone told me ‘ai makes all the little things trivial to do’ and i agree strongly with that. Those many little things are things that together make a strong statement about quality. Our codebase has gone up in quality significantly with ai whereas we’d let the little things slide due to understaffing before.
> The code is generally good if the prompt was reasonable at this point
Which, again, is 100% unverifiable and cannot be generalized. As described in the article.
How do I know this? Because, as I said in the article, I use these tools daily.
And "prompt was reasonable" is a yet another magical incantation that may or may not work. Here's my experience: https://news.ycombinator.com/item?id=44470144
> The code is generally good if the prompt was reasonable
The point is writing that prompt takes longer than writing the code.
> Someone told me ‘ai makes all the little things trivial to do’ and i agree strongly with that
Yeah, it's great for doing all of those little things. It's bad at doing the big things.
> The point is writing that prompt takes longer than writing the code.
Luckily we can reuse system prompts :) Mine usually contains something like https://gist.github.com/victorb/1fe62fe7b80a64fc5b446f82d313... + project-specific instructions, which is reused across sessions.
Currently, it does not take the same amount of time to prompt as if I was to write the code.
Have to disagree with this too - ask an LLM to architect a project, or propose a cleaner solution and usually does a good job.
Where it still sucks is doing both at once. Thus the shift to integrating "to do" lists in Cursor. My flow has shifted to "design this feature" then "continue to implement" 10 times in a row with code review between each step.
> I find that the code quality LLMs output is pretty bad.
That was my experience with Cursor, but Claude Code is a different world. What specific product/models brought you to this generalization?
Claude Code depending on weather, phase of the moon, and compute availability at a specific point in time: https://news.ycombinator.com/item?id=44470144
What sort of mechanical refactors?
"Find all places this API is used and rewrite it using these other APIs."
> I audit everything myself before making PRs and test rigorously
How do you audit code from an untrusted source that quickly, LLMs do not have the whole project in their heads and are proned to hallucinate.
On average how long are your prompts and does the LLM also write the unit tests?
The auditing is not quick. I prefer cursor to claude code because I can review its changes while it’s going more easily and stop and redirect it if it starts to veer off course (which is often, but the cost of doing business). Over time I still gain an understanding of the codebase that I can use to inform my prompts or redirection, so it’s not like I’m blindly asking it to do things. Yes, I do ask it to write unit tests a lot of the time. But I don’t have it spin off and just iterate until the unit tests pass — that’s a recipe for it to do what it needs to do to pass them and is counterproductive. I plan what I want the set of tests to look like and have them write functions in isolation without mentioning tests, and if tests fail I go through a process of auditing the failing code and then the tests themselves to make sure nothing was missed. It’s exactly how I would treat a coworkers code that I review. My prompts range from a few sentences to a few paragraphs, and nowadays I construct a large .md file with a checklist that we iterate on for larger refactors and projects to manage context
I use Claude code for hours a day, it’s a liar, trust what it does at your own risk.
I personally think you’re sugar coating the experience.
It lies with such enthusiasm though.
Recently worked with a weird C flavor (Monkey C) it hallucinated every single method, all the time, every time again.
I know it's just a question of time, likely. However that was soooo far from helpful. And it was itself so sure it's doing it right, again and again without ever consulting the docs
> I use Claude code for hours a day, it’s a liar, trust what it does at your own risk.
The person you're responding to literally said, "I audit everything myself before making PRs and test rigorously".
I didn't see that but I assume they edited their comment.
Please re-read the article. Especially the first list of things we don't know about you, your projects etc.
Your specific experience cannot be generalized. And speaking as the author, and who is (as written in the article) literally using these tools everyday.
> But I’m just so floored how anyone could not be extracting the same utility from it. It feels like there’s two articles like this every week now.
This is where we learn that you haven't actually read the article. Because it is very clearly stating, with links, that I am extracting value from these tools.
And the article is also very clearly not about extracting or not extracting value.
I did read the entire article before commenting and acknowledge that you are using them to some affect, but the line about 50% of the time it works 50% of the time is where I lost faith in the claims you’re making. I agree it’s very context dependent but, in the same way, you did not outline your approaches and practices in how you use AI in your workflow. The same lack of context exists on the other side of the argument.
I agree about the 50/50 thing. It's about how much Claude helped me, and I use it daily too.
I'll give some context, though.
- I use OCaml and Python/SQL, on two different projects.
- Both are single-person.
- The first project is a real-time messaging system, the second one is logging a bunch of events in an SQL database.
In the first project, Claude has been... underwhelming. It casually uses C idioms, overabuses records and procedural programming, ignores basic stuff about the OCaml standard library, and even gave me some data structures that slowed me down later down the line. It also casuallyies about what functions does.
A real example: `Buffer.add_utf_8_uchar` adds the ASCII representation of an utf8 char to a buffer, so it adds something that looks like `\123\456` for non-ascii.
I had to scold Claude for using this function to add an utf8 character to a Buffer so many times I've lost count.
In the second project, Claude really shined. Making most of the SQL database and moving most of the logic to the SQL engine, writing coherent and readable Python code, etc.
I think the main difference is that the first one is an arcane project in an underdog language. The second one is a special case of a common "shovel through lists of stuffs and stuff them in SQL" problem, in the most common language.
You basically get what you trained for.
Just FYI, try commenting on that function what it is intended to be used for. Because without more info LLMs will rely on function names strongly. Heck, have the LLM add comments to every function and I bet it will start to do better.
It's not my function in the example, it's a standard library function. It does have a weird name though.
> but the line about 50% of the time it works 50% of the time is where I lost faith in the claims you’re making.
It's a play on the Anchorman joke that I slightly misremembered: "60% of the time it works 100% of the time"
> is where I lost faith in the claims you’re making.
Ah yes. You lost faith in mine, but I have to have 100% faith in your 100% unverified claim about "job at a demanding startup" where "you still haven't written a single line of code by hand"?
Why do you assume that your word and experience is more correct than mine? Or why should anyone?
> you did not outline your approaches and practices in how you use AI in your workflow
No one does. And if you actually read the article, you'd see that is literally the point.
> …the line about 50% of the time it works 50% of the time is where I lost faith in the claims you’re making…
That's where the author lost me as well. I'd really be interested in a deep dive on their workflow/tools to understand how I've been so unbelievably lucky in comparison.
Sibling comment: https://news.ycombinator.com/item?id=44468374
> I started a job at a demanding startup and it’s been several months and I have still not written a single line of code by hand
Damn, this sounds pretty boring.
It’s not. It’s like I used to play baseball professionally and now I’m a coach or GM building teams and yielding results. It’s a different set of skills. I’m working mostly in idea space and seeing my ideas come to life with a faster feedback loop and the toil is mostly gone
> I’ve shipped non trivial greenfield products
Links please
Here's maybe the most impressive thing I've vibecoded, where I wanted to track a file write/read race condition in a vscode extension: https://github.com/go-go-golems/go-go-labs/tree/main/cmd/exp...
This is _far_ from web crud.
Otherwise, 99% of my code these days is LLM generated, there's a fair amount of visible commits from my opensource on my profile https://github.com/wesen .
A lot of it is more on the system side of things, although there are a fair amount of one-off webapps, now that I can do frontends that don't suck.
I’d like to, but purposefully am using a throwaway account. It’s an iOS app rated 4.5 stars on the app store and has a nice community. Mild userbase, in the hundreds.
> but my productivity was still at least 3x
How do you measure this?
Mean time to shipping features of various estimated difficulty. It’s subjective and not perfect, but generally speaking I need to work way less. I’ll be honest, one thing I think I could have done faster without AI was to implement CRDT-based cloud sync for a project I have going. I think I’ve tried to utilize AI too much for this. It’s good at implementing vector clock implementations, but not at preventing race conditions.
Are you sure it wasn't just stealing from open source projects? If so, you could just cut out the middle man.
And you created an account just to write this unbelievable claim?
A bit suspicious, wouldn’t you agree?
> there’s a process of research and planning and perusing in careful steps, and I set the agent up for success
Are there any good articles you can share or maybe your process? I’m really trying to get good at this but I don’t find myself great at using agents and I honestly don’t know where to start. I’ve tried the memory bank in cline, tried using more thinking directives, but I find I can’t get it to do complex things and it ends up being a time sink for me.
More anecdata: +1 for “LLMs write all my production code now”. 25+ years in industry, as expert as it’s possible to be in my domain. 100% agree LLMs fail hilariously badly, often, and dangerously. And still, write ~all my code.
No agenda here, not selling anything. Just sitting here towards the later part of my career, no need to prove anything to anyone, stating the view from a grey beard.
Crypto hype was shill from grifters pumping whatever bag holding scam they could, which was precisely what the behavioral economic incentives drove. GenAI dev is something else. I’ve watched many people working with it, your mileage will vary. But in my opinion (and it’s mine, you do you), hand coding is an apocryphal skill. The only part I wonder about is how far up and down the system/design/architecture stack the power-tooling is going to go. My intuition and empirical findings incline towards a direction I think would fuel a flame war. But I’m just grey beard Internet random, and hey look, no evidence just more baseless claims. Nothing to see here.
Disclosure: I hold no direct shares in Mag 7, nor do I work for one.
Web dev CRUD in node?
Multi platform web+native consumer application with lots of moving parts and integration. I think to call it a CRUD app would be oversimplifying it.
I personally don't really get this.
_So much_ work in the 'services' industries globally comes down to really a human transposing data from one Excel sheet to another (or from a CRM/emails to Excel), manually. Every (or nearly every) enterprise scale company will have hundreds if not thousands of FTEs doing this kind of work day in day out - often with a lot of it outsourced. I would guess that for every 1 software engineer there are 100 people doing this kind of 'manual data pipelining'.
So really for giant value to be created out of LLMs you do not need them to be incredible at OCaml. They just need to ~outperform humans on Excel. Where I do think MCP really helps is that you can connect all these systems together easily, and a lot of the errors in this kind of work came from trying to pass the entire 'task' in context. If you can take an email via MCP, extract some data out and put it into a CRM (again via MCP) a row at a time the hallucination rate is very low IME. I would say at least a junior overworked human level.
Perhaps this was the point of the article, but non-determinism is not an issue for these kind of use cases, given all the humans involved are not deterministic either. We can build systems and processes to help enforce quality on non deterministic (eg: human) systems.
Finally, I've followed crypto closely and also LLMs closely. They do not seem to be similar in terms of utility and adoption. The closest thing I can recall is smartphone adoption. A lot of my non technical friends didn't think/want a smartphone when the iPhone first came out. Within a few years, all of them have them. Similar with LLMs. Virtually all of my non technical friends use it now for incredibly varied use cases.
Making a comparison to crypto is lazy criticism. It’s not even worth validating. It’s people who want to take the negative vibe from crypto and repurpose it. The two technologies have nothing to do with each other, and therefore there’s clearly no reason to make comparative technical assessments between them.
That said, the social response is a trend of tech worship that I suspect many engineers who have been around the block are weary of. It’s easy to find unrealistic claims, the worst coming from the CEOs of AI companies.
At the same time, a LOT of people are practically computer illiterate. I can only imagine how exciting it must seem to people who have very limited exposure to even basic automation. And the whole “talking computer” we’ve all become accustomed to seeing in science fiction is pretty much becoming reality.
There’s a world of takes in there. It’s wild.
I worked in ML and NLP several years before AI. What’s most striking to me is that this is way more mainstream than anything that has ever happened in the field. And with that comes a lot of inexperience in designing with statistical inference. It’s going to be the Wild West for a while — in opinions, in successful implementation, in learning how to form realistic project ideas.
Look at it this way: now your friend with a novel app idea can be told to do it themselves. That’s at least a win for everyone.
> Look at it this way: now your friend with a novel app idea can be told to do it themselves. That’s at least a win for everyone.
For now, anyways. Thing is, that friend now also has a reasonable shot at succeeding in doing it themselves. It'll take some more time for people to fully internalize it. But let's not forget that there's a chunk of this industry that's basically building apps for people with "novel app ideas" that have some money but run out of friends to pester. LLMs are going to eat a chunk out of that business quite soon.
wrong.
ultimately, crypto is information science. mathematically, cryptography, compression, and so on (data transmission) are all the "same" problem.
LLMs compress knowledge, not just data, and they do it in a lossy way.
traditional information science work is all about dealing with lossless data in a highly lossy world.
And it's all powered by electricity. Coincidence? I think not.
Each FTE doing that manual data pipelining work is also validating that work, and they have a quasi-legal responsibility to do their job correctly and on time. They may have substantial emotional investment in the company, whether survival instinct to not be fired, or ambition to overperform, or ethics and sense to report a rogue manager through alternate channels.
An LLM won't call other nodes in the organization to check when it sees that the value is unreasonable for some out-of-context reason, like yesterday was a one-time-only bank holiday and so the value should be 0. *It can be absolutely be worth an FTE salary to make sure these numbers are accurate.* And for there to be a person to blame/fire/imprison if they aren't accurate.
People are also incredibly accurate at doing this kind of manual data piping all day.
There is also a reason that these jobs are already not automated. Many of these jobs you don't need language models. We could have automated them already but it is not worth someone to sign off on. I have been in this situation at a bank. I could have automated a process rather easily but the upside for me was a smaller team and no real gain while the downside was getting fired for a massive automated mistake if something went wrong.
> An LLM won't call other nodes in the organization to check when it sees that the value is unreasonable for some out-of-context reason, like yesterday was a one-time-only bank holiday and so the value should be 0.
Why not? LLMs are the first kind of technology that can take this kind of global view. We're not making much use of it in this way just yet, but considering "out-of-context reasons" and taking a wider perspective is pretty much the defining aspect of LLMs as general-purpose AI tools. In time, I expect them to match humans on this (at least humans that care; it's not hard to match those who don't).
I do agree on the liability angle. This increasingly seems to be the main value a human brings to the table. It's not a new trend, though. See e.g. medicine, architecture, civil engineering - licensed professionals aren't doing the bulk of the work, but they're in the loop and well-compensated for verifying and signing off on the work done by less-paid technicians.
> considering "out-of-context reasons" and taking a wider perspective is pretty much the defining aspect of LLMs as general-purpose AI tools.
"out-of-context" literally means that the reason isn't in its context. Even if it can make the leap that the number should be zero if it's a bank holiday, how would an LLM know that yesterday was a one-off bank holiday? A human would only know through their lived experience that the markets were shut down, the news was making a big deal over it, etc. It's the same problem using cheap human labor in a different region of the world for this kind of thing; they can perform the mechanical task, but they don't have the context to detect the myriad of ways it can go subtly wrong.
> "out-of-context" literally means that the reason isn't in its context. Even if it can make the leap that the number should be zero if it's a bank holiday, how would an LLM know that yesterday was a one-off bank holiday?
Depends. Was it a one-off holiday announced at 11th our or something? Then it obviously won't know. You'd need extra setup to enable it to realize that, such as e.g. first feeding an LLM the context of your task and a digest of news stories spanning a week, asking it to find if there's anything potentially relevant, and then appending that output to the LLM calls doing the work. It's not something you'd do by default in general case, but that's only because tokens cost money and context space is scarce.
Is it a regular bank holiday? Then all it would need is today's date in the context, which is often just appended somewhere between system and user prompts, along with e.g. user location data.
I see that by "out-of-context reasons" you meant the first case; I read it as a second. In the second case, the "out-of-context" bit could be the fact that a bank holiday could alter the entry for that day; if that rule is important or plausible enough but not given explicitly in the prompt, the model will learn it during training, and will likely connect the dots. This is what I meant as the "defining aspect of LLMs as general-purpose AI tools".
The flip side is, when it connects the dots when it shouldn't, we say it's hallucinating.
That kind of knowledge is present in the training set, doesn't need to be in the context or system prompt.
That said, I too would only use an LLM today in the same kinds of role that five years ago would be outsourced to a different culture.
Culture, not even language: this is how you get the difference between "biscuits and gravy" as understood in the UK vs in the USA.
LLMs handle major trappings of culture just fine. As long as a culture has enough of a footprint in terms of written words, the LLM probably knows it better than any single individual, even though it has not lived it.
Looking at your other comment sibling to mine, I think part of the difficulty discussing these topics is how much these things are considered isolated magic artefacts (bad for engineering) or one tool amongst many where the magic word is "synergy".
So I agree with you: LLMs do know all written cultures on the internet and can mimic them acceptably — but they only actually do so when this is requested by some combination of the fine-tuning, RLHF, system prompt, and context.
In your example, having some current news injected, which is easy, but actually requires someone to plumb that in. And as you say, you'd not do that unless you thought you needed to.
But even easier to pick, lower-hanging fruit, often gets missed. When the "dangerous sycophancy" behaviour started getting in the news, I updated my custom ChatGPT "traits" setting to this:
But cultural differences can be subtle, and there's a long tail of cultural traits of the same kind that means 1980s Text Adventure NLP doesn't scale to what ChatGPT itself does. While this can still be solved with fine-tuning or getting your staff to RLHF it, the number of examples current AI need in order to learn is high compared to a real human, so it won't learn your corporate culture from experience *as fast* as a new starter within your team, unless you're a sufficiently big corporation that it can be on enough (I don't know how many exactly) teams within your company at the same time.Honesty and truthfulness are of primary importance. Avoid American-style positivity, instead aim for German-style bluntness: I absolutely *do not* want to be told everything I ask is "great", and that goes double when it's a dumb idea.
No, it does not know culture. And no, it can't handle talking about it.
Ask an LLM "Can you compare egyptian mythology with aliens?" and they will happily do it:
That's an offensive, pseudoscientific view on egyptian culture shunned by academics.
Even ChatGPT "Critical Viewpoint" section (a small part of a large bullshit response) _still_ entertains offensive ideas:
They should have answered that such comparisons are potentially offensive, and explained why academia thinks so, _before_ spilling out nonsense.
There legitimately is a lot of cross over between Egyptian mythology and other high strangeness phenomenon as understood culturally though, such as aliens/ufos.
I think you did just demonstrate you know less about culture than LLMs, which is not at all unsurprising.
Dude, I chose this example precisely because I know for a fact there is a lot of bullshit about it on the internet and LLMs cannot differentiate between a good source and a bad source.
This is honestly unbelievable. You're defending ancient aliens. What's next? Heavens Gate? Ashtar Sheran?
Even the LLMs themselves acknowledge that this is regarded as offensive. If you correct it, it will apologize (they just can't do it _before_ you correct them).
You're wrong.
> Even the LLMs themselves acknowledge that this is regarded as offensive. If you correct it, it will apologize (they just can't do it _before_ you correct them).
Nah, that's just LLMs being trained to acquiesce to the insanity of the last ~15 years, as many people seem to expect that claiming you're offended by something is an ultimate argument that everyone must yield to (and they'll keep making a fuss out of it until they do).