Hard agree. As LLMs drive the cost of writing code toward zero, the volume of code we produce is going to explode. But the cost of complexity doesn't go down—it actually might go up because we're generating code faster than we can mentally model it.
SRE becomes the most critical layer because it's the only discipline focused on 'does this actually run reliably?' rather than 'did we ship the feature?'. We're moving from a world of 'crafting logic' to 'managing logic flows'.
I dunno, I don't think in practice SRE or DevOPs are even really different from the people we used to call sys admins (former sysadmin myself). I think the future of mediocre companies is SRE chasing after LLM fires, but I think a competitive business would have a much better strategy for building systems. Humans are still by far the most efficient and generalized reasoners, and putting the energy intensive, brittle ai model in charge of most implementation is setting yourself up to fail.
Former sysadmin and I've been an SRE for >15 years now.
They are very different. If your SREs are spending much of their time chasing fires, they are doing it wrong.
Unfortunately sometimes it's more of a title than a job description. Company's define the job, and call it what ever they feel like.
By "SRE", are people actually talking about "QA"?
SREs usually don't know the first thing about whether particular logic within the product is working according to a particular set of business requirements. That's just not their role.
Good SREs at a senior level do. They are familiar with the product, and the customers and the business requirements.
Without that it's impossible to correctly prioritise your work.
Any SRE who does that is really filling a QA role. It's not part of the SRE job title, which is more about deployments/monitoring/availability/performance, than about specific functional requirements.
In a well-run org, the software engineers (along with QA if you have them) are responsible for validation of requirements.
Most companies don't have QA anymore, just their CI/CD's automated tests.
I see it less as SRE and more about defensive backend architecture. When you are dealing with non-deterministic outputs, you can't just monitor for uptime, you have to architect for containment. I've been relying heavily on LangGraph and Celery to manage state, basically treating the LLM as a fuzzy component that needs a rigid wrapper. It feels like we are building state machines where the transitions are probabilistic, so the infrastructure (Redis, queues) has to be much more robust than the code generating the content.
> But the cost of complexity doesn't go down
But how much of current day software complexity is inherent in the problem space vs just bad design and too many (human) chefs in the kitchen? I'm guessing most of it is the latter category.
We might get more software but with less complexity overall, assuming LLMs become good enough.
I agree that there's a lot of complexity today due to the process in which we write code (people, lack of understanding the problem space, etc.) vs the problem itself.
Would we say us as humans also have captured the "best" way to reduce complexity and write great code? Maybe there's patterns and guidelines but no hard and fast rules. Until we have better understanding around that, LLMs may also not arrive at those levels either. Most of that knowledge is gleamed when sticking with a system -- dealing with past choices and requiring changes and tweaks to the code, complexity and solution over time. Maybe the right "memory" or compaction could help LLMs get better over time, but we're just scratching the surface there today.
LLMs output code as good as their training data. They can reason about parts of code they are prompted and offer ideas, but they're inherently based on the data and concepts they've trained on. And unfortunately...its likely much more average code than highly respected ones that flood the training data, at least for now.
Ideally I'd love to see better code written and complexity driven down by _whatever_ writes the code. But there will always been verification required when using a writer that is probabilistic.
That probably requires superhuman AI, though.
>> As LLMs drive the cost of writing code toward zero
And they drive the cost of validating the correctness of such code towards infinity...
This sounds like the most min maxed drivel. What if I took every concept and dialed it to either zero or 11 and then picked a random conclusion!!!??
I think there's two kinds of software-producing-organizations:
There's the small shops where you're running some kind of monolith generally open to the Internet, maybe you have a database hooked up to it. These shops do not need dedicated DevOps/SRE. Throw it into a container platform (e.g. AWS ECS/Fargate, GCP Cloud Run, fly.io, the market is broad enough that it's basically getting commoditized), hook up observability/alerting, maybe pay a consultant to review it and make sure you didn't do anything stupid. Then just pay the bill every month, and don't over-think it.
Then you have large shops: the ones where you're running at the scale where the cost premium of container platforms is higher than the salary of an engineer to move you off it, the ones where you have to figure out how to get the systems from different companies pre-M&A to talk to each other, where you have N development teams organizationally far away from the sales and legal teams signing SLAs yet need to be constrained by said SLAs, where you have some system that was architected to handle X scale and the business has now sold 100X and you have to figure out what band-aids to throw at the failing system while telling the devs they need to re-architect, where you need to build your Alertmanager routing tree configuration dynamically because YAML is garbage and the routing rules change based on whether or not SRE decided to return the pager, plus ensuring that devs have the ability to self-service create new services, plus progressive rollout of new alerts across the organization, etc., so even Alertmanager config needs to be owned by an engineer.
I really can't imagine LLMs replacing SREs in large shops. SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.
> SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.
According to the specified goals of SRE, this is actually not just a small fraction - but something that shouldn't happen. To be clear, I'm fully aware that this will always be necessary - but whenever it happened - it's because the site reliability engineer (SRE) overlooked something.
Hence if that's considered a large part of the job.. then you're just not a SRE as Google defined that role
https://sre.google/sre-book/table-of-contents/
Very little connection to the blog post we're commenting on though - at least as far as I can tell.
At least I didn't find any focus on debugging. It put forward that the capability to produce reliable software is what will distinguish in the future, and I think this holds up and is inline with the official definition of SRE
I don't think people really adhere to Google's definition; most companies don't even have nearly similar scale. Most SRE I've seen are running from one Pagerduty alert to the next and not really doing much of a deep dive into understanding the problem.
This makes sense - as am analogy the flight crash investigator is presumably a very different role to the engineer designing flight safety systems.
I think you've identified analogous functions, but I don't think your analogy holds as you've written it. A more faithful analogy to OP is that there is no better flight crash investigator than the aviation engineer designing the plane, but flight crash investigation is an actual failure of his primary duty of engineering safe planes.
Still not a great rendition of this thought, but closer.
those alertmanager descriptions feel scary. I'm stuck in the zabbix era.
what do you mean "progressive rollout of new alerts across the organization"? what kind of alerts?
Well, all kinds. Alerting is a really great way to track things that need to change, tell people about that thing along established channels, and also tell them when it's been addressed satisfactorily. Alertmanager will already be configured with credentials and network access to PagerDuty, Slack, Jira, email, etc., and you can use something like Karma to give people interfaces to the different Alertmanagers and manage silences.
If you're deploying alerts, then yeah you want a progressive rollout just like anything else, or you run the risk of alert fatigue from false positives, which is Really Bad because it undermines faith in the alerting system.
For example, say you want to start to track, per team, how many code quality issues they have, and set thresholds above which they will get alerted. The alert will make a Jira ticket - getting code quality under control can be afforded to be scheduled into a sprint. You probably need different alert thresholds for different teams, and you want to test the waters before you start having Alertmanager make real Jira issues. So, yeah, progressive rollout.
Having worked on Cloud Run/Cloud Functions, I think almost every company that isn't itself a cloud provider could be in category 1, with moderately more featureful implementations that actually competed with K8s.
Kubernetes is a huge problem, it's IMO a shitty prototype that industry ran away with (because Google tried to throw a wrench at Docker/AWS when Containers and Cloud were the hot new things, pretending Kubernetes is basically the same as Borg), then the community calcified around the prototype state and bought all this SAAS/structured their production environments around it, and now all these SAAS providers and Platform Engineers/Devops people who make a living off of milking money out of Kubernetes users are guarding their gold mines.
Part of the K8s marketing push was rebranding Infrastructure Engineering = building atop Kubernetes (vs operating at the layers at and beneath it), and K8s leaks abstractions/exposes an enormous configuration surface area, so you just get K8s But More Configuration/Leaks. Also, You Need A Platform, so do Platform Engineering too, for your totally unique use case of connecting git to CI to slackbot/email/2FA to our release scripts.
At my new company we're working on fixing this but it'll probably be 1-2 more years until we can open source it (mostly because it's not generalized enough yet and I don't want to make the same mistake as Kubernetes. But we will open source it). The problem is mostly multitenancy, better primitives, modeling the whole user story in the platform itself, and getting rid of false dichotomies/bad abstractions regarding scaling and state (including the entire control plane). Also, more official tooling and you have to put on a dunce cap if YAML gets within 2 network hopes of any zone.
In your example, I think
1. you shouldn't have to think about scaling and provisioning at this level of granularity, it should always be at the multitenant zonal level, this is one of the cardinal sins Kubernetes made that Borg handled much better
2. YAML is indeed garbage but availability reporting and alerting need better official support, it doesn't make sense for every ecommerce shop and bank to building this stuff
3. a huge amount of alerts and configs could actually be expressed in business logic if cloud platforms exposed synchronous/real-time billing with the scaling speed of Cloud Run.
If you think about it, so so so many problems devops teams deal with are literally just
1. We need to be able to handle scaling events
2. We need to control costs
3. Sometimes these conflict and we struggle to translate between the two.
4. Nobody lets me set hard billing limits/enforcement at the platform level.
(I implemented enforcement for something close to this for Run/Appengine/Functions, it truly is a very difficult problem, but I do think it's possible. Real time usage->billing->balance debits was one of the first things we implemented on our platform).
5. For some reason scaling and provisioning are different things (partly because the cloud provider is slow, partly because Kubernetes is single-tenant)
6. Our ops team's job is to translate between business logic and resource logic, and half our alerts are basically asking a human to manually make some cost/scaling analysis or tradeoff, because we can't automate that, because the underlying resource model/platform makes it impossible.
You gotta go under the hood to fix this stuff.
Since you are developing in this domain. Our challenge with both lambdas and cloud run type managed solutions is that they seem incompatible with our service mesh. Cloud run and lambdas can not be incorporated with gcp service mesh, but only if it is managed through gcp as well. Anything custom is out of the question. Since we require end to end mTLS in our setup we cannot use cloud run.
To me this shows that cloud run is more of an end product than a building block and it hinders the adoption as basically we need to replicate most of cloud run ourselves just to add that tiny bit of also running our Sidecar.
How do you see this going in your new solution?
> Cloud run and lambdas can not be incorporated with gcp service mesh, but only if it is managed through gcp as well
I'm not exactly sure what this means, a few different interpretations make sense to me. If this is purely a run <-> other gcp product in a vpc problem, I'm not sure how much info about that is considered proprietary and which I could share, or even if my understanding of it is even accurate anymore. If it's that cloud run can't run in your service mesh then it's just, these are both managed services. But yes, I do think it's possible to run into a situation/configuration that is impossible to express in run that doesn't seem like it should be inexpressible.
This is why designing around multitenancy is important. I think with hierarchical namespacing and a transparent resource model you could offer better escape hatches for integrating managed services/products that don't know how to talk to each other. Even though your project may be a single "tenant", because these managed services are probably implemented in different ways under the hood and have opaque resource models (ie run doesn't fully expose all underlying primitives), they end up basically being multitenant relative to each other.
That being said, I don't see why you couldn't use mTLS to talk to Cloud Run instances, you just might have to implement it differently from how you're doing it elsewhere? This almost just sounds like a shortcoming of your service mesh implementation that it doesn't bundle something exposing run-like semantics by default (which is basically what we're doing), because why would it know how to talk to a proprietary third party managed service?
Lots to unpack here.
I will just say based on recent experience the fix is not Kubernetes bad it’s Kubernetes is not a product platform; it’s a substrate, and most orgs actually want a platform.
We recently ripped out a barebones Kubernetes product (like Rancher but not Rancher). It was hosting a lot of our software development apps like GitLab, Nexus, KeyCloak, etc
But in order to run those things, you have to build an entire platform and wire it all together. This is on premises running on vxRail.
We ended up discovering that our company had an internal software development platform based on EKS-A and it comes with auto installers with all the apps and includes ArgoCD to maintain state and orchestrate new deployments.
The previous team did a shitty job DIY-ing the prior platform. So we switched to something more maintainable.
If someone made a product like that then I am sure a lot of people would buy it.
> real-time usage -> billing
This is one of the things that excites me about TigerBeetle; the reason why so much billing by cloud providers is reported only on an hourly granularity at best is because the underlying systems are running batch jobs to calculate final billed sums. Having a billing database that is efficient enough to keep up with real-time is a game-changer and we've barely scratched the surface of what it makes possible.
Thanks for mentioning them, we're doing quite similar debit-credit stuff as https://docs.tigerbeetle.com/concepts/debit-credit/ but reading https://docs.tigerbeetle.com/concepts/performance/ they are definitely thinking about the problem differently from us. You need much more prescribed entities (eg resources and skus) on the modelling side and different choices on the performance side (for something like a usage pricing system) for a cloud platform.
This feels like a single-tenant, centralized ACH but I think what you actually want for a multitenant, multizonal cloud platform is not ACH but something more capability-based. The problem is that cloud resources are billed as subscriptions/rates and you can't centralize anything on the hot-path (like this does) because it means that zone/any availability interacting with that node causes a lack of availability for everything else. Also, the business logic and complexity for computing an actual final bill for a cloud customer's usage is quite complex because it's reliant on so many different kinds of things, including pricing models which can get very complex or bespoke, and it doesn't seem like tigerbeetle wants calculating prices to be part of their transactions (I think)
The way we're modelling this is with hierarchical sub-ledgers (eg per-zone, per-tenant, per-resourcegroup) and something which you could think of as a line of credit. In my opinion the pricing and resource modelling + integration with the billing tx are much more challenging because they need to be able to handle a lot of business logic. Anyway, if someone chooses to opt-in to invoice billing there's an escape hatch and way for us to handle things we can't express yet.
Every time I’ve pushed for cloud run at jobs that were on or leaning towards k8s I was looked at as a very unserious person. Like you can’t be a “real” engineer if you’re not battling yaml configs and argoCD all day (and all night).
It does have real tradeoffs/flaws/limitations, chief among them, Run isn't allowed to "become" Kubernetes, you're expected to "graduate". There's been an immense marketing push for Kubernetes and Platform Engineering and all the associated SAAS sending the same message (also, notice how much less praise you hear about it now that the marketing has died down?).
The incentives are just really messed up all around. Think about all the actual people working in devops who have their careers/job tied to Kubernetes, and how many developers get drawn in by the allure and marketing because it lets them work on more fun problems than their actual job, and all the provisioned instances and vendor software and certs and conferences, and all the money that represents.
There are plenty of PaaS components that run on k8s if you want to use them. I'm not a fan, because I think giving developers direct access to k8s is the better pattern.
Managed k8s services like EKS have been super reliable the last few years.
YAML is fine, it's just configuration language.
> you shouldn't have to think about scaling and provisioning at this level of granularity, it should always be at the multitenant zonal level, this is one of the cardinal sins Kubernetes made that Borg handled much better
I'm not sure what you mean here. Manage k8s services, and even k8s clusters you deploy yourself, can autoscale across AZ's. This has been a feature for many years now. You just set a topology key on your pod template spec, your pods will spread across the AZ's, easy.
Most tasks you would want to do to deploy an application, there's an out of the box solution for k8s that already exists. There have been millions of labor-hours poured into k8s as a platform, unless you have some extremely niche use case, you are wasting your time building an alternative.
stackskipton makes a good point about authority. SRE works at Google because SREs can block launches and demand fixes. Without that organizational power, you're just an on-call engineer who also writes tooling.
The article's premise (AI makes code cheap, so operations becomes the differentiator) has some truth to it. But I'd frame it differently: the bottleneck was never really "writing code." It was understanding what to build and keeping it running. AI helps with one of those. Maybe.
> because SREs can block launches and demand fixes
I didn't find that particularly true during my tenure, but obviously Google is huge, so there probably exist teams that actually can afford to behave this way...
If the agent swarm is collectively smarter and better than the SRE, they'll be replaced just like other types of workers. There is no domain that has special protection.
The models are not smarter than us by far. Have you not run into issues with reasoning and comprehension with them? They get confused, they miss big details, build complicated code thats ineffective. They don't work well at tasks that require a larger holistic understanding of the problem. The models are weak, brittle reasoners, because they have an indirect and contradictory understanding of the wold. We're several breakthroughs away and several hardware generations from having models that are robust reasoners for grounded, non-kind problems.
My thoughts exactly. This is just some guy grasping at straws before he understands that he will have to bow to our new overlords sooner or later.
Edit: Or maybe he is fully aware and just need to push some books before it's too late.
Or, most charitably, maybe they're not sure and trying to Cunningham's Law their way through the conundrum.
What about C-suite executives & shareholders? Are they safe from automation?
A uniquely important thing that a CEO brings to the table is accountability. You can't automate accounta- ...sorry, I can't continue this with a straight face :DDD
You can only replace someone who was useful. If one is useless, but is still there, it means they are not there for their contribution and you can't replace them by automating whatever it might have been.
The thing about C-suite executives is they usually have short tenures, however the management levels below them are often cozy in their bureaucracy, resist change, often trying to outlast the new management.
I actually argue that AI will therefore impact these levels of management the most.
Think about it, if you were employed as a transformational CEO would you risk trying to fight existing managers or just replace them with AI?
>I actually argue that AI will therefore impact these levels of management the most.
Not AI but bad economy and mass layoffs tend to wipe out management positions the most. As a decent IC, in case of layoffs in bad economy, you'll always find some place to work at if you're flexible with location and salary because everyone still needs people who know how to actually build shit, but nobody needs to add more managers in their ranks to consume payroll and add no value.
A lot of large companies lay off swags of technical staff regularly (or watch them leave), and rotate CEOs but their middle management have jobs for life - as the Peter Principe states, they are promoted to their highest respective incompetence and stay there because no CEO has time to replace them.
AI will transform this.
Disagree with the "jobs for life" part for management. Only managers who are there thanks to connection, nepotism or cronyism, are there for life as long as those shielding them also stay in place. THose who got in or got promoted to management meritocratically don't have that protection and are the first to be let go.
At all large MNCs I worked at, management got hired and fired mostly on their (or lack thereof) connections and less on what they actually did. Once they got let go, they had near impossible time finding another management position elsewhere without connections in other places.
This is so true Especially with middle managers they are they the ones that are hit the hardest
Yes I was talking about middle managers mostly. Upper management, C-suite, execs are mostly protected from firing unless they F-up big time like sexual assault, hate speech, etc.
Generally yes. The more power one holds in an organization the more safe they are from automation.
You can probably automate the full economy. Both production and consumption
Yes. The AI cannot be the child/other type of beneficiary of a well-connected person, yet.
Ultimately, no. But when we get to this point - once we have AI deciding on its own what needs to be done in the world in general - then the bottom falls out, and we'll all be watching a new global economy, in which humans won't partake anymore. At best, we'll become pets to our new AI overlords; more likely, resources to exploit.
Automating away shareholders can't come soon enough.
The make the decisions so I doubt they will soon themselves to be automated away. Their main risk will be that nobody can buy their products once everything is automated.
I wonder if capitalism and democracy will be just a short chapter in history that will be replaced by something else. Autocratic governments seem to be the most prevalent form of government in history.
There absolutely is. Sports.
Couldn't disagree with this article more. I think the future of software engineering is more T-shaped.
Look at the 'Product Engineer' roles we are seeing spreading in forward-thinking startups and scaleups.
That's the future of SWE I think. SWEs take on more PM and design responsibilities as part of the existing role.
I don't think the two are mutually exclusive! e.g. a T-shaped product engineer on one side and a T-shaped SRE on the other. Both will kind of compact what used to be multiple roles/responsibilities together. The good news (and my prediction) IMO is the engineering won't be going away as much as the other roles.
I agree. In many cases it's probably easier for a developer to become more of a product person, than for a product person to become a dev. Even with LLM's you still need to have some technical skills & be able to read code to handle technical tasks effectively.
Of course things might look different when the product is something that requires really deep domain knowledge.
Or architects, someone has to draw the nice diagrams and spec files for the robots.
However, like in automated factories, only a small percentage is required to stay around.
I was an old school SRE before the days of containerization and such. Today, we have one who is a YAML wizard and I won't even pretend to begin to understand the entire architecture between all the moving pieces(kube, flux, helm, etc).
That said, Claude has absolutely no problem not only answering questions, but finding bugs and adding new features to it.
In short, I feel they're just as screwed as us devs.
For those who were oblivious to what SRE means, just like me: SRE os _site reliability engineering_
Servers, Ready to Eat
I knew what an SRE was and found the article somewhat interesting with a slightly novel (throwaway), more realistic take, on the "why need Salesforce when you can vibe your own Salesforce convo."
But not defining what an SRE is feels like a glaring, almost suffocating, omission.
Seemingly Random Engineering
Sysadmin Really Expensive
Sales Recovery Engineering
Stuckup Retro Engineer
Super Ready Engineer
As an SRE I can tell you AI can't do everything. I have done a little software development, even AI can't do everything. What we are likely to see is operational engineering become the consolidated role between the two. Knows enough about software development and knows enough about site reliability... blamo operational engineer.
"As an SRE I can tell you AI can't do everything."
That's what they used to say about software engineering and yet this is becoming less and less obvious as capabilities increase.
There are no hiding places for any of us.
Not the person you are replying to but, even if the technical skills of AI increase (and stuff like Codex and Claude Code is indeed insanely good), you still need someone to make risky decisions that could take down prod.
Not sure management is eager to give permission to software owned by other companies (inference providers) the permission to delete prod DBs.
Also these roles usually involve talking to other teams and stakeholder more often than with a traditional SWE role.
Though
> There are no hiding places for any of us.
I agree with this statement. While the timeline is unclear (LLM use is heavily subsidized), I think this will translate into less demand for engineers, overall.
I think it is important to know that AI needs to be maintained. You can't reasonably expect it to have a 99.9% reliability rate. As long as this remains true work will exist in the foreseeable future.
Indeed, however the amount of "someone" is going to be way less.
It's still perfectly obvious as AI can't remotely write software if you want it to actually, you know, work.
Paraphrase: "As an SRE I can tell you that the undetermined and unknowable potential of AI definitely won't involve my job being replaced."
Actually it is more that my role will transform and I have no say in it.
Yet, AI is not there yet. Even the top models struggle at simplest SRE tasks.
We just created a benchmark on adding distributed logs (OpenTelemetry instrumentation) to small services, around 300 lines of code.
Claude Opus 4.5 succeed at 29%, GPT 5.2 at 26%, Gemini 3 Pro at 16%.
Agreed. I believe this is going to be the trend.
I don’t think LLM context will able to digest large codebases and their algorithms are not going to reason like SREs in the next coming years. And given the current hype and market, investors are gonna pull out with recessions all over the world and we will see another AI Winters.
Code has become a commodity. Corporate engineering hierarchy will be much flat in coming years both horizontally and vertically - one staff will command two senior engineers with two juniors each, orchestrating N agents each.
I think that’s it - this is the end of bootcamp devs. This will act as a great filter and probably decrease the mass influx of bootcamp devs.
Bootcamp devs were always going to be doomed in the job market. They were a symptom of not having enough true classically trained computer science degree holding engineers to hire, so you compromised by looking for anyone that knew how to code well enough. But this problem eventually corrects.
Now, there are way too many computer science grads in a time when code is easy and cheap. Not much to gain from hiring a bootcamp dev over the real deal.
But I would say if you truly enjoy coding and you didn’t get to study CS in a university, a bootcamp is probably a fun experience to go through just for your own enjoyment, not for job seeking purposes. Just don’t pay too much.
“People don’t buy software, they hire a service” is a bullshit straw man.
That OS on your laptop? Software. The terminal your SSH runs in? Software. The browser you’re reading this take in? Software. The editor you wrote your last 10k LOC in? Software.
The only “service” I buy is email — and even that I run myself. It’s still just software, plus ops.
Yes, running things is hard. Nobody serious disputes that. But pretending this is some new revelation is ahistorical. We used to call this systems engineering, operations, reliability, or just doing your job before SRE needed a brand deck.
And let’s be clear about the direction of value:
Software without SRE still has value. SRE without software has none.
A binary I can run, copy, fork, and understand beats a perfectly monitored nothing. A CLI tool with zero uptime guarantees still solves problems. A library still ships value. A game still runs. A compiler still compiles.
Ops exists to serve software, not replace it. Reliability amplifies value — it does not create it.
If “writing code is easy,” why is the world drowning in unreliable, unmaintainable, over-engineered trash with immaculate dashboards and flawless incident postmortems?
People buy software. They appreciate service when the software becomes infrastructure. Confusing the two is how you end up worshipping uptime graphs while shipping nothing worth running.
Yeah, I think that when writing code becomes cheap, then all the COMPLEMENTS become more valuable:
- testing - reviewing, and reading/understanding/explaining - operations / SREBut what if those complementary skills also become cheap?
As someone who works in Ops role (SRE/DevOps/Sysadmin), SREs are something that only works at Google mainly because for Devs to do SRE, they need ability to reject or demand code fixes which means you need someone being a prompt engineer who needs to understand the code and now they back to being developer.
As for more dedicated to Ops side, it's garbage in, garbage out. I've already had too many outages caused by AI Slop being fed into production, calling all Developers = SRE won't change the fact that AI can't program now without massive experienced people controlling it.
Most devs can't do SRE, in fact the best devs I've met know they can't do SRE (and vice versa). If I may get a bit philosophical, SRE must be conservative by nature and I feel that devs are often innovative by nature. Another argument is that they simply focus on different problems. One sets up an IDE and clicks play, has some ephemeral devcontainer environment that "just works", and the hard part is to craft the software. The other has the software ready and sometimes very few instructions on how to run it, + your typical production issues, security, scaling, etc. The brain of each gets wired differently over time to solve those very different issues effectively.
I don’t understand this take - if all engineers go on call, they learn real quick what happens when their coworkers are too innovative. It is a good feedback loop that teaches them not to make unreliable software.
SREs are great when the problem is “the network is down” or “kubernetes won’t run my pods”, but expecting a random engineer to know all the failure modes of software they didn’t build and don’t have context on never seems to work out well.
It's possible to do both, you just need to be cognizant of what you're doing in both positions.
A tricky part becomes when you don't have both roles for something, like SRE-developed tools that are maintained by the ones writing them, and you need to strike the balance yourselves until/unless you wind up with that split. If you're not aware of both hats and juggling wearing them intentionally, in that case, you can wind up with tools out of SRE that are worse than any SWE-only tool might ever be, because the SREs sometimes think they won't make the same mistakes, but all the same feature-focused things apply for SRE-written tools too...
I manage a team of developers in a low code environment without AI. The junior developer positions require 8 years of experience, which I think is absurd. Everybody has to program on their own, though pair programming for knowledge transfer is super frequent, but the primary skills of concern are operational excellence (including some project management tasks), transmission, and reliability.
From a people perspective that means excellence when working with outside teams and gathering requirements on your own. It also means always knowing the status of your work in all environments, even in production after deployment. If your soft skills are strong and you can independently program work streams that touch multiple external parties you are golden. It seems this is the future.
I'm sorry, nothing personal...but any place that requires 8 years of experience but only gives a title of "junior" is pretty dang close to a sweat shop.
On a different note, i do see what you mention about some op excellence skills (e.g. project management, requirements gathering, etc.) being areas of concern at my $dayjob. But, i kinda always saw them as skills that are valuable in any era, and need not only be in this AI era....but everyone's mileage and environment certainly can vary that expectation. Also, at my $dayjob, the business lacks so much funding to pay software vendors fairly, properly that we get what we pay for....so its often low quality output. Its not low *code* because we employee and contract regular, full code devs....but it certainly often is poor quality...and i wonder as low code offerings and opportunities - paired with more solid AI development asistance - continue to emerge, i suppose something like a SRE role can become that much more important - regardless if one works in low code or low cost arena.
I think you are too hung up on titles. This is the least sweatshop job I have ever had in my 20 year career. Vanity titles is how they get you.
I acknowledge being hung up on titles. I used to give titles too much attention as a very new person on my very first job...then over the decades, i learned not to get hung up on titles...but i guess my current $dayjob has sooooo many flaws (organizationally, they're very amateurish), that it seems i got re-sensitized to titles. Here, titles seem to give a person everything from significantly better pay, to better authority, to training offerings, etc., etc...even beyond the point of being rational and sound. Its almost to the point of something silly like in the Office Space movie. So yeah, i guess the dire state of my current employer has made me a bit more negative than i used to be, and now focusing on crap like titles. ;-) I guess they got me (for now)!
If your place is indeed the least sweatshop job, then congrats and enjoy the good parts! :-)
In other words, the apps will be trash, and an operations team that doesn't have the time, capability, or mandate to fix them will be constantly scrambling to keep the fires out?
Sounds... reliable.
Same as it ever was.
Totally agree. Vibe coding will generate lots of internal AI apps, but turning them into reliable, secure, governed services still requires real engineering, which is exactly why we’re building https://manifest.build. It lets non-technical teams build Agentic apps fast through an AI powered workflow builder while giving engineering and IT a single platform to add governance, security, data access, and keep everything production-ready at scale.
If the future of software engineering is SRE, because GenAI is taking care of coding, a similar trend is coming for SRE-type work.
It's called AI SRE, and for now, it's mostly targeted at helping on-call engineers investigate and solve incidents. But of course, these agents can also be used proactively to improve reliability.
> And you definitely don't care how a payments network point of sale terminal and your bank talk to each other... Good software is invisible.
> ...
> Are you keeping up with security updates? Will you leak all my data? Do I trust you? Can I rely on you?
IMO, if the answers to those questions matter to you, then you damn well should care how it works. Because even if you aren't sufficiently technically minded to audit the system, having someone be able to describe it to you coherently is an important starting point in building that trust and having reason to believe that security and privacy will work as advertised.
Euh, our job is hard enough as it is, don't start leaning on us to clean up the AI mess too.
Again there's a cognitive dissonance in play here where the future of coding is somehow LLMs and but at the same time the LLMS would not evolve not to handle the operations as well even if we disregard pipedreams about AGIs being just around the corner. Especially when markdown files for AI are essentially glorified runbooks.
The only thing lacking in this article was explanation of the abbreviation from the title. SRE = Site Reliability Engineer(ing).
True, but also need to know the basics well of what constitutes good code and how it should scale vs just working code. Too many people relying on LLMs to produce stuff which just about works but give users a terrible experience as it bearly works.
I have a lot of work: Make the agents work at warp speed. Prepare specs for next iteration Hopefully exhaust resources.. for free time. <rest as much as possible>
Every 5 hours 24/7. Rinse repeat
Real SRE? or low skilled sysadmin drowned in pagers calling themselves as SRE? Because the future is bleak if it’s the latter.
Who wants to be on-call for someone else's buggy vibe-coded app? Sign me right up for that...
Operational excellency was always part of the job, regardless of what fancy term described it, be it DevOps, SRE or something else. The future of software engineering is software engineering, with emphasis on engineering.
And the other part of the future is that we are all going to become "editors" (in the publishing sense) instead of "writers"
IMO SRE works mostly because they exist outside the product engineering organization. They want to help you succeed but if you want to YOLO your launch and move fast and break things they have the option to hand back the pager and find other work. That option is rarely used but the option alone seems to create better than usual incentives.
With Vibecoding I imagine the LLM will get a MCP that allows them to schedule the jobs on Kubernetes or whatever IaaS and a fleet of agents will do the basic troubleshooting or whackamole type activities, leaving only the hard problems for human SRE. Before and after AI, the corporate incentives will always be to ship slop unless there is a counterbalancing force keeping the shipping team accountable to higher standards.
Except the small detail that as proven by all the people that lost their jobs to factory robots, the number of required SRE is relatively small in porpotion to existing demographics of SWEs.
Also this doesn't cover most of the jobs, which are actually in consulting, and not product development.
This may be true about SaaS. Not all software is SaaS, thankfully.
Surely SRE is just a .md file like everything else? :upside-down-face:
What’s an “SRE”?
Site Reliability Engineering. It is the role that, among other things, ensures that a service uptime is optimal. It's the closest thing we have nowadays to the system admin role
Thank you!
Seems like that would only be relevant to web development, not software engineering in general.
True, but since the vast majority of software engineering is web engineering and the title is clearly about web, it seems fit to mention that.
IMO, that isn't true, nor is the vast majority of software engineering related to the web.
Every industry has been undergoing digital transformation for decades. There are SREs ensuring service levels for everything, from your electrical meter, to satellite navigation systems. Someone wrote the code that boots your phone and starts your car. Somebody's wireless code is passing through your body as you read this, while an SRE ensures the packet loss isn't too high.
Your point doesn't really change what I said. There are many languages in the world but English is the most common one. Those two facts are true at the same time. This is the same, there are many types of software engineering out there but the most common software engineering job relates to building web applications. If you don't believe me, hit your regular job board and count.
This says nothing about how if AI can write software, AI cannot do these other things.
CRE - Code Reliability Engineering
AI will not get much better than what we have today, and what we have today is not enough to totally transform software engineering. It is a little easier to be a software engineer now, but that’s it. You can still fuck everything up.
> AI will not get much better than what we have today
Wow, where did this come from?
From what just comes to my mind based on recent research, I'd expect at least the following this or next year:
* Continuous learning via an architectural change like Titans or TTT-E2E.
* Advancement in World Models (many labs focusing on them now)
* Longer-running agentic systems, with Gas Town being a recent proof of concept.
* Advances in computer and browser usage - tons of money being poured into this, and RL with self-play is straightforward
* AI integration into robotics, especially when coupled with world models
What does robotics have to do with writing better code? Is this just a random AI wishlist?
All the new “advances” in AI (LLMs) will mostly be from better context engineering. The core feature of an intelligent response for a given prompt will not improve much.
The stuff you mention is unproven in usefulness or is so far away that most software engineers have enough time to wrap up their careers and retire gainfully.
AI has already been integrated with robotics. We have entire factories running entirely with robots in the dark. For mass consumer markets, a floor vacuuming and mopping robot that can also climb stairs is probably peak robotics. They already build world models that map out your entire home and reason about materials and cleanliness.
There’s not much more juice left to squeeze here. The next frontier is genetic programming (biological).
> All he wanted was to make his job easier and now he's shackled to this stupid system.
What people failed to grasp about low-code/no-code tools (and what I believe the author ultimately says) is that it was never about technical ability. It was about time.
The people who were "supposed" to be the targets of these tools didn't have the time to begin with, let alone the technical experience to round out the rough edges. It's a chore maintaining these types of things.
These tools don't change that equation. I truly believe that we'll see a new golden age of targeted, bepsoke software that can now be developed cheaper instead of small/medium businesses utilizing off-the-shelf, one-size-fits-all solutions.
What? Maybe OPs future. SWE is just going to replace QA and maybe architects if the industry adopts AI more, but there's a lot of hold outs. There's plenty of projects out there that are 'boring' and will not bother.
There were several cheaper than programmers options to automate things, Robot Processing Automation being probably the most known, but it never get the expected traction.
Why (imo)? Senior leaders still like to say: I run a 500 headcount finance EMEA organization for Siemens, I am the Chief People Officer of Meta anf I lead an org of 1000 smart HR pros. Most of their status is still tight to the org headcount.
We have another person without any respect for the actual stack that powers his fantasies writing LLM propaganda.
Who probably has never written anything of value in his life and therefore approves the theft of other people's valuable work.
Until you find out there are 40 - 80 startups writing agents in the SRE space :/
It only matters if any of those can promise reliability and either put their own money where their mouth is or convince (and actually get them to pay up) a bigger player to insure them.
Ultimately hardware, software, QA, etc is all about delivering a system that produces certain outputs for certain inputs, with certain penalties if it doesn’t. If you can, great, if you can’t, good luck. Whether you achieve the “can” with human development or LLM is of little concern as long as you can pay out the penalties of “can’t”.
Basically that’s what people are doing with YOLO mode letting Claude do everything in the system.
Reliable ai agents would make you a trillionaire.
And I wish them luck, because the thought of current ai bots doing SRE work effectively is laughable.
Operational excellence will always be needed but part of that is writing good code. If the slop machine has made bad decisions it could be more efficient to rewrite using human expertise and deploy that.
But there is bad code and good code and SREs cant tell you which is which, nor fix it.
My take (I'm an SRE) is that SRE should work pre-emptively to provide reproducible prod-like environments so that QA can test DEV code closer to real-life conditions. Most prod platforms I've seen are nowhere near that level of automation, which makes it really hard to detect or even reproduce production issues.
And no, as an SRE I won't read DEV code, but I can help my team test it.
> And no, as an SRE I won't read DEV code, but I can help my team test it.
I mean to each their own. Sometimes if I catch a page and the rabbit hole leads to the devs code, I look under the covers.
And sometimes it's a bug I can identify and fix pretty quickly. Sometimes faster than the dev team because I just saw another dev team make the same mistake a month prior.
You gotta know when to cut your losses and stop searching the rabbit hole though, that's true.
I agree with your nuance, but that's not my default mode, unless I know the language and the domain well I am not going to write an MR. I'm going to read the stack trace to see it it's a conf issue though.
Why not? I'm a SWE SRE and I'm arguably better at telling good code from bad code than many of the pure devs I've worked with.
Edit: ^ At the cost of being much worse at being able to tell what features are useful or well implemented.
> Writing code was always the easy part of this job. The hard part was keeping your code running for the long time.
Spoken like a true SRE. I'm mostly writing code, rather than working on keeping it in production, but I've had websites up since 2006 (hope that counts as long time in this corner of the internet) with very little down time and frankly not much effort.
My experience with SREs was largely that they're glorified SSH: they tell me I'm the programmer and I should know what to type into their shell to debug the problem (despite them SREing those services for years, while I joined two months ago and haven't even seen the particular service). But no I can't have shell access, and yes I should be the one spelling out what needs to be typed in.