Context: I lead the DevOps team in a mid size engineering org (60 engineers approximately). The product is a B2B SaaS product. The organization started taking cloud costs seriously early 2024 and my team worked throughout the year to make infrastructure changes to reduce our cloud costs. This included projects like removing manually created stale infrastructure, automating infra management, rightsizing, purchasing the right Savings Plans(and RIs), and many others. Now we're at a place where infra-only projects to optimize cloud costs are pretty much exhausted. On the other hand, most of the anomalies (and surprises) regarding cost spikes come from application level changes. This causes serious problems because not only these anomalies are identified late into the deployment lifecycle, but these anomalies are inherently harder to resolve quickly. An example of this is when a service triggered a downstream workflow which started spawning additional background jobs (10x in production) which blew up the cost projections.

So, my question to the group is - Who do you hold(or should hold) accountable when cloud costs spike unexpectedly: the engineers who write the code, the platform team who manages the infrastructure, or the product managers who set the requirements? (My current solution is a mix of platform team and engineers but we're still trying to formalize the accountability model.)