HNNewShowAskJobs
Built with Tanstack Start
Quick takes on the recent OpenAI public incident write-up(surfingcomplexity.blog)
108 points by azhenley 7 months ago | 71 comments
  • btown7 months ago

    > In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.

    Something that I wish all databases and API servers would do, and that few actually do in practice, is to allocate a certain amount of headroom (memory and CPU) to "break glass in case of emergency" sessions. Have an interrupt fired periodically that listens exclusively on a port that will only be used for emergency instructions (but uses equal security measures to production, and is only visible internally). Ensure that it can allocate against a preallocated block of memory; allow it to schedule higher-priority threads. A small concession to make in the usual course of business, but when it's useful it's vital.

    • dotancohen7 months ago |parent

      I have a 1 GiB file to rm in case of emergency on every filesystem I manage - both personal and professional. I've only had to delete it maybe three or four times that I remember, but it's been a system-saver each time. I've long considered writing a process that just consumes some CPU, memory, and bandwidth for the same emergency buffer. I've even considered that it should even examine `last` every second and free up those resources when I SSH in.

      • tux37 months ago |parent

        It is also common for filesystems to reserve a small percentage to the root user. I think the ext4 default is still 5% (which can be quite a bit more than 1GB on modern drives!)

        • dotancohen7 months ago |parent

          I haven't used root directly for over a decade. Modern usage is to log in as an unprivileged user, and to use sudo for all root operations.

          • tucnak7 months ago |parent

            Haha, I'm going to steal that!

          • skywhopper7 months ago |parent

            What’s your point here? Using sudo is using root.

            • remram7 months ago |parent

              If you can't log in as user because resources are reserved for root... you can't sudo.

            • tucnak7 months ago |parent

              I think you missed the joke here

              • tux37 months ago |parent

                I don't think they were joking?

                I personally tend to leave a root shell open (and both my feet remain largely hole-free to this day), but it's pretty common advice, to avoid accidentally typing something unfortunate like rm ./* into the wrong shell

                • tucnak7 months ago |parent

                  Yeah, no. If you're going to be doing `rm /` you might as well do `sudo rm /` just as easily. It's the same security model, and honestly the distinction is quite funny.

                  • tux37 months ago |parent

                    No, there's a distinction if you look again. People don't accidentally type sudo in front of commands, but people do type things into the wrong window, or the wrong tab.

                    • tucnak7 months ago |parent

                      > People don't accidentally type sudo in front of commands

                      And you're basing this assumption off...?

                      • tux36 months ago |parent

                        Why, I am acutely tuned to the hivemind, of course.

              • emptiestplace7 months ago |parent

                There is no joke here.

              • dotancohen7 months ago |parent

                I missed the joke too. Care to share?

                • tucnak7 months ago |parent

                  sudo is equivalent to root

                  • thephyber7 months ago |parent

                    But it’s not. It’s a subset of what root can do.

                    The entire purpose of the /etc/sudoers file is to configure which users have access to sudo and which commands they can use.

                    Your top comment’s parent didn’t say the ssh login user had all sudo permissions. For best security, there should be many users which each have different limited permissions. Navigating the multiple `sudo su` is frustrating but has a purpose.

                    • LinuxBender6 months ago |parent

                      The entire purpose of the /etc/sudoers file is to configure which users have access to sudo and which commands they can use.

                      In all of my career I had seen that at one company. Everyone else just leaves is unrestricted. I would be impressed to see sudo used the way it was intended in more places. Some places even use passwordless sudo and ssh multiplexing which together with simple phishing give unfettered and unlogged access to production.

        • timschmidt7 months ago |parent

          Yup. 'tune2fs -m0' has saved my bacon more than once.

      • hank8087 months ago |parent

        We used to refer to this as a "cork file" back in the day.

    • tcdent7 months ago |parent

      Resource exhaustion can be super frustrating, and always feels like a third world slum situation when it happens.

      Like, why would operating systems allow themselves to run out of headroom entirely in this day and age?

      • TeMPOraL7 months ago |parent

        > a third world slum situation when it happens

        > why would operating systems allow themselves to run out of headroom entirely in this day and age?

        Following the analogy, perhaps for the same reason the richest cities in the richest, most developed Western nations, are rapidly beginning to look very much like third-world slums?

        Over-optimization is the name of the game. All systems need some amount of slack in them to stay flexible, robust (and livable). Unfortunately, cutting into the slack is always profitable on the margin, so without top-down intervention, the slack will be cut into until it's gone entirely. "Look, these machines are utilized only to 75% of capacity; adding this new system will only increase it by 5%"; "look, we can save XX$/month by cutting on compute, the machines will still be maxing out at 90%, so we'll still have a buffer". "Oh, this new telemetry service will bump that only by 1%".

        "Look, there's so much free space here in between these blocks of flats; adding another block won't hurt."

        And so on. Until your machines are running at 99% capacity and you risk global outage every time someone sneezes near the server room. Until your city starts to look like London, and if you're from Central Europe like me, you may start to realize that being a few months or years behind on the most recent gadgets is small price to pay in exchange for cities that are affordable and clean.

      • skywhopper7 months ago |parent

        They usually don’t? If they did, they would have just crashed. If they are still running, then they allowed themselves headroom. The real question is who gets to use that headroom and how? What happens when the administrative processes use “too many” resources?

        There are tons of tools built into modern OSes to manage prioritization of processes, etc. Whether it’s worth the tradeoff for you to deal with the operational overhead of harnessing those features is up to you.

      • throwaway2907 months ago |parent

        Because it makes computer slower and people pay for compute.

        I think consumer systems do it. macos shows the "quit some app, you're out of ram" but the system itself works.

        but if you are asking that OS knows this is containers process and this is control plane process and treat them differently, I think no one does that.

        Imagine how frustrating would it be if your os does it wrongly and you end up paying 10x because the os nerfed your main process because it thought something more important was running

        • yuliyp7 months ago |parent

          > OS knows this is containers process and this is control plane process and treat them differently, I think no one does that

          cgroups on Linux does exactly this, and are a standard part of ensuring that containers don't exceed their allocated resources.

          • dotancohen7 months ago |parent

            Most software running on w Linux server are not running in a container, which itself has (minimal) resource overhead.

            Unless you are talking about VMs in cloud computing, but even in those cases the VM is usually abstracted away and the end client only sees a VPS (e.g. with EC2 or GCE).

            • yuliyp7 months ago |parent

              You don't need containers to use cgroups. For instance, systemd uses cgroups to control resource allocation to various services.

          • throwaway2907 months ago |parent

            OK with some configuration a server OS can do it too. Looks like ClosedAI doesn't pay sysadmins enough if they did not configure such a standard thing.

            • yuliyp7 months ago |parent

              I was replying to your comment where you suggested that OS-level resource reservation wasn't a thing that happened in production systems. The situation here is that a single process was being overloaded by one type of load, preventing it from doing the other things that process was responsible for. Making that work properly would require load-shedding/prioritization within the process itself.

              • throwaway2907 months ago |parent

                > Kubernetes API servers saturated because they were receiving too much traffic. Once that happened, the API servers no longer functioned properly. As a consequence, their DNS-based service discovery mechanism ultimately failed

                I am not sure where it is about "one process"?

        • TeMPOraL7 months ago |parent

          > Imagine how frustrating would it be if your os does it wrongly and you end up paying 10x because the os nerfed your main process because it thought something more important was running

          Imagine how frustrating it is when the Linux on your desktop suddenly decides to start randomly killing processes, or worse, attempts to swap some memory out, causing a feedback loop of delays that completely freezes the system until you power-cycle the machine.

          That's one of the two things Windows always did better (the other thing is disabling write buffering for removable storage, on the account of that storage being, well, removable).

          Resource limits are not something you want to discover when you've exceeded them; they need to be managed and alerted about in advance. "Makes computer slower" and alerting the "people who pay for compute" is preferable to crashing - especially in distributed systems, where failures cascade (particularly when the whole system has been penny-pinched / overoptimized to the same degree as any single computer within it).

          • throwaway2907 months ago |parent

            > That's one of the two things Windows always did better

            Someone replied about cgroups on Linux and how it is bog standard stuff (that ClosedAI just didn't know how to use apparently?)

        • 7 months ago |parent
          [deleted]
    • remram7 months ago |parent

      You always have a load-balancer in front of your control plane (apiservers) if you have more than 1, that's what I would use to break the glass. You need your engineers to know how to do that though.

      You don't even need to break everything, take 1 apiserver out for admin access (provided etcd is not overwhelmed too).

      • btown7 months ago |parent

        Yep - my understanding of https://github.com/kubernetes/kubeadm/blob/main/docs/ha-cons... is that Kubernetes doesn't usually control that load balancer (nor should it, since you could accidentally tell Kubernetes to take down the control plane LB, then not be able to get it back up again!).

        I suppose one could set up a "fast pass lane" kubeconfig that adds a header that haproxy would understand, and route to a priority class in its queue with e.g. https://www.haproxy.com/documentation/haproxy-configuration-... . But there's no easy `kubectl --with-priority` (or, to my knowledge, good guidelines for the various gitops solutions) that follows this pattern out of the box.

        • remram7 months ago |parent

          I just connect to the backend server directly after taking that backend out of the load-balanced pool. Easier than going through the load-balancer in a special way.

    • xyzzy1237 months ago |parent

      As I understand it you can configure api rate limiting per-user but you'd need to work out "reasonable" values on your own to allow enough headroom for admin requests.

      This would require per-cluster testing (and is complex to test since you need to induce representative load) so I suppose hardly anyone does it.

      • dilyevsky7 months ago |parent

        There’s a whole set of features built[0] to prevent this exact scenario - runaway controllers consuming all api resources. Not sure why it didn’t work, maybe they are running an old release

        [0] - https://kubernetes.io/docs/concepts/cluster-administration/f...

    • jiggawatts7 months ago |parent

      Microsoft SQL Server has a dedicated admin connection (DAC) feature.

    • cies7 months ago |parent

      In Zig you have to handle the case malloc does not succeed.

  • geocrasher7 months ago

       In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.
    
    The DNS song seems appropriate.

    https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tun...

  • dilyevsky7 months ago

    Something doesn't add up - CoreDNS's kubernetes plugin should be serving Service RRs from its internal cache even if APIServer is down because it's using cache.Indexer. The records would be stale but unless their application pods all restarted, which they could not since APIServer was down, or all CoreDNS pods got restarted, which, again, they could not, just records expiring from the cache shouldn't have caused full discovery outage.

    • jimmyl027 months ago |parent

      wouldn't it be coredns caches the information and records from API server for X amount of time (it seems like this might be 20 minutes?) then once the 20 minutes expired coredns would query api server, receive no response, then fail?

      I think the idea of just serving cached responses indefinitely when api server is unreachable is what you're describing but not sure if this is default. (and probably has other tradeoffs that I'm not sure about too)

      • dilyevsky7 months ago |parent

        Based on my understanding of the plugin code it is the default. The way cache.Indexer works is it's continuously streaming resources from APIServer using Watch API and updates internal map. I think if Watch API is down it just sits there and doesn't purge anything but I haven't tested that. The 20 min expiry is probably referring to CodeDNS cache stanza which is a separate plugin[0].

        [0] - https://coredns.io/plugins/cache

    • dmauskop6 months ago |parent

      I had the same thought! https://news.ycombinator.com/item?id=42446318

      My guess is they were running CoreDNS on control plane nodes since that's the kubeadm default.

  • JohnMakin7 months ago

    I caused an API server outage once with a monitoring tool, however in my case it was a monstrosity of a 20,000 line script. We quickly realized what we had done and turned it off, and I have seen in very large clusters with 1000+ nodes that you need to be especially sensitive about monitoring API server resource usage depending on what precisely you are doing. Surprised they hadn't learned this lesson yet, given the likely scale of their workloads.

    • dboreham7 months ago |parent

      > 20,000 line script

      Dude.

      • fuzzy_biscuit7 months ago |parent

        They meant manuscript I assume.

        • cbsmith7 months ago |parent

          Yeah, "script" is just an abbreviation of "manuscript", right? ;-)

          • TeMPOraL7 months ago |parent

            Right, and the (manu)script that size makes a long scroll. Humanity had to more or less invent pagination to deal with this.

            (I'll see myself out.)

            • cbsmith7 months ago |parent

              Thanks dad. ;-)

              Job well done.

      • JohnMakin7 months ago |parent

        This was most definitely not my choice or preference, but you know how it goes.

  • StarlaAtNight7 months ago

    This quote cracked me up:

    “I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS”

    • akshayshah7 months ago |parent

      James Mickens is a comedic genius. The linked article always makes me laugh out loud.

      https://www.usenix.org/system/files/1311_05-08_mickens.pdf

      • rrr_oh_man7 months ago |parent

        Excellent article, but the typesetting / justification in that pdf is horrendous.

        The text columns look like the side of that hallway rubber mat that my dog keeps chewing on.

          spend a lot of time trying
          edge. However, as someone
          lieve that true progress is
          mes, and for the chickens
          y zombies, and the polite
          to eat your brain to acquire
          be prepared; thus, in the
          e scientific breakthroughs,
          ast inevitably becomes
          he main thing that I ponder is
          post-apocalyptic survival
          ag-tag group of associates.
          cruit: a locksmith (to open
          ith has run out of ideas);
          row snakes at my enemies
          g is a reasonable way to
          ble in my ultimate success
        • remram6 months ago |parent

          Converted to markdown: https://gist.github.com/remram44/767dd41676866cc2d30dae6e3a1...

      • hyperdimension7 months ago |parent

        That was an amazing read. Thanks for linking it.

  • ilaksh7 months ago

    Wow, sounds like a nightmare. Operations staff definitely have real jobs.

  • dang7 months ago

    Recent and related:

    ChatGPT Down - https://news.ycombinator.com/item?id=42394391 - Dec 2024 (30 comments)

  • jimmyl027 months ago

    splitting the control and data plane is a great way to improve resilience and prevent everything from being hard down. I wonder how it could be accomplished with service discovery / routing.

    maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.

    • cbsmith7 months ago |parent

      > maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.

      That is generally how it is done... there's this never ending conflict between "push" architectures and "pull" architectures, and this scenario sure makes "push" seem better, and it is... until you're in one of those scenarios where "pull" is better. ;-)

    • nijave7 months ago |parent

      I imagine a peer to peer scheme could also work. All nodes are control plane and data plane and advertise/broadcast to other nodes (similar to how network routing tables are distributed)

      That'd be a pretty big architectural change, though

  • ec1096857 months ago

    Surprised they don’t have a slower rollout across multiple regions / kubernetes clusters, given the K8s API’s are a spof as shown here where a change brought the control plane down.

    Also, stale-if-error is a far safer pattern for service discovery than ttl’d dns.

  • feyman_r7 months ago

    For me, this was stunning : “ 2:51pm to 3:20pm: The change was applied to all clusters”

    How can such a large change not be staged in some manner or the other? Feedback loops have a way of catching up later which is why it’s important to roll out gradually.

    • antod7 months ago |parent

      In the words of DevOps Borat...

      "To make error is human. To propagate error to all server in automatic way is #devops."

      • Havoc7 months ago |parent

        So sad that the account isn’t posting anymore.

    • TeMPOraL7 months ago |parent

      DNS makes for an atypically slow feedback loop. If you're not aware of it, then for an otherwise safe-looking change, you may test and complete the gradual roll-out before the failure hits you.

  • nijave7 months ago

    Seems like automated node access could have also been helpful here. Kill the offending pods directly on the nodes to relieve API server pressure long enough to rollback

    • dilyevsky7 months ago |parent

      The only problem is you have to find out where those pods are and your primary source of this information is currently under dos attack by said pods

      • nijave7 months ago |parent

        Not if you connect to all the nodes

  • 7 months ago
    [deleted]