HNNewShowAskJobs
Built with Tanstack Start
Quick takes on the recent OpenAI public incident write-up(surfingcomplexity.blog)
108 points by azhenley a year ago | 71 comments
  • btowna year ago

    > In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.

    Something that I wish all databases and API servers would do, and that few actually do in practice, is to allocate a certain amount of headroom (memory and CPU) to "break glass in case of emergency" sessions. Have an interrupt fired periodically that listens exclusively on a port that will only be used for emergency instructions (but uses equal security measures to production, and is only visible internally). Ensure that it can allocate against a preallocated block of memory; allow it to schedule higher-priority threads. A small concession to make in the usual course of business, but when it's useful it's vital.

    • dotancohena year ago |parent

      I have a 1 GiB file to rm in case of emergency on every filesystem I manage - both personal and professional. I've only had to delete it maybe three or four times that I remember, but it's been a system-saver each time. I've long considered writing a process that just consumes some CPU, memory, and bandwidth for the same emergency buffer. I've even considered that it should even examine `last` every second and free up those resources when I SSH in.

      • tux3a year ago |parent

        It is also common for filesystems to reserve a small percentage to the root user. I think the ext4 default is still 5% (which can be quite a bit more than 1GB on modern drives!)

        • dotancohena year ago |parent

          I haven't used root directly for over a decade. Modern usage is to log in as an unprivileged user, and to use sudo for all root operations.

          • tucnaka year ago |parent

            Haha, I'm going to steal that!

          • skywhoppera year ago |parent

            What’s your point here? Using sudo is using root.

            • remrama year ago |parent

              If you can't log in as user because resources are reserved for root... you can't sudo.

            • tucnaka year ago |parent

              I think you missed the joke here

              • tux3a year ago |parent

                I don't think they were joking?

                I personally tend to leave a root shell open (and both my feet remain largely hole-free to this day), but it's pretty common advice, to avoid accidentally typing something unfortunate like rm ./* into the wrong shell

                • tucnaka year ago |parent

                  Yeah, no. If you're going to be doing `rm /` you might as well do `sudo rm /` just as easily. It's the same security model, and honestly the distinction is quite funny.

                  • tux3a year ago |parent

                    No, there's a distinction if you look again. People don't accidentally type sudo in front of commands, but people do type things into the wrong window, or the wrong tab.

                    • tucnaka year ago |parent

                      > People don't accidentally type sudo in front of commands

                      And you're basing this assumption off...?

                      • tux3a year ago |parent

                        Why, I am acutely tuned to the hivemind, of course.

              • emptiestplacea year ago |parent

                There is no joke here.

              • dotancohena year ago |parent

                I missed the joke too. Care to share?

                • tucnaka year ago |parent

                  sudo is equivalent to root

                  • thephybera year ago |parent

                    But it’s not. It’s a subset of what root can do.

                    The entire purpose of the /etc/sudoers file is to configure which users have access to sudo and which commands they can use.

                    Your top comment’s parent didn’t say the ssh login user had all sudo permissions. For best security, there should be many users which each have different limited permissions. Navigating the multiple `sudo su` is frustrating but has a purpose.

                    • LinuxBendera year ago |parent

                      The entire purpose of the /etc/sudoers file is to configure which users have access to sudo and which commands they can use.

                      In all of my career I had seen that at one company. Everyone else just leaves is unrestricted. I would be impressed to see sudo used the way it was intended in more places. Some places even use passwordless sudo and ssh multiplexing which together with simple phishing give unfettered and unlogged access to production.

        • timschmidta year ago |parent

          Yup. 'tune2fs -m0' has saved my bacon more than once.

      • hank808a year ago |parent

        We used to refer to this as a "cork file" back in the day.

    • tcdenta year ago |parent

      Resource exhaustion can be super frustrating, and always feels like a third world slum situation when it happens.

      Like, why would operating systems allow themselves to run out of headroom entirely in this day and age?

      • TeMPOraLa year ago |parent

        > a third world slum situation when it happens

        > why would operating systems allow themselves to run out of headroom entirely in this day and age?

        Following the analogy, perhaps for the same reason the richest cities in the richest, most developed Western nations, are rapidly beginning to look very much like third-world slums?

        Over-optimization is the name of the game. All systems need some amount of slack in them to stay flexible, robust (and livable). Unfortunately, cutting into the slack is always profitable on the margin, so without top-down intervention, the slack will be cut into until it's gone entirely. "Look, these machines are utilized only to 75% of capacity; adding this new system will only increase it by 5%"; "look, we can save XX$/month by cutting on compute, the machines will still be maxing out at 90%, so we'll still have a buffer". "Oh, this new telemetry service will bump that only by 1%".

        "Look, there's so much free space here in between these blocks of flats; adding another block won't hurt."

        And so on. Until your machines are running at 99% capacity and you risk global outage every time someone sneezes near the server room. Until your city starts to look like London, and if you're from Central Europe like me, you may start to realize that being a few months or years behind on the most recent gadgets is small price to pay in exchange for cities that are affordable and clean.

      • skywhoppera year ago |parent

        They usually don’t? If they did, they would have just crashed. If they are still running, then they allowed themselves headroom. The real question is who gets to use that headroom and how? What happens when the administrative processes use “too many” resources?

        There are tons of tools built into modern OSes to manage prioritization of processes, etc. Whether it’s worth the tradeoff for you to deal with the operational overhead of harnessing those features is up to you.

      • throwaway290a year ago |parent

        Because it makes computer slower and people pay for compute.

        I think consumer systems do it. macos shows the "quit some app, you're out of ram" but the system itself works.

        but if you are asking that OS knows this is containers process and this is control plane process and treat them differently, I think no one does that.

        Imagine how frustrating would it be if your os does it wrongly and you end up paying 10x because the os nerfed your main process because it thought something more important was running

        • yuliypa year ago |parent

          > OS knows this is containers process and this is control plane process and treat them differently, I think no one does that

          cgroups on Linux does exactly this, and are a standard part of ensuring that containers don't exceed their allocated resources.

          • dotancohena year ago |parent

            Most software running on w Linux server are not running in a container, which itself has (minimal) resource overhead.

            Unless you are talking about VMs in cloud computing, but even in those cases the VM is usually abstracted away and the end client only sees a VPS (e.g. with EC2 or GCE).

            • yuliypa year ago |parent

              You don't need containers to use cgroups. For instance, systemd uses cgroups to control resource allocation to various services.

          • throwaway290a year ago |parent

            OK with some configuration a server OS can do it too. Looks like ClosedAI doesn't pay sysadmins enough if they did not configure such a standard thing.

            • yuliypa year ago |parent

              I was replying to your comment where you suggested that OS-level resource reservation wasn't a thing that happened in production systems. The situation here is that a single process was being overloaded by one type of load, preventing it from doing the other things that process was responsible for. Making that work properly would require load-shedding/prioritization within the process itself.

              • throwaway290a year ago |parent

                > Kubernetes API servers saturated because they were receiving too much traffic. Once that happened, the API servers no longer functioned properly. As a consequence, their DNS-based service discovery mechanism ultimately failed

                I am not sure where it is about "one process"?

        • TeMPOraLa year ago |parent

          > Imagine how frustrating would it be if your os does it wrongly and you end up paying 10x because the os nerfed your main process because it thought something more important was running

          Imagine how frustrating it is when the Linux on your desktop suddenly decides to start randomly killing processes, or worse, attempts to swap some memory out, causing a feedback loop of delays that completely freezes the system until you power-cycle the machine.

          That's one of the two things Windows always did better (the other thing is disabling write buffering for removable storage, on the account of that storage being, well, removable).

          Resource limits are not something you want to discover when you've exceeded them; they need to be managed and alerted about in advance. "Makes computer slower" and alerting the "people who pay for compute" is preferable to crashing - especially in distributed systems, where failures cascade (particularly when the whole system has been penny-pinched / overoptimized to the same degree as any single computer within it).

          • throwaway290a year ago |parent

            > That's one of the two things Windows always did better

            Someone replied about cgroups on Linux and how it is bog standard stuff (that ClosedAI just didn't know how to use apparently?)

        • a year ago |parent
          [deleted]
    • remrama year ago |parent

      You always have a load-balancer in front of your control plane (apiservers) if you have more than 1, that's what I would use to break the glass. You need your engineers to know how to do that though.

      You don't even need to break everything, take 1 apiserver out for admin access (provided etcd is not overwhelmed too).

      • btowna year ago |parent

        Yep - my understanding of https://github.com/kubernetes/kubeadm/blob/main/docs/ha-cons... is that Kubernetes doesn't usually control that load balancer (nor should it, since you could accidentally tell Kubernetes to take down the control plane LB, then not be able to get it back up again!).

        I suppose one could set up a "fast pass lane" kubeconfig that adds a header that haproxy would understand, and route to a priority class in its queue with e.g. https://www.haproxy.com/documentation/haproxy-configuration-... . But there's no easy `kubectl --with-priority` (or, to my knowledge, good guidelines for the various gitops solutions) that follows this pattern out of the box.

        • remrama year ago |parent

          I just connect to the backend server directly after taking that backend out of the load-balanced pool. Easier than going through the load-balancer in a special way.

    • xyzzy123a year ago |parent

      As I understand it you can configure api rate limiting per-user but you'd need to work out "reasonable" values on your own to allow enough headroom for admin requests.

      This would require per-cluster testing (and is complex to test since you need to induce representative load) so I suppose hardly anyone does it.

      • dilyevskya year ago |parent

        There’s a whole set of features built[0] to prevent this exact scenario - runaway controllers consuming all api resources. Not sure why it didn’t work, maybe they are running an old release

        [0] - https://kubernetes.io/docs/concepts/cluster-administration/f...

    • jiggawattsa year ago |parent

      Microsoft SQL Server has a dedicated admin connection (DAC) feature.

    • ciesa year ago |parent

      In Zig you have to handle the case malloc does not succeed.

  • geocrashera year ago

       In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.
    
    The DNS song seems appropriate.

    https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tun...

  • dilyevskya year ago

    Something doesn't add up - CoreDNS's kubernetes plugin should be serving Service RRs from its internal cache even if APIServer is down because it's using cache.Indexer. The records would be stale but unless their application pods all restarted, which they could not since APIServer was down, or all CoreDNS pods got restarted, which, again, they could not, just records expiring from the cache shouldn't have caused full discovery outage.

    • jimmyl02a year ago |parent

      wouldn't it be coredns caches the information and records from API server for X amount of time (it seems like this might be 20 minutes?) then once the 20 minutes expired coredns would query api server, receive no response, then fail?

      I think the idea of just serving cached responses indefinitely when api server is unreachable is what you're describing but not sure if this is default. (and probably has other tradeoffs that I'm not sure about too)

      • dilyevskya year ago |parent

        Based on my understanding of the plugin code it is the default. The way cache.Indexer works is it's continuously streaming resources from APIServer using Watch API and updates internal map. I think if Watch API is down it just sits there and doesn't purge anything but I haven't tested that. The 20 min expiry is probably referring to CodeDNS cache stanza which is a separate plugin[0].

        [0] - https://coredns.io/plugins/cache

    • dmauskopa year ago |parent

      I had the same thought! https://news.ycombinator.com/item?id=42446318

      My guess is they were running CoreDNS on control plane nodes since that's the kubeadm default.

  • JohnMakina year ago

    I caused an API server outage once with a monitoring tool, however in my case it was a monstrosity of a 20,000 line script. We quickly realized what we had done and turned it off, and I have seen in very large clusters with 1000+ nodes that you need to be especially sensitive about monitoring API server resource usage depending on what precisely you are doing. Surprised they hadn't learned this lesson yet, given the likely scale of their workloads.

    • dborehama year ago |parent

      > 20,000 line script

      Dude.

      • fuzzy_biscuita year ago |parent

        They meant manuscript I assume.

        • cbsmitha year ago |parent

          Yeah, "script" is just an abbreviation of "manuscript", right? ;-)

          • TeMPOraLa year ago |parent

            Right, and the (manu)script that size makes a long scroll. Humanity had to more or less invent pagination to deal with this.

            (I'll see myself out.)

            • cbsmitha year ago |parent

              Thanks dad. ;-)

              Job well done.

      • JohnMakina year ago |parent

        This was most definitely not my choice or preference, but you know how it goes.

  • StarlaAtNighta year ago

    This quote cracked me up:

    “I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS”

    • akshayshaha year ago |parent

      James Mickens is a comedic genius. The linked article always makes me laugh out loud.

      https://www.usenix.org/system/files/1311_05-08_mickens.pdf

      • rrr_oh_mana year ago |parent

        Excellent article, but the typesetting / justification in that pdf is horrendous.

        The text columns look like the side of that hallway rubber mat that my dog keeps chewing on.

          spend a lot of time trying
          edge. However, as someone
          lieve that true progress is
          mes, and for the chickens
          y zombies, and the polite
          to eat your brain to acquire
          be prepared; thus, in the
          e scientific breakthroughs,
          ast inevitably becomes
          he main thing that I ponder is
          post-apocalyptic survival
          ag-tag group of associates.
          cruit: a locksmith (to open
          ith has run out of ideas);
          row snakes at my enemies
          g is a reasonable way to
          ble in my ultimate success
        • remrama year ago |parent

          Converted to markdown: https://gist.github.com/remram44/767dd41676866cc2d30dae6e3a1...

      • hyperdimensiona year ago |parent

        That was an amazing read. Thanks for linking it.

  • ilaksha year ago

    Wow, sounds like a nightmare. Operations staff definitely have real jobs.

  • danga year ago

    Recent and related:

    ChatGPT Down - https://news.ycombinator.com/item?id=42394391 - Dec 2024 (30 comments)

  • jimmyl02a year ago

    splitting the control and data plane is a great way to improve resilience and prevent everything from being hard down. I wonder how it could be accomplished with service discovery / routing.

    maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.

    • cbsmitha year ago |parent

      > maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.

      That is generally how it is done... there's this never ending conflict between "push" architectures and "pull" architectures, and this scenario sure makes "push" seem better, and it is... until you're in one of those scenarios where "pull" is better. ;-)

    • nijavea year ago |parent

      I imagine a peer to peer scheme could also work. All nodes are control plane and data plane and advertise/broadcast to other nodes (similar to how network routing tables are distributed)

      That'd be a pretty big architectural change, though

  • ec109685a year ago

    Surprised they don’t have a slower rollout across multiple regions / kubernetes clusters, given the K8s API’s are a spof as shown here where a change brought the control plane down.

    Also, stale-if-error is a far safer pattern for service discovery than ttl’d dns.

  • feyman_ra year ago

    For me, this was stunning : “ 2:51pm to 3:20pm: The change was applied to all clusters”

    How can such a large change not be staged in some manner or the other? Feedback loops have a way of catching up later which is why it’s important to roll out gradually.

    • antoda year ago |parent

      In the words of DevOps Borat...

      "To make error is human. To propagate error to all server in automatic way is #devops."

      • Havoca year ago |parent

        So sad that the account isn’t posting anymore.

    • TeMPOraLa year ago |parent

      DNS makes for an atypically slow feedback loop. If you're not aware of it, then for an otherwise safe-looking change, you may test and complete the gradual roll-out before the failure hits you.

  • nijavea year ago

    Seems like automated node access could have also been helpful here. Kill the offending pods directly on the nodes to relieve API server pressure long enough to rollback

    • dilyevskya year ago |parent

      The only problem is you have to find out where those pods are and your primary source of this information is currently under dos attack by said pods

      • nijavea year ago |parent

        Not if you connect to all the nodes

  • a year ago
    [deleted]