Something doesn't add up - CoreDNS's kubernetes plugin should be serving Service RRs from its internal cache even if APIServer is down because it's using cache.Indexer. The records would be stale but unless their application pods all restarted, which they could not since APIServer was down, or all CoreDNS pods got restarted, which, again, they could not, just records expiring from the cache shouldn't have caused full discovery outage.
wouldn't it be coredns caches the information and records from API server for X amount of time (it seems like this might be 20 minutes?) then once the 20 minutes expired coredns would query api server, receive no response, then fail?
I think the idea of just serving cached responses indefinitely when api server is unreachable is what you're describing but not sure if this is default. (and probably has other tradeoffs that I'm not sure about too)
Based on my understanding of the plugin code it is the default. The way cache.Indexer works is it's continuously streaming resources from APIServer using Watch API and updates internal map. I think if Watch API is down it just sits there and doesn't purge anything but I haven't tested that. The 20 min expiry is probably referring to CodeDNS cache stanza which is a separate plugin[0].
I caused an API server outage once with a monitoring tool, however in my case it was a monstrosity of a 20,000 line script. We quickly realized what we had done and turned it off, and I have seen in very large clusters with 1000+ nodes that you need to be especially sensitive about monitoring API server resource usage depending on what precisely you are doing. Surprised they hadn't learned this lesson yet, given the likely scale of their workloads.
> 20,000 line script
Dude.
They meant manuscript I assume.
> In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.
Something that I wish all databases and API servers would do, and that few actually do in practice, is to allocate a certain amount of headroom (memory and CPU) to "break glass in case of emergency" sessions. Have an interrupt fired periodically that listens exclusively on a port that will only be used for emergency instructions (but uses equal security measures to production, and is only visible internally). Ensure that it can allocate against a preallocated block of memory; allow it to schedule higher-priority threads. A small concession to make in the usual course of business, but when it's useful it's vital.
Resource exhaustion can be super frustrating, and always feels like a third world slum situation when it happens.
Like, why would operating systems allow themselves to run out of headroom entirely in this day and age?
Wow, sounds like a nightmare. Operations staff definitely have real jobs.
splitting the control and data plane is a great way to improve resilience and prevent everything from being hard down. I wonder how it could be accomplished with service discovery / routing.
maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.
Recent and related:
ChatGPT Down - https://news.ycombinator.com/item?id=42394391 - Dec 2024 (30 comments)