Tanstack Start | Multiple Digital Ocean services down

Multiple Digital Ocean services down(status.digitalocean.com)

115 points by inanothertime 2 days ago | 45 comments

showerst 2 days ago
I use DO's load balancers in a couple of projects, and they don't list Cloudflare as an upstream dependency anywhere that I've seen. It's so frustrating to think you're clear of a service then find out that you're actually in their blast radius too through no fault of your own.
- miken123 2 days ago |parent
  It is mentioned in their list of subprocessors: https://www.digitalocean.com/trust/subprocessors
- coreylane 2 days ago |parent
  I find stuff like this all the time, railway.com recently launched an object storage service, but it's simply a wrapper for wasabi buckets under the hood, and they don't mention this anywhere... not even the subprocessors page https://railway.com/legal/subprocessors - customers have no idea they are using wasabi storage buckets unless they dig around the dns records. so i have to do all this research to find upstream dependencies and go subscribe to status.wasabi.com alerts etc.
  dig b1.eu-central-1.storage.railway.app +short
  s3.eu-central-1.wasabisys.com.
  eu-central-1.wasabisys.com.
  - timomeh a day ago |parent
    Hey, I'm the person that was responsible for adding object storage to Railway. It was my onboarding project, basically a project I was able to choose myself and implemented in 3 weeks in my 3rd month after joining Railway.
    Object Storage is currently in Priority Boarding, our beta program. We can and will definitely do better, document it and add it to the subprocessor list. I'm really sorry about the current lack of it. There was another important project that I had to do between the beta release of buckets and now. I'm oncall this week, but will continue to bring Buckets to GA next week. So, just to give this context. There's no intentional malevolence or shadiness going on, it's simply because there's 1 engineer (me) working on it, and there's a lot of stuff to prioritize and do.
    It's also super important to get user feedback as early as possible. That's why it's a beta release right now, and the beta release is a bit "rushed". The earlier I can get user feedback, the better the GA version will be.
    On the "simply a wrapper for wasabi buckets" - yes, we're currently using wasabi under the hood. I can't add physical Object Storage within 3 weeks to all our server locations :D But that's something we'll work towards. I wouldn't say it's "simply" a wrapper, because we're adding substantial value when you use Buckets on Railway: automatic bucket creation for new environments, variable references, credentials as automatic variables, included in your usage limits and alerts, and so on.
    I'll do right by you, and by all users.
- wesammikhail 2 days ago |parent
  slight off topic: I used DO LBs for a little while but found myself moving away from that toward a small droplet with haproxy or nginx setup. Worked much better for me personally!
  - showerst 2 days ago |parent
    The point of an LB for these projects is to get away from a single point of failure, and I find configuring HA and setting up the networking and everything to be a pain point.
    These are all low-traffic projects so it's more cost effective to just throw on the smallest LB than spend the time setting it up myself.
    - grayhatter 2 days ago |parent
      If they are small projects, why are they behind a load balancer to begin with?
      - nickmonad 2 days ago |parent
        Usually because of SSL termination. It's generally "easier" to just let DO manage getting the cert installed. Of course, there are tradeoffs.
      - showerst 2 days ago |parent
        I use the LB's for high availability rather than needing load balancing. The LB + 2 web back-ends + Managed DB means a project is resilient to a single server failing, for relatively low devops effort and around $75/mo.
        grayhatter 2 days ago |parent
        Are both servers deployed from the exact same repo/scripts? Or are they meaningful different, and/or balanced across multiple data centers?
        Did your high availability system survive this outage?
        showerst 2 days ago |parent
        I have a couple of instances of this same pattern for various things that have been running for 5+ years, none of them have suffered downtime caused by the infrastructure. I use ansible scripts for the web servers, and the DO API or dashboard to provision the Load Balancer and Database. You can get it all hooked up in a half hour, and it really doesn't take any maintenance other than setting up good practices for rotating the web servers out for updates.
        They wouldn't survive DO losing a DC, they're not so mission critical that it's worth the extra complexity to do that, and I don't recall DO losing a DC in the past 10 years or so.
        They did stay up during this outage, which was apparently mostly concentrated on a different product called the 'global load balancer', which ironically is exactly the extra complexity I mentioned to in theory survive a DC outage.
        Keep in mind these are "important" in the sense that they justify $100/mo on infra and monitoring, but not "life critical" in that an outage is gonna kill somebody or cost millions of bucks an hour. Once your traffic gets past a certain threshold, DO's costs don't scale that well and you're better off on a large distributed self-managed setup on Hetzner or buying into a stack like AWS.
        To me their LB and DB products hit a real sweet spot -- better reliability than one box, and meaningfully less work than setting up a cluster with floating IP and heartbeats and all that for a very minimal price difference.
- bbss 2 days ago |parent
  Regional LBs do not have Cloudflare as an upstream dependency.
jsheard 2 days ago
They don't name names but it's probably due to the ongoing Cloudflare explosion. I know the DigitalOcean Spaces CDN is just Cloudflare under the hood.
- matt-p 2 days ago |parent
  Just spaces CDN, not spaces - you'd think they'd just turn the CDN off for a bit.
  - potato3732842 2 days ago |parent
    You can't just "turn off CDN" on the modern internet. You'd instantly DDOS your customers' origins. They're not provisioned to handle it, and even if they were the size of the pipe going to them isn't. The modern internet is built around the expectation that everything is distributed via CDN. Some more "traditional" websites would probably be fine.
    - oasisbob 2 days ago |parent
      Might be just me, but I can think of many origins under my control which could live without a (non-functional) CDN for a while.
      CDN is great for peak-load, latency reductions, and cost - but not all sites depend on it for scale 24/7
    - matt-p 2 days ago |parent
      If you are DO you could, you just decided not to bother. They control the origins it's spaces (s3), so they could absolutely spin up further gateways or a cache layer and then turn the CDN off.
      - graemep 2 days ago |parent
        Either you are wrong and they do not have the capacity to do that, or they have decided it is acceptable to be down because a major provider is down
        I imagine a cache layer cannot be that easy to spin up - otherwise why would they outsource it?
        matt-p 2 days ago |parent
        You outsource it because clouflare have more locations than you so offer lower latency and can offer it at a cost that's cheaper or the same price as doing it yourself.
        graemep 2 days ago |parent
        Which suggests its expensive enough for it to be unlikely they just have the capacity lying around to spin up.
        oasisbob 2 days ago |parent
        To the contrary, CDN pricing will usually beat cloud provider egress fees.
        Common example: you can absolutely serve static content from an S3 bucket worldwide without using a CDN. It will usually scale OK under load. However, you're going to pay more for egress and give your customers a worse experience. Capacity isn't the problem you're engineering around.
        For a site serving content at scale, a CDN is purpose-built to get content around the world efficiently. This is usually cheaper and faster than trying to do it yourself.
        graemep a day ago |parent
        That is not what I said. I said DO will not have the spare capacity because its too expensive. Can you please tell me who DO pay egress fees to?
        matt-p a day ago |parent
        They will be doing a mix of peering both across free PNIs and very low cost IXP ports, with the reminder going down transit like Colt or cogent. Probably average cost of the order of about $1 per 20TB of egress in Europe and NA markets.
        The thing is with edge capacity is that you massively overbuild on the basis that;
        It's generally a long ISH lead time to scale capacity (days not minutes or hours).
        Transit contracts are usually 12-60 months
        Traffic is variable, not all ports cover all destinations
        Redundancy
        So if you are doing say 100Gbps 95%ile out of say London then you will probably have at least 6+ 100Gb ports, so you do have quite a bit of latent capacity if you just need it for a few hours.
    - tgma 2 days ago |parent
      nit: that's more DoS (from a handful of DO LBs) than DDoS.
TechRemarker 2 days ago
Yes all sites showing the CloudFlare error due to the massive outage. Seems their outages are getting more frequent and taking down the internet in new ways each time.
foxyv 2 days ago
Man, it really seems like the cloud providers are having some tough times lately. Azure, AWS, and Cloudflare! Is everything just secretly AWS?
archerx 2 days ago
I have two projects on DO using droplets and they are still running fine.
- soheilpro 2 days ago |parent
  Droplets are fine.
  > This incident affects: API, App Platform (Global), Load Balancers (Global), and Spaces (Global).
- hshdhdhj4444 2 days ago |parent
  It seems mostly a CludFlare related issue.
  My DOs are working fine as well.
  - sgc 2 days ago |parent
    Are you using their "reserved IPs"? I was thinking of starting to use them, but now I wonder if it is part of their load balancing stack under the hood.
giancarlostoro 2 days ago
So yesterday Azure got hit hard, today CF and DO are down, bad week or something else?
- watermelon0 2 days ago |parent
  Azure DDoS event happened in October. Blog post about the attack was published yesterday, and was quickly picked up by news sites.
- matt-p 2 days ago |parent
  DDOS, but I don't really understand why in particular.
  - giancarlostoro 2 days ago |parent
    Having known people like this, its either flexing about who has the more powerful botnet or advertising who can do what.
  - red-iron-pine 2 days ago |parent
    NATO testing internal infra, or Russian hackers stepping it up after aggressive sabotage efforts in Eastern Europe?
- BubbleRings 2 days ago |parent
  I would also like to know people’s opinion on this.
- zx8080 2 days ago |parent
  Year-end promotion cycle is the worst time for end-users and the best one for engineers greedy for promotions.
  - Lammy 2 days ago |parent
    Don't blame individual engineers who want to do what will be rewarded instead of company performance policies that reward this type of behavior.
    - fridder 2 days ago |parent
      shoot, there are also end of year layoffs and reorgs to pump up those year end numbers
  - red-iron-pine 2 days ago |parent
    what engineers, mate? they AI now
    and they're doing just spectacular
igtztorrero 2 days ago
I knew it, DigitalOcean CDN is using Cloudflare behind the scenes. Why DO ?
aforty 2 days ago
Cloudflare outage.
mrkramer 2 days ago
Who is next?
- red-iron-pine 2 days ago |parent
  my guesses would be look at who has a FedRAMP capable service first.
  maybe also GCP, hetzner, akamai
drob518 2 days ago
Dominos falling into dominos falling into dominos…