> With Prequal we saw dramatic improvements across all metrics [for youtube], including reductions of 2x in tail latency, 5-10x in tail RIF, 10-20% in tail memory usage, 2x in tail CPU utilization, and a near-elimination of errors due to load imbalance. In addition to meeting SLOs, this also allowed us to significantly raise the utilization targets that govern how much traffic we are willing to send to each datacenter, thereby saving significant resources.
This feels like one of those "my company realized savings greater than my entire career's expected compensation."
The question you should ask is for how much less others would accept to do it. How much it saves isn't how to price some work, you price work based on the people available that can do it.
If the fire department puts out a fire in your house you don't pay them the cost of the building. You don't give your life to a doctor, etc. That way of thinking is weird.
It's not weird, but incomplete. It is broadly captured in the Economics concept of "willingness to pay", loosely the maximum price someone or some firm would be willing to pay for something of benefit to them.
This contrasts with "willingness to accept", loosely the minimum compensation someone or some firm would accept to produce a good or service (or accept some negative thing).
Neither of these is sufficient to determine the price of something precisely, but, in aggregate, these concepts bound the market price for some good or service.
I agree I wasn't super precise, I thought it was enough to show that pricing work equal or close to equal to value produced is unlikely as long as there's many people available to do it.
In my example of the fire department if nobody is really coming and you have no insurance or other way to save stuff you would indeed pay a lot. From what I read these were the dynamics in Roman times.
If you're thinking of Marcus Crassus, the dynamic there was that he'd offer to buy your burning property from you at a steep discount, and would only put the fire out if you agreed.
> "willingness to pay" [and] "willingness to accept"
Often abbreviated somewhat sloppily into demand and supply.
which should make the employee proud, and I'm sure Google is compensating them very, very well.
let's also not forget that the people involved didn't create this in a vacuum. it cost Google a LOT more than their compensation to make it possible for them to even start working on this project, let alone carrying it forward to completion.
people underestimate how hard and expensive it is to manage a company in a way that allows its employees to do a good job.
> it cost Google a LOT more than their compensation to make it possible for them to even start working on this project
The actual "not a vacuum" context here is an environment that has been basically printing money for google for the last twenty years. It did not "cost" them anything. It's fine to acknowledge that the people who built google, however well paid they are, are creating vastly more value than they are personally receiving.
If it costs nothing, why don't we both get together and build a money-printing machine as well?
You say like it's something easy anyone can get done. Have you ever tried? If you try to build a sustainable and successful business, you'll see how hard it is.
You probably know, if you think about it, that I wasn’t saying there were no expenditures involved. This is just specious.
You said “it cost Google a LOT more than their compensation to make it possible for them to even start working on this project, let alone carrying it forward to completion.” As if Google is groaning under the weight of sacrifices made specifically so these SREs can play in their engineering playground every day. I am saying this is backwards — that there has been no such sacrifice on Google’s part.
> Google is groaning under the weight of sacrifices made specifically so these SREs can play in their engineering playground every day
they're literally doing that. google has +$300B net positive assets. [1] they could distribute that to stockholders, go to the bahamas and call it a day. but they keep this wealth there so that SWEs can "play" and create stuff for others to use.
yes, they're not a humanitatian org. they do it so that they earn more, then invest more. but they are sacrificing short term gains and risking actually losing money.
> but they keep this wealth there so that SWEs can "play" and create stuff for others to use.
That is not their motive or even a secondary consideration for shareholders.
To be fair, Google didn’t found YouTube.
And Youtube wouldn't last long on its own.
Regarding infra, they were not capable of reaching the scale they needed and were already struggling by the time they got acquired.
On the money side, they couldn't attract advertisers as well as Google.
My point was that for whatever reason(s), Google chose to buy YouTube, while continuing to work on Google Video (which was started the month before YouTube), instead of buying a different competitor.
Given that YouTube was founded by 3 PayPal alums and was angel investor funded while being one of the top most visited sites at the time of its acquisition by Google, I am inclined to believe that their issues you mentioned above and others not mentioned would have been resolved in-house, via their partnership with Google Video or another company, or via acquisition by another company.
I don’t think it’s likely that YouTube would have failed as a company with the founders and investors it had, such as Sequoia, as it was simply too popular. YouTube had already started running ads for ~9 months before acquisition, though they were initially against pre-roll and other site-wide ads.
https://en.wikipedia.org/wiki/History_of_YouTube
> Participatory video ads were designed to link specific promotions to specific channels rather than advertising on the entire platform at once. When the ads were introduced, in August 2006, YouTube CEO Chad Hurley rejected the idea of expanding into areas of advertising seen as less user-friendly at the time, saying, "we think there are better ways for people to engage with brands than forcing them to watch a commercial before seeing content. You could ask anyone on the net if they enjoy that experience and they'd probably say no." However, YouTube began running in-video ads in August 2007, with preroll ads introduced in 2008.
Profit margins matter a hell of a lot to publicly traded companies. Share price is a multiple of earnings.
> let's also not forget that the people involved didn't create this in a vacuum. it cost Google a LOT more than their compensation to make it possible for them to even start working on this project
...let's also not forget that Google didn't manufacture this opportunity in a vacuum. It cost the rest of society a lot more than their revenue to make it possible for them to employ people who can spend their entire days working on software.
Didn't it cost society to raise us to adulthood so that we can work for Google and other businesses? How many people have sacrificed their time and attention, directly or indirectly, to our benefit, since we were born? Yet, nobody's saying our salaries aren't deserved or the merit of our own efforts.
I think many people are explicitly saying that.
Pretty common really. Huge web properties like this have just massive room for cost optimizations, and AFAICT, is largely bottlenecked by management capacity. On the other hand, it's also pretty common to emphasize pet metrics like tails, which are pretty much by definition a minority of the cost. We give them the benefit of the doubt because a) google and b) peer review publication, but thats not usually a heurestic available to decision makers.
Optimizing tail latency isn't about saving fleet cost. It's about reducing the chance of one tail latency event ruining a page view, when a page view incurs dozens of backend requests.
The more requests you have, the higher the chance one of them hits a tail. So the overall latency a user sees is largely dependent on a) number of requests b) tail latency of each event.
This method improves the tail latency for ALL supported services, in a generic way. That's multiplicative impact across all services, from a user perspective.
Presumably, the number of requests is harder to reduce if they're all required for the business.
The last sentence of the abstract indicates the goal is to be able to run at higher utilization without sacrificing latency. The outcome of that is you need less hardware for a given load which for Google undoubtedly means massive cost savings and higher profitability.
> Prequal has dramatically decreased tail latency, error rates, and resource > use, enabling YouTube and other production systems at Google to run at much > higher utilization.
> Optimizing tail latency isn't about saving fleet cost.
Indirectly, it is. As the quote I replied to suggests, in order to combat tail latency services often run with surplus capacity. This is just a fundamental tradeoff between the two variables mentioned.
So, by improving the LB algo, they (and anyone, really) can reduce the surplus needed to meet any specific SLO.
You're correct - that's a fair point.
The person chose to be employed instead of starting their own business. Less risk, less reward.
I disagree with toenail being downvoted. In Europe, employees enjoy a great deal of stability. In Netherlands, it's nigh-impossible to fire someone: you have to fill in 10 pages of justification paperwork and have an independent government agency review and approve it. If someone has a long-term illness then you have to pay 70% of their salary for up to 2 years, even when they do no work at all. Most people don't want to be entrepreneur: they want clear instructions and stability. When you try to give stock to employees, the tax authorities raise an eyebrow: why would you give stock to employees when they enjoy none of your risks? It makes no sense, so we'll treat it as a form of salary, so we'll tax you 52% on the stock's paper value.
At the end of the day, what's left for the entrepreneur? You enjoy all the risk, but you don't get to have a paid 2 year sick leave. Even sympathy for your hard work can be hard to get. The potential of money is all you have.
Things are different in the US of course, where people can be fired the next minute without reason. That looks like just borderline abuse to me. But from a European perspective, the above comment does not deserve downvoting at all.
> In Netherlands, it's nigh-impossible to fire someone: you have to fill in 10 pages of justification paperwork and have an independent government agency review and approve it.
Practically speaking, won't they fire someone by negotiating a "voluntary" severance agreement? It's not like an employee wants to stay on for long once it's been made known that they are unwanted.
Though obviously a severance agreement is better than at-will employment, it's also not a guarantee of long employment.
That is possible, but you'll have to go through a lawyer to draft a contract. It costs time and money (apart from the severance fee). Assuming negotiations are successful. It's still not easy.
What you get as an employee is certainty that your compensation will only increase in the single digits per year and that the pension fund will compound a part of your income in single digits as well (pension funds typically underperform the S&P; even in bad years).
So either 2 years of “stable income” or the chance of much higher compounding rate. As I see it, employment is a nice backup if being self-employed doesn’t work out.
> why would you give stock to employees when they enjoy none of your risks? It makes no sense, so we'll treat it as a form of salary, so we'll tax you 52% on the stock's paper value.
The Netherlands is the only country in the EU that taxes on unrealised gains.
Not sure if you mean just for stock options, but more countries/regions have some form wealth tax on unrealised gains.
Just stock, but I forgot that Spain also have stock included in wealth tax for people with over 700k assets. I think it’s just NL and ES though - I dont think any other EU countries tax stocks in this way?
I thought that were more, but a quick search only revealed Norway, apart from Spain that you've already mentioned.
Also worth noting that in the Madrid region they don't have the wealth tax.
Also Denmark I believe
Especially given that many people will elect to be stably and cheaply employed and do just as good work as someone asking a “royalties model” fee basis for this kind of solution, it’s hard to see the advantage of going it alone. The competition is the ”cheap” FTE.
This is pretty much the only option for a person to have any sort of ownership... at least in the US where co-ops and the like are extremely rare w/ little legal help.
Honestly this guy probably got paid so much that the reward will be higher than many "successful" startup outcomes.
The biggest YouTube tail latency is “skip after x seconds”. And I don’t think this paper helps.
I'm surprised that S3/AWS hasn't blogged or done a white paper about their approach yet. It's been something like 7 years now since they moved away from standard load balancing.
If you think about an object storage platform, much like with YouTube, traditional load balancing is a really bad fit. No two requests are even remotely the same in terms of resource requirements, duration etc.
The latest deep dive on S3 at reinvent sheds some light on how it’s done.
Awesome, thanks for the link. Will be interesting to hear how things have developed in the several years since I left.
Out of curiosity, any documents on this, even for something else than AWS's S3? I find the idea very interesting
Nothing specific that I'm aware of. Off the top of my head, and sticking to things that are pretty safe NDA territory, load balancers algorithms typically do things round-robin style, or least current connections, or speed of response etc. They don't know anything about the underlying servers, or the requests, just what they've sent in which direction. If you have multiple load balancers sitting in front of your fleet, they often don't know anything about what each other is doing either.
With an Object Storage service like S3, no two GET or PUT requests an LB serves are really the same, or have the same impact. They use different amounts of bandwidth, pull different at different speeds, different latency, require different amounts of CPU for handling or checksumming etc. It didn't used to be too weird to find API servers that were bored stiff, while others were working hard, all while having approximately the same number of requests going to them.
Smartphones used to be a nightmare, especially with the number that would be on poor signal quality, and/or reaching internationally. Millions of live connections just sitting there slowly GETing or PUTing requests, using up precious connection resources on web servers, but not much CPU time.
They talked a lot about using probes to select candidate servers, but I struggled to find a good explanation of what exactly a probe was.
However the "Replica selection" section seems to shed some detail, albeit somewhat indirectly. From what I can gather a probe consists of N metrics, which are gathered by the backend servers upon request from the load balancers.
In the paper they used two metrics, requests in flight (RIF) and measured latency for the most recent requests.
I assume the backend server maintains a RIF counter and a circular list of the last N requests, which it uses to compute the average latency of recent requests (so skipping old requests in the list presumably). They mention that responding to a probe should be fast and O(1).
At least that's my understanding after glossing through the paper.
The key is that the probes
a) are fast: they certainly incur the same network cost of a regular request. But more than that, all they do is read two counters, so they're super quick for backends to serve.
b) cheap: they don't do nearly as much work as a "real" request, so the cost of enabling this system is not prohibitive. They simply return two numbers. The probes don't compete with "real" requests for resources.
c) give the load balancer useful information: among all the metrics they could have returned from the backend, the ones they chose led to good prediction outcomes.
One could imagine playing with the metrics used, even using ML to select the best ones, and to adapt them dynamically based on workload and time period.
I would have liked to have read something quantitative about the measured cost (in terms of client and server CPU) other than describing them as "small". I'm trying to imagine doing this in gRPC and it seems like the overhead would be pretty high. I know Stubby is more efficient but some hard numbers would have been nice.
Using observed latency, power of two choices, and requests in flight reminds me a lot of Finagle https://twitter.github.io/finagle/guide/Clients.html#p2c-lea...
They specifically compare to client-side least-loaded with Po2C in section 5.2/figure 7.
Needs to read this in-depth over weekends. Have fully imersed into LLMs for the past 2 years, ignored system research.
This appears a very logical solution, i.e., use estimation service quality instead of resource metrics, for scheduling. This is also more or less a known facts in the recent years, as systems are becoming so complex and so distributed intertwined that scheduling based on host load concerns a minor factors of serving requests. It's like one grew taller, and need to worry not stepping on huddles, but not bumping heads into door frame.
But we do need this kind of research to fomalize the practice, and get everyone on board.
Google's applied research absolutely winning here.
From TFA:
> PReQuaL does not balance CPU load, but instead selects servers according to estimated latency and active requests-in-flight
So, still load balancing
Load balancing is a term of art; the actual algorithm for distributing requests need not be load-based. A more accurate term for the component might be "request distributor," but I don't foresee people changing their vocabulary any time soon.
I’ve never heard of a load balancer that balances CPU load. They balance queuing depth and that’s only a proxy for cpu load and a pretty terrible one at that.
I really don’t understand how their claim is anything more than a least-conn with a better weighting algorithm.
We don’t generally use heterogenous server clusters anymore. Noisy neighbors and differences from one data center to the next are definitely things, but outside of microservices, you’ve got a lot of requests with different overhead to them. Route B might be five times as expensive as route A. So it’s not server predictors that I want, but route predictors. Those need a weight or cost estimator based on previous traffic.
Poor man version of this: we had ingress load balancers and then a local load balancer, like one does for Ruby or NodeJS or a handful of other languages. I found that we got much better tail latency running a more “square” arrangement. We initially had a little under 3 times as many boxes as cores per box, and I switched to the next biggest EC2 instance, which takes you to 3:4 ratio. That not only cancelled out a slight latency increase from moving to docker containers but also let me to reduce the cluster size by about 5% and still have a bit better p95 times.
I get two equally weighted attempts to balance the load fairly, instead of one and change.
Abstract:
We present PReQuaL (Probing to Reduce Queuing and Latency), a load balancer for...
“Don’t load balance, erm, here’s our loadbalancer” struck me as quite humorous too. :)
Maybe RIF-balancing is a better term.
Fascinating that 2-3 probes per request is a sweet spot, intuitively it seems like a lot of overhead.
Rquests (in flight or currently processing) are the load in this case. But I guess "queue balancing" captures the intuittion better: what matters for latency is the future delay more than current delay.
Least-conn is requests in flight. It’s like they’re trying to make it hard for people to search for prior art. My ass feels very smoky right now.
They estimate what load will be in the future too.
Are people really balancing load based on simple cpu utilization for non-trivial services? That seems really surprising to me, but they present it as the current best practice?
In my experience by far the most widespread solution for directing requests is randomly. Round robin is pretty popular. Everything more sophisticated than that is vanishingly rare outside of large organizations. Look in gRPC. What client side load balancers does it come with? Pick_first and round_robin.
Weighted round robin has some traction but even in engineering groups with lots of experience and talent the complexity of measuring CPU rate utilization is underappreciated.
When reading a paper like this, ignore what the paper frames as the current state of the art, a lot of times. Often the research is framed in such a way that gives an impact boost by setting up somewhat of a strawman about the current state.
Every production load balancer that I have come across in the last 10 years, load balances on the metric that is important to them.
Also, replace "reading a paper" with "reading a startup pitch", "reading a company's marketing material" and any number of other things... Practically any time you ask a company what their value prop is you'll get a description as though we all just came out of the caves.
Cuts to black and white video of someone struggling, visibly depressed almost panicked about opening a bag of potato chips. Cut to color video of happy relaxed individual after applying entirely obvious sold as-new load balancing heuristic.
I'm curious what you think kubernetes does, because AFAICT, cpu -- a metric available to hpa -- would be an _improvement_.
A few notes on Power-Of-Two-Choices (Po2C) which is part of the solution described in the paper.
NGINX appears to have Po2C available as an option on the "random" balancing algorithm. [1] It only considers load (requests in flight), unlike the paper which also considers recent latency (and does many other things).
However NGINX "Plus" (the paid version) appears to have a couple additional settings that use actual traffic as latency probes (not asynchronous synthetic probes as described in the paper). [also 1] It's not fully clear to me but I think if you use these options it's going to only use latency, not load. The paper described in this post actually uses both load (RIF) and latency. Eager to try these out.
Also, I noted that HAProxy author and kernel maintainer Willy Tarreau wrote an article in 2019 about adding Power-of-two-choices to HAProxy. [2] He builds a test bed with a few micro services and compares several balancing algorithms to Po2C, considering response time, throughput, and max load.
One conclusion is that "At the very least, Power of Two is always better than Random. So, we decided to change the default number of draws for the Random algorithm from one to two, matching Power of Two." However, in the response time data, the test appears to consider only average response time. I would love to see the p90 and p99 tail latency of these tests. The main focus of the paper in this post is minimizing p90/99, not average latency.
Tarreau also helpfully provides links to both a 2001 research report by Mitzenmacher, Richa & Sitaraman and also the original 1996 Po2C PhD thesis by Mitzenmacher.
[1] https://docs.nginx.com/nginx/admin-guide/load-balancer/http-...
[2] https://www.haproxy.com/blog/power-of-two-load-balancing
The previous policy uses weighted round robin (WRR), which is described as:
> weight wi is calculated as qi/ui, where qi and ui represent the recent query-per-second (QPS) rate and CPU utilization of replica i.
The pathological case describes a situation where antagonist loads are soaking up CPU on some on the machines, but WRR equally distributes the load because weight formula prioritizes the utilization of the individual tenant, not of the entire machine. Wouldn't including the the machine utilization (probably downweighted) into the WRR formula also solve the issue?
What about this is a surprising result? Optimize as directly as possible for load balancing the thing(s) that matter most to your users (e.g. latency), not the thing that indirectly correlates to the thing(s) that matter (CPU Utilization). I've been making this exact argument at work for a while.
Funny to read this article today, just after myself and two others at work also just saved the company we work at far more than our combined expected lifetime gross earnings with a single (different) optimization.
The paper reiterates -- multiple times -- that optimizing for latency is obviously the right choice; the other major signal is instantaneous requests-in-flight (RIF). They explicitly mention, on the very first page, that their contributions are not that "low latency is good, aim for that", it's that A) they use a particular combination of RIF and latency to select replicas called their "hot-cold lexicographic rule" which they find works really well, and B) their "probes" (requests by the balancer to the backends, to discover what their current state is) are collected asynchronously rather than in the critical request serving path, to help further drive utilization. Most of the ink in the paper in Section 4 explains the design choices they made around probing as a result.
I'm not done with the paper yet, but the basics are in fact written on page one.
you got me curious. can you share?
is this available anywhere yet or still not outside of Google? it would be great to test it on our K8S which uses nginx-ingress
I wonder how predictable the failure modes are?
Traditional load balancing has some known failure modes and engineers know to design to avoid them.