The article discusses the specific use case of serverless computing, e.g. AWS Lambda, and how a central database doesn't always work well with apps constructed in a serverless fashion.
I was immediately interested in this post because 6-7 years ago I worked on this very problem- I needed to ingest a set of complex hierarchical files that could change at any time, and I needed to "query" them to extract particular information. FaaS is expensive for computationally expensive tasks, and it also didn't make sense to load big XML files and parse them every time I needed to do a lookup in any instance of my Lambda function.
My solution was to have a central function on a timer that read and parsed the files every couple of minutes, loaded the data into a SQLite database, indexed it, and put the file in S3.
Now my functions just downloaded the file from S3, if it was newer than the local copy or on a cold start, and did the lookup. Blindingly fast and no duplication of effort.
One of the things that is not immediately obvious from Lambda is that it has a local /tmp directory that you can read from and write to. Also the Python runtime includes SQLite; no need to upload code besides your function.
I'm excited that work is going on that might make such solutions even faster; I think it's a very useful pattern for distributed computing.
We have that issue at work, though I solved it by including the sqlite database within the container image that we use. We then deploy the new container image (with the same code as before, but with a revised database file) at most every fifteen minutes.
This gives you an atomic point at which you are 100% confident all instances are using the same database, and by provisioning concurrency, you can also avoid a "thundering herd" of instances all fetching from the file on S3 at startup (which can otherwise lead to throttling).
Of course, that's only feasible if it's acceptable that your data can be stale for some number of minutes, but if you're caching the way you are, and periodically checking S3 for an updated database, it probably is.
> "thundering herd" of instances all fetching from the file on S3 at startup (which can otherwise lead to throttling).
Have any "thundering herd" problems with S3, including throttling, actually been seen?
I think S3 is advertised to have no concurrent connection limit, and support up to at least 5,500 GETs per second (per "prefix", which I'm confused about what that means exactly in practice). I don't think S3 ever applies intentional throttling, although of course if you exceed it's capacity to deliver data you will see "natural" throttling.
Do you have a fleet big enough that you might be exceeding those limits, or have people experienced problems even well under these limits, or is it just precautionary?
Sorry--the throttling was at the AWS Lambda layer, not S3. We were being throttled because we'd deploy a new container image and suddenly thousands of new containers are all simultaneously trying to pull the database file from S3.
We aim to return a response in the single digit milliseconds and sometimes get tens of thousands of requests per second, so even if it only takes a second or two to fetch that file from S3, the request isn't getting served while it's happening, and new requests are coming in.
You very quickly hit your Lambda concurrency limit and get throttled just waiting for your instances to fetch the file, even though logically you're doing exactly what you planned to.
By having the file exist already in the container image, you lean on AWS's existing tools for a phased rollout to replace portions of your deployment at a time, and every one is responding in single digit milliseconds from its very first request.
EDIT: The same technique could be applied for other container management systems, but for stuff like Kubernetes or ECS, it might be simpler to use OP's method with a readiness check that only returns true if you fetched the file successfully. And maybe some other logic to do something if your file gets too stale, or you're failing to fetch updates for some reason.
I actually versioned my database file - I had a small metadata table with version number and creation time.
Then in the output from each of my other functions, I included the database version number. So all my output could be subsequently normalized by re-running the same input versus an arbitrary version of the database file.
Have you looked at the user_version pragma? I've been able to avoid use of special metadata tables with this.
> One of the things that is not immediately obvious from Lambda is that it has a local /tmp directory that you can read from and write to.
The other big thing a lot of people don't know about Python on Lambda is that your global scope is also persisted for that execution context's lifetime like /tmp is. I ran into issues at one point with Lambdas that processed a high volume of data getting intermittent errors connecting to S3. An AWS engineer told me to cache my boto3 stuff (session, client, resources, etc.) in the global namespace, and that solved the problem overnight.
> Now my functions just downloaded the file from S3, if it was newer than the local copy
if you have strong consistency requirements, this doesn't work. synchronizing clocks reliably between different servers is surprisingly hard. you might end up working with stale data. might work for use cases that can accept eventual consistency.
If you have strong consistency requirements, then it doesn't work by the very nature of making multiple copies of the database. Even if the clocks are perfect. (Though the clocks are probably close enough that it doesn't matter.)
One of the announcements from AWS this year at Re:invent is that they now can guarantee that the instances clocks are synced within microseconds of each other. Close enough that you can rely on it for distributed timekeeping.
Wouldnt e-tag version numbers also work? Or just havkng .jsom with version metadata next to the db blob? No need to sync clocks. Just GET the small db-ver.json and compare version details?
I versioned the databases and my use case didn’t require strong consistency.
I don't really know if that matters for this use case. Just by the very nature of source_data -> processing -> dest_data taking nonzero time anything consuming dest_data must already be tolerant of some amount of lag. And how it's coded guarantees you can never observe dest_data going new -> old -> new.
It is so painful to see so many people make wrong use of S3.
Is there a list of correct uses of s3 we can all follow?
Yep. Right here. [0].
Generally people ignore the per PUT and GET pricing on S3 along with the higher latency since it's a "global" service. If your objects are small then you're almost always benefited from using DynamoDB as the GET pricing and latency are far more favorable, as long as you don't mind the region dependency or the multi region setup.
[0]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi...
storing massive amounts of unstructured data
One small comment: it may be worth disclaiming that one of the two cited researchers is the author's boss.
It's a small detail, but I mistakenly thought the author and the researchers were unrelated until I read a bit more
"The benefits become noticeable only at p999 onwards; for p90 and p99, the performance is almost the same as SQLite."
I hate to be a hater, and I love sqlite and optimizations, but this is true.
The benchmark seems a bit weird. Fetch 100 results from a table with no filtering,sorting,or anything? That feels like the IO is going to be really small anyways.
they compare threads and coroutines for limbo. threads have much worse p90 latencies since they context switch.... im not sure they can draw any conclusions except that coroutines are faster (of course)
I went down a rabbit hole one week trying to figure out if there was a simple pathway to making a JSON-like format that was just a strict subset of SQLite file format. I figured for read-only workloads, like edge networking situations, that this might be useful.
There's a lot of arbitrariness to the file format though that made me quickly lose steam. But maybe someone with a more complementary form of stubbornness than mine could pull it off.
One of the nice things about sqlite is that there is a very extensive test suite that extensively tests it. The question is whether the rewrite have something similar or will it get the similar testing? Especially if it uses fast but hard to write and potentially buggy features like io_uring.
^^^ - this was my first reaction too. I wonder how they would ensure the same level of quality (e.g. not just safe code due to Rust)
This is a great article.
There was a previous attempt to bring async io to Postgres, but sadly it went dormant: https://commitfest.postgresql.org/34/3316/
A more recent proposal was to make it possible to swap out the storage manager for a custom one without having to fork the codebase. I.e. extensions can provide an alternative. https://commitfest.postgresql.org/49/4428/
This would allow for custom ones that do async IO to any custom storage layer.
There are a lot of interested parties in the new proposal (it's come out of Neon, as they run a fork with a custom storage manager). With the move to separate compute from storage this becomes something many Postgres orgs will want to be able to do.
A change of core to use async io becomes slightly less relevant when you can swap out the whole storage manager.
(Note that the storage manager only handles pages in the heap tables, not the WAL. There is more exploration needed there to make the WAL extendable/replaceable)
So silly question - if i understand right, the idea is you can do other stuff while i/o is working async.
When working on a database, don't you want to wait for the transaction to complete before continuing on? How does this affect durability of transactions? Or do i just have the wrong mental model for this.
I think the OP is about a runtime that runs hundreds of programs concurrently. When one program is waiting for a transaction other programs can execute.
From the paper it looks like this is for read heavy workloads (testing write performance is "future work") and I think for network file systems which will add latency.
The complex thing with a transactional db is that many concurrent transactions (should be) executed simultaneously, and that mix that single query tx and the one that loads 1 million rows.
The sqlite model is that only one write transaction can be run at a time. That's kind of a defining trade-off because it allows simplifying locking.
I've seen this show before. Let's async all the things IO and not pay attention to database integrity and reliably fsync'ing with storage. I look forward to drh's rebuttal.
IIUC this is only about read performance. It's totally fine to async all your reads as long as (like SQLite does) you have a Reader-Writer lock and verify integrity properly on writes.
Jepsen will have interesting things to say as well.
sqlite is open source, but an important test harness is not. How does any alternative ensure compatibility?
I argue it's not Open Source (Freedom, not Free Beer) because PRs are locked and only Hipp and close contributors can merge code. It's openly developed, but not by the community.
You can certainly argue that, but that's not what Open Source or Free Software has ever been. It's about your freedoms as a user, you are always free to fork with a different model. I think the expectation of "open contributions" is quite damaging, to the point where peple/organizations are hesitant to release their software as open source at all.
That's not what Open Source means. The development team not being willing to review your pull requests does not limit your freedom to use sqlite in any way.
This is a case of you deciding that open source means something which it does not, never has, and will not mean.
I consider this an empty exercise, but if it pleases you to use language that way, no one can stop you.
sqlite is actually public domain. https://sqlite.org/copyright.html. This is also the reason why they are closed contribution.
It's a strange combination in the free software world, but I'm grateful for it.
I wonder if using a different allocator in SQLite (https://www.sqlite.org/malloc.html) would improve performance in their workload to a greater degree than any amount of Rust or io_uring.
I can understand how io_uring increases server utilization, but I fail to see how it will make any individual query faster.
- A "individual query" can be a very complex, turing-complete computer program. A single query may do >1 IO operation like read or write more than one database page. io_uring & async IO strategy would allow this work to occur concurrently.
- Even if no new op-codes are introduced and the design is basically exactly the same, io_uring could allow some amortization of syscall overhead. Doing (N ring-buffer prepares + N/10 syscalls) instead of (N syscalls) will improve your straight-line speed.
I wonder why Limbo has an installer script and isn't just `cargo install limbo`
Update: Checked out the script and it seems to just be for convenience and maybe compatibility with OSes that Cargo can compile for but not run on.
Seeing a curl pipe script makes me worry it's going to ask for odd permissions, if I don't also see something simpler like a binary download or cargo install. There is a zip for Windows so maybe the script is just for getting the binary.
Just this weekend I had the perfect problem for sqlite, unfortunately 200MB and above it became unwieldy.
I’d like to hear more about this
Are we sure edge computing providers have io_uring enabled? It is disabled in inter alia, ChromeOS and Android, because it's been a significant source of vulnerabilities. Seems deadly in a multi tenant environment.
Not an expert at all: I wonder if getting the perf requires trading-off some (advertised) safety. IO uring has been noted to be confusing with async in Rust, https://news.ycombinator.com/item?id=41992975
I'm reminded of how Confluent advertised Kafka as a database. They quietly externalized key guarantees of an RDBMS onto their customers, who were then saddled with implementing those guarantees in application level logic. By obscuring the trade-offs, Confluent made developers feel they could have their cake and eat it too.
IMHO, most of io_uring's performance should come from reducing transition between kernel and userspace. There doesn't need to be a safety tradeoff there (although in practice, there have been several safety issues). There may be a trade off against having a simple concurrency model; without io_uring you can't really request an accept and not handle the response that comes back, because a syscall is necessarily synchronous from the point of view of the caller; the calling thread can't continue until it gets a response, even if that's EINTR.
Out of my league / knowledge, but a tidbit that you might understand better: last time I mentioned this, someone said something about SELinux can't do security checks and implied it's some fundamental mismatch, rather than some work SELinux can do
Well I'm a little outside my league on SELinux, but as I understand it, SELinux a way to limit syscalls by configuration.
io_uring operations are similar to syscalls, but not exactly the same thing, so if I've read up correctly, I think SELinux originally may not have covered operations --- if the config allowed you to use io_uring, you could bypass SELinux restrictions, and the easiest/safest thing to do for that is to restrict io_uring. I think this may have been resolved, it looks like there was work on SELinux controls for io_uring back in 2021 [1] that looks like it got merged, but I haven't really followed this. There's also the issue that what happened in Linus's tree 2021 doesn't necessarily reflect what's in common kernels in 2024; some distributions are built from ancient branches.
Based on the language in the pull request, I would think you'd end up needing to have near duplicate policy entries for regular syscalls and io_uring operations, which sounds tedious (but the whole SELinux thing sounds tedious to me, so I'm maybe not the right person to ask :D )
[1] https://lore.kernel.org/lkml/CAHC9VhRJ=fHzMHM6tt8JqkZa4bf0h7...
Their goal is to run this on their own cloud.
Despite their lofty claims about community building, their projects are very much about forwarding their use case.
Given that SQLite is public domain, they're not required to give anything back. So, it's very cool that they're making parts of their tech FOSS.
But I've yet to see anything coming from them that isn't “just because we need it, and SQLite wouldn't do it for us.”
There's little concern about making things useful to others, and very little community consensus about any of it.
I think you've taken the most cynical view possible.
SQLite is open source but not open contribution. So if they "won't do it for us" and "we need it", what else are they supposed to do? They're offering SQLite in the cloud, obviously they need to offer a streaming backup solution. Is there something wrong in building that?
Alternatively, do you want them to reimplement features in SQLite already built out by the SQLite team?
Really sounds like you're complaining about getting an open source, MIT licensed, open contribution bit of software for free that you're under no obligation to use. And here you are complaining about "community consensus".
The problem is that SQLite already exists, and is public domain. It's a fantastic piece of software engineering that has had a lot of time and effort put into making it great.
The pitch for these SQLite alternatives is:
- SQLite is public domain so there's no problem with us rewriting it
- We're going to rewrite it in Rust because that's going to make it inherently better (and don't question this)
- We're going to MIT license our rewrite because there's various reasons why people would rather use MIT-licensed code than public domain code (but SQLite being public domain means we get to do all of this rewriting and re-licensing to begin with)
- Also we're going to extend SQLite to be “cloud-capable” because that's our business use-case, even though it's completely at odds with SQLite's intended use-case
- Also we're going to foster a “community” around our rewrite(-in-progress) (because people still think this is something desirable for some reason, as though a nonzero part of SQLite's greatness is that it operates entirely without having do deal with “nerd drama” that such “communities” inevitably breed)
- Also, we replaced that pesky, unsavory-to-our-sensibilities “Code of Ethics” with a bog-standard “Code of Conduct”—because, again: “community”
- But we're not going to even attempt to replicate the rigorous, arduous, and closed-source testing that goes into developing SQLite (which really makes up the bulk of its engineering effort)
- We've made some progress toward all of this, but it's nowhere near done yet
- But we're announcing it now anyway, because “hype” and “community” and “modern”
- Also, look at our microbenchmarks that show that our unfinished SQLite rewrite is already faster than SQLite (because we haven't reimplemented everything from SQLite yet) (and also we don't plan to reimplement everything from SQLite anyway)
I find it really odd that I can only seems to find criticism of any of this here on HN, and in a couple of reddit threads. It's kind of like when there was that Godot controversy awhile back, and some people made a fork and announced it, despite being identical aside from the Code of Conduct. Merely announcing a fork/rewrite of existing open-source software as a means of "building hype" and “creating a community”, while benefiting from untold man-hours of work done by others, strikes me personally as rather untoward behavior—regardless of whether the reasoning behind said fork/rewrite is ideological (Redot) or business (libSQL/“Limbo”).
Software—especially software used to create other software—should be lauded for its engineering efforts, rather than its ability to “build hype” or “build and foster online ‘community’ ‘engagement’”. If SQLite was abandonware, then it would be a somewhat different story, but SQLite is an amazing piece of software—perhaps the most widely-used on the planet—that is still being actively developed and supported. So, piggybacking on its success, without having concrete, demonstrable improvements to its functionality, comes across as… well, I can't think of another polite, applicable adjective than “untoward”.
I may be able to shed some light.
It seems they proposed a way to resolve the contradiction I raised (io_uring isn't supported on cloud providers) with the ground reality in the company's blog post.
Namely, that:
* it is intended for edge function deployment.
* the paper they mention that is informing their decision to rewrite is based on perf improvements in the longtail by using io_uring.
I framed it as "Are we sure...?", but truth is, I know providers don't have io_uring support. This is relatively straightforward to derive from edge functions are well-known to be multi-tenant.
A cursory search shows unresolved AWS tickets re: this, and multiple announcements from Google about how it has been disabled in gCloud.
Thus, it is likely they understood I was framing it politely, and that there was a clear contradiction here, hence their reply to me, raising a potential resolution for that contradiction, a resolution I hadn't considered.
I don't see anything complaining, or bringing up, or implicitly denying, all the other stuff ("open source", "MIT licensed", "open contribution", "software for free" "under no obligation to use."). In fact, they explicitly indicated they completely agree with that view ("it's very cool that they're making parts of their tech FOSS.")
Of course they are scratching their own itch, so to speak. Thats what companies do. I think the fact that they are doing so much in the open is the indication of good stewardship itself. I'm not sure what else they would do or release that they didn't need internally. For that matter, I'm not really aware of many significant contributions to FLOSS at all that aren't initially intended for company use, that's kinda how it works. Where I'm surprised here is how much secret sauce Turso is sharing at all.
I have no problem with them scratching their itch. That's par for the course.
I'm salty about them describing the SQLite licensing, development model, and code of ethics as almost toxic, setting up a separate entity with a website and a manifesto promising to do better, and then folding “libSQL into the Turso family” within the year.
They forked, played politics, added a few features (with some ill-considered incompatibilities), and properly documented zero of them.
And I'm salty because I'm actually interested in some of those features, and they're impossible to use without proper documentation.
I've had much better luck interacting with SQLite developers in the SQLite forum.
Anything new is going to have people poking and prodding at it. It doesn't mean that the concept is 'bad'.
Linux has desperately needed an interface like io_uring for decades. Async IO was in Windows NT 3.1 in 1993.
Might've replied to the wrong comment: I don't think io_uring is bad, and the comment doesn't contain 'bad', and I certainly don't think async IO is bad :)
Amazon runs every Lambda function in it's own microVM.
So... did they talk to the SQLite maintainer to see how much of this can be taken on board? Because it seems weird to omit that if they did, and it seems even weirder if they didn't after benchmarking showed two orders of magnitude improvement.
(Even if that information should only be a line item in the paper, I don't see one... and a post _about_ the paper should definitely have something to link to?)
They're rewriting SQLite. They're going to put their effort into that surely? Also SQLite explicitly state that they do not accept outside contributions, so there's no point trying.
It is not quite correct to say that the sqlite project does not accept outside contributions at all. The web site says "the project does not accept patches from people who have not submitted an affidavit dedicating their contribution into the public domain."
Much better framing than the previous "yet another library rewritten in Rust"
> "However, the authors argue that KV doesn’t suit all problem domains. Mapping table-like data into a KV model leads to poor developer experience and (de)serialization costs. SQL would be much better, and SQLite being embedded solves this—it can be directly embedded in the serverless runtime."
The levels people will go to to so that they can use SQL never ceases to astound me.
Relations are one of the most efficient and flexible ways to represent arbitrary graphs.
In my experience Everyone goes to incredible lengths to avoid sql, in ignorance of this fact.
They store (key, value) tables they they then extract into an object graph.
Relations are cool, but SQL DBs either prohibit or make it hard to present relations inside relations, which is one of the most common ways of structuring data in everyday programming life. You can see people suggesting writing SQL functions that convert rows to json or using ORM simply to query a one-to-many relationship, that's crazy: https://stackoverflow.com/questions/54601529/efficiently-map...
Any tool can be used incorrectly...
Im not sure what relations in relations mean. Do you just mean M:N?
I mean 'tables' inside 'tables', 0NF. If I have a list of restaurants with their reviews, naturally, I have a table of restaurants, each of which contains a table of reviews. If I have a table of nested comments, naturally, I have a table of comments, each of which recursively contains a table of child comments.
Those are some of the most common use cases for data presentation; and SQL-based DBs are not flexible enough to handle them in a straightforward way.
No. This is binding your data structure to a single access pattern. You can get all the reviews for a restaurant. Now what if you want all the reviews left by a particular user?
I invite you to read the original relational database paper which addresses this exact use case: A Relational Model of Data for Large Shared Data Banks
Storing such data structure would be binding to a single access pattern, I agree. But a query to retrieve such structure is simply convoluted for no reason to the point of many people resorting to hacks like json_agg.
Make a SQL view for the tree and a function in the client application which can read a tree in its linear projected form.
If SQL were flexible, I wouldn’t need to create a whole view for a single query.
> Mapping table-like data into a KV model leads to poor developer experience
This is definitely true in my experience. Unless you are literally storing a hashmap, KV databases are a pain to use directly. I think they're meant to be building blocks for other databases.
Trying to put relational data into K-V store is always going to be a crazy mess. Even if the language used was not SQL, it would still obviously be a good idea to use the right type of database for your problem space.
I mean, if SQL is a good match for the data, embedding a database engine designed to be embedded doesn't seem like too far of a reach?