HNNewShowAskJobs
Built with Tanstack Start
Litestream VFS(fly.io)
349 points by emschwartz a day ago | 80 comments
  • psanforda day ago

    Oh hey this is using my go sqlite vfs module[0]. I love it when I find out some code I wrote is useful to others!

    [0]: https://github.com/psanford/sqlite3vfs

    • benbjohnsona day ago |parent

      It worked great! Thanks for your work on it.

    • fragmedea day ago |parent

      that's all we really want in life.

  • bencorniaa day ago

    > What we’re doing here is instantaneous point-in-time recovery (PITR), expressed simply in SQL and SQLite pragmas.

    > Ever wanted to do a quick query against a prod dataset, but didn’t want to shell into a prod server and fumble with the sqlite3 terminal command like a hacker in an 80s movie? Or needed to do a quick sanity check against yesterday’s data, but without doing a full database restore? Litestream VFS makes that easy. I’m so psyched about how it turned out.

    Man this is cool. I love the unix ethos of Litestream's design. SQLite works as normal and Litestream operates transparently on that process.

  • simonwa day ago

    This is such a clean interface design:

      export LITESTREAM_REPLICA_URL="s3://my-bucket/my.db"
      export AWS_ACCESS_KEY_ID="your-access-key"
      export AWS_SECRET_ACCESS_KEY="your-secret-key"
    
      sqlite3
    
      .load litestream.so
      .open file:///my.db?vfs=litestream
      PRAGMA litestream_time = '5 minutes ago'; 
      select * from sandwich_ratings limit 3;
    • zackifya day ago |parent

      For macos users,

      brew install sqlite3, then change the bottom part:

        /opt/homebrew/opt/sqlite/bin/sqlite3
        .load litestream sqlite3_litestreamvfs_init
        .open file:///my.db?vfs=litestream
      
      you have to manually pass in the init function name
  • zackifya day ago

    This is great... just got it working using bun:sqlite! Just need to have "LITESTREAM_REPLICA_URL" and the key id and secret env vars set when running the script.

      import { Database } from "bun:sqlite";
      Database.setCustomSQLite("/opt/homebrew/opt/sqlite/lib/libsqlite3.dylib");
    
      // Load extension first with a temp db
      const temp = new Database(":memory:");
      temp.loadExtension("/path/to/litestream.dylib", "sqlite3_litestreamvfs_init");
    
      // Now open with litestream VFS
      const db = new Database("file:my.db?vfs=litestream");
    
      const fruits = db.query("SELECT * FROM fruits;").all();
      console.log(fruits);
    • koeng18 hours ago |parent

      Neat! Would this mainly be used for JavaScript servers running bun (ie, not end users)?

    • seigel21 hours ago |parent

      Cool that you got this to work! How did you get the "dylib" location or file.

      • ricardobeat20 hours ago |parent

            brew list sqlite
        
        gives you the installed path, works for any formula.
        • seigel19 hours ago |parent

          Neat. What I wasn't able to find was the dynamic library, just the `litestream` executable. Was there some secret you used for the litestream dylib? Thanks in advance!

          • ricardobeat19 hours ago |parent

            Looks like you need to build it yourself: https://litestream.io/guides/vfs/

            • seigel19 hours ago |parent

              Got it....you beat me to it. I had just figured that part out!! I didn't understand that part. ALSO, and this might be helpful for others, in the releases section, if you expand the little blue link at the bottom that says something like "see the other 35 assets", then the VFS extension will be downloadable from there!

              Thanks for humouring me! :D

      • zackify19 hours ago |parent

        The litestream one, from the litestream github releases page!

    • seigel18 hours ago |parent

      For anyone following this example, one thing to note that I figured out the hard way is that this line from @zackify SHOULD BE HIGHTLIGHTED...

      * "Just need to have "LITESTREAM_REPLICA_URL" and the key id and secret env vars set when running the script"

      ... and that attempting to load the variables using `dotenv` will not work!!

  • indigodaddya day ago

    This is awesome. Especially for sqlite db’s that are read only from a website user perspective. My use case would be an sqlite DB that would live on S3 and get updated by cron or some other task runner/automation means (eg some other facility independent of the website that is using the db), and the website would use litestream vfs and just make use of that “read only” (the website will never change or modify the db) db straightup. Can it be used in this described fashion? Also/if so, how will litestream vfs react to the remote db updating itself within this scenario? Will it be cool with that? Also I’m assuming there is or will be Python modules/integration for doing the needful around Litestream VFS?

    Currently on this app, I have the Python/flask app just refreshing the sqlite db from a Google spreadsheet as the auth source (via dataframe then convert to sqlite) for the sqlite db on a daily scheduled basis done within the app.

    For reference this is the current app: (yes the app is kinda shite but I’m just a sysadmin trying to learn Python!) https://github.com/jgbrwn/my-upc/blob/main/app.py

    • benbjohnsona day ago |parent

      Author here. Litestream VFS will automatically poll for new back up data every second so it keeps itself up to date with any changes made by the original database.

      You don't need any additional code (Python or otherwise) to use the VFS. It will work on the SQLite CLI as is.

      • indigodaddya day ago |parent

        Ok, yeah I think litestream vfs isn’t suitable to do as I described in the intended scenario.

    • Eikona day ago |parent

      Perhaps you would find ZeroFS [0] useful. It works great out the box with SQLite [1] and only depends on S3 as an external service.

      [0] https://github.com/Barre/ZeroFS

      [1] https://github.com/Barre/ZeroFS?#sqlite-performance

      • indigodaddya day ago |parent

        Forgot to say, thanks for posting this, looks quite useful for various projects that have been on my mind. At one point I was looking for a git vfs for Python (I did find one for caddy static serving specifically, but I needed it for Python) but couldn’t find much that wasn’t abandoned—- an s3 vfs might do the trick for a lot of use cases though.

        • Eikona day ago |parent

          My pleasure, this project has been a lot of fun :).

      • indigodaddya day ago |parent

        Yes this approach might be better. Sounds like litestream vfs won’t really do what I wanted in my described scenario

  • ncrucesa day ago

    I also have this implemented and ready to go in my Go SQLite driver: https://github.com/ncruces/go-sqlite3/blob/main/litestream/e...

    Slightly different API (programmatic, no env variables, works with as many databases as you may want), but otherwise, everything should work.

    Note that PRAGMA litestream_time is per connection, so some care is necessary when using a connection pool.

    • apitman5 hours ago |parent

      ncruces/go-sqlite3 was the first thing I thought of when I saw .load litestream.so. That's awesome that you've implemented this. Was it a big lift to make it work with wasm?

  • dkam13 hours ago

    Love the progress being made here. I've been really enjoying learning about another embedded database - DuckDB - the OLAP to SQLite's OLTP.

    DuckDB has a lakehouse extension called "DuckLake" which generates "snapshots" for every transaction and lets you "time travel" through your database. Feels kind of analogous to LiteStream VFS PITR - but it's fascinating to see the nomenclature used for similar features. The OLTP world calls it Point In Time Recovery, while in the OLAP/data lake world, they call it Time Travel and it feels like a first-class feature.

    In SQLite Litestream VFS, you use `PRAGMA litestream_time = ‘5 minutes ago’` ( or a timestamp ) - and in DuckLake, you use `SELECT * FROM tbl AT (VERSION => 3);` ( or a time stamp ).

    DuckDB (unlike SQLite) doesn't allow other processes to read while one process is writing to the same file - all processes get locked out during writes. DuckLake solves this by using an external catalog database (PostgreSQL, MySQL, or SQLite) to coordinate concurrent access across multiple processes, while storing the actual data as Parquet files. It's a clever architecture for "multiplayer DuckDB.” - deliciously dependent on an OLTP to manage their distributed multiple user OLAP. Delta Lake uses uploaded JSON files to manage the metadata skipping the OLTP.

    Another interesting comparison is the Parquet files used in the OLAP world - they’re immutable, column oriented and contain summaries of the content in the footers. LTX seems analogous - they’re immutable, stored on shared storage s3, allowing multiple database readers. No doubt they’re row oriented, being from the OLTP world.

    Parquet files (in DuckLake) can be "merged" together - with DuckLake tracking this in its PostgreSQL/SQLite catalog - and in SQLite Litestream, the LTX files get “compacted” by the Litestream daemon, and read by the LitestreamVFS client. They both use range requests on s3 to retrieve the headers so they can efficiently download only the needed pages.

    Both worlds are converging on immutable files hosted on shared storage + metadata + compaction for handling versioned data.

    I'd love to see more cross-pollination between these projects!

  • chickensonga day ago

    As a sandwich enthusiast, I would like to know more about these sandwich ratings.

  • skybriana day ago

    This sounds pretty cool, but I’m confused about what software being announced. Is there a new release of Litestream?

    • benbjohnsona day ago |parent

      Author here. Yes, Litestream v0.5.3 has been released with a new read-only VFS option: https://github.com/benbjohnson/litestream/releases/tag/v0.5....

      • zackifya day ago |parent

        so how would i connect from a separate machine, i can't figure out from the post or release notes or current litestream website docs, how would i use the extension to do that?

        Edit:

        need to set LITESTREAM_ACCESS_KEY_ID, LITESTREAM_SECRET_ACCESS_KEY, LITESTREAM_REPLICA_URL

        then the module works

    • wima day ago |parent

      I noticed the new release also includes "directory replication support for multi-tenant databases", great addition as well!

  • darintaya day ago

    Does this work with sqlite extensions? If I were using e.g. sqlite-vec or -vss or some other vector search extension would I be able to use litestream to back it up to S3 live, and then litestream-vfs to query it remotely without downloading the whole thing?

    • andersmurphya day ago |parent

      Yeah I wonder how it would work with things like custome application functions etc. But I guess the query is run locally and it's the pages that are fetched from S3? So it just works? That would be awesome.

      I guess there's only one way to find out.

      • JaggerJoa day ago |parent

        because liters works at the lowest level (virtual file system) all other things should just work.

    • ncrucesa day ago |parent

      Yes. (that's all, really)

  • JSR_FDED21 hours ago

    From a Litestream user’s perspective:

    Litestream continues to work as always, making continuous backups to S3.

    Like always, I can restore from those backups to my local system.

    But now I have the option of doing “virtual restores” where I can query a database backup directly on S3.

  • pkhuong20 hours ago

    I remember when Litestream not being a VFS was a plus https://news.ycombinator.com/item?id=29461406 ;)

    • tptacek20 hours ago |parent

      It still is! I don't know how much clearer we could be about this point. :)

      Litestream does not require a VFS to work. It still does all the cool stuff it did before; in fact, it does those things much better now, even without the VFS.

  • dangoodmanUTa day ago

    I'm glad they did this! I've always thought VFS was a better fit for the objectives of Litestream than the original design.

    SQLite VFS is really cool tech, and pretty easy to work with (IMO easier than FUSE).

    I had made a _somewhat similar_ VFS [1] (with a totally different set of guarantees), and it felt pretty magical how it "just worked" with normal SQLite

    [1] https://github.com/danthegoodman1/gRPSQLite

  • orliesaurusa day ago

    Been tinkering with litestream... the read-only VFS is neat but I'm curious about eventual write capabilities... using VFS for distributed DBs could unlock some interesting patterns.

    ALSO I'm thinking about mixing this with object store caching... maybe combining memfs with remote metadata; would love to see more details on performance.

    BUT I might be overthinking it... just excited to see SQLite exploring beyond local files...

    • reactordeva day ago |parent

      Opens up a whole new can of worms. Transactions come to mind. Who would be responsible for coordinating? If two nodes wrote to the table conflicting information at the same time, who wins?

      • benbjohnsona day ago |parent

        Author here. We've done some proof-of-concept work on creating distributed leases using S3. We have some use cases internally where we've considered adding write capabilities to the VFS but we haven't started any work on it yet.

        • reactordeva day ago |parent

          It’s a tricky problem that goes beyond the fs as you know. Since it’s cloud and since it’s distributed, a manager wouldn’t be that far fetched that could issue CRDT like messages across the cluster of nodes to issue a “write” or utilize the sync mechanism you have to propagate a “master” db that you write to (aggregate or designate, either way). I did some work on this on a go based database graph and ended up doing a gossip sync crdt message bus.

          • withinboredom6 hours ago |parent

            If latency isn't important. Something like this could work. For applications where latency or linearizability is important, it probably won't work.

        • hobo_marka day ago |parent

          Is this VFS for read-only databases? Or can I query a database that has a single litestream writer somewhere continously making updates and backing them up to S3?

          • benbjohnsona day ago |parent

            The VFS is read only but it will continuously poll for new updates so if you have a writer somewhere else using regular Litestream then it will pick up those updates automatically.

  • joshstrangea day ago

    I work with many distributed, often offline, hosts with varied levels of internet speeds. Does this do any offline caching? Like if I load a vfs litestream database on one of my nodes and it goes offline can it still query or will it fall over unless the data was recently fetched?

    • benbjohnsona day ago |parent

      Author here. It will cache pages in memory right now but it doesn't do disk-based caching currently. That's a good idea though and probably not too difficult to implement. Feel free to post a feature request as a GitHub issue on the project.

  • honktimea day ago

    I dont fully understand this, would this be useful for scaling sqlite on systems that have really high read needs and a single writer? I thought that was what LiteFS was for, or am i off on that too?

    • tptaceka day ago |parent

      No, you're right: scaling "out" SQLite is what LiteFS is about, and this is about (significantly) improving operational capabilities for single-server (or multi-independent-server, like distributed cache) SQLite deployments.

    • davnicwila day ago |parent

      I have to say I'm similarly not really sure I'm getting it, and I've been following litestream pretty closely over the years (not an expert but certainly familiar).

      I think what we're getting here is a way to just spin up a local shell / app and run arbitrary queries from any point in time over the network without having to sync the full prod database. I guess with LiteFS you would have to do this, or pre-plan to do this, it's not totally on-demand.

      Or said another way, do things locally as though in prod without having to ssh to prod and do it there (if you even can, I guess if 'prod' is just s3 you can't really do this anyway so it's an entirely new capability).

      @benbjohnson is this right? I humbly suggest adding a tl;dr of the main takeaway up top of the post to clarify. Love your work on litestream, thanks for what you do!

      • tptacek21 hours ago |parent

        Your second paragraph is correct, with the added proviso that you can build features in your own application tooling to take advantage of the same capability.

        One reason you're not getting such a clear usage statement at the top of this post is, it's an early feature for a general-purpose capability. I think we might rather get other people's takes about what it's most useful for? There are some obvious use cases, like the one you just identified.

        • davnicwil20 hours ago |parent

          very fair! Thanks for confirming

  • hintoftimea day ago

    Does this mean that I can run an application in K8s via one or many horizontally scaled pods all running off DB in s3? No StatefulSet required?

    • dpedua day ago |parent

      I was doing something similar just the other day and came across sqlite-s3vfs[0]. It is likewise a SQLite VFS that translates IO to S3 api calls. However, that project is only for python and seemingly abandoned. Additionally, if you want more than one writer, you'd have to coordinate that yourself, afaik.

      [0]: https://pypi.org/project/sqlite-s3vfs/

    • benbjohnsona day ago |parent

      Author here. The VFS support right now is currently read only so it's useful for something more like a shared cache of data.

  • born-jrea day ago

    more goodies nice!

    I am going to integrate Litestream into the thing I am going to building[1]. I experimented with a lot of ways, but it turns out there is WebDAV support recently merged, not in the docs.

    [1]: https://github.com/blue-monads/potatoverse

  • dzongaa day ago

    dumb question: can this be used for versioned tables then ? what to see the state of a table 1 hour ago ?

    • benbjohnsona day ago |parent

      Author here. You can query the state of a table from an hour ago with Litestream VFS. It won't give you versioned tables in the sense that every time you update a row that it writes a new version in a table somewhere though.

  • nickpetersona day ago

    Would this work with other object stores or is it s3 specific?

    • tptacek21 hours ago |parent

      Works with other object stores.

  • jauntywundrkinda day ago

    So much fun streaming/sync/cdc stuff happening, all so cool. Having an underlying FUSE driver doing the Change Data Capture is really neat. This looks like such an incredibly lightweight way to remote-connect to sqlite. And to add a sort of exterior transaction management.

    Different use case, but makes me think of sqlite Rewrite-it-it-Rust Turso announcing AgentFS. Here the roles are flipped, sqlite is acting as a file store to back FUSE, to allow watching/transaction-managing the filesystem/what agents are doing. Turso also has a sick CDC system built in, that just writes all changes to a cdc table. Which is related to this whole meta question, of what is happening to my sqlite DB. https://turso.tech/blog/agentfs

    • tptaceka day ago |parent

      Just to be clear, the underlying FUSE thing is LiteFS, not Litestream; nothing described in this post needs a FUSE filesystem, just a small SQLite VFS plugin library.

      • jauntywundrkinda day ago |parent

        Thanks Thomas.

        To just drop the relevant paragraph that addresses my un-clarity/in-correctness (and which is super fun to read):

        > Litestream v0.5 integrates LTX, our SQLite data-shipping file format. Where earlier Litestream blindly shipped whole raw SQLite pages to and from object storage, LTX ships ordered sets of pages. We built LTX for LiteFS, which uses a FUSE filesystem to do transaction-aware replication for unmodified applications, but we’ve spent this year figuring out ways to use LTX in Litestream, without all that FUSE drama.

        • tptaceka day ago |parent

          You got it. Yeah, Ben built LiteFS a year or two ago, which is why he designed LTX. But using LiteFS required people to set up FUSE filesystems, which was too complicated for a lot of people, so Litestream got way more uptake. This past year he's been harvesting all the good stuff from LiteFS that doesn't require FUSE and building it into Litestream.

          The easiest way so far to understand the split between Litestream and LiteFS: Litestream is an operational tool, for backup and restore. LiteFS is a method for doing online leader/follower replica clusters.

  • rekwaha day ago

    Now do this with DuckDB.

    • sundbrya day ago |parent

      Use iceberg tables for that in duckdb

  • itissida day ago

    Really nice. We should have this as an add-on to https://app.codecrafters.io/courses/sqlite/overview It can probably teach one a lot about the value of good replication and data formats.

    If you are not familiar with data systems, havea read DDIA(Designing Data Intensive Applications) Chapter 3. Especially the part on building a database from the ground up — It almost starts with sthing like "Whats the simplest key value store?": `echo`(O(1) write to end of file, super fast) and `grep`(O(n) read, slow) — and then build up all the way to LSMTrees and BTrees. It will all make a lot more sense why this preserves so many of those ideas.

  • petcata day ago

    Are people still trying to shoehorn sqlite to run in a server-side context? I thought that was a fad that everyone gave up on.

    • ashish01a day ago |parent

      I use Litestream for near real-time backups. Does not change how SQLite is used on the server, just a replacement for .backup

    • mhitzaa day ago |parent

      No, it's still pretty cool, easy to use with low operational complexity in low volume read-mostly projects: CMSs, blogs, ecommerce platforms.

      • andersmurphya day ago |parent

        It's got crazy write throughput too if you hold it right.

    • 0xbadcafebeea day ago |parent

      I am a heavy skeptic of this thing, but I can see a good use case for it: S3 I/O, ephemeral compute (1 instance), versioned blobs. The first two allow you to abstract the data away from the compute (flexibility), and the third lets you recover from mistakes or bugs quicker (or do immutable migrations easier).

      I think the devil's in the details though. I expect a high number of unusual bugs due to the novel code, networking, and multiple abstractions. I'd need to trial it for a year before I called it reliable.

    • jtbaylya day ago |parent

      I am. Super simple. Super cheap. Great dev experience. Want to know whether the migration is going to work? Just download the prod db locally and test it. I'm happy.

      • christophilusa day ago |parent

        Works for very small prod databases, I guess.

        • tptaceka day ago |parent

          We use it internally for some rather large databases. It's not database size that matters, it's usage pattern.

        • victorbjorklunda day ago |parent

          What is a very small database? At what size do you think read queries stops working in SQLite?

        • andersmurphya day ago |parent

          Handles billions of rows just fine. Can take you unreasonably far on a single server.

    • jauntywundrkinda day ago |parent

      For things like config management I feel like it makes all the sense in the world. Whomever the primary is can soak some infrequent-ish write-load. Then the whole DB can quickly copy to where it's needed, or, in lite stream VFS 's case, even less needs to be shipped.

    • 9rxa day ago |parent

      People are building DBMSes and, instead of writing the engine from scratch, are choosing an off-the-shelf solution that integrates into a DBMS with ease.

      A better question to ask is why the world needs yet another DBMS, but the reasons are no doubt valid.

    • andersmurphya day ago |parent

      Nope, still going strong.

      https://news.ycombinator.com/item?id=46124205