Tanstack Start | Kuvasz-streamer: open-source CDC for Postgres for low latency replication

Kuvasz-streamer: open-source CDC for Postgres for low latency replication(streamer.kuvasz.io)

129 points by thunderbong 10 months ago | 50 comments

rockwotj 10 months ago
lots of Go based CDC stuff going on these days. Redpanda Connect (formerly benthos) recently added support for Postgres CDC [1] and MySQL is coming soon too [2].
1: https://github.com/redpanda-data/connect/pull/2917
2: https://github.com/redpanda-data/connect/pull/3014
- mebcitto 10 months ago |parent
  From the newish Go-based Postgres CDC tools, I know about:
  * pgstream: https://github.com/xataio/pgstream
  * pg_flo: https://github.com/pgflo/pg_flo
  Are there others? Each of them has slightly different angles and messaging, but it is interesting to see.
  - rockwotj 10 months ago |parent
    https://github.com/artie-labs/reader is another I know of
    - jitl 10 months ago |parent
      No support for streaming PG changes
      - rockwotj 10 months ago |parent
        It’s in the works from my understanding. I helped build the Redpanda Connect one and it’s quite easy to use
- andruby 10 months ago |parent
  Sequinstream is written in Elixir and also pretty recent.
  https://github.com/sequinstream/sequin
  Any reason we're seeing so many CDC tools pop up?
  - jitl 10 months ago |parent
    The status quo tool Debezium is annoying/heavy because it’s a banana that comes attached to the Java, Kafka Connect, Zookeeper jungle - it’s a massive ecosystem and dependency chain you need to buy into. The Kafka clients outside of Java-land I’ve looked are all sketchy - in Node, KafkaJS went in maintained for years, Confluent recently started maintaining rdkafka-based client that’s somehow slower than the pure JS one and breaks every time I try to upgrade it. The Rust Kafka client has months-old issues in the latest release where half the messages go missing and APIs seem to no-op, and any version will SIGSEGV if you hold it wrong - obviously memory unsafe. The official rdkafka Go client depends on system C library package versions “matching up” meaning you often need a newer librdkafka and libsasl which is annoying; the unofficial pure-go one looks decent though.
    Overall the Confluent ecosystem feels targeted at “data engineer” use-cases so if you want to build a reactive product it’s not a great fit. I’m not sure what the performance target is of the Debezium Postgres connector maintainers but I get the sense they’re not ambitious because there’s so little documentation about performance optimization; data ecosystem feels contemporary with “nightly batch job” kind of thing vs product people today who want 0ms latency.
    If you look at backend infrastructure there’s a clear trope of “good idea implemented in Java becomes standard, but re-implementing in $AOT_COMPILED_LANGUAGE gives big advantage:
    - Cassandra -> ScyllaDB
    - Kafka -> RedPanda, …
    - Zookeeper -> ClickHouse Keeper, Consul, etcd, …
    - Debezium -> All these thingies
    There’s also a lot of hype around Postgres right now, so a bit of VC funded Cambrian explosion going on and I think a lot these will die off as a clear winner emerges.
    - rockwotj 10 months ago |parent
      BTW I think most the ecosystem has settled on https://github.com/twmb/franz-go being the best and highest performing kafka client for Golang (purego)
  - nijave 10 months ago |parent
    >Any reason we're seeing so many CDC tools pop up?
    When I looked for something ~1 year ago to dump to S3 (object storage) they all sucked in some way.
    I'm also of the opinion Postgres gives you a pretty "raw" interface with logical replication so a decent amount of building is needed and each person is going to have slightly different requirements/goals.
    I haven't looked recently but hopefully these do a better job handling edge cases like TOASTd values, schema changes, and ideally full load
dewey 10 months ago
I’ve recently looked into tools like that, I have a busy Postgres table that has a lot of updates on one column and it’s overwhelming Debezium.
I’ve tried many things and looked into excluding them from replication with a publication filter but this still causes “events”.
Anyone has some pointers on CDC on busy tables?
- hundredwatt 10 months ago |parent
  That’s one of the cases where query-based CDC may out perform log-based (as long as you don’t care to see every intermediate change that happened to a row between syncs)
- kuvasz-io 10 months ago |parent
  kuvasz-streamer batches updates in a single timed transaction (1 second for example). This is independent of the source transaction.
  I have seen a significant increase in performance with this feature.
mitchbregs 10 months ago
Good to see another Postgres CDC solution. I have used both Debezium and PeerDB before. Currently, I am using PeerDB in my work to replicate data to ClickHouse and have been loving the experience. The pace at which it performs the initial load is impressive, and the replication latency to ClickHouse is just a few seconds. I haven’t tested PeerDB other targets, such as Kafka, where the latency might be lower.
datadeft 10 months ago
Kuvasz is a Hungarian dog breed if anyone is wondering.
- thealch3m1st 10 months ago |parent
  https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExZzJwejljdHJybnp...
- thealch3m1st 10 months ago |parent
  Really awesome and loyal dogs
gregw2 10 months ago
I like the "type=history" mode which can auto-build a slowly changing dimension ("SCD type 2") for you; more CDC solutions should do that: https://streamer.kuvasz.io/streaming-modes/
That said, their implementation is kinda poor since it allows overlapping dates for queries when a row gets updated multiple times per day. When you SQL join to that kind of SCD2 by a given date you can easily get duplicates.
This can be avoided by A) updating old rows to end-date yesterday rather than today, and B) if a row begins and ends on the same day, the start date or end date can be NULL or a hardcoded ancient or far-future end-date, such as having the record from "2023-01-01 to 2023-01-01" instead be "2023-01-01 to 0001-01-01". Those rows won't show up in joins, but the change remains visible/auditable and you do get the last row available for every given date (and only one such row.)
- kuvasz-io 10 months ago |parent
  The date fields are actually timestamps having microsecond accuracy. Maybe this was not clear in the docs.
```
       1 | 12  |  1 | r1   |        | 1900-01-01 00:00:00+00        | 2025-01-07 21:15:49.233384+00 | f
       7 | 12  |  2 | r2   |        | 1900-01-01 00:00:00+00        | 2025-01-07 21:15:49.233384+00 | f
      13 | 12  |  1 | x1   |        | 2025-01-07 21:15:49.233384+00 | 9999-01-01 00:00:00+00        | f
```
  - gregw2 10 months ago |parent
    Ah yeah, makes sense. The docs are misleading.
    The tricky thing with DateTime SCD2s vs Date-only SCD2s is that DateTime SCD2s work for identifying what was true for a given click/transaction/ingest time, but doesn't work for identifying what was the final truth associated for a given "business date" such as an "invoice date". That tends to take an ETL or SQL window functions complexity/performance hit. But with streams/CDC, DateTime SCD2s are the easy+clean thing to implement.
    Do you use the DateTime the message is received on the target system, or some DateTime from the originating journal/WAL-log?
davidarenas 10 months ago
What kind of delivery guarantees does this offer? And does it provide data replay?
Currently evaluating https://sequinstream.com/ which claims sub-200ms latency, but has a lot of extras that I don’t need and a lighter weight alternative would be nice.
- jitl 10 months ago |parent
  I looked at sequin just now and wish they published their benchmark code. I’m curious how they configure Debezium. At 500 change/s, I get ~90ms average latency with my Debezium cluster after fiddling with Debezium options a bunch.
  I don’t love Debezium but I also don’t love Erlang, plenty of us have scars from RabbitMQ’s weird sharp edges…
  - _acco 10 months ago |parent
    Sequin engineer here. We'll publish our benchmark repo soon! Indeed, we're still doing a lot of fiddling with Debezium ourselves to make sure we cover different configurations, deployments, etc.
    The main thing we want to communicate is that we're able to keep up with workloads Debezium can handle.
    (And, re: RabbitMQ, I wouldn't write off a platform based on a single application built on that platform :) )
- __s 10 months ago |parent
  You can get single digit latency with almost anything for CDC to queue if you properly colocate your services & your CDC ingestion is keeping up with postgres slot
  - davidarenas 10 months ago |parent
    Should have specified; this is for tracking bursts of up to 1.5k tps not in the same AZ or even VPC.
- kuvasz-io 10 months ago |parent
  No guarantees on latency as this depends on the hardware.
  Check the load test
  https://kuvasz.io/kuvasz-streamer-load-test/
CAP_NET_ADMIN 10 months ago
Does anyone know a battle-tested tool that would help with (almost)online migrations of postgresql servers to other hosts? I know it can be done by manually, but I'd like to avoid that
- andruby 10 months ago |parent
  PG's internal wal-level replication? primary to a read-replica, and then switch the read to become the primary. You'll have a bit of downtime while you stop connections on the original server, switch the new server to primary, and update your app config to connect to the new server.
  I believe that's a pretty standard way to provide "HA" postgres. (We use Patroni for our HA setup)
  https://github.com/patroni/patroni
  - dikei 10 months ago |parent
    We use the same setup, though we use PGBouncer so after switching primary we just force reconnect all clients from PGBouncer instead.
    The clients will have to retry on-going transactions, but that's a basic fault tolerant requirement anyway.
- nijave 10 months ago |parent
  In my experience, everyone's setup is slightly different so it's hard to find a generic solution. pgcopydb is pretty good
  I can't remember the name but I saw a Ruby based tool on Hacker News a few months ago that'd automate logical rep setup and failover for you
- hans_castorp 10 months ago |parent
  pglogical can do that (or at least minimize the manual steps as much as possible)
  I am not entirely sure, but I think CloudNativePG (a Kubernetes operator) can also be used for that.
zsoltkacsandi 10 months ago
From the domain I suspect it’s a fellow Hungarian developer?
- _zoltan_10 months ago |parent
  I was surprised to learn: no. so why pick a Hungarian and? A mystery.
  - kuvasz-io 10 months ago |parent
    The mystery shall remain
boomskats 10 months ago
Are there any benchmarks documented for this, possibly comparing it to alternatives like pgstream or debezium? The "Test report" link on the website[0] returns a 404.
[0]: https://streamer.kuvasz.io/report.html
- kuvasz-io 10 months ago |parent
  The load test report is here.
  https://kuvasz.io/kuvasz-streamer-load-test/
  - boomskats 10 months ago |parent
    Thanks!
mdaniel 10 months ago
One will want to be cognizant of its AGPLv3 license https://github.com/kuvasz-io/kuvasz-streamer/blob/v1.19.2/LI...
- withinboredom 10 months ago |parent
  This is becoming pretty standard for these types of projects that want to be open source but don't want to end up finding out their product got sucked up by AWS and friends. Usually, they offer a commercial license as well.
  - immibis 10 months ago |parent
    It should have always been standard, but corps like AWS managed to convince developers to give them free labour, for a while.
- 1oooqooq 10 months ago |parent
  why would you want to keep infra helper code private anyway? it's this core for your business?
rbanffy 10 months ago
When I was a teenager, my uncle had a Kuvasz. She was the most loving dog I ever met, and would easily win a Miss Congeniality contest against any Golden Retriever.
arcticfox 10 months ago
Looks nice! How does this compare to PeerDB?
rednafi 10 months ago
Love seeing a CDC tool that doesn't aim to replicate the universe and fails to do any of it correctly.
olavgg 10 months ago
How does it compare versus Debezium?
- dikei 10 months ago |parent
  I'd be more similar to Debezium-server that runs everything in the same process, than regular KafkaConnect-based Debezium.
  However, this only does postgres-postgres, so it's a lot more limited compared to Debezium.
sushidev 10 months ago
Seems very useful. This stuff can’t be done already with pg replication?
- jitl 10 months ago |parent
  I think it’s targeted at [many -> one] database consolidation versus Postgres replication which is more suited for [one -> one]. I’m sure you can do [many -> one] with just Postgres and some shenanigans but it’s probably quite Rube Goldberg. Also in Postgres <16 you can’t replica of replica, using a CDC tool to pipe data around you don’t have that restriction.
  - kuvasz-io 10 months ago |parent
    It also takes care of setting up publications, replication slots and the initial sync.
brodouevencode 10 months ago
[flagged]
- 10 months ago |parent
  [deleted]