Great post but it seems like you still rely on Fabric to run Spark NEE. If you're on AWS or GCP, you should probably not ditch Spark but combine both. DuckDB's gotcha is that it can't scale horizontally (multi-node), unlike Databricks. A single node can get you as far as you or can rent 2TB memory + 20TB NVME in AWS, and if you use PySpark, you can run DuckDB until it doesn't scale with its Spark integration (https://duckdb.org/docs/api/python/spark_api.html) and switch to Databricks if you need to scale out. That way, you get the best of the two worlds.
DuckDB on AWS EC2's price performance rate is 10x that of Databricks and Snowflake with its native file format, so it's a better deal if you're not processing petabyte-level data. That's unsurprising, given that DuckDB operates in a single node (no need for distributed shuffles) and works primarily with NVME (no use of object stores such as S3 for intermediate data). Thus, it can optimize the workloads much better than the other data warehouses.
If you use SQL, another gotcha is that DuckDB doesn't have advanced catalog features in cloud data warehouses. Still, it's possible to combine DuckDB compute and Snowflake Horizon / Databricks Unity Catalog thanks to Apache Iceberg, which enables multi-engine support in the same catalog. I'm experimenting this multi-stack idea with DuckDB <> Snowflake, and it works well so far: https://github.com/buremba/universql
> A single node can get you as far as you or can rent 2TB memory + 20TB NVME in AWS
What I'm a little curious about with these "single node" solutions - is redundancy not a concern with setups like this? Is it assumed that you can just rebuild your data warehouse from some form of "cold" storage if you lose your nvme data?
Exactly. Let's say you have a data pipeline in dbt or your favourite transformation tool. All the source data will be in your cold storage, AWS S3, all the intermediate data (TEMP tables) is written to NVME, and then the final tables are created in S3 & registered in your catalog. AWS recently released S3 Tables, which aims to make the maintenance easier with Apache Iceberg as well: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tab...
You can use spot instances if you can afford more latency or might prefer on-demand instances and keep them warm if you need low latency. Databricks (compute) and Snowflake (warehouse) do that automatically for you for the premium price.
In these setups compute is ephemeral and decoupled from storage (you would normally use an object storage offering that is actually redundant out-of-the-box), so the 2TB is for working memory + OS + whatever and the 20TB NVMe is purely for maybe local spilling and a local cache so you can save on storage reads.
If a node fails when running a process (e.g. for an external reason not related to your own code or data: like your spot EC2 instance terminating due to high demand), you just run it again. When you're done running your processes, normally the processing node is completely terminated.
tl;dr: you treat them like cattle with a very short lifecycle around data processes. The specifics of resource/process scheduling being dependent on your data needs.
Or just use Clickhouse.
Clickhouse has excellent performance and it scales out indeed. Unfortunately, it's hard to deploy and maintain a Clickhouse cluster and its catalog features are pretty basic compared to Databricks and Snowflake.
I’m sorry what? Clickhouse is crazy easy to deploy. Single node is just a binary. “Cluster” is single binaries on individual nodes plus clickhouse-keeper (the one extra layer of complexity). Altinity and others offer k8s operators or managed solutions to do all this on-prem to make it even easier.
Your data catalog point remains, it doesn’t offer anything other basic SQL describe functionality.
The comparison is with Snowflake and Databricks, which takes a button or a query (https://docs.snowflake.com/en/sql-reference/sql/create-wareh...) to deploy new clusters. No disk, network, or permission setup is needed for each cluster.
I have rarely seen data people familiar with K8s as they mostly use managed services, but feel free to prove me wrong!
I fully disagree with suggesting ridiculously overpriced for-profit SaaS solutions in the context of choosing among open-source tech.
I don't recommend overpriced SaaS solutions; I highlight that there is a lesson to be learned on why they're so popular, and open-source tech could be as simple as them.
That's why I started Universql in the first place; use their interface/protocol as an open-source tech and reduce the compute cost (often ~90% of the total cost of DWHs) by using DuckDB. You still get all the catalog features without paying the premium price.
They are so popular because the decisions are made by managament in many companies and they have a good sales team.
Using popularity as justification of anything is the definition of slippery slope.
I agree but I also don't think you can have $3B ARR with only good sales.
Clickhouse also has managed service (https://clickhouse.com/)
There are in fact multiple cloud services for ClickHouse. Aiven and Altinity also offer them, among others.
Disclaimer: I work at Altinity.
I mean there’s clickhouse cloud then…
Clickhouse is open core. Which is a dealbreaker for me.
in my testing, clickhouse dies on various queries with OOM on large datasets, because algos are subefficient.
I went through this trade off at my last job. I started off migrating my adhoc queries to duckdb directly from delta tables. Over time, I used duckdb enough to do some performance tuning. I found that migrating from Delta to duckdb's native file format provided substantial speed wins.
The author focuses on read/write performance on Delta (makes sense for the scope of the comparison). I think if an engineer is considering switching from spark to duckdb/polars for their data warehouse, they would likely be open to data formats other than Delta, which is tightly coupled to the spark (and even more so to the closed-source Databricks implementation). In my use case, we saw enough speed wins and cost savings that it made sense to fully migrate our data warehouse to a self managed duckdb warehouse using duckdb's native file format.
Thanks for sharing, very intersting!
I'm thinking the same wrt dropping Parquet.
I don't need concurrent writes, which seems to me about the only true caveat DuckDB would have.
Two other questions I am asking myself:
1) Is there a an upper file size limit in duckdb's native format where performance might degrade?
2) Are there significant performance degradations/ hard limits if I want to consolidate from X DuckDB's into a single one by programmatically attaching them all and pulling data in via a large `UNION ALL` query?
Or would you use something like Polars to query over N DuckDB's?
I can offer my experience with respect to your two questions, though my use case is likely atypical.
1) I haven't personally run into upper size limits to the point of non linear performance degradation. However, some caveats to that are (a) most my files are in the range of 2-10gb with a few topping out near 100gb. (b) I am running a single r6gd metal as the primary interface with this which has 512 gb of ram. So, essentially, any one of my files can fit into ram.
Even given that setup, I will mention that I find myself hand tuning queries a lot more than I was with Spark. Since duckdb is meant to be more lightweight the query optimization engine is less robust.
2) I can't speak too much towards this use case. I haven't had any occasion to query across duckdb files. However, I experimented on top of delta lake between duckdb and polars and never really found a true performance case for polars in my (again atypical use case) set of test. But definitely worth doing your own benchmarking on your specific use case :)
Thanks for sharing! :)
To state the obvious, Delta is an open standards format which should be widely supported.
Databricks have also bought into Iceberg and will probably lead with that or unify the two in future.
There is an open source Delta that is a very good library. This is not the same as Databricks' implementation and there are at times compatability issues. For example, by default if you write a delta table using Databricks' dbr runtimes, that table is not readable by the open source Delta because due to the "deletion vectors" optimization that is only accessible within Databricks.
That aside, I was more pointing out that Delta, particularly via a commercial offering, is a data format biased towards Spark in terms of performance, since it is being developed primarily by Databricks as a part of the spark ecosystem. If you are plan to use Delta regardless of your compute engine, it makes perfect sense as a benchmark. However, for certain circumstances, the performance wins could be (in my case was) worth it to switch data formats.
OSS Delta support deletion vectors. The problem is that the OSS Deltalake (based on Delta-rs) python library does not and this prevents engines like DuckDB and Polars from writing to suck tables. I'm pretty sure DV is in OSS since 3.1
Interesting. So what does that look like on disk? Possibly slightly naively I'm imagining a single massive file?
Polars is much more useful if you’re doing complex transformations instead of basic ETL.
Something under appreciated about polars is how easy it is to build a plugin. I recently took a rust crate that reimplemented the h3 geospatial coordinate system, exposed it at as a polars plugin and achieved performance 5X faster than the DuckDB version.
With knowing 0 rust and some help from AI it only took me 2ish days - I can’t imagine doing this in C++ (DuckDB).
Polars is a life saver. We used it in a fairly complex project and it worked very well.
Neat! Really curious how you managed to outperform DuckDB 5x. Do you see yourself maintaining this long term? would love to use polars as my one stop shop + plugins rather than pulling in additional tooling.
Maybe this is telling more of the company I work in, but it is just incomprehensible for me to casually contemplate dumping a generally comparable, installed production capability.
All I think when I read this is, standing up new environments, observability, dev/QA training, change control, data migration, mitigating risks to business continuity, integrating with data sources and sinks, and on and on...
I've got enough headaches already without another one of those projects.
I submitted this because I thought it was a good, high effort post, but I must admit I was surprised by the conclusion. In my experience, admittedly on different workloads, duckdb is both faster and easier to use than spark, and requires significantly less tuning and less complex infrastructure. I've been trying to transition as much as possible over to duckdb.
There are also some interesting points in the following podcast about ease of use and transactional capabilities of duckdb which are easy to overlook (you can skip the first 10 mins): https://open.spotify.com/episode/7zBdJurLfWBilCi6DQ2eYb
Of course, if you have truly massive data, you probably still need spark
Thanks, I'll give that a listen. Here is the Apple Podcast link to the same episode: https://podcasts.apple.com/gb/podcast/the-joe-reis-show/id16...
I've also experimented with duckdb whilst on a databricks project, and did also think "we could do this whole thing with duckdb and a large EC2 instance spun up for an few hours a week".
But of course duckdb was new then, and you can't re-architect on a hunch. Thanks for the aricle.
This guy says "I live and breathe Spark"... I would take the conclusions with a grain of salt.
Author of the blog here: fair point. Pretty much every published benchmark has an agenda that ultimately skews the conclusion. I did my best here to be impartial, I.e I fully designed the benchmark and each test prior to running code on any engine to mimic typical ELT demands w/o having the opportunity to optimize Spark since I know it well.
Do you still use file formats at all in your work?
I'm currently thinking of ditching Parquet all together, and going all in DuckDB files.
I don't need concurrent writes, my data would rarely exceed 1TB and if it were, I could still offload to Parquet.
Conceptually I can't see a reason for this not working, but given the novelty of the tech I'm wondering if it'll hold up.
We're still using parquet. So we use the native duckdb format for intermediate processing (during pipeline execution) but the end results are saved out as parquet. This is partly because customers often read the data from other tools (e.g. AWS athena)
I'd be interested in hearing about experiences of using duckdb files though, i can see instances where it could be useful to us
My opinion: the high-prevalence of implementations using Spark, Pandas, etc. are mostly driven by (1) people's tendency to work with tools that use APIs they are already familiar with, (2) resume driven development, and/or to a much lesser degree (3) sustainability with regard to future maintainers, versus what may be technically sensible with regard to performance. A decade ago we saw similar articles referencing misapplications of Hadoop/Mapreduce, and today it is Spark as its successor.
Pandas' use of the dataframe concepts and APIs were informed by R and a desire to provide something familiar and accessible to R users (i.e. ease of user adoption).
Likewise, when the Spark development community somewhere around the version 0.11 days began implementing the dataframe abstraction over its original native RDD abstractions, it understood the need to provide a robust Python API similar to the Pandas APIs for accessibility (i.e. ease of user adoption).
At some point those familiar APIs also became a burden, or were not-great to begin with, in several ways and we see new tools emerge like DuckDB and Polars.
However, we now have a non-unique issue where people are learning and applying specific tools versus general problem-solving skills and tradecraft in the related domain (i.e. the common pattern of people with hammers seeing everything as nails). Note all of the "learn these -n- tools/packages to become a great ____ engineer and make xyz dollars" type tutorials and starter-packs on the internet today.
You can use Fugue as a translation layer. https://github.com/fugue-project/fugue
It's always easy to ignorantly criticise technology choices.
But from my experience in almost all cases it is misguided requirements e.g. we want to support 100x data requirements in 5 years that drive in hindsight bad choices. Not resume driven development.
And at least in enterprise space having a vendor who can support the technology is just as important as the merits of the technology itself. And vendors tend to spring up from popular, trendy technologies.
It's not binary. Valid uses of a technology don't mean there aren't others using that same technology in a resume-driven manner.
Yeah, I’ve seen enough of both to know that this is genuine.
The problem with resume-driven technology choices are a kind of tech debt that typically costs a lot of cash to operate and perhaps worse, delivers terrible opportunity costs, both which really do sink businesses.
Premature scaling doesn’t. The challenge is not leaving it too late. Even doing that though only has led to a very few high-profile business failures.
The author disparages ibis but i really think that this is short sighted. Ibis does a great job of mixing sql with dataframes to perform complex queries and abstracts away a lot of the underlyng logic and allows for query optimization.
Example:
df = (
)df.mutate(new_column=df.old_column.dosomething()) .alias('temp_table') .sql('SELECT db_only_function(new_column) AS newer_column from temp_table') .mutate(other_new_column = newer_column.do_other_stuff())
It's super flexible and duckdb makes it very performant. The general vice i experience creating overly complex transforms but otherwise it's super useful and really easy to mix dataframes and SQL. Finally it supports pretty much every backend including pyspark and polars
Nice write up. I don’t think the comments about duckdb spilling to disk are correct. I believe if you create a temp or persistent db duckdb will spill to disk.
I might have missed it, but the integration of duckdb and the arrow library makes mixing and matching dataframes and sql syntax fairly seamless.
I’m convinced the simplicity of duckdb is worth a performance penalty compared to spark for most workloads. Ime, people struggle with fully utilizing spark.
About spilling to disk, in DuckDB’s docs I see:
> Both persistent and in-memory databases use spilling to disk to facilitate larger-than-memory workloads (i.e., out-of-core-processing).
I don’t have personal experience with it though.
Miles Cole here… thx for the correction, another reader just notes this as well. I’ll get this corrected tomorrow and possibly retest after verifying I have spill set up. Thx!
Yeah I'd never go for spark again, all of its use cases are better handled with either DuckDB or Ray (or combination of both).
I thought Ray was a reinforcement learning platform? Can you elaborate on how it is a replacement for Spark?
Ray is a distributed computing framework that has a fast scheduler, 0 copy / serialization between tasks (shared memory) and stateful "actors", making it great for RL but it's more general than that.
I'd recommend checking out their architecture whitepaper: https://docs.google.com/document/d/1tBw9A4j62ruI5omIJbMxly-l...
Imagine spark without the JVM baggage and with no need to spill to disk / serialize between steps unless it's necessary.
Ray has some really unfortunate nomenclature/branding in my opinion. Ray is, itself, a framework for distributed computing, on top of which they have built numerous applied platforms such as Ray Data, Ray Train, Ray Serve, etc. This includes your reinforcement learning platform, RayRLib.
I think they’ve diluted the “brand” a bit with this approach and would be better off sticking with “Ray” for the distributed computing and spinning up the others as something completely separate, but that’s just me.
I do wonder if some new tech adoption will actually be slowed due to the prevalence of LLM assisted coding?
That is - all these code assistants are going to be 10x as useful on spark/pandas as they would be on duckdb/polars, due to the age of the former and the continued rate of change in the latter.
Yeah i've noticed this a lot with ibis. Coming from pandas, ibis was a bit of a learning curve and the LLMs were very bad at generating correct code. LLM generated pandas code is generally copy-paste-run and it often just works correctly. Ibis however was usually completely wrong or just slightly tweaked pandas code.
Makes me wonder if each new technology project will need to make an effort to sythensize data to allow for "indexing" by LLMs. It will be like a new form of SEO
> My name is Miles, I’m a Principal Program Manager at Microsoft. While a Spark specialist by role
JFYI. I think the article itself is pretty unbiased but I feel like its worth putting this disclaimer for the author.
I was a little confused as why he couldn't run Spark locally because he insisted on the "Fabric Spark runtime".
Why? I don’t understand what working at Microsoft has anything to do with this.
And I found the entire article very informative with little room for bias.
You know there's always going to be comments from those who bring the low value criticism of sleuthing out someone's employment and making implications. Being transparent can assuage such discussion.
The author is pretty transparent. The topic sentence of the second paragraph:
"Before writing this blog post, honestly, I couldn’t have answered with anything besides a gut feeling largely based on having a confirmation bias towards Spark."
The blog post echos my experience that duckDB just works (due to superior disk spilling capabilities) and polars OOMs a lot.
My biggest issue with Polars (don’t shoot me) was the poor LLM support. 4o at least seems to frequently get confused with Pandas syntax or logic.
This pushed me to finally investigate the DuckDB hype, and it’s hard to believe I can just write SQL and it just works.
The API has mostly stabilized and at this point other than some minor errors (“groupby” vs “group_by”) LLM’s seem to do pretty well with it. Personally, I’m glad they made the breaking changes they did for the long-term health of the library.
groupby and group_by is a big problem for polars with 4o.
Problem is ur using 4o. Use 3.5 sonnet, its much better
Would love to see Lake Sail overtake Spark, so we could generally dodge tuning the JVM for big Spark jobs.
Have you used it? How does it compare to other Rust-based ETL frameworks like Arroyo?
I haven't personally used it and I believe it's still relatively early in development, but it provides a pyspark compatible python API, so the idea is you should be able to migrate your Spark/Databricks workloads easily. Arroyo seems to be focused on Streaming. Lake Sail targets streaming and batch workloads.
Good write up. The only real bias I can detect is that the author seems to conflate their (lack of) familiarity with ease of use. I bet if they spent a few months using DuckDB and Polars on a daily basis, they might find some of the tasks just as easy or easier to implement.
One thing that's not clear to me about the 'use duckdb instead' proposals is how to orchestrate the batches. In databricks/spark there are two components:
- auto loader/cloud files. Can attach to a blob storage of e.g csv or json, and give them as batches. As new files come in you get batches containing only the new files.
-structured streaming and it's checkpoints. It keeps tracks across runs of how far in the source it has read (including cloud files sources), and it's easy to either continue the job with only the new data, or delete the checkpoint and rebuild everything.
How can you do something similar with duckdb? If you have e.g a growing blob store of csv/avro/json files? Just read everything every day? Create some homegrown setup?
I guess what I describe above is independent of the actual compute library, you could use any transformation library to do the actual batches (and with foreachbatch you can actually use duckdb in spark like this).
You'd typically go for an orchestrator like Airflow or Kestra, which gives you orchestration state.
This will enable you to run incremental batches or do structured backfilling.
Another option would be to use a data loading framework like dlt, which will also do the same but even more light weight (without the orchestration part, you write your batch flow logic in it).
I like Spark a lot. When I was doing a lot of data work, DataBricks was a great tool for a team that wasn’t elbow deep into data engineering to be able to get a lot of stuff done.
DuckDB is fantastic, though. I’ve never really built big data streaming situations so I can really accomplish anything I’ve needed with DuckDB. I’m not sure about building full data pipelines with it. Any time I’ve tried, it feels a little “duck-tapey” but the ecosystem has matured tremendously in the past couple years.
Polars never got a lot of love from me, though I love what they’re doing. I used to do a lot of work in pandas and python but I kind of moved onto greener pastures. I really just prefer any kind of ETL work in SQL.
Compute was kind of always secondary to developer experience for me. It kills me to say this but my favorite tool for data exploration is still PowerBI. If I have a strange CSV it’s tough to beat dragging it into a BI tool and exploring it that way. I’d love something like DuckDB Harlequin but for BI / Data Visualization. I don’t really love all the SaaS BI platforms I’ve explored. I did really like Plotly.
Totally open to hearing other folks’ experiences or suggestions. Nothing here is an indictment of any particular tool, just my own ADD and the pressure of needing to ship.
I love using the Malloy vscode extension to explore .csv and .parquet. Malloy embeds duckdb - and is a fresh take on analytic SQL. https://www.malloydata.dev/
Another alternative to consider is https://www.getdaft.io/ . AFAIU it is a more direct competitor to Spark (distributed mode).
Miles Cole here: I’d love to see Daft on Ray become more widely used. Same Dataframe API and run it in either single or multi-machine mode. The only thing I don’t love about it today is that their marketing is a bit misleading. Daft is distributed VIA Ray, Daft itself is not distributed.
Hey, I'm one of the developers of Daft :)
Thanks for the feedback on marketing! Daft is indeed distributed using Ray, but to do so involves Daft being architected very carefully for distributed computing (e.g. using map/reduce paradigms).
Ray fulfills almost a Kubernetes-like role for us in terms of orchestration/scheduling (admittedly it does quite a bit more as well especially in the area of data movement). But yes the technologies are very complementary!
Interesting and well-written article. Thanks to the author for writing it. Replacing Spark with these single-machine tools seems to be on the hype, and Spark is not en vogue anymore.
The author ran Spark in Fabric, which has V-Order write enabled by default. DuckDB and Polars don't have this, as it's an MS proprietary algorithm. V-Order adds about 15% overhead to write, so it does change the result a bit.
The data sizes were bit on a large size, at least for the data amounts I see daily. There definitely are tables in the 10GB, 100GB, and even in 1TB size range, but most tables traveling through data pipelines are much smaller.
FYI I had V-Order and Optimzed Write disabled in the benchmark. The only wrote diff was that I enabled deletion vectors in Spark since it’s supported which the other two don’t.
I'm a bit confused by the claim that duckdb doesn't support dataframes.
This blog post suggests that it has been supported since 2021 and matches my experience.
From what I understood the article refers to the point that DuckDB doesn't provide its own dataframe API, meaning a way to express SQL queries in Python classes/functions.
The link you shared shows how DuckDB can run SQL queries on a pandas dataframe (e.g. `duckdb.query("<SQL query>")`. The SQL query in this case is a string. A dataframe API would allow you to write it completely in Python. An example for this would be polars dataframes (`df.select(pl.col("...").alias("...")).filter(pl.col("...") > x)`).
Dataframe APIs benefit from autocompletion, error handling, syntax highlighting, etc. that the SQL strings wouldn't. Please let me know if I missed something from the blog post you linked!
For non trivial queries I write them in a separate SQL file where I get the benefit of syntax highlighting, auto formatting and error checks.
There may be another benefit: a lot of LLMs are getting good at how do I do X in Duckdb.
Your point about SQL strings vs more strongly typed DF APIs stands.
However it's somewhat weakened by the possibility that some parts of the SQL string are resolved by the surrounding python context.
Author here: that’s exactly what I was trying to communicate but you said it better :)
There is a Spark API[1] being built using their Relational API[2].
Progress is being tracked on Github Discussions[3].
[1]: https://duckdb.org/docs/api/python/spark_api.html
Very cool! This seems like fantastic functionality and would make it super easy to migrate small Spark workloads to DuckDB :)
Maybe they mean the other way around aka querying duckdb using dataframe queries instead of sql, which can be achieved through the ibis project
I do think code should be shared when you are benchmarking. He could be using Polars' eager API for instance, which would not be apples to apples.
Hi - Miles Cole here… I used lazy APIs where available. I.e. everything up to write_delta() is lazy in the Polars (Mod) variant.
Yeah I was debating whether to share all of the source code. I may share a portion of it soon.
Great! A small correction on your post. Polars does have SQL suppor. It isn't the main usecase, so it isn't as good as that of Spark and DuckDB, but it does exist and is being improved on.
Ritchie - thx for graciously correcting some things I got wrong, will get it corrected!
what is the current maximum ball park amount of data one can realistically handle on a single machine setup on AWS / GCP / Azure?
realistically means keeping in mind that the processing itself also requires memory as well as prerequisites like indexes which also need to be kept in memory.
maximum memory at AWS would be 1.5TB using r8g.metal-48xl. so, assuming 50% usable for the raw data means about 750GB are realistic.
- [deleted]
pretty new to the large scale data processing space so not sure if this is a known question but isn't the purpose of spark that it can be distributed across many workers and parallelized?
I guess the scale of data here ~100GB is manageable with something like DuckDB but once data gets past a certain scale, wouldn't single machine performance have no way of matching a distributed spark cluster?
Yes, but such situations are incredibly rare IMO. I was just recently in a job where a self described expert was obsessed with writing all data pipelines (in this context, 1-2GB of data) in spark, because that was the “proper” way to do things since we had Databricks. It was beyond a waste of time.
Yup, everyone thought they had “big data” for a moment in the industry, and it turns out, they all just read the same trendy blog posts. I made the mistake of “clustering” at my previous company, and it introduced so much unnecessary complexity. If you are really worried about future proofing, just design a schema where the main entities are denormalized and use UUID for your primary keys. If you need to shard later, it will be much easier. Although, again, it’s likely that you will never need to.
> Yes, but such situations are incredibly rare IMO.
They happen all the time if you work for banks, large finance companies or government.
It’s not just the 2TB databases - it’s the 100 analysts all doing it their own thing with that data at the same time.
Maybe "rare" was the wrong choice of word. Maybe I should have said such situations are "highly contextual". It's one of those things where if you have to ask the question "Do I need this?" you almost certainly don't. And even in situations where you actually do need it, it's strongly preferred to "find a way not to need it" for day to day things (aka downsample, or be judicious about what you load into the analytics environment). I can only speak to my experience in FAANG, but even though we had the infra and budget to run huge distributed queries, it was highly discouraged unless absolutely needed!
This is my problem with databricks. It seems like in the course of selling their product they have taken the received wisdom of "do not run expensive and complex compute clusters unless absolutely necessary" and turned it into "it's fun and easy to run distributed compute clusters - everyone's doing it and you should too" regardless of how contextually appropriate it is.
My impression is that companies like databricks because of the framework it offers for organizing data workflows. Spark is then used since it is how databricks does its processing.
Microsoft’s synapse is their «ripoff product» which tries to compete directly by offering the same thing (only worse) with MS branding and better azure integration.
I’ve yet to see Spark being used outside of these products but would be happy to hear of such use.
So many think they (their company) have “Big Data” but don’t.
My employer's product database is about 2TB currently and is still perfectly manageable with a single machine, and to be honest, it's not even all that optimized.
People make the mistake of thinking it's a single 100GB dataset.
When it's more common to be manipulating lots of smaller datasets together in a way where you need to have more than 100GB of disk space. And in this situation you really need Spark.
Can you elaborate? I had no problem with DuckDB and a few TB of data. But it depends on what exactly you do with it, of course.
few hundred TBs here - no issues :)
I think it’s quite rare that a dataset is too large to be processed on a single volume. EBS volumes max out at 64Tb
If you're using EBS for shuffle/source data you've already behind.
If you can afford it you should be using the host NVME drives or FSX and relying on Spark to handle outages. The difference in performance will be orders of magnitude different.
And in this case you won't have the ability to store 64TB. The average max is 2TB.
It's not just disk capacity though. Disk bandwidth, CPU and memory capacity/bandwidth etc can also become bottlenecks on a single machine.
It's really the only reason to use Spark in the first place, because you're doing non-local level processing - when I was throwing 100TB at the problem it made more sense than most of the data science tasks I see it used for.
This magnitude of data fascinates me. Can you elaborate on how that much data came to be and what kind of processing was needed for all that data? And if not, maybe point to some podcast, blog that goes into the nitty gritty of those types of real big data challenges.
Financial market data (especially FX data) can have thousands of ticks per second. It's not unheard of for data engineers in that space to sometimes handle 1TB per day.
Click/traffic data for a top 100 website. We weren't doing a ton, but basic recommendation processing, search improvement, pattern matching in user behavior, etc
We normally still would only need to process say, the last 10 days of user data to get decent recommendations, but occasionally it would make sense for processes running over the entire dataset.
Also this isn't that large when you consider binary artifacts (say, healthcare imaging) being stored in a database, which pretty sure that's what a lot of electronic healthcare record systems do.
A random company I bumped into has a 40TB OLTP database to this effect.
Isn't Spark extremely memory inefficient due to the use of Java?
Yea, but isn't DuckDB extremely memory unsafe due to the use of C++? Isn't Pandas extremely slow due to the use of Python?
Why is Apache DataFusion not there as an alternative?
Or use https://lancedb.com/ LanceDB is a developer-friendly, open source database for AI. From hyper scalable vector search and advanced retrieval for RAG, to streaming training data and interactive exploration of large scale AI datasets, LanceDB is the best foundation for your AI application
Recent podcast https://talkpython.fm/episodes/show/488/multimodal-data-with...
Hey, you have some silly thing rendered at your product's landing page chewing CPU.
Its not mine, just learned about it is all