TimeSeries Benchmark on CrateDB and MongoDB

A proud day for Nyrkiö Oy and me personally, as we are today publishing our first report as an “independent performance expert”, comparing two databases as fairly and objectively as possible. It’s a daunting task, as benchmarking is hard and benchmarking databases is harder. And as a performance expert you are asked to put all your professional experience and reputation on the line, living up to the standards of a more or less scientific rigor of objectivity and reproducibility, while everyone knows the work was commissioned by the marketing department of one of the contestants. Even so, I was honored to be asked to do this project and humbly submit the results for scrutiny by the database performance community – and of course to the benefit of the wider community of database users.

In this case the work was commissioned by CrateDB, who also sponsored the cloud infrastructure for both the CrateDB and MongoDB benchmarks. (On identical hardware, obviously.) CrateDB has not had any influence in the test setup nor the authoring of the report. They were allowed an opportunity to review the report before publication, and they were given the right to decide whether they want it to be published or not, or whether they would have wanted to use it non-publicly with their own customers, or not at all. They chose to publish.

Reality fits theory remarkably well

I don’t want to spoil the results too much, they are summarized on the front page of the attached PDF. But reflecting on the results now as I write this, in hindsight they fit well into the theory: MongoDB is a classic BTree based DBMS, and even if the JSON based user experience is its own world completely, from a performance point of view it is actually very similar to the traditional general purpose OLTP relational databases Oracle, Postgresql and MySQL. The 20x difference in insert speed between CrateDB and MongoDB could just as well have been a benchmark 15 years ago between Infobright and MySQL, or Teradata and Oracle. It is the typical, ballpark difference to expect between a BTree-based versus columnar database architecture.

On the read side we know from theory that the BTree is the optimal data structure to query a data set. But this is for the case where a specific query is known in advance, and the corresponding optimal BTree was defined when creating and loading the database. For queries not covered well, or at all, by an index, MongoDB results quickly deteriorate to seconds and minutes. The columnar architecture of CrateDB on the other hand is able to return results within hundreds of milliseconds. This is not as good as the fastest MongoDB queries, but for a human user waiting for the results, both are essentially immediate. The remarkable property of the columnar CrateDB architecture is that it is able to return results for any query with similar performance.

Note by the way that at the start of this work we did not set out to compare a BTree based and columnar based database. The comparative setup in this report was chosen because both MongoDB and CrateDB are general purpose databases that can also be used for a timeseries database. It’s just the observation with the benefit of hindsight that of course in this particular use case CrateDB’s results are explained by the columnar side of its dualist architecture.

The database that puts performance consultants out of work?

Usually when a vendor commissions and “indepedent” expert report, the project is given to some well known performance expert who is familiar with the product of the client, and often less familiar with the competing products of the report. (It’s really difficult to pretend to be an expert in two different databases, so that is almost never the case for these comparative benchmark exercises.)

In this case though, it just so happens that MongoDB is the database I’m most familiar with, and once upon a time I could claim the undisputed title of being the leading performance expert in Europe, at least, if not beyond even… CrateDB on the other hand was a new product for me, and part of the project was to learn enough of it so that I could make an informed evaluation of how to benchmark it, and how to interpret the results.

CrateDB fits well into my own professional history in that it is a Java-based database with close ties to the (Big) Data ecosystem of the Apache Software Foundation. While there’s still a small part of me questioning who in their right mind would build a database in Java… Over the last 4 years I have had to soften my stance and admit that there’s strong innovation happening in the ecosystem that has given birth to Cassandra (my previous 4 years), Elastic, Lucene… Not to mention Kafka, Spark and Trino… These projects all borrow components and ideas from each other, and are able to do so as they share the same programming language and license.

CrateDB is another member of that family, and has built a modern distributed database with some truly innovative features, standing on the architectural shoulders of Netty and Lucene. In fact, working with CrateDB reminded me of a time 10 years ago when Elastic was gaining in popularity. Its users thought it was so great, they wanted to use Elastic as their primary and only database. And Elastic was busy educating their user base to please don’t do that, because they were a search engine and had no intention of being a good database.

When using CrateDB – that got started as a project around those times – it’s like you’ve stumbled into an alternative reality where Elastic is a proper database. You can use it as a decent OLTP database for your apps, but the same database product also allows you to do analytical queries and even text search as it comes with the full power of the Lucene libraries. (Yes yes, HNSW vector search too of course. It is 100% hype compatible.)

A big part of database performance work, is to plan the optimal set of indexes given a set of queries the application developer needs to execute. There’s a joke going around the database consulting community saying that half of what we do is to just execute EXPLAIN and tell the customer to add another index; while the other half you just tell them to buy more RAM.

I knew from blog posts that Rockset had developed a service where they would index every column by default. In the case of Rockset the LSM indexing structure was the innovative thing that was efficient enough to make such a revolutionary idea even possible. CrateDB is now the second product I’ve heard of offering this feature – and with Rockset being acquired and shutting down, I’m left asking is it the only one or are there others that do this and I haven’t heard of them yet?

(CC) BY Brent @ Flickr

The ability to index all columns, in other words making all queries equally fast seems revolutionary to me. It allows completely ad hoc exploratory querying. For example in this case, if a database was set up to be a timeseries database, you expect all queries to be limited to some range in time, and queries that aren’t will either be slow or impossible. With CrateDB you are at any point free to turn around and explore a different path – follow your curiosity or instinct to peek around the next peak, where those other databases couldn’t go. In the report we have added one such query to verify and demonstrate that this really works.

But not only that. Even just for the application developer this means you don’t need to go back to rethink your indexing strategy when a new feature does something you didn’t plan to do before. You can just focus on writing code, and the database will be fine with whatever query you need it to do.

I’m still fairly new as a CrateDB user myself, but after this first touch I’m even more excited about this feature. (Which I have first heard about from Rockset, to be clear, but I never got a chance to use their service before it was too late.) It seems to me after 50 years of planning the optimal combination of BTree indexes, we are so entrenched in our ways, that nobody has even thought of asking: what if I didn’t need to add any indexes? What if I could just put data into the database and queries would just work?

Now if you’ve read this blog this far, what do you think about this? Find Nyrkiö on Twitter or Linkedin and let us know!

DSI: Distributed Systems Infrastructure

For these benchmarks I was happy to return to the project we built 5 years ago at MongoDB to automate all our benchmarking: DSI. While originally this was a framework designed to benchmark MongoDB with multiple different benchmarks, CrateDB kindly sponsored the work to generalize the MongoDB parts so that we are now able to easily add support for any database or other software product, just as we can easily add support for new benchmark clients.

DSI now obviously supports CrateDB too, but the true validation for this work was seen last week, as I was able to add support for MySQL in just 2 days! (And half of that time was just me trying to remember how I can login to the damn thing…)

DSI will be the underlying open source project for Nyrkiö’s Benchmarking-as-a-Service product, now offered as a closed beta to select customers. If you’re interested, contact our sales or just fork it and join me in hacking it on an open source basis.

Nyrkiö comparison of CrateDB and MongoDB with TSBS v2 Download