Using SIMD to aggregate billions of values per second

bluestreak | 171 points

QuestDB co-founder and CTO here - happy to share questdb, a performance-driven open-source time-series database that uses SQL.

High performance databases have a reputation of being inaccessible. They are expensive, closed-source, and require complex proprietary languages.

We have made our code available under Apache 2.0. Under this new release, QuestDB leverages SIMD instructions, vectorizations and parallel execution to achieve performance figures at the top of the high-performance databases. Results are shown in our blog post.

I sincerely hope that you will find this helpful, and that this will unlock new possibilities! In the meantime, we will carry on working on QuestDB, adding more features, and taking this speed to more query types. Feel free to join us on slack or Github if you want to participate.

bluestreak | 4 years ago

It would be great if you include clickhouse in your benchmark. It also boasts heavy SIMD use and is free + open source.

polskibus | 4 years ago

not to hijack too much, but since this is on the topic of timeseries...i'm currently working on a fast* Canvas2D timeseries chart:

https://github.com/leeoniya/uPlot

* ~4,000 pts/ms on an i5 and integrated gpu

leeoniya | 4 years ago

Super interesting product I'll definitely be taking a deeper look at this when I'm at work tomorrow.

I notice all your comparisons are with floating point and integer types. I was recently looking at SIMD as a possible way to speed up some of our calculations. But we create financial software and most data is typically stored as decimals not floating points to avoid problems with binary floating point precision during calculations.

Does quest handle decimals, just without the SIMD speed up?

Is this just a dead end? Are there any SIMD implementations that deal with decimal numbers? I considered Hacky workarounds like using integer based types internally and then treating them as as fixed point decimals for display but that doesn't give enough range for my purposes.

SimonPStevens | 4 years ago

So how does it compare to Clickhouse?

jmakov | 4 years ago

Cool. I see you're doing a regular (not compensated) horizontal sum in a loop. Horizontal sums are slow, but I'm guessing you wanted to have exactly the same result as if the sum was calculated sequentially (for doubles)? Do you know if any databases use more accurate summation methods (compensated summation)?

zbjornson | 4 years ago

I came across QuestDB in the past, but never tried myself. At my company, we use kx and onetick. Could you please elaborate why you are also comparing with Postgres since it's not really a time-series database nor revendicating to be part of the "high performance" club?

quod_2058 | 4 years ago

Were these benchmarks before or after 2020.03.26? There was a bug that caused max operations to take twice as long.

From the KDB+ 4.0 release notes:

2020.03.26 FIX fixed performance regression for max. e.g. q)x:100000000?100;system"ts:10 max x"

binomiq | 4 years ago

The numbers are impressive, especially because it is against kdb. q/kdb is mostly finance focused and closed source so not really flexible. questdb has an advantage on this, it might be a bit of a tangent but I wonder if this could be used to replace redis, I can see how having SQL as a querying language could be a big plus.

sirffuzzylogik | 4 years ago

In the reference there's no mention of SQL Window functions. Is it possible to do multiple moving averages over different time spans?

If not, are there plans to add support in the future?

stereosteve | 4 years ago

Great work! 2.3x faster than kdb and 500x than postgres for sum (double) time series What are your goals for the next release?

jeromerousselot | 4 years ago
[deleted]
| 4 years ago

I assume the values must be all be in memory beforehand and not hard storage.

popotamonga | 4 years ago

What about data compression? How does that compare to other time series DBs?

continuations | 4 years ago

What dialect of SQL are you using?

georgewfraser | 4 years ago

is it difficult to add SIMD to the existing time-series database?

sch00lb0y | 4 years ago

This can't be called a database due to lacking of any persistent storage. It does not survive a system crash.

It is a structured in-memory (or rather in-JVM) cache with a rudimentary SQL interface.

Calling things by its proper names is a half-way to intelligence.

johndoe42377 | 4 years ago