drrtuy`s board

июн 09, 2019

MDB ColumnStore and ClickHouse performance comparison

Introduction

ColumnStore and ClickHouse have a lot in common. Both are columnar-oriented analytical workload engines and extensively use MPP concept so they share almost the same niche. That is why I'm always asked at conferences about performance comparison between CS and CH. Until very recently there was not much to read in the wild on this matter except couple of slightly outdated tests[1] [2]. However there was a series of comparisons presented at the last Percona Live 19[3]. But in March 2019 I wasn't aware of those speeches yet so I decided to conduct my own comparison against CH that I personally find the best of a kind b/c this state of art engine exploits lots of concepts that any modern analytical engine needs.

Setup

I wanted a comparison not a stress test so I used my MacBook Pro 2018 with Linux Arch on it as a hardware. On top I put Docker to separate CS and CH environments. CS was a custom 1.2 release build for U18[15] that included MCOL-498 feature[4]. CH was a release build of 9158e0 commit. Configuration-wise I increased DBBC.NumBlocksPct in CS'es config up to 80%, Lang to UTF-8, DBRoot1.PreallocSpace to OFF. CH was good enough w/o any additional changes in my environment but feel free to add a note if I missed some important setting here. To avoid fancy memory performance degradation I disabled swap before I run the tests.

Dataset

There are number of analytical data sets available in the Net however I appreciate Yandex'es data set the most[5] b/c it comes from Yandex.Metrics that is the largest web-analytics in Russia. There was a remarkable speech[6] given by Alexey Milovidov about the origins of this data set. I decided to use only one table of the pair, namely hits because of the queries I used for the comparison. I also had to change the structure of the hits table to avoid array data type that CS doesn't support now. Here are the DDL statements for CS[7] and CH[8]. When I created the table I ingested the final dataset[9] into hits. It is worth to note that CH specific ddl file[8] contains both CREATE TABLE and INSERT statements that could be used to extract columns from the original hits table.

Data ingestion

When it comes down to data ingestion cpimport is the tool to use in CS. I uploaded data using 8 threads set both writers and readers to 8 and allocated 10MB for the buffer. Here is the command for data ingestion in CS.

cpimport -r8 -w8 -E'"' -c10000000   -e20000 -s',' test hits ./hits_partial_col_set.csv

To parellize data ingestion in CH I have to split data files. Playing with parallelize factor I came up with the number 4 that happened to be the best in my case. And here are the commands that I used to ingest data into CH.

split -n l/4
for i in $(echo 'abcd' | fold -w1); do echo $ hits_partial_col_set_a$i; time clickhouse-client --query "INSERT INTO test.h1 FORMAT CSV" --max_insert_block_size=100000 < hits_partial_col_set_a$i & done

CS spent 100 seconds to ingest the data set into CS and it took 1.9GB of disk space whilst CH spent 125 seconds and the table took 2.5GB of disk space.

Queries

I used the queries from CH'es benchmark page[10] but I changed them a bit b/c of the differences b/w CS and CH:

  • uniqExact() was used instead of uniq() for CH
  • all queries with ORDER BY/LIMIT is wraped in subqueries for CS
  • replaced time period lower and upper limits with a new ones that for both CS and CH
  • replaced CounterID for both CS and CH

Here are the lists for CS[11] and for CH[12] I got in the end.

Methodology

I run every query two times first using cold cache and then using a prewarmed cache. To clean caches I restarted CS and clean OS caches for CH b/c latter utilizes Page Cache extensively while the former uses its own block cache. Here[13] is the scenario log for CS and here[14] for CH.

Results

# Query in CS Query in CH ColumnStore ClickHouse
1
SELECT count(*) FROM test.hits;
SELECT count() FROM test.hits;
(0.111 s.) (0.080 s.) (0.075 s.) (0.008 s.)
2
SELECT count(*) FROM test.hits WHERE AdvEngineID != 0;
SELECT count() FROM test.hits WHERE AdvEngineID != 0;
(0.146 s.) (0.081 s.) (0.151 s.) (0.003 s.)
3
SELECT sum(AdvEngineID), count(*), avg(ResolutionWidth) FROM test.hits;
SELECT sum(AdvEngineID), count(), avg(ResolutionWidth) FROM test.hits;
(0.330 s.) (0.264 s.) (0.171 s.) (0.024 s.)
4
SELECT sum(UserID) FROM test.hits;
SELECT sum(UserID) FROM test.hits;
(0.240 s.) (0.178 s.) (0.067 s.) (0.016 s.)
5
SELECT distinct UserID FROM test.hits;
SELECT uniqExact(UserID) FROM test.hits;
(0.353 s.) (0.297 s.) (0.122 s.) (0.035 s.)
6
SELECT distinct SearchPhrase FROM test.hits;
SELECT uniqExact(SearchPhrase) FROM test.hits;
(0.844 s.) (0.773 s.) (0.275 s.) (0.111 s.)
7
SELECT min(EventDate), max(EventDate) FROM test.hits;
SELECT min(EventDate), max(EventDate) FROM test.hits;
(0.279 s.) (0.258 s.) (0.171 s.) (0.015 s.)
8
SELECT * FROM (SELECT AdvEngineID, count(*) FROM test.hits WHERE AdvEngineID != 0 GROUP BY AdvEngineID ORDER BY count(*) DESC) a;
SELECT AdvEngineID, count() FROM test.hits WHERE AdvEngineID != 0 GROUP BY AdvEngineID ORDER BY count() DESC;
(0.136 s.) (0.078 s.) (0.170 s.) (0.007 s.)
9
SELECT * FROM ( SELECT distinct UserID as u, regionid FROM test.hits GROUP BY regionid ORDER BY u DESC LIMIT 10 )a ;
SELECT RegionID, uniqExact(UserID) AS u FROM test.hits GROUP BY RegionID ORDER BY u DESC LIMIT 10;
(0.370 s.) (0.272 s.) (0.262 s.) (0.059 s.)
10
SELECT * FROM (SELECT RegionID, sum(AdvEngineID), count(*) AS c, avg(ResolutionWidth), count(distinct UserID) FROM test.hits GROUP BY RegionID ORDER BY c DESC LIMIT 10) a;
SELECT RegionID, sum(AdvEngineID), count() AS c, avg(ResolutionWidth), uniqExact(UserID) FROM test.hits GROUP BY RegionID ORDER BY c DESC LIMIT 10;
(0.628 s.) (0.518 s.) (0.230 s.) (0.070 s.)
11
SELECT * FROM (SELECT distinct UserID AS u, MobilePhoneModel FROM test.hits WHERE MobilePhoneModel != '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10) a;
SELECT MobilePhoneModel, uniqExact(UserID) AS u FROM test.hits WHERE MobilePhoneModel != '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10;
(0.539 s.) (0.475 s.) (0.213 s.) (0.040 s.)
12
SELECT * FROM (SELECT distinct UserID AS u, MobilePhone, MobilePhoneModel FROM test.hits WHERE MobilePhoneModel != '' GROUP BY MobilePhone, MobilePhoneModel ORDER BY u DESC LIMIT 10) a;
SELECT MobilePhone, MobilePhoneModel, uniqExact(UserID) AS u FROM test.hits WHERE MobilePhoneModel != '' GROUP BY MobilePhone, MobilePhoneModel ORDER BY u DESC LIMIT 10;
(0.561 s.) (0.443 s.) (0.215 s.) (0.047 s.)
13
SELECT * FROM (SELECT SearchPhrase, count(*) AS c FROM test.hits WHERE SearchPhrase != '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10) a;
SELECT SearchPhrase, count() AS c FROM test.hits WHERE SearchPhrase != '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
(0.539 s.) (0.476 s.) (0.171 s.) (0.094 s.)
14
SELECT * FROM (SELECT distinct UserID AS u, SearchPhrase FROM test.hits WHERE SearchPhrase != '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10) a;
SELECT SearchPhrase, uniqExact(UserID) AS u FROM test.hits WHERE SearchPhrase != '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10;
(0.627 s.) (0.522 s.) (0.252 s.) (0.077 s.)
15
SELECT * FROM (SELECT SearchEngineID, SearchPhrase, count(*) AS c FROM test.hits WHERE SearchPhrase != '' GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10) a;
SELECT SearchEngineID, SearchPhrase, count() AS c FROM test.hits WHERE SearchPhrase != '' GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10;
(0.644 s.) (0.474 s.) (0.183 s.) (0.107 s.)
16
SELECT * FROM(SELECT UserID, count(*) FROM test.hits GROUP BY UserID ORDER BY count(*) DESC LIMIT 10) a;
SELECT UserID, count() FROM test.hits GROUP BY UserID ORDER BY count() DESC LIMIT 10;
(0.312 s.) (0.258 s.) (0.194 s.) (0.050 s.)
17
SELECT * FROM(SELECT UserID, SearchPhrase, count(*) FROM test.hits GROUP BY UserID, SearchPhrase ORDER BY count(*) DESC LIMIT 10) a;
SELECT UserID, SearchPhrase, count() FROM test.hits GROUP BY UserID, SearchPhrase ORDER BY count() DESC LIMIT 10;
(0.947 s.) (0.830 s.) (0.316 s.) (0.163 s.)
18
SELECT * FROM(SELECT UserID, SearchPhrase, count(*) FROM test.hits GROUP BY UserID, SearchPhrase LIMIT 10) a;
SELECT UserID, SearchPhrase, count() FROM test.hits GROUP BY UserID, SearchPhrase LIMIT 10;
(0.915 s.) (0.783 s.) (0.322 s.) (0.158 s.)
19
SELECT * FROM(SELECT UserID, extract( minute from EventTime) AS m, SearchPhrase, count(*) FROM test.hits GROUP BY UserID, m, SearchPhrase ORDER BY count(*) DESC LIMIT 10) a;
SELECT UserID, toMinute(EventTime) AS m, SearchPhrase, count() FROM test.hits GROUP BY UserID, m, SearchPhrase ORDER BY count() DESC LIMIT 10;
(2.383 s.) (2.057 s.) (0.673 s.) (0.460 s.)
20
SELECT UserID FROM test.hits WHERE UserID = 1711847583975330569;
SELECT UserID FROM test.hits WHERE UserID = 12345678901234567890;
(0.151 s.) (0.081 s.) (0.054 s.) (0.012 s.)
21
SELECT count(*) FROM test.hits WHERE URL LIKE '%metrika%';
SELECT count() FROM test.hits WHERE URL LIKE '%metrika%';
(3.717 s.) (3.717 s.) (0.262 s.) (0.215 s.)
22
SELECT * FROM(SELECT SearchPhrase, min(URL), count(*) AS c FROM test.hits WHERE URL LIKE '%metrika%' AND SearchPhrase != '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10) a;
SELECT SearchPhrase, any(URL), count() AS c FROM test.hits WHERE URL LIKE '%metrika%' AND SearchPhrase != '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
(1.093 s.) (0.629 s.) (0.405 s.) (0.195 s.)
23
SELECT * FROM(SELECT min(UserID),SearchPhrase, min(URL), min(Title), count(*) AS c FROM test.hits WHERE Title LIKE '%Яндекс%' AND URL NOT LIKE '%.yandex.%' AND SearchPhrase != '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10) a;
SELECT SearchPhrase, any(URL), any(Title), count() AS c, uniqExact(UserID) FROM test.hits WHERE Title LIKE '%Яндекс%' AND URL NOT LIKE '%.yandex.%' AND SearchPhrase != '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
(2.339 s.) (1.415 s.) (0.647 s.) (0.436 s.)
24
SELECT * FROM(SELECT * FROM test.hits WHERE URL LIKE '%metrika%' ORDER BY EventTime LIMIT 10) a;
SELECT * FROM test.hits WHERE URL LIKE '%metrika%' ORDER BY EventTime LIMIT 10;
(7.356 s.) (3.491 s.) (1.464 s.) (0.954 s.)
25
SELECT * FROM(SELECT SearchPhrase FROM test.hits WHERE SearchPhrase != '' ORDER BY EventTime LIMIT 10) a;
SELECT SearchPhrase FROM test.hits WHERE SearchPhrase != '' ORDER BY EventTime LIMIT 10;
(0.548 s.) (0.419 s.) (0.213 s.) (0.042 s.)
26
SELECT * FROM(SELECT SearchPhrase FROM test.hits WHERE SearchPhrase != '' ORDER BY SearchPhrase LIMIT 10) a;
SELECT SearchPhrase FROM test.hits WHERE SearchPhrase != '' ORDER BY SearchPhrase LIMIT 10;
(0.484 s.) (0.400 s.) (0.198 s.) (0.032 s.)
27
SELECT * FROM(SELECT SearchPhrase FROM test.hits WHERE SearchPhrase != '' ORDER BY EventTime, SearchPhrase LIMIT 10) a;
SELECT SearchPhrase FROM test.hits WHERE SearchPhrase != '' ORDER BY EventTime, SearchPhrase LIMIT 10;
(0.548 s.) (0.433 s.) (0.195 s.) (0.045 s.)
28
SELECT * FROM(SELECT CounterID, avg(length(URL)) AS l, count(*) AS c FROM test.hits WHERE URL != '' GROUP BY CounterID HAVING c > 100000 ORDER BY l DESC LIMIT 25) a;
SELECT CounterID, avg(length(URL)) AS l, count() AS c FROM test.hits WHERE URL != '' GROUP BY CounterID HAVING c > 100000 ORDER BY l DESC LIMIT 25;
(2.415 s.) (2.150 s.) (0.436 s.) (0.208 s.)
29
SELECT * FROM(SELECT Referer, avg(length(Referer)) AS l, count(*) AS c FROM test.hits WHERE Referer != '' GROUP BY 1 HAVING c > 100000 ORDER BY l DESC LIMIT 25) a;
SELECT domainWithoutWWW(Referer) AS key, avg(length(Referer)) AS l, count() AS c, any(Referer) FROM test.hits WHERE Referer != '' GROUP BY key HAVING c > 100000 ORDER BY l DESC LIMIT 25;
(3.130 s.) (2.712 s.) (0.600 s.) (0.375 s.)
30
SELECT sum(ResolutionWidth), sum(ResolutionWidth + 1), sum(ResolutionWidth + 2), sum(ResolutionWidth + 3), sum(ResolutionWidth + 4), sum(ResolutionWidth + 5), sum(ResolutionWidth + 6), sum(ResolutionWidth + 7), sum(ResolutionWidth + 8), sum(ResolutionWidth + 9), sum(ResolutionWidth + 10), sum(ResolutionWidth + 11), sum(ResolutionWidth + 12), sum(ResolutionWidth + 13), sum(ResolutionWidth + 14), sum(ResolutionWidth + 15), sum(ResolutionWidth + 16), sum(ResolutionWidth + 17), sum(ResolutionWidth + 18), sum(ResolutionWidth + 19), sum(ResolutionWidth + 20), sum(ResolutionWidth + 21), sum(ResolutionWidth + 22), sum(ResolutionWidth + 23), sum(ResolutionWidth + 24), sum(ResolutionWidth + 25), sum(ResolutionWidth + 26), sum(ResolutionWidth + 27), sum(ResolutionWidth + 28), sum(ResolutionWidth + 29), sum(ResolutionWidth + 30), sum(ResolutionWidth + 31), sum(ResolutionWidth + 32), sum(ResolutionWidth + 33), sum(ResolutionWidth + 34), sum(ResolutionWidth + 35), sum(ResolutionWidth + 36), sum(ResolutionWidth + 37), sum(ResolutionWidth + 38), sum(ResolutionWidth + 39), sum(ResolutionWidth + 40), sum(ResolutionWidth + 41), sum(ResolutionWidth + 42), sum(ResolutionWidth + 43), sum(ResolutionWidth + 44), sum(ResolutionWidth + 45), sum(ResolutionWidth + 46), sum(ResolutionWidth + 47), sum(ResolutionWidth + 48), sum(ResolutionWidth + 49), sum(ResolutionWidth + 50), sum(ResolutionWidth + 51), sum(ResolutionWidth + 52), sum(ResolutionWidth + 53), sum(ResolutionWidth + 54), sum(ResolutionWidth + 55), sum(ResolutionWidth + 56), sum(ResolutionWidth + 57), sum(ResolutionWidth + 58), sum(ResolutionWidth + 59), sum(ResolutionWidth + 60), sum(ResolutionWidth + 61), sum(ResolutionWidth + 62), sum(ResolutionWidth + 63), sum(ResolutionWidth + 64), sum(ResolutionWidth + 65), sum(ResolutionWidth + 66), sum(ResolutionWidth + 67), sum(ResolutionWidth + 68), sum(ResolutionWidth + 69), sum(ResolutionWidth + 70), sum(ResolutionWidth + 71), sum(ResolutionWidth + 72), sum(ResolutionWidth + 73), sum(ResolutionWidth + 74), sum(ResolutionWidth + 75), sum(ResolutionWidth + 76), sum(ResolutionWidth + 77), sum(ResolutionWidth + 78), sum(ResolutionWidth + 79), sum(ResolutionWidth + 80), sum(ResolutionWidth + 81), sum(ResolutionWidth + 82), sum(ResolutionWidth + 83), sum(ResolutionWidth + 84), sum(ResolutionWidth + 85), sum(ResolutionWidth + 86), sum(ResolutionWidth + 87), sum(ResolutionWidth + 88), sum(ResolutionWidth + 89) FROM test.hits;
SELECT sum(ResolutionWidth), sum(ResolutionWidth + 1), sum(ResolutionWidth + 2), sum(ResolutionWidth + 3), sum(ResolutionWidth + 4), sum(ResolutionWidth + 5), sum(ResolutionWidth + 6), sum(ResolutionWidth + 7), sum(ResolutionWidth + 8), sum(ResolutionWidth + 9), sum(ResolutionWidth + 10), sum(ResolutionWidth + 11), sum(ResolutionWidth + 12), sum(ResolutionWidth + 13), sum(ResolutionWidth + 14), sum(ResolutionWidth + 15), sum(ResolutionWidth + 16), sum(ResolutionWidth + 17), sum(ResolutionWidth + 18), sum(ResolutionWidth + 19), sum(ResolutionWidth + 20), sum(ResolutionWidth + 21), sum(ResolutionWidth + 22), sum(ResolutionWidth + 23), sum(ResolutionWidth + 24), sum(ResolutionWidth + 25), sum(ResolutionWidth + 26), sum(ResolutionWidth + 27), sum(ResolutionWidth + 28), sum(ResolutionWidth + 29), sum(ResolutionWidth + 30), sum(ResolutionWidth + 31), sum(ResolutionWidth + 32), sum(ResolutionWidth + 33), sum(ResolutionWidth + 34), sum(ResolutionWidth + 35), sum(ResolutionWidth + 36), sum(ResolutionWidth + 37), sum(ResolutionWidth + 38), sum(ResolutionWidth + 39), sum(ResolutionWidth + 40), sum(ResolutionWidth + 41), sum(ResolutionWidth + 42), sum(ResolutionWidth + 43), sum(ResolutionWidth + 44), sum(ResolutionWidth + 45), sum(ResolutionWidth + 46), sum(ResolutionWidth + 47), sum(ResolutionWidth + 48), sum(ResolutionWidth + 49), sum(ResolutionWidth + 50), sum(ResolutionWidth + 51), sum(ResolutionWidth + 52), sum(ResolutionWidth + 53), sum(ResolutionWidth + 54), sum(ResolutionWidth + 55), sum(ResolutionWidth + 56), sum(ResolutionWidth + 57), sum(ResolutionWidth + 58), sum(ResolutionWidth + 59), sum(ResolutionWidth + 60), sum(ResolutionWidth + 61), sum(ResolutionWidth + 62), sum(ResolutionWidth + 63), sum(ResolutionWidth + 64), sum(ResolutionWidth + 65), sum(ResolutionWidth + 66), sum(ResolutionWidth + 67), sum(ResolutionWidth + 68), sum(ResolutionWidth + 69), sum(ResolutionWidth + 70), sum(ResolutionWidth + 71), sum(ResolutionWidth + 72), sum(ResolutionWidth + 73), sum(ResolutionWidth + 74), sum(ResolutionWidth + 75), sum(ResolutionWidth + 76), sum(ResolutionWidth + 77), sum(ResolutionWidth + 78), sum(ResolutionWidth + 79), sum(ResolutionWidth + 80), sum(ResolutionWidth + 81), sum(ResolutionWidth + 82), sum(ResolutionWidth + 83), sum(ResolutionWidth + 84), sum(ResolutionWidth + 85), sum(ResolutionWidth + 86), sum(ResolutionWidth + 87), sum(ResolutionWidth + 88), sum(ResolutionWidth + 89) FROM test.hits;
(14.983 s.) (15.201 s.) (0.913 s.) (0.759 s.)
31
SELECT * FROM(SELECT SearchEngineID, ClientIP, count(*) AS c, sum(Refresh), avg(ResolutionWidth) FROM test.hits WHERE SearchPhrase != '' GROUP BY SearchEngineID, ClientIP ORDER BY c DESC LIMIT 10) a;
SELECT SearchEngineID, ClientIP, count() AS c, sum(Refresh), avg(ResolutionWidth) FROM test.hits WHERE SearchPhrase != '' GROUP BY SearchEngineID, ClientIP ORDER BY c DESC LIMIT 10;
(0.610 s.) (0.393 s.) (0.250 s.) (0.073 s.)
32
SELECT * FROM(SELECT WatchID, ClientIP, count(*) AS c, sum(Refresh), avg(ResolutionWidth) FROM test.hits WHERE SearchPhrase != '' GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10) a;
SELECT WatchID, ClientIP, count() AS c, sum(Refresh), avg(ResolutionWidth) FROM test.hits WHERE SearchPhrase != '' GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;
(0.711 s.) (0.511 s.) (0.303 s.) (0.130 s.)
33
SELECT * FROM(SELECT WatchID, ClientIP, count(*) AS c, sum(Refresh), avg(ResolutionWidth) FROM test.hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10) a;
SELECT WatchID, ClientIP, count() AS c, sum(Refresh), avg(ResolutionWidth) FROM test.hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;
(3.860 s.) (3.877 s.) (1.413 s.) (1.130 s.)
34
SELECT * FROM(SELECT URL, count(*) AS c FROM test.hits GROUP BY URL ORDER BY c DESC LIMIT 10) a;
SELECT URL, count() AS c FROM test.hits GROUP BY URL ORDER BY c DESC LIMIT 10;
(2.782 s.) (2.447 s.) (0.863 s.) (0.635 s.)
35
SELECT * FROM(SELECT 1, URL, count(*) AS c FROM test.hits GROUP BY 1, URL ORDER BY c DESC LIMIT 10) a;
SELECT 1, URL, count() AS c FROM test.hits GROUP BY 1, URL ORDER BY c DESC LIMIT 10;
(2.745 s.) (2.445 s.) (0.870 s.) (0.621 s.)
36
SELECT * FROM(SELECT ClientIP AS x, ClientIP - 1, ClientIP - 2, ClientIP - 3, count(*) AS c FROM test.hits GROUP BY x, x - 1, x - 2, x - 3 ORDER BY c DESC LIMIT 10) a;
SELECT ClientIP AS x, x - 1, x - 2, x - 3, count() AS c FROM test.hits GROUP BY x, x - 1, x - 2, x - 3 ORDER BY c DESC LIMIT 10;
(0.631 s.) (0.560 s.) (0.332 s.) (0.139 s.)
37
select * from (SELECT URL, count(*) AS PageViews FROM test.hits WHERE CounterID = 63 AND EventDate >= '2014-03-01' AND EventDate <= '2014-03-31' AND DontCountHits =0 AND Refresh =0 AND URL != '' and URL IS NOT NULL GROUP BY URL ORDER BY PageViews DESC LIMIT 10) a;
SELECT URL, count() AS PageViews FROM test.hits WHERE CounterID = 93229 AND EventDate >= toDate('2014-03-17') AND EventDate <= toDate('2014-03-23') AND NOT DontCountHits AND NOT Refresh AND notEmpty(URL) GROUP BY URL ORDER BY PageViews DESC LIMIT 10;
(1.366 s.) (0.852 s.) (0.195 s.) (0.004 s.)
38
SELECT * FROM (SELECT Title, count(*) AS PageViews FROM test.hits WHERE CounterID = 63 AND EventDate >= '2014-03-01' AND EventDate <= '2014-03-31' AND NOT DontCountHits AND NOT Refresh AND Title IS NOT NULL GROUP BY Title ORDER BY PageViews DESC LIMIT 10) a;
SELECT Title, count() AS PageViews FROM test.hits WHERE CounterID = 93229 AND EventDate >= toDate('2014-03-17') AND EventDate <= toDate('2014-03-23') AND NOT DontCountHits AND NOT Refresh AND notEmpty(Title) GROUP BY Title ORDER BY PageViews DESC LIMIT 10;
(0.386 s.) (0.210 s.) (0.091 s.) (0.004 s.)
39
SELECT * FROM (SELECT URL, count(*) AS PageViews FROM test.hits WHERE CounterID = 63 AND EventDate >= '2014-03-01' AND EventDate <= '2014-03-31' AND NOT Refresh AND IsLink AND NOT IsDownload GROUP BY URL ORDER BY PageViews DESC LIMIT 1000) a;
SELECT URL, count() AS PageViews FROM test.hits WHERE CounterID = 93229 AND EventDate >= toDate('2014-03-17') AND EventDate <= toDate('2014-03-23') AND NOT Refresh AND IsLink AND NOT IsDownload GROUP BY URL ORDER BY PageViews DESC LIMIT 1000;
(0.258 s.) (0.138 s.) (0.178 s.) (0.008 s.)
40
SELECT * FROM (SELECT TraficSourceID, SearchEngineID, AdvEngineID, IF((SearchEngineID = 0 AND AdvEngineID = 0),Referer,'') AS Src, URL AS Dst, count(*) AS PageViews FROM test.hits WHERE CounterID = 63 AND EventDate between '2014-03-01' AND '2014-03-31' AND NOT Refresh GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC LIMIT 1000) a;
SELECT TraficSourceID, SearchEngineID, AdvEngineID, ((SearchEngineID = 0 AND AdvEngineID = 0) ? Referer : '') AS Src, URL AS Dst, count() AS PageViews FROM test.hits WHERE CounterID = 93229 AND EventDate >= toDate('2014-03-17') AND EventDate <= toDate('2014-03-23') AND NOT Refresh GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC LIMIT 1000;
(0.405 s.) (0.199 s.) (0.200 s.) (0.014 s.)
41
SELECT * FROM (SELECT WindowClientWidth, WindowClientHeight, count(*) AS PageViews FROM test.hits WHERE CounterID = 63 AND EventDate between '2014-03-01' AND '2014-03-31' AND NOT Refresh AND NOT DontCountHits AND URLHash = 8759892953031403647 GROUP BY WindowClientWidth, WindowClientHeight ORDER BY PageViews DESC LIMIT 10000) a;
SELECT WindowClientWidth, WindowClientHeight, count() AS PageViews FROM test.hits WHERE CounterID = 93229 AND EventDate >= toDate('2014-03-17') AND EventDate <= toDate('2014-03-23') AND NOT Refresh AND NOT DontCountHits AND URLHash = halfMD5('http://yandex.ru/') GROUP BY WindowClientWidth, WindowClientHeight ORDER BY PageViews DESC LIMIT 10000;
(0.233 s.) (0.113 s.) (0.190 s.) (0.005 s.)
42
SELECT * FROM (SELECT EventTime AS Minute, count(*) AS PageViews FROM test.hits WHERE CounterID = 63 AND EventDate between '2014-03-01' and '2014-03-31' AND NOT Refresh AND NOT DontCountHits GROUP BY Minute ORDER BY Minute) a;
SELECT toStartOfMinute(EventTime) AS Minute, count() AS PageViews FROM test.hits WHERE CounterID = 93229 AND EventDate >= toDate('2014-03-17') AND EventDate <= toDate('2013-07-02') AND NOT Refresh AND NOT DontCountHits GROUP BY Minute ORDER BY Minute;
(0.301 s.) (0.182 s.) (0.093 s.) (0.008 s.)

Conclusion

As you can see CS is beaten by CH in most cases. The most notable reasons for that in CS are:

  • Lack of parallel ORDER BY facility. But we are working on it and hopefully the project converges 3Q-ish 2019. Here[15] is the issue to follow up
  • Very inefficient variable size strings filtering. We are attacking this from number of sides first introducing segment files support for wide(>8 byte) data type. JFYI CS now uses a Dictionary for storing DT wider then 8bytes and this imposes a number of performance penalties. I will write more on this another day
  • CS lacks NULL bitmaps a lot. To save NULL or empty value we use tokens that are MAX_VALUE-1 and MAX_VALUE-2 of the range available for the data type. To check the column value for NULL CS literally extracts the value and compares it with a magic and that is ugly. We have project to eliminate the inefficiency in CS roadmap though

However CH is a great example of performance-centered approach in making analytical engines. I must confess when it boils down to performance I have CH as a golden exemplar in mind. Let's look into queries where timigs differ the most. I will mostly look into CS problems rather than CH advantages.

  • Query 21. This one does simple scans using regexp. CS utilizes regcomp/regexec pair for doing the search and this approach isn't very good at times. We have a different approach in mind using trigrams + counting Bloom filters to produce a fingerprint of a string.
  • Query 24. This query suffers from inefficient regex handling, outdated algorithms of Dictionary scanning and single threaded sort.
  • Query 28. Strings scanning and single-threaded sort both take enourmous amount of total time.
  • Query 29. Same as previous.
  • Query 30. The most significant difference and the most interesting case. That is mix of reason number 3 from the list above combined with the fact CS uses Vulcano processing approach thus processing the set row by row that is very inefficient in analytical workload world. We have plans how to overcome this limitation with using both vectorization and JIT for the most intensively working loops.

Any comments and/or suggestions are welcome and thanks for reading this. And as a spoiler. I'm going to post an additional comparison concerned with JOIN-s mostly so stay tuned.

Links

  1. https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks-mariadb-columnstore-vs-clickhouse-vs-apache-spark/
  2. https://linuxjedi.co.uk/2017/09/21/correcting-mariadb-columnstore-benchmarks/#more-485
  3. https://www.percona.com/live/19/sessions/opensource-column-store-databases-mariadb-columnstore-vs-clickhouse
  4. https://jira.mariadb.org/browse/MCOL-498
  5. https://github.com/yandex/ClickHouse/blob/master/dbms/tests/instructions/developer_instruction_ru.md#%D1%82%D0%B5%D1%81%D1%82%D0%BE%D0%B2%D1%8B%D0%B5-%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D0%B5 *Russian text
  6. https://www.highload.ru/spb/2019/abstracts/4808
  7. https://drrtuy.zerothree.su/CS-ddl.txt
  8. https://drrtuy.zerothree.su/CH-ddl.txt
  9. s3://datasetyandex/hits_partial_col_set.csv.lzma
  10. https://clickhouse.yandex/benchmark.html
  11. https://drrtuy.zerothree.su/CS-testing-queries.txt
  12. https://drrtuy.zerothree.su/CH-testing-queries.txt
  13. https://drrtuy.zerothree.su/CS-testing-raw-results.txt
  14. https://drrtuy.zerothree.su/CH-testing-raw-results.txt
  15. s3://mcol498/mariadb-columnstore-1.2.4-1-bionic.amd64.deb.tar.gz
Click to read and post comments