ClickHouse data skipping index seemingly not skipping any rows

Question

I have a simple table in ClickHouse with 10M rows, defined like this:

CREATE TABLE data.syslogs
(
    `id` UInt32,
    `time` DateTime,
    `priority` UInt8,
    `message` String,
    INDEX ngrambf_message_index message TYPE ngrambf_v1(5, 65536, 3, 37) GRANULARITY 256
)
ENGINE = MergeTree
PRIMARY KEY (time, id)
ORDER BY (time, id, priority)
SETTINGS index_granularity = 8192

I'm running a match query over it like so:

SELECT message FROM syslogs WHERE match(syslogs.message, 'stat');

My issue is that this seems to be skipping no data at all. Here is the output:

3900390 rows in set. Elapsed: 6.016 sec. Processed 10.00 million rows, 777.80 MB (1.66 million rows/s., 129.29 MB/s.) Peak memory usage: 21.30 MiB.

Here are some example rows:

9996. │ 17069 │ 2024-08-20 08:40:25 │       13 │ statusd: something event: station count: 5 │
9997. │ 17069 │ 2024-08-20 08:40:25 │       13 │ statusd: something event: station count: 5 │

It seems like the index doesn't have any effect, I would expected it to skip some rows at least even if its defined badly.

Any ideas as to why?

Thanks!

Rich Raposa · Accepted Answer · 2024-11-14 16:19:17Z

1

Your GRANULARITY is way off. It's not a number of rows, it's a number of granules (which is already 8,192 rows). So you are trying to skip 256*8192 = 2,097,152 rows! That's going to be hard to skip.

Try setting your GRANULARITY to 1 first, and then if you see results that are acceptable try setting it to 2 or 3. The index will quickly become ineffective with a higher GRANULARITY.

answered Nov 14, 2024 at 16:19

Rich Raposa

9884 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Simeon Over a year ago

Thanks, will try this, I was actually attempting to control the number of rows in the granule. Can I do this only for the secondary index?

Simeon Over a year ago

Much much slower with low granularity: 178.53 thousand rows/s (gran 1) vs. 1.38 million rows/s. (gran 8192). Could be the dataset, I have very large chunks of the same data as its simulated. So I'd have 500k rows which are exactly the same, I'm trying to skip those. Ideally IMO for this dataset I need a granule of 500k and 1 index entry per granule.

Slach Over a year ago

You have two levels of granularity, first table level in rows (and actually this is adaptive) and second in data skip indexes, which means how many granules from PK will index in one granula of secondary index so GRANULARITY 1 in definitino of INDEX data skip index have no effect, if searched data present in most of granulas in data parts

Rich Raposa Over a year ago

You're not understanding what I said. Leave your table's index_granularity at 8192, and set the GRANULARITY of ngrambf_message_index to 1

Simeon Over a year ago

Yes this is what I did, I don't care about the primary index granularity, I changed the data skipping index one (the bloom filter)

Collectives™ on Stack Overflow

ClickHouse data skipping index seemingly not skipping any rows

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related