0

I have a simple table in ClickHouse with 10M rows, defined like this:

CREATE TABLE data.syslogs
(
    `id` UInt32,
    `time` DateTime,
    `priority` UInt8,
    `message` String,
    INDEX ngrambf_message_index message TYPE ngrambf_v1(5, 65536, 3, 37) GRANULARITY 256
)
ENGINE = MergeTree
PRIMARY KEY (time, id)
ORDER BY (time, id, priority)
SETTINGS index_granularity = 8192

I'm running a match query over it like so:

SELECT message FROM syslogs WHERE match(syslogs.message, 'stat');

My issue is that this seems to be skipping no data at all. Here is the output:

3900390 rows in set. Elapsed: 6.016 sec. Processed 10.00 million rows, 777.80 MB (1.66 million rows/s., 129.29 MB/s.) Peak memory usage: 21.30 MiB.

Here are some example rows:

9996. │ 17069 │ 2024-08-20 08:40:25 │       13 │ statusd: something event: station count: 5 │
9997. │ 17069 │ 2024-08-20 08:40:25 │       13 │ statusd: something event: station count: 5 │

It seems like the index doesn't have any effect, I would expected it to skip some rows at least even if its defined badly.

Any ideas as to why?

Thanks!

1 Answer 1

1

Your GRANULARITY is way off. It's not a number of rows, it's a number of granules (which is already 8,192 rows). So you are trying to skip 256*8192 = 2,097,152 rows! That's going to be hard to skip.

Try setting your GRANULARITY to 1 first, and then if you see results that are acceptable try setting it to 2 or 3. The index will quickly become ineffective with a higher GRANULARITY.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, will try this, I was actually attempting to control the number of rows in the granule. Can I do this only for the secondary index?
Much much slower with low granularity: 178.53 thousand rows/s (gran 1) vs. 1.38 million rows/s. (gran 8192). Could be the dataset, I have very large chunks of the same data as its simulated. So I'd have 500k rows which are exactly the same, I'm trying to skip those. Ideally IMO for this dataset I need a granule of 500k and 1 index entry per granule.
You have two levels of granularity, first table level in rows (and actually this is adaptive) and second in data skip indexes, which means how many granules from PK will index in one granula of secondary index so GRANULARITY 1 in definitino of INDEX data skip index have no effect, if searched data present in most of granulas in data parts
You're not understanding what I said. Leave your table's index_granularity at 8192, and set the GRANULARITY of ngrambf_message_index to 1
Yes this is what I did, I don't care about the primary index granularity, I changed the data skipping index one (the bloom filter)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.