8

I have the following table:

CREATE TABLE recipemetadata
(
  --Lots of columns
  diet_glutenfree boolean NOT NULL,
);

Most every row will be set to FALSE unless someone comes up with some crazy new gluten free diet that sweeps the country.

I need to be able to very quickly query for rows where this value is true. I've created the index:

CREATE INDEX IDX_RecipeMetadata_GlutenFree ON RecipeMetadata(diet_glutenfree) WHERE diet_glutenfree;

It appears to work, however I can't figure out how to tell if indeed it's only indexing rows where the value is true. I want to make sure it's not doing something silly like indexing any rows with any value at all.

Should I add an operator to the WHERE clause, or is this syntax perfectly valid? Hopefully this isn't one of those super easy RTFM questions that will get downvoted 30 times.

UPDATE:

I've gone ahead and added 10,000 rows to RecipeMetadata with random values. I then did an ANALYZE on the table and a REINDEX just to be sure. When I run the query:

select recipeid from RecipeMetadata where diet_glutenfree;

I get:

'Seq Scan on recipemetadata  (cost=0.00..214.26 rows=5010 width=16)'
'  Filter: diet_glutenfree'

So, it appears to be doing a sequential scan on the table even though only about half the rows have this flag. The index is being ignored.

If I do:

select recipeid from RecipeMetadata where not diet_glutenfree;

I get:

'Seq Scan on recipemetadata  (cost=0.00..214.26 rows=5016 width=16)'
'  Filter: (NOT diet_glutenfree)'

So no matter what, this index is not being used.

4
  • 1
    Please add a link to your PostgreSQL mailing list post from the archives so people can connect this discussion with that one. It'd be nice if you'd post a follow up to your mailing list post with a link to this, too. If you're going to cross-post in multiple places please say so to prevent people from repeating work. Commented Dec 15, 2011 at 5:45
  • Not a problem, I'll do this in the future (I usually don't post in both places).. Commented Dec 15, 2011 at 5:47
  • BTW, I think the short answer to your question is "Yes" ... but if you're concerned, fill a table with some dummy data, ANALYZE the table, then use EXPLAIN ANALYZE to examine the plans of some queries that should hit the partial index. Commented Dec 15, 2011 at 5:56
  • Yup, doing that now actually.. Commented Dec 15, 2011 at 6:08

2 Answers 2

4

I've confirmed the index works as expected.

I re-created the random data, only this time set diet_glutenfree to random() > 0.9 so there's only a 10% chance of an on bit.

I then re-created the indexes and tried the query again.

SELECT RecipeId from RecipeMetadata where diet_glutenfree;

Returns:

'Index Scan using idx_recipemetadata_glutenfree on recipemetadata  (cost=0.00..135.15 rows=1030 width=16)'
'  Index Cond: (diet_glutenfree = true)'

And:

SELECT RecipeId from RecipeMetadata where NOT diet_glutenfree;

Returns:

'Seq Scan on recipemetadata  (cost=0.00..214.26 rows=8996 width=16)'
'  Filter: (NOT diet_glutenfree)'

It seems my first attempt was polluted since PG estimates it's faster to scan the whole table rather than hit the index if it has to load over half the rows anyway.

However, I think I would get these exact results on a full index of the column. Is there a way to verify the number of rows indexed in a partial index?

UPDATE

The index is around 40k. I created a full index of the same column and it's over 200k, so it looks like it's definitely partial.

Sign up to request clarification or add additional context in comments.

3 Comments

Yep, bang on. "About half" the rows won't cause Pg to favour the index. You'll need much better selectivity than 50% before an index scan is faster than a seqscan.
Thanks so much! I created a full index as well to compare sizes. It's definitely working as expected.
Note: you seem to have only 10K records. The 'working set' for your query will probably fit in core. The optimisation you perform is an optimisation in terms of cpu-usage. Once the "working set" is bigger than available buffer space, your query will become I/O bound, and the index won't help you anymore (unless your rows are so large that only a few fit on a disk-page).
1

An index on a one-bit field makes no sense. For understanding the decisions made by the planner, you must think in terms of pages, not in terms of rows.

For 8K pages and an (estinated) rowsize of 80, there are 100 rows on every page. Assuming a random distribution, the chance that a page consist of only rows with a true value is neglectable, pow (0.5, 100), about 1e-33, IICC. (and the same for 'false' of course) Thus for a query on gluten_free == true, every page has to be fetched anyway, and filtered afterwards. Using an index would only cause more pages (:the index) to be fetched.

2 Comments

"An index on a one-bit field makes no sense". Postgres bools require 8 bits of storage: postgresql.org/docs/8.4/static/datatype-boolean.html "Assuming a random distribution" -- this is potentially a big assumption. Far less than 50% of foods are typically gluten free. Insightful response, regardless.
"one-bit field" was about the information content, not about the required storage size. There might be a storage structure possible for effectively storing/indexing/retrieving bitfields (think: judy-trees) these might need fewer disk pages to be fetched, but it will be hard to combine them with the ATOM requirements for a RDBMS.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.