Postgresql : Suitable multi column indexing for (timestamp,string)

Question

I have table in which there is timestamp field (in format yyyy-MM-dd HH:mm:ss.SSS ) (timestamp without time zone) and a non-unique field (string format) .

Consider an example:
Assume this is table User(userId,userType,modifiedOn). userType is non-unique key and modifiedOn is timestamp without time zone.

User Table is updating on some eligible criteria by other jobs at some 20 - 40 minutes interval.

userType can be max 200 distinct value while User table have millions of data.

What type of indexing should I use ?

Currently I am trying

CREATE INDEX user_modifiedOn_userType_index on user USING btree(modifiedOn,userType);

Note :
I am putting between this range of time like this modifiedOn between '04-APR-18 07:44:21' and '06-APR-18 07:44:21'.
Currently using postgresql version 9.6 later will shift to 10.3

But I have doubts:

1) How much order of columns matter in multiColumn indexing?

Thought: modifiedOn will have millions of distinct values so it should come first while userType have hardly 200 distinct values.

2) Is Indexing on timestamp possible upto hour or min? If it is possible then how much it will impact on performance.

The indexing strategy depends on the queries, not the tables. The rule of thumb is: index for equality first, then for ranges. — user330315
– user330315, Commented Apr 27, 2018 at 14:18
@a_horse_with_no_name But userType have maximum 200 distinct value so I think modifiedOn should come first. My query is either Select * from user where modifiedOn between ? and ? and userType = ? or Select * from user where modifiedOn >= ?. — Badman
– Badman, Commented Apr 27, 2018 at 14:37
@a_horse_with_no_name This one is critical one Select * from user where modifiedOn between ? and ? and userType = ?. — Badman
– Badman, Commented Apr 27, 2018 at 14:41

Laurenz Albe · Accepted Answer · 2018-04-30 06:59:34Z

2

TL;DR: In the light of your most frequent query, you should index on (user_type, modifiedon). If you omit the first column, the index would not be optimal, but still useful.

Try to think of the way data is organized in an index: effectively, it is a sorted list, ordered first by the first index column and then – within each group of equal values of the first column – by the second index column.

So if you index on (modifiedon, usertype), the index will look similar to this:

 modifiedon |  usertype
------------+-------------
 2018-01-01 | basicuser
 2018-01-01 | normaluser
 2018-01-01 | superuser
 2018-01-01 | .........
 2018-01-02 | normaluser
 2018-01-02 | .........
 .......... | .........
 2018-04-29 | basicuser
 2018-04-29 | normaluser
 2018-04-29 | xpertuser

An index scan can only be used if the data you are looking for form a continuous block of entries in the index.

Now if your query is

SELECT * FROM user WHERE modifiedon BETWEEN $1 AND $2 AND usertype = $3;

the index can be used for the first conditions, because the entries for modifiedon between two dates form a continuous block of index entries. However, the index cannot be used for the second condition, because the index entries for a certain usertype are not next to each other within the block selected by the first condition.

However, if you have an index on (usertype, modifiedon), it will look like this:

 usertype   | modifiedon
------------+-------------
 basicuser  | 2018-01-01
 basicuser  | 2018-01-02
 basicuser  | ..........
 basicuser  | 2018-04-29
 normaluser | 2018-01-01
 normaluser | 2018-01-02
 normaluser | ..........
 normaluser | 2018-04-29
 .......... | ..........
 xpertuser  | 2018-03-01
 xpertuser  | ..........
 xpertuser  | 2018-04-29

It is obvious that the entries that match the query form a continuous block of entries in the index, so it can be used for the whole condition.

So this combined index is the best index for the query.

However, it may be that there are only very few usertypes. Then the second condition is not very selective, and there is not much benefit in including the usertype column in the index. In fact, it could be harmful, because it makes the index larger, and that would mean more work during the index scan, so that you could effectively lose that way.

edited Apr 30, 2018 at 6:59

answered Apr 27, 2018 at 15:09

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Badman Over a year ago

Can you give any reference to prove your points? Index on (user_type, modifiedon) why not this order (modifiedon, user_type) ? In above discussion with "a_horse_with_no_name" to understand more about my queries.

Laurenz Albe Over a year ago

Fair enough request. I have expanded the answer substantially be adding an explanation. If you need a reference, a simple web search should turn up plenty of information.

Collectives™ on Stack Overflow

Postgresql : Suitable multi column indexing for (timestamp,string)

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related