Most Efficient Way to Create an Index in Postgres

Question

Is it more efficient to create an index after loading data is complete or before, or does it not matter?

For example, say I have 500 files to load into a Postgres 8.4 DB. Here are the two index creation scenarios I could use:

Create index when table is created, then load each file into table; or
Create index after all files have been loaded into the table.

The table data itself is about 45 Gigabytes. The index is about 12 Gigabytes. I'm using a standard index. It is created like this:

CREATE INDEX idx_name ON table_name (column_name);

My data loading uses COPY FROM.

Once all the files are loaded, no updates, deletes or additional loads will occur on the table (it's a day's worth of data that will not change). So I wanted to ask which scenario would be most efficient? Initial testing seems to indicate that loading all the files and then creating the index (scenario 2) is faster, but I have not done a scientific comparison of the two approaches.

mvp · Accepted Answer · 2016-06-20 09:33:08Z

103

Your observation is correct - it is much more efficient to load data first and only then create index. Reason for this is that index updates during insert are expensive. If you create index after all data is there, it is much faster.

It goes even further - if you need to import large amount of data into existing indexed table, it is often more efficient to drop existing index first, import the data, and then re-create index again.

One downside of creating index after importing is that table must be locked, and that may take long time (it will not be locked in opposite scenario). But, in PostgreSQL 8.2 and later, you can use CREATE INDEX CONCURRENTLY, which does not lock table during indexing (with some caveats).

edited Jun 20, 2016 at 9:33

answered Sep 2, 2013 at 20:41

mvp

118k15 gold badges132 silver badges155 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

user330315 Over a year ago

"when table is locked no one can read or write" - I don't think that's true. When a CREATE INDEX is running, the table can still be read, but not updated if I'm not mistaken.

mvp Over a year ago

@a_horse_with_no_name: I stand corrected. CREATE INDEX must acquire SHARE lock, which protects a table against concurrent data changes. CREATE INDEX CONCURRENTLY must acquire SHARE UPDATE EXCLUSIVE lock, which protects a table against concurrent schema changes and VACUUM runs. postgresql.org/docs/9.1/static/explicit-locking.html

jjanes Over a year ago

Often a newly-made 45 Gigabyte table is going to be pretty much useless until after it is indexed. Locking a useless table is no loss, so get the indexing over with as fast as possible.

mat_boy Over a year ago

Which caveats? Can you explain?

Stoic_Observer Over a year ago

@mat_boy I'm about 2 years too late here, but for the lazy... when creating an index concurrently PostgreSQL will perform two scans of the table and each scan must wait for existing transactions that have modified the table to terminate. This will cause the index creation to take longer, and require more cpu usage, causing queries to this database to slow down while the index is created.

|

Collectives™ on Stack Overflow

Most Efficient Way to Create an Index in Postgres

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related