76

Is it more efficient to create an index after loading data is complete or before, or does it not matter?

For example, say I have 500 files to load into a Postgres 8.4 DB. Here are the two index creation scenarios I could use:

  1. Create index when table is created, then load each file into table; or
  2. Create index after all files have been loaded into the table.

The table data itself is about 45 Gigabytes. The index is about 12 Gigabytes. I'm using a standard index. It is created like this:

CREATE INDEX idx_name ON table_name (column_name);

My data loading uses COPY FROM.

Once all the files are loaded, no updates, deletes or additional loads will occur on the table (it's a day's worth of data that will not change). So I wanted to ask which scenario would be most efficient? Initial testing seems to indicate that loading all the files and then creating the index (scenario 2) is faster, but I have not done a scientific comparison of the two approaches.

1 Answer 1

103

Your observation is correct - it is much more efficient to load data first and only then create index. Reason for this is that index updates during insert are expensive. If you create index after all data is there, it is much faster.

It goes even further - if you need to import large amount of data into existing indexed table, it is often more efficient to drop existing index first, import the data, and then re-create index again.

One downside of creating index after importing is that table must be locked, and that may take long time (it will not be locked in opposite scenario). But, in PostgreSQL 8.2 and later, you can use CREATE INDEX CONCURRENTLY, which does not lock table during indexing (with some caveats).

Sign up to request clarification or add additional context in comments.

8 Comments

"when table is locked no one can read or write" - I don't think that's true. When a CREATE INDEX is running, the table can still be read, but not updated if I'm not mistaken.
@a_horse_with_no_name: I stand corrected. CREATE INDEX must acquire SHARE lock, which protects a table against concurrent data changes. CREATE INDEX CONCURRENTLY must acquire SHARE UPDATE EXCLUSIVE lock, which protects a table against concurrent schema changes and VACUUM runs. postgresql.org/docs/9.1/static/explicit-locking.html
Often a newly-made 45 Gigabyte table is going to be pretty much useless until after it is indexed. Locking a useless table is no loss, so get the indexing over with as fast as possible.
Which caveats? Can you explain?
@mat_boy I'm about 2 years too late here, but for the lazy... when creating an index concurrently PostgreSQL will perform two scans of the table and each scan must wait for existing transactions that have modified the table to terminate. This will cause the index creation to take longer, and require more cpu usage, causing queries to this database to slow down while the index is created.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.