9

I see plenty of examples of importing a CSV into a PostgreSQL db, but what I need is an efficient way to import 500,000 CSV's into a single PostgreSQL db. Each CSV is a bit over 500KB (so grand total of approx 272GB of data).

The CSV's are identically formatted and there are no duplicate records (the data was generated programatically from a raw data source). I have been searching and will continue to search online for options, but I would appreciate any direction on getting this done in the most efficient manner possible. I do have some experience with Python, but will dig into any other solution that seems appropriate.

Thanks!

1
  • 1
    The fastest way to import data into PostgreSQL is to use the COPY command. Commented Sep 28, 2012 at 20:10

3 Answers 3

11

If you start by reading the PostgreSQL guide "Populating a Database" you'll see several pieces of advice:

  1. Load the data in a single transaction.
  2. Use COPY if at all possible.
  3. Remove indexes, foreign key constraints etc before loading the data and restore them afterwards.

PostgreSQL's COPY statement already supports the CSV format:

COPY table (column1, column2, ...) FROM '/path/to/data.csv' WITH (FORMAT CSV)

so it looks as if you are best off not using Python at all, or using Python only to generate the required sequence of COPY statements.

Sign up to request clarification or add additional context in comments.

5 Comments

Gareth - I assume from your response that you cannot use wildcards in the Copy statement? From what I can tell, I need some type of programming method to create a list of the files and then issue individual Copy commands. Please correct me if I am wrong. I hope I am :). It would be much easier if I could just issue a Copy statement with a wildcard *.csv
Surely generating the sequence of COPY commands is easy? You could do it shell, for example: (for FILE in /path/to/*.csv; do echo "COPY table (column1, column2, ...) FROM '$FILE' WITH (FORMAT CSV);"; done) > import-commands.sql
Isn't that a bit inefficient? Generating a file with 500,000 Copy commands?
In the context of loading 272 GB of data, a few hundred thousand SQL commands are small change.
for others coming across this, you could also use python or R or whatever your preference is to group some of those CSV files into bigger batches ( as much as your RAM can handle ) and then use the COPY that Postgres provides
0

Nice chunk of data you have there. I'm not 100% sure about Postgre, but at least MySQL provides some SQL commands, to feed a csv directly into a table. This bypasses any insert checks and so on and is thatswhy more than a order of magnitude faster than any ordinary insert operations.

So the probably fastest way to go is create some simple python script, telling your postgre server, which csv files in which order to hungrily devour into it's endless tables.

Comments

0

I use php and postgres, and read the csv file with php and ride a string in the following format:

{ {line1 column1, line1 column2, line1 column3} , { line2 column1,line2 column2,line2 column3} }

Care in a single transaction by passing the string parameter to postgresql function.

I can check all records, formatting, amount of data, etc., and get a result of importing 500,000 records in about 3 minutes.

To read the data in postgresql function:

 DECLARE
     d varchar[];
 BEGIN

     FOREACH d SLICE 1 IN ARRAY p_dados
     LOOP


     INSERT INTO schema.table (
       column1,
       column2,
       column3,
     )
     VALUES ( 
       d[1],
       d[2]::INTEGER, -- explicit conversion to INTEGER
       d[3]::BIGINT, -- explicit conversion to BIGINT
     );        
     END LOOP;

  END;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.