1

I have a series of .csv files with some data, and I want a Python script to open them all, do some preprocessing, and upload the processed data to my postgres database.

I have it mostly complete, but my upload step isn't working. I'm sure it's something simple that I'm missing, but I just can't find it. I'd appreciate any help you can provide.

Here's the code:

import psycopg2
import sys
from os import listdir
from os.path import isfile, join
import csv
import re
import io

try:
    con = db_connect("dbname = '[redacted]' user = '[redacted]' password = '[redacted]' host = '[redacted]'")
except:
    print("Can't connect to database.")
    sys.exit(1)
cur = con.cursor()

upload_file = io.StringIO()

file_list = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in file_list:
    id_match = re.search(r'.*-(\d+)\.csv', file)
    if id_match:
        id = id_match.group(1)
        file_name = format(id_match.group())
        with open(mypath+file_name) as fh:
            id_reader = csv.reader(fh)
            next(id_reader, None)   # Skip the header row
            for row in id_reader:
                [stuff goes here to get desired values from file]
                if upload_file.getvalue() != '': upload_file.write('\n')
            upload_file.write('{0}\t{1}\t{2}'.format(id, [val1], [val2]))

print(upload_file.getvalue())   # prints output that looks like I expect it to
          # with thousands of rows that seem to have the right values in the right fields

cur.copy_from(upload_file, '[my_table]', sep='\t', columns=('id', 'col_1', 'col_2'))
con.commit()

if con:
    con.close()

This runs without error, but a select query in psql still shows no records in the table. What am I missing?

Edit: I ended up giving up and writing it to a temporary file, and then uploading the file. This worked without any trouble...I'd obviously rather not have the temporary file though, so I'm happy to have suggestions if someone sees the problem.

6
  • The code seems OK. You said your print line outputs thousands of rows but you passed '\t' as a separator to copy_from. Maybe that's the problem? Commented Oct 31, 2016 at 19:46
  • I'm reading through a couple dozen files that each have a couple hundred records, so it's a total of a few thousands of rows. This line of the code makes them show up as separate rows in the print statement: if upload_file.getvalue() != '': upload_file.write('\n') Commented Oct 31, 2016 at 19:48
  • Yup, that might be the problem then. Try to change sep='\t' to sep='\n' in you copy_from parameters and see if anything changes in the database. Commented Oct 31, 2016 at 20:04
  • psycopg2.DataError: COPY delimiter cannot be newline or carriage return Commented Oct 31, 2016 at 20:09
  • In that case it might be better to change upload_file.write('\n') for upload_file.write('\t') Commented Oct 31, 2016 at 20:10

1 Answer 1

1

When you write to an io.StringIO (or any other file) object, the file pointer remains at the position of the last character written. So, when you do

f = io.StringIO()
f.write('1\t2\t3\n')
s = f.readline()

the file pointer stays at the end of the file and s contains an empty string.


To read (not getvalue) the contents, you must reposition the file pointer to the beginning, e.g. use seek(0)

upload_file.seek(0)
cur.copy_from(upload_file, '[my_table]', columns = ('id', 'col_1', 'col_2'))

This allows copy_from to read from the beginning and import all the lines in your upload_file.


Don't forget, that you read and keep all the files in your memory, which might work for a single small import, but may become a problem when doing large imports or multiple imports in parallel.

Sign up to request clarification or add additional context in comments.

3 Comments

Don't forget, that you read and keep all the files in your memory, which might work for a single small import, but may become a problem when doing large imports or multiple imorts in parallel. This makes me sad though; the whole reason I was doing it this way was that I thought it'd be the fastest way to upload thousands of records at once. I'll dig around and see if I can find a better suggestion in the archives.
I don't say it's bad, just something to keep in mind, if memory is a concern. And if it's only a few thousand records, I wouldn't worry too much. If memory is a concern, you might try writing the modified CSV to a pipe, which is then read by copy_from, if this is possible.
Ah, ok. So it's about the definition of large and small -- thousands of records is large if the question is about doing insert values statements, but small if the question is "will it be painful to store these in memory for a few seconds during a daily batch update". Thanks so much.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.