0

I have a CSV file which has over a million rows and I am trying to parse this file and insert the rows into the DB.

    with open(file, "rb") as csvfile:

        re = csv.DictReader(csvfile)
        for row in re:
        //insert row['column_name'] into DB

For csv files below 2 MB this works well but anything more than that ends up eating my memory. It is probably because i store the Dictreader's contents in a list called "re" and it is not able to loop over such a huge list. I definitely need to access the csv file with its column names which is why I chose dictreader since it easily provides column level access to my csv files. Can anyone tell me why this is happening and how can this be avoided?

2
  • 1
    stackoverflow.com/questions/24868856/… Commented Apr 23, 2015 at 6:30
  • 1
    Although not answering your actual question, if you need to load the data as is, it could be easier and faster to use the DB's own facilities (for example, COPY table(col1, col2) FROM file WITH CSV in Postgres or LOAD DATA INFILE in MySQL). Commented Apr 23, 2015 at 6:31

2 Answers 2

4

The DictReader does not load the whole file in memory but read it by chunks as explained in this answer suggested by DhruvPathak.

But depending on your database engine, the actual write on disk may only happen at commit. That means that the database (and not the csv reader) keeps all data in memory and at end exhausts it.

So you should try to commit every n records, with n typically between 10 an 1000 depending on the size of you lines and the available memory.

Sign up to request clarification or add additional context in comments.

3 Comments

So I guess it is the SQLAlchemy which is eating my memory
@Tania : just try to commit every n-th statement and you'll get confirmation :-)
Yes I tried. Looks like that s my problem. Any ways to avoid it?
1

If you don't need the entire columns at once, you can simply read the file line by line like you would with a text file and parse each row. The exact parsing depends on your data format but you could do something like:

delimiter = ','
with open(filename, 'r') as fil:
    headers = fil.next()
    headers = headers.strip().split(delimiter)
    dic_headers = {hdr: headers.index(hdr) for hdr in headers}
    for line in fil:
        row = line.strip().split(delimiter)
        ## do something with row[dic_headers['column_name']]

This is a very simple example but it can be more elaborate. For example, this does not work if your data contains ,.

6 Comments

Can you please tell me which line in my previous code takes all the cols at once?
AFAIK the csv readers internally already iterate over lines and do not load all file in memory, so I highly doubt that this really solves OP's problem
I just saw that and uprooted your answer. we learn every day.
headers = headers.strip().split(delimiter) results in error built_in_method has no attr split
@Tania As suggested by DhruvPathak and Serge Ballesta, this is most likely not solving your memory error as the DictReader does not put the entire file into memory
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.