Python CSV parsing fills up memory

Question

I have a CSV file which has over a million rows and I am trying to parse this file and insert the rows into the DB.

    with open(file, "rb") as csvfile:

        re = csv.DictReader(csvfile)
        for row in re:
        //insert row['column_name'] into DB

For csv files below 2 MB this works well but anything more than that ends up eating my memory. It is probably because i store the Dictreader's contents in a list called "re" and it is not able to loop over such a huge list. I definitely need to access the csv file with its column names which is why I chose dictreader since it easily provides column level access to my csv files. Can anyone tell me why this is happening and how can this be avoided?

Although not answering your actual question, if you need to load the data as is, it could be easier and faster to use the DB's own facilities (for example, COPY table(col1, col2) FROM file WITH CSV in Postgres or LOAD DATA INFILE in MySQL). — Vadim Landa
– Vadim Landa, Commented Apr 23, 2015 at 6:31

Community · Accepted Answer · 2017-05-23 11:51:02Z

4

The DictReader does not load the whole file in memory but read it by chunks as explained in this answer suggested by DhruvPathak.

But depending on your database engine, the actual write on disk may only happen at commit. That means that the database (and not the csv reader) keeps all data in memory and at end exhausts it.

So you should try to commit every n records, with n typically between 10 an 1000 depending on the size of you lines and the available memory.

edited May 23, 2017 at 11:51

CommunityBot

11 silver badge

answered Apr 23, 2015 at 6:39

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tania Over a year ago

So I guess it is the SQLAlchemy which is eating my memory

Serge Ballesta Over a year ago

@Tania : just try to commit every n-th statement and you'll get confirmation :-)

Tania Over a year ago

Yes I tried. Looks like that s my problem. Any ways to avoid it?

Julien Spronck · Accepted Answer · 2015-04-23 06:28:12Z

1

If you don't need the entire columns at once, you can simply read the file line by line like you would with a text file and parse each row. The exact parsing depends on your data format but you could do something like:

delimiter = ','
with open(filename, 'r') as fil:
    headers = fil.next()
    headers = headers.strip().split(delimiter)
    dic_headers = {hdr: headers.index(hdr) for hdr in headers}
    for line in fil:
        row = line.strip().split(delimiter)
        ## do something with row[dic_headers['column_name']]

This is a very simple example but it can be more elaborate. For example, this does not work if your data contains ,.

answered Apr 23, 2015 at 6:28

Julien Spronck

15.5k5 gold badges50 silver badges57 bronze badges

6 Comments

Tania Over a year ago

Can you please tell me which line in my previous code takes all the cols at once?

Serge Ballesta Over a year ago

AFAIK the csv readers internally already iterate over lines and do not load all file in memory, so I highly doubt that this really solves OP's problem

Julien Spronck Over a year ago

I just saw that and uprooted your answer. we learn every day.

Tania Over a year ago

headers = headers.strip().split(delimiter) results in error built_in_method has no attr split

Julien Spronck Over a year ago

@Tania As suggested by DhruvPathak and Serge Ballesta, this is most likely not solving your memory error as the DictReader does not put the entire file into memory

|

Collectives™ on Stack Overflow

Python CSV parsing fills up memory

2 Answers 2

3 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related