How to go easy on memory while iterating through a large CSV file in python?

Question

I currently have a csv-file with 200k rows, each row including 80 entries, separated by a comma. I try to open the csv-file with open() and append the data to a 2-D python list. When I try to iterate through that list and append the 80 entries to a single one, the computer freezes. Does my code produce some kind of memory issue? Should I work with my data in batches or is there a more efficient way to got through what I'm trying to do?

In short: Open csv, go through 200k entries and transform them from [1, 2, 3, 4, 5,..., 80], [1, ..., 80], .... 200k -> [12345...80]. [1...80], 200k

import csv


# create empty shells
raw_data = []
concatenate_data = []


def get_data():
    counter = 1

    # open the raw data file and put it into a list
    with open('raw_data_train.csv', 'r') as file:
        reader = csv.reader(file, dialect='excel')

        for row in reader:
            print('\rCurrent item: {0}'.format(counter), end='', flush=True)
            raw_data.append(row)
            counter += 1

    print('\nReading done')


def format_data():
    counter = 1
    temp = ''

    # concatenate the separated letters for each string in the csv file
    for batch in raw_data:
        for letters in batch:
            temp += letters
        concatenate_data.append(temp)
        print('\rCurrent item: {0}'.format(counter), end='', flush=True)
        counter += 1

    print('\nTransforming done')
    print(concatenate_data[0:10])

@Jean-FrançoisFabre What do you mean by normal? I just need this variable temporarily to hold the 80 single entries and transform them into a single one. That's why it is only included in the format data function. — d3x
– d3x, Commented Dec 26, 2016 at 17:49

Jean-François Fabre · Accepted Answer · 2016-12-26 17:52:51Z

1

format_data() routine is bound to hog your CPU a lot:

using string concatenation which is sub-optimal as opposed to other methods (StringIO, str.join)
using the same temp variable in the whole routine
appending temp in the loop (appending basically a bigger and bigger string).

I suppose you just want to do that: append all text as 1 string for each line, without spaces. Much faster done using str.join to avoid string concatenation:

for batch in raw_data:
    concatenate_data.append("".join(batch))

or even faster if you can get rid of the prints:

 concatenate_data = ["".join(batch) for batch in raw_data]

edited Dec 26, 2016 at 17:52

answered Dec 26, 2016 at 17:47

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jean-François Fabre Over a year ago

I figured out that much because it made no sense appending the same data over and over.

d3x Over a year ago

I'm so sorry, I completely forgot to set the temp variable back to an empty string. The goal is so append 80 single strings to one big and that for all 200k entries.

Jean-François Fabre Over a year ago

you mean that concatenate_data is a list of concatenated lines, or a huge string containing all the strings of the csv file, flat?

d3x Over a year ago

I mean a list of concatenated chars. So change [[a, b, c, d, ....], [a, b, c, d, ...], ....[]] to [abcd..., abdc..., abcd..., ...]

d3x Over a year ago

Thank you very much. Didn't know that .join is a lot faster that the usual concatenation. The prints are only there to debug :)

Collectives™ on Stack Overflow

How to go easy on memory while iterating through a large CSV file in python?

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related