3

I currently have a csv-file with 200k rows, each row including 80 entries, separated by a comma. I try to open the csv-file with open() and append the data to a 2-D python list. When I try to iterate through that list and append the 80 entries to a single one, the computer freezes. Does my code produce some kind of memory issue? Should I work with my data in batches or is there a more efficient way to got through what I'm trying to do?

In short: Open csv, go through 200k entries and transform them from [1, 2, 3, 4, 5,..., 80], [1, ..., 80], .... 200k -> [12345...80]. [1...80], 200k

import csv


# create empty shells
raw_data = []
concatenate_data = []


def get_data():
    counter = 1

    # open the raw data file and put it into a list
    with open('raw_data_train.csv', 'r') as file:
        reader = csv.reader(file, dialect='excel')

        for row in reader:
            print('\rCurrent item: {0}'.format(counter), end='', flush=True)
            raw_data.append(row)
            counter += 1

    print('\nReading done')


def format_data():
    counter = 1
    temp = ''

    # concatenate the separated letters for each string in the csv file
    for batch in raw_data:
        for letters in batch:
            temp += letters
        concatenate_data.append(temp)
        print('\rCurrent item: {0}'.format(counter), end='', flush=True)
        counter += 1

    print('\nTransforming done')
    print(concatenate_data[0:10])
2
  • is it normal that temp is only initialized at start? Commented Dec 26, 2016 at 17:45
  • @Jean-FrançoisFabre What do you mean by normal? I just need this variable temporarily to hold the 80 single entries and transform them into a single one. That's why it is only included in the format data function. Commented Dec 26, 2016 at 17:49

1 Answer 1

1

format_data() routine is bound to hog your CPU a lot:

  • using string concatenation which is sub-optimal as opposed to other methods (StringIO, str.join)
  • using the same temp variable in the whole routine
  • appending temp in the loop (appending basically a bigger and bigger string).

I suppose you just want to do that: append all text as 1 string for each line, without spaces. Much faster done using str.join to avoid string concatenation:

for batch in raw_data:
    concatenate_data.append("".join(batch))

or even faster if you can get rid of the prints:

 concatenate_data = ["".join(batch) for batch in raw_data]
Sign up to request clarification or add additional context in comments.

5 Comments

I figured out that much because it made no sense appending the same data over and over.
I'm so sorry, I completely forgot to set the temp variable back to an empty string. The goal is so append 80 single strings to one big and that for all 200k entries.
you mean that concatenate_data is a list of concatenated lines, or a huge string containing all the strings of the csv file, flat?
I mean a list of concatenated chars. So change [[a, b, c, d, ....], [a, b, c, d, ...], ....[]] to [abcd..., abdc..., abcd..., ...]
Thank you very much. Didn't know that .join is a lot faster that the usual concatenation. The prints are only there to debug :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.