MemoryError in Python: How can I optimise my code?

Question

I have a number of json files to combine and output as a single csv (to load into R), with each json file at about 1.5gb. While doing a trial on 4-5 json files at 250mb each, I get the following error below. I'm running Python version '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]' on 8gb ram and Windows 7 professional 64 bit.

I'm a Python novice and have little experience with writing optimized code and would appreciate guidance on how I can optimize my script below. Thank you!

======= Python MemoryError =======

Traceback (most recent call last):
  File "C:\Users\...\tweetjson_to_csv.py", line 52, in <module>
    for line in file:
MemoryError
[Finished in 29.5s]

======= json to csv conversion script =======

# csv file that you want to save to
out = open("output.csv", "ab")

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]
open_files = map(open, filenames)

# change argument to the file you want to open
for file in open_files:
    for line in file:
        # only keep tweets and not the empty lines
        if line.rstrip():
            try:
                tweets.append(json.loads(line))
            except:
                pass

for tweet in tweets:
    ids.append(tweet["id_str"])
    texts.append(tweet["text"])
    time_created.append(tweet["created_at"])
    retweet_counts.append(tweet["retweet_count"])
... ...

print >> out, "ids,text,time_created,retweet_counts,in_reply_to,geos,coordinates,places,country,language,screen_name,followers,friends,statuses,locations"
rows = zip(ids,texts,time_created,retweet_counts,in_reply_to_screen_name,geos,coordinates,places,places_country,lang,user_screen_names,user_followers_count,user_friends_count,user_statuses_count,user_locations)

csv = writer(out)

for row in rows:
    values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
    csv.writerow(values)

out.close()

You're loading everything into memory (tweets.append(json.loads(line))). Can you phrase your algorithm in a way such that you write to output.csv immediately after reading each line? — univerio
– univerio, Commented May 15, 2014 at 2:38
This is probably more appropriate for codereview.stackexchange.com — dano
– dano, Commented May 15, 2014 at 2:43
But, while I'm here, you should just open the files one at a time. No reason to open them all at once. Especially since you're not closing them when you're done with them. — dano
– dano, Commented May 15, 2014 at 2:43
thanks @dano. How should I amend the code so I close files when I'm done with them? — Eugene Yan
– Eugene Yan, Commented May 15, 2014 at 2:45
@VasiliSyrakis How will classes help? This seems like straight forward imperative programming requirement. Open file, read line, close file. Not every problem needs to be solved with OOP. — user764357
– user764357, Commented May 15, 2014 at 3:03

score 3 · Accepted Answer · 2014-05-15 03:07:03Z

3

This line right here:

open_files = map(open, filenames)

Opens every file at once concurrently.

Then you read everything and munge it into the same single array tweets.

And you have two main for loops, so each tweet (of which there are several GBs worth) is iterated through ~~twice~~ a staggering 4 times! Because you added in the zip function and then the iteration to write to the file. Any one of those points could be the cause of the memory error.

Unless absolutely necessary, try to only touch each piece of data once. As you iterate through a file, process the line and write it out immediately.

Try something like this instead:

out = open("output.csv", "ab")

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]

def process_tweet_into_line(line):
    # load as json, process turn into a csv and return
    return line

# change argument to the file you want to open
for name in file_names:
    with open(name) as file:
        for line in file:
            # only keep tweets and not the empty lines
            if line.rstrip():
                try:
                    tweet = process_tweet_into_line(line)
                    out.write(line)
                except:
                    pass

edited May 15, 2014 at 3:07

answered May 15, 2014 at 2:51

user764357

Sign up to request clarification or add additional context in comments.

1 Comment

Eugene Yan Over a year ago

thanks @Lego Stormtroopr but I'm having a little difficulty implementing your suggestion in my code. Would you be able to help by fleshing it out a bit more please? Thanks!

Collectives™ on Stack Overflow

MemoryError in Python: How can I optimise my code?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related