I have a number of json files to combine and output as a single csv (to load into R), with each json file at about 1.5gb. While doing a trial on 4-5 json files at 250mb each, I get the following error below. I'm running Python version '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]' on 8gb ram and Windows 7 professional 64 bit.
I'm a Python novice and have little experience with writing optimized code and would appreciate guidance on how I can optimize my script below. Thank you!
======= Python MemoryError =======
Traceback (most recent call last):
File "C:\Users\...\tweetjson_to_csv.py", line 52, in <module>
for line in file:
MemoryError
[Finished in 29.5s]
======= json to csv conversion script =======
# csv file that you want to save to
out = open("output.csv", "ab")
filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]
open_files = map(open, filenames)
# change argument to the file you want to open
for file in open_files:
for line in file:
# only keep tweets and not the empty lines
if line.rstrip():
try:
tweets.append(json.loads(line))
except:
pass
for tweet in tweets:
ids.append(tweet["id_str"])
texts.append(tweet["text"])
time_created.append(tweet["created_at"])
retweet_counts.append(tweet["retweet_count"])
... ...
print >> out, "ids,text,time_created,retweet_counts,in_reply_to,geos,coordinates,places,country,language,screen_name,followers,friends,statuses,locations"
rows = zip(ids,texts,time_created,retweet_counts,in_reply_to_screen_name,geos,coordinates,places,places_country,lang,user_screen_names,user_followers_count,user_friends_count,user_statuses_count,user_locations)
csv = writer(out)
for row in rows:
values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
csv.writerow(values)
out.close()
tweets.append(json.loads(line))). Can you phrase your algorithm in a way such that you write tooutput.csvimmediately after reading each line?