2

I've been asked by a colleague to convert 6 huge files from the "Yelp Dataset Challenge" from somewhat "flat," regular JSON into CSV (he thinks they look like fun teaching data).

I thought I could bang it out with:

# With thanks to http://www.diveintopython3.net/files.html and https://www.reddit.com/r/MachineLearning/comments/33eglq/python_help_jsoncsv_pandas/cqkwyu8/

import os
import pandas

jsondir = 'c:\\example\\bigfiles\\'
csvdir = 'c:\\example\\bigcsvfiles\\'
if not os.path.exists(csvdir): os.makedirs(csvdir)

for file in os.listdir(jsondir):
    with open(jsondir+file, 'r', encoding='utf-8') as f: data = f.readlines()
    df = pandas.read_json('[' + ','.join(map(lambda x: x.rstrip(), data)) + ']')
    df.to_csv(csvdir+os.path.splitext(file)[0]+'.csv',index=0,quoting=1)

Unfortunately, my computer's memory isn't up to the task on this size of file. (Even if I get rid of the loop, although it bangs out a 50MB file in less than a minute, it struggles to avoid freezing my computer or crashing on the 100MB+ files, and the biggest file is 3.25GB.)

Is there something else simple but performant that I can run instead?

In a loop would be great, but I can also run 6 times w/ separate filenames if it makes a difference for memory (there are just 6 files).

Here's an example of the contents of a ".json" file -- note that each file actually has lots of JSON objects, 1 per line.

{"business_id":"xyzzy","name":"Business A","neighborhood":"","address":"XX YY ZZ","city":"Tempe","state":"AZ","postal_code":"85283","latitude":33.32823894longitude":-111.28948,"stars":3,"review_count":3,"is_open":0,"attributes":["BikeParking: True","BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}","DogsAllowed: False","RestaurantsPriceRange2: 2","WheelchairAccessible: True"],"categories":["Tobacco Shops","Nightlife","Vape Shops","Shopping"],"hours":["Monday 11:0-21:0","Tuesday 11:0-21:0","Wednesday 11:0-21:0","Thursday 11:0-21:0","Friday 11:0-22:0","Saturday 10:0-22:0","Sunday 11:0-18:0"],"type":"business"}
{"business_id":"dsfiuweio2f","name":"Some Place","neighborhood":"","address":"Strip or something","city":"Las Vegas","state":"NV","postal_code":"89106","latitude":36.189134,"longitude":-115.92094,"stars":1.5,"review_count":2,"is_open":1,"attributes":["BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True"],"categories":["Caterers","Grocery","Food","Event Planning & Services","Party & Event Planning","Specialty Food"],"hours":["Monday 0:0-0:0","Tuesday 0:0-0:0","Wednesday 0:0-0:0","Thursday 0:0-0:0","Friday 0:0-0:0","Saturday 0:0-0:0","Sunday 0:0-0:0"],"type":"business"}
{"business_id":"abccb","name":"La la la","neighborhood":"Blah blah","address":"Yay that","city":"Toronto","state":"ON","postal_code":"M6H 1L5","latitude":43.283984,"longitude":-79.28284,"stars":2,"review_count":6,"is_open":1,"attributes":["Alcohol: none","Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False}","BikeParking: True","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}","Caters: True","GoodForKids: True","GoodForMeal: {'dessert': False, 'latenight': False, 'lunch': False, 'dinner': False, 'breakfast': False, 'brunch': False}","HasTV: True","NoiseLevel: quiet","OutdoorSeating: False","RestaurantsAttire: casual","RestaurantsDelivery: True","RestaurantsGoodForGroups: True","RestaurantsPriceRange2: 1","RestaurantsReservations: False","RestaurantsTableService: False","RestaurantsTakeOut: True","WiFi: free"],"categories":["Restaurants","Pizza","Chicken Wings","Italian"],"hours":["Monday 11:0-2:0","Tuesday 11:0-2:0","Wednesday 11:0-2:0","Thursday 11:0-3:0","Friday 11:0-3:0","Saturday 11:0-3:0","Sunday 11:0-2:0"],"type":"business"}

The nested JSON data can simply remain as the string literals representing it -- I'm only looking to convert top-level keys into CSV file headings.

1
  • Instead of reading and parsing the whole file at one time, you can try to read it in one json dictionary or one csv row at one time, then parse and insert it into the csv. It would require a little more manual coding, but will perform well with the file stream style. Commented Feb 10, 2017 at 22:25

1 Answer 1

4

The problem is that your code reads the entire file into memory, then creates a near-copy of it in memory. I suspect it also creates a third copy, but haven't verified that. The solution, as suggested by Neo X, is to read in the file line-by-line and process it accordingly. Here's a replacement for the for loop:

for file in os.listdir(jsondir):
    csv_file = csvdir + os.path.splitext(file)[0] + '.csv'
    with open(jsondir+file, 'r', encoding='utf-8') as f, open(csv_file, 'w', encoding='utf-8') as csv:
        header = True
        for line in f:
            df = pandas.read_json(''.join(('[', line.rstrip(), ']')))
            df.to_csv(csv, header=header, index=0, quoting=1)
            header = False

I've tested this using python 3.5 on a Mac; it should work on Windows but I haven't tested it there.

Notes:

  1. I've adjusted your json data; there appeared to be an error around latitude / longitude in the first line.

  2. This has only been tested with a small file; I'm not sure where to get your 3.5 GB file from.

  3. I'm assuming this is a one-time usage for your friend. If this were Production code, you would need to verify that exception handling for the 'with' statement is correct. See How can I open multiple files using "with open" in Python? for details.

  4. This should be fairly performant, but again, I'm not sure where to get your large files from.

Sign up to request clarification or add additional context in comments.

4 Comments

Check out ijson, it makes streaming a JSON file as easy as using a Python iterator
@kevin: Question: Why doesn't your to_csv() include a mode='a' parameter? Is there something about calling to_csv() inside a with open that makes it auto-append? Also, your code works beautifully -- converting a small file takes longer, but my computer no longer freezes, and the job still completes in a reasonable amount of time (it should easily be done by the end of the day), so I can just let it run in the background. Thanks a million. (Finally, I edited your code to include UTF-8 encoding on the output file--I was getting errors w/ foreign input data until I did that.)
@k.. Glad I could help! Since csv is already open and is passed into to_csv(), the latter doesn't close the file after writing. You can verify by looking at def save() in the source; it sets close = False at line 1476. github.com/pandas-dev/pandas/blob/master/pandas/formats/…. Good question!
there is the option header=first that eliminates two lines of your code

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.