6

I'm trying to read a very large set of nested json files in to a pandas dataframe, using the code below. It's a few million records, it's the "review" file from the yelp academic dataset.

Does anyone know a quicker way to do this?

Is it possible to just load a sample of the json records? I would probably be fine with just a couple hundred thousand records.

Also I probably don't need all the fields from the review.json file, could I just load a subset of them like user_id, business_id, stars? And would that speed things up?

I would post sample data but I can't even get it to finish loading.

Code:

df_review = pd.read_json('dataset/review.json', lines=True)

Update:

Code:

reviews = ''

with open('dataset/review.json','r') as f:
    for line in f.readlines()[0:1000]:
        reviews += line

testdf = pd.read_json(reviews,lines=True)

Error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-18-8e4a45990905> in <module>()
      5         reviews += line
      6 
----> 7 testdf = pd.read_json(reviews,lines=True)

/Users/anaconda/lib/python2.7/site-packages/pandas/io/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
    273         # commas and put it in a json list to make a valid json object.
    274         lines = list(StringIO(json.strip()))
--> 275         json = u'[' + u','.join(lines) + u']'
    276 
    277     obj = None

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 357: ordinal not in range(128)

Update 2:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

reviews = ''

with open('dataset/review.json','r') as f:
    for line in f.readlines()[0:1000]:
        reviews += line

testdf = pd.read_json(reviews,lines=True)

3 Answers 3

2

If your file has json objects line separated as you imply, this might be able to work. Just reading the first 1000 lines of the file and then reading that with pandas.

import pandas as pd  

reviews = ''

with open('dataset/review.json','r') as f:
    for line in f.readlines()[0:1000]:
        reviews += line

pd.read_json(reviews,lines=True)
Sign up to request clarification or add additional context in comments.

6 Comments

This won't work since it will break the json format.
@OrDuan I agree, but the original question included "lines=True" which implies that the json is formatted in such a way that each line is a different json object. If that is the case, the above solution should work.
@NathanH Thank you for getting back to me so quickly. I tried your suggestion and got an error message. I added the code I ran and the error message as an update to the original post. Do you know what the issue might be? Also using your suggestion would I be able to read in a set number of records instead of the whole file? Is that the idea?
@user3476463 are you able to post an example of the json structure? This would help when trying to figure this out. Also, could you not just open up the json file, copy some of the objects into a new file, and then process the file?
@NathanH I was able to import some records from the json file using your suggestion once I set the default coding to utf-8, thank you! If I wanted to post a sample of the json data how could I open it up and grab some? I tried throwing the file in to a text editor before but it's so large it just locked up the text editor. Is there a way to do it in python?
|
1

I agree with @Nathan H 's proposition. But the precise point will probably lies in parallelization.

import pandas as pd  
buf = ''
buf_lst = []
df_lst = []
chunk_size = 1000
with open('dataset/review.json','r') as f:
    lines = f.readlines()
    buf_lst += [ ''.join(lines[x:x+chunk_size]) for x in range(0,len(lines), chunk_size)]

def f(buf):
    return pd.read_json( buf,lines=True)

#### single-thread
df_lst = map( f, buf_lst)

#### multi-thread
import multiprocessing as mp
pool = mp.Pool(4)
df_lst = pool.map( f, buf_lst)
pool.join()
pool.close()

However, I am not sure how to combine pandas dataframe yet.

2 Comments

I expect you figured this out already, but pandas.concat(list_of_frames) is what you want to combine them
This one solved my problem. In my code, real multiprocessing didn't work, "import multiprocessing.dummy as mp" solved it
0

Speeding up that one line would be challenging because it's already super optimized.

I would first check if you can get less rows/data from the provider, as you mentioned.

If you can process the data before, I would recommend to convert it to JSON before(even try different parsers, their performance changes for each dataset structure), than save just the data you need, and with this output call the pandas method.

Here you can find some benchmark of json parsers, keep in mind that you should test on your data, this article is from 2015.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.