Speed Up Loading json data into dataframe

Question

I'm trying to read a very large set of nested json files in to a pandas dataframe, using the code below. It's a few million records, it's the "review" file from the yelp academic dataset.

Does anyone know a quicker way to do this?

Is it possible to just load a sample of the json records? I would probably be fine with just a couple hundred thousand records.

Also I probably don't need all the fields from the review.json file, could I just load a subset of them like user_id, business_id, stars? And would that speed things up?

I would post sample data but I can't even get it to finish loading.

Code:

df_review = pd.read_json('dataset/review.json', lines=True)

Update:

Code:

reviews = ''

with open('dataset/review.json','r') as f:
    for line in f.readlines()[0:1000]:
        reviews += line

testdf = pd.read_json(reviews,lines=True)

Error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-18-8e4a45990905> in <module>()
      5         reviews += line
      6 
----> 7 testdf = pd.read_json(reviews,lines=True)

/Users/anaconda/lib/python2.7/site-packages/pandas/io/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
    273         # commas and put it in a json list to make a valid json object.
    274         lines = list(StringIO(json.strip()))
--> 275         json = u'[' + u','.join(lines) + u']'
    276 
    277     obj = None

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 357: ordinal not in range(128)

Update 2:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

reviews = ''

with open('dataset/review.json','r') as f:
    for line in f.readlines()[0:1000]:
        reviews += line

testdf = pd.read_json(reviews,lines=True)

Nathan H · Accepted Answer · 2017-10-25 19:53:17Z

2

If your file has json objects line separated as you imply, this might be able to work. Just reading the first 1000 lines of the file and then reading that with pandas.

import pandas as pd  

reviews = ''

with open('dataset/review.json','r') as f:
    for line in f.readlines()[0:1000]:
        reviews += line

pd.read_json(reviews,lines=True)

answered Oct 25, 2017 at 19:53

Nathan H

3461 silver badge10 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

0xdead Over a year ago

This won't work since it will break the json format.

Nathan H Over a year ago

@OrDuan I agree, but the original question included "lines=True" which implies that the json is formatted in such a way that each line is a different json object. If that is the case, the above solution should work.

user3476463 Over a year ago

@NathanH Thank you for getting back to me so quickly. I tried your suggestion and got an error message. I added the code I ran and the error message as an update to the original post. Do you know what the issue might be? Also using your suggestion would I be able to read in a set number of records instead of the whole file? Is that the idea?

Nathan H Over a year ago

@user3476463 are you able to post an example of the json structure? This would help when trying to figure this out. Also, could you not just open up the json file, copy some of the objects into a new file, and then process the file?

user3476463 Over a year ago

@NathanH I was able to import some records from the json file using your suggestion once I set the default coding to utf-8, thank you! If I wanted to post a sample of the json data how could I open it up and grab some? I tried throwing the file in to a text editor before but it's so large it just locked up the text editor. Is there a way to do it in python?

|

shouldsee · Accepted Answer · 2017-10-25 20:20:00Z

1

I agree with @Nathan H 's proposition. But the precise point will probably lies in parallelization.

import pandas as pd  
buf = ''
buf_lst = []
df_lst = []
chunk_size = 1000
with open('dataset/review.json','r') as f:
    lines = f.readlines()
    buf_lst += [ ''.join(lines[x:x+chunk_size]) for x in range(0,len(lines), chunk_size)]

def f(buf):
    return pd.read_json( buf,lines=True)

#### single-thread
df_lst = map( f, buf_lst)

#### multi-thread
import multiprocessing as mp
pool = mp.Pool(4)
df_lst = pool.map( f, buf_lst)
pool.join()
pool.close()

However, I am not sure how to combine pandas dataframe yet.

edited Oct 25, 2017 at 20:20

answered Oct 25, 2017 at 20:14

shouldsee

4847 silver badges7 bronze badges

2 Comments

mzpq Over a year ago

I expect you figured this out already, but pandas.concat(list_of_frames) is what you want to combine them

Jelmer Wind Over a year ago

This one solved my problem. In my code, real multiprocessing didn't work, "import multiprocessing.dummy as mp" solved it

0xdead · Accepted Answer · 2017-10-25 19:43:30Z

0

Speeding up that one line would be challenging because it's already super optimized.

I would first check if you can get less rows/data from the provider, as you mentioned.

If you can process the data before, I would recommend to convert it to JSON before(even try different parsers, their performance changes for each dataset structure), than save just the data you need, and with this output call the pandas method.

Here you can find some benchmark of json parsers, keep in mind that you should test on your data, this article is from 2015.

answered Oct 25, 2017 at 19:43

0xdead

14k6 gold badges62 silver badges66 bronze badges

Collectives™ on Stack Overflow

Speed Up Loading json data into dataframe

3 Answers 3

6 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related