I'm trying to read a very large set of nested json files in to a pandas dataframe, using the code below. It's a few million records, it's the "review" file from the yelp academic dataset.
Does anyone know a quicker way to do this?
Is it possible to just load a sample of the json records? I would probably be fine with just a couple hundred thousand records.
Also I probably don't need all the fields from the review.json file, could I just load a subset of them like user_id, business_id, stars? And would that speed things up?
I would post sample data but I can't even get it to finish loading.
Code:
df_review = pd.read_json('dataset/review.json', lines=True)
Update:
Code:
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
testdf = pd.read_json(reviews,lines=True)
Error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-18-8e4a45990905> in <module>()
5 reviews += line
6
----> 7 testdf = pd.read_json(reviews,lines=True)
/Users/anaconda/lib/python2.7/site-packages/pandas/io/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
273 # commas and put it in a json list to make a valid json object.
274 lines = list(StringIO(json.strip()))
--> 275 json = u'[' + u','.join(lines) + u']'
276
277 obj = None
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 357: ordinal not in range(128)
Update 2:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
testdf = pd.read_json(reviews,lines=True)