Python: Reading 200k JSON files into a Pandas Dataframe

Question

I'm very new to python (< 2 weeks), and was asked to read in 200k+ JSON files I was provided (as-is) into a single database (using python). These JSON files have flat, one level attributes which vary across file from 50 - > 1000, but those 50 are a subset of the 1000.

Here is a snippet of a json file:

{
"study_type" : "Observational",
"intervention.intervention_type" : "Device",
"primary_outcome.time_frame" : "24 months",
"primary_completion_date.type" : "Actual",
"design_info.primary_purpose" : "Diagnostic",
"design_info.secondary_purpose" : "Intervention",
"start_date" : "January 2014",
"end_date" : "March 2014",
"overall_status" : "Completed",
"location_countries.country" : "United States",
"location.facility.name" : "Generic Institution",
}

Our goal is to take this master database of these JSON files, clean up the individual columns, run descriptive statistics on those columns and create a final, cleaned up database.

I'm coming from a SAS background so my thought was to use pandas and create a (very) large dataframe. I've been combing through stack overflow this past week and I've leveraged some learnings, but feel there has to be a way to make this way more efficient.

Below is the code I have written so far - it runs, but very slow (I estimate it will take days, if not weeks, to run even after eliminating unneeded input attributes/columns starting with "result').

In addition, the awkward way I convert the dictionary to a final table leaves the column index numbers above the column name, which I haven't been able to figure out how to remove.

import json, os
import pandas as pd    
from copy import deepcopy

path_to_json = '/home/ubuntu/json_flat/'

#Gets list of files in directory with *.json suffix
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#Initialize series
df_list = []

#For every json file found
for js in list_files:

    with open(os.path.join(path_to_json, js)) as data_file:
        data = json.loads(data_file.read())                         #Loads Json file into dictionary
        data_file.close()                                           #Close data file / remove from memory

        data_copy = deepcopy(data)                                  #Copies dictionary file
        for k in data_copy.keys():                                  #Iterate over copied dictionary file
            if k.startswith('result'):                              #If field starts with "X" then delete from dictionary
                del data[k]
        df = pd.Series(data)                                        #Convert Dictionary to Series
        df_list.append(df)                                          #Append to empty series  
        database = pd.concat(df_list, axis=1).reset_index()         #Concatenate series into database

output_db = database.transpose()                                    #Transpose rows/columns
output_db.to_csv('/home/ubuntu/output/output_db.csv', mode = 'w', index=False)

Any thoughts, advice is greatly appreciated. I am completely open to using a different technique or approach entirely (in python) if it's more efficient and still allows us to meet our objectives above.

Thanks!

Note that you can use json.load() (docs) to read the files directly. No need to add the read() and do json.loads(). — patrick
– patrick, Commented Apr 27, 2017 at 14:42
Also, why don't you read all the jsons into one big dictionary, then convert that entire dictionary into a pandas DataFrame (see here) that you can write to file. — patrick
– patrick, Commented Apr 27, 2017 at 14:44
Thanks Patrick! Appreciate the tip, I'll make the change. It doesn't seem to impact run time too much but every efficiency helps. — RDara
– RDara, Commented Apr 27, 2017 at 14:49
@patrick just saw your second post as I was responding. Let me give that a shot. — RDara
– RDara, Commented Apr 27, 2017 at 14:51

patrick · Accepted Answer · 2017-04-27 17:16:18Z

3

I've tried to replicate your approach in a more concise manner, less copies and appending. It works with the example data you supplied but don't know if there are further intricacies in your data set. You can give this a try, I hope the comments help.

import json
import os
import pandas
import io


path_to_json = "XXX"

list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#set up an empty dictionary
resultdict = {}

for fili in list_files:
    #the with avoids the extra step of closing the file
    with open(os.path.join(path_to_json, fili), "r") as inputjson:
        #the dictionary key is set to filename here, but you could also use e.g. a counter
        resultdict[fili] = json.load(inputjson)
        """
        you can exclude stuff here or later via dictionary comprehensions: 
        http://stackoverflow.com/questions/1747817/create-a-dictionary-with-list-comprehension-in-python
        e.g. as in your example code
        resultdict[fili] = {k:v for k,v in json.load(inputjson).items() if not k.startswith("result")}
        """

#put the whole thing into the DataFrame     
dataframe = pandas.DataFrame(resultdict)

#write out, transpose for desired format
with open("output.csv", "w") as csvout:
    dataframe.T.to_csv(csvout)

answered Apr 27, 2017 at 17:16

patrick

4,9726 gold badges51 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

RDara Over a year ago

Patrick, this works great! It gets rid of the column index values beautifully. Where I run into an issue (JSONDecodeError) is if I un-comment the exclusion criteria. It looks like the resulting resultdict[fili] has a key of the json file name & values containing the key/value pair prior to conversion to a dataframe. Thanks again for any thoughts/suggestions.

patrick Over a year ago

@RDara when doing that, you'll need to comment out the first step w/out the exclusion. Does that did it?

RDara Over a year ago

sorry which first step are your referring to? Basically I'm adding in the following line: resultdict[file] = {k: v for k,v in json.load(inputjson).items() if not k.startswith("location")} (changed to "location" since that is in the sample JSON file above) .. Thanks for your patience!

RDara Over a year ago

Just to add, what originally took me over 90 minutes was reduced to seconds on a subset of 10k files. Just tested on the full 200k+ of JSON files (pulling a subset of fields) and that seems to have worked fairly quickly as well. Can't thank you enough!!

John Zwinck Over a year ago

You can just pass the filename to to_csv() - no need for open() there.

|

John Zwinck · Accepted Answer · 2017-04-27 14:44:59Z

1

Your most critical performance bug is probably this:

database = pd.concat(df_list, axis=1).reset_index()

You do this in a loop, every time adding one more thing to df_list and then concat again. But there is no use of this "database" variable until the end, so you can do this step just once, outside the loop.

With Pandas, "concat" in a loop is a huge anti-pattern. Build your list in the loop, concat once.

The second thing is that you should use Pandas to read the JSON files also: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

Keep it simple. Write a function that takes a path, calls pd.read_json(), removes rows you don't need (series.str.startswith()), etc.

Once you have that working well, your next step will be to check whether you are CPU limited (CPU usage 100%), or I/O limited (CPU usage much less than 100%).

answered Apr 27, 2017 at 14:44

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

1 Comment

RDara Over a year ago

Thanks John, that one line movement alone changed the runtime significantly. For the life of me I can't get the startswith() function to work with this series (the only way I can use "read_json()" on these files is as: typ='series' and orient='records' w/o error). ----------------------------------------------------------------- if not data.str.startswith('results'): (returns a ValueError)

Collectives™ on Stack Overflow

Python: Reading 200k JSON files into a Pandas Dataframe

2 Answers 2

8 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related