I'm very new to python (< 2 weeks), and was asked to read in 200k+ JSON files I was provided (as-is) into a single database (using python). These JSON files have flat, one level attributes which vary across file from 50 - > 1000, but those 50 are a subset of the 1000.
Here is a snippet of a json file:
{
"study_type" : "Observational",
"intervention.intervention_type" : "Device",
"primary_outcome.time_frame" : "24 months",
"primary_completion_date.type" : "Actual",
"design_info.primary_purpose" : "Diagnostic",
"design_info.secondary_purpose" : "Intervention",
"start_date" : "January 2014",
"end_date" : "March 2014",
"overall_status" : "Completed",
"location_countries.country" : "United States",
"location.facility.name" : "Generic Institution",
}
Our goal is to take this master database of these JSON files, clean up the individual columns, run descriptive statistics on those columns and create a final, cleaned up database.
I'm coming from a SAS background so my thought was to use pandas and create a (very) large dataframe. I've been combing through stack overflow this past week and I've leveraged some learnings, but feel there has to be a way to make this way more efficient.
Below is the code I have written so far - it runs, but very slow (I estimate it will take days, if not weeks, to run even after eliminating unneeded input attributes/columns starting with "result').
In addition, the awkward way I convert the dictionary to a final table leaves the column index numbers above the column name, which I haven't been able to figure out how to remove.
import json, os
import pandas as pd
from copy import deepcopy
path_to_json = '/home/ubuntu/json_flat/'
#Gets list of files in directory with *.json suffix
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
#Initialize series
df_list = []
#For every json file found
for js in list_files:
with open(os.path.join(path_to_json, js)) as data_file:
data = json.loads(data_file.read()) #Loads Json file into dictionary
data_file.close() #Close data file / remove from memory
data_copy = deepcopy(data) #Copies dictionary file
for k in data_copy.keys(): #Iterate over copied dictionary file
if k.startswith('result'): #If field starts with "X" then delete from dictionary
del data[k]
df = pd.Series(data) #Convert Dictionary to Series
df_list.append(df) #Append to empty series
database = pd.concat(df_list, axis=1).reset_index() #Concatenate series into database
output_db = database.transpose() #Transpose rows/columns
output_db.to_csv('/home/ubuntu/output/output_db.csv', mode = 'w', index=False)
Any thoughts, advice is greatly appreciated. I am completely open to using a different technique or approach entirely (in python) if it's more efficient and still allows us to meet our objectives above.
Thanks!
json.load()(docs) to read the files directly. No need to add theread()and dojson.loads().DataFrame(see here) that you can write to file.