1

I have 10K folders each with 200 records in 200 JSON format files. Trying to compile all records into one data frame then finally into a CSV (other format suggestions welcome)

Here is my working solution which takes around 8.3hrs just for the dataframe building process. (Not converting into CSV)

%%time
finalDf = pd.DataFrame()
rootdir ='/path/foldername'
all_files = Path(rootdir).rglob('*.json')
for filename in all_files:
    with open(filename, 'r+') as f:
        data = json.load(f)
        df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
        finalDf = finalDf.append(df, ignore_index=True)

Any suggestions to optimize this and bring the time down.

5
  • I found similar post. How about trying that? stackoverflow.com/questions/27407430/… Commented Jul 15, 2020 at 1:46
  • Yup, will try Ultra JSON usage. But not very optimistic. Commented Jul 15, 2020 at 2:31
  • What platform are you on? Does this need to run on Windows? Commented Jul 15, 2020 at 14:19
  • Is the goal just to write the CSV or are you going to process the full DF first? Commented Jul 15, 2020 at 14:20
  • There are other faster serializations such as feather, parquet, hdf file systems. And depending on what you want to do with the data long term, a nosql solution like mongdb or even good ole sql are good choices. Once you've imported these json, then you have a rich query capabilities and if this grows with more json over time, just keep importing as more come in. Commented Jul 15, 2020 at 16:24

2 Answers 2

2

One important issue comes from the dataframe appending performed in O(n^2). Indeed, for each new processed json file, finalDf is entirely copied!

Here is a modified version running in O(n) time:

%%time
finalDf = pd.DataFrame()
rootdir ='/path/foldername'
all_files = Path(rootdir).rglob('*.json')
allDf = []
for filename in all_files:
    with open(filename, 'r+') as f:
        data = json.load(f)
        df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
        allDf.append(df)
finalDf = pd.concat(allDf, ignore_index=True)

If this not enough, the json parsing and pandas post-processings could be executed in parallel using the multiprocessing module.

Sign up to request clarification or add additional context in comments.

1 Comment

This certainly helps. Could you elaborate on the json parsing and pandas post-processings could be executed in parallel using the multiprocessing module . Any implementations/code?
1

If the goal is to just write the CSV, you can use multiprocessing to parallelize the read/deserialize/serialize steps and control the file writes with a lock. With a CSV you don't have to hold the whole thing in memory, just append each DF as its generated. If you are using hard drives instead of a ssd, you may also get a boost if the CSV is on a different drive (not just partition).

import multiprocessing as mp
import json
import pandas as pd
from pathlib import Path
import os

def update_csv(args):
    lock, infile, outfile = args
    with open(infile) as f:
        data = json.load(f)
    df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
    with lock:
        with open(outfile, mode="a", newline="") as f:
            df.to_csv(f)

if __name__ == "__main__":
    rootdir ='/path/foldername'
    outfile = 'myoutput.csv'
    if os.path.exists(outfile):
        os.remove(outfile)
    all_files = [str(p) for p in Path(rootdir).rglob('*.json')]
    mgr = mp.Manager()
    lock = mgr.Lock()
    # pool sizing is a bit of a guess....
    with mp.Pool(mp.cpu_count()-1) as pool:
        result = pool.map(update_csv, [(lock, fn, outfile) for fn in all_files],
            chunksize=1)

Personally, I prefer to use a file system lock file for this type of thing but that's platform dependent and you may have problems on some file system types (like a mounted remote file system). multiprocessing.Manager uses background synchronization - I'm not sure if its Lock is efficient or not. But good enough here.... it'll only be a minor % of costs.

7 Comments

In the last line result = pool.map.. Passing this "args=[(lock, outfile, fn) for fn in all_files]" gives TypeError: map() got an unexpected keyword argument 'args'. Passing this "[(lock, outfile, fn) for fn in all_files]" gives RuntimeError: Lock objects should only be shared between processes through inheritance
Okay, a few problems there.... I've posted an update.
Is this sequence args=[(lock, outfile, fn) correct bc I am getting the same error some how. Even corrected, if _name_ == "_main_": and exists typos
Okay, now it runs, at lest with 0 json files to process.
There are lots of other options but it depends on how you want to consume the data later. You could put the json into a SQL db, a non SQL db such as mongodb or couchdb, into a table in HDFS, use apache arrow. These all have ways to query and filter data when you read it so that you don't have to pull the entire dataset into memory. If you really want the entire dataset in memory, then formats like parquet and feather will be efficient for reading into pandas.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.