Optimization of json.load() to reduce in-memory usage and time in Python

Question

I have 10K folders each with 200 records in 200 JSON format files. Trying to compile all records into one data frame then finally into a CSV (other format suggestions welcome)

Here is my working solution which takes around 8.3hrs just for the dataframe building process. (Not converting into CSV)

%%time
finalDf = pd.DataFrame()
rootdir ='/path/foldername'
all_files = Path(rootdir).rglob('*.json')
for filename in all_files:
    with open(filename, 'r+') as f:
        data = json.load(f)
        df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
        finalDf = finalDf.append(df, ignore_index=True)

Any suggestions to optimize this and bring the time down.

I found similar post. How about trying that? stackoverflow.com/questions/27407430/… — akio.tanaka
– akio.tanaka, Commented Jul 15, 2020 at 1:46
Is the goal just to write the CSV or are you going to process the full DF first? — tdelaney
– tdelaney, Commented Jul 15, 2020 at 14:20
There are other faster serializations such as feather, parquet, hdf file systems. And depending on what you want to do with the data long term, a nosql solution like mongdb or even good ole sql are good choices. Once you've imported these json, then you have a rich query capabilities and if this grows with more json over time, just keep importing as more come in. — tdelaney
– tdelaney, Commented Jul 15, 2020 at 16:24

Jérôme Richard · Accepted Answer · 2020-07-15 08:11:08Z

2

One important issue comes from the dataframe appending performed in O(n^2). Indeed, for each new processed json file, finalDf is entirely copied!

Here is a modified version running in O(n) time:

%%time
finalDf = pd.DataFrame()
rootdir ='/path/foldername'
all_files = Path(rootdir).rglob('*.json')
allDf = []
for filename in all_files:
    with open(filename, 'r+') as f:
        data = json.load(f)
        df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
        allDf.append(df)
finalDf = pd.concat(allDf, ignore_index=True)

If this not enough, the json parsing and pandas post-processings could be executed in parallel using the multiprocessing module.

answered Jul 15, 2020 at 8:11

Jérôme Richard

53.4k6 gold badges48 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Abhi Over a year ago

This certainly helps. Could you elaborate on the json parsing and pandas post-processings could be executed in parallel using the multiprocessing module . Any implementations/code?

Abhi · Accepted Answer · 2020-07-16 02:39:23Z

1

If the goal is to just write the CSV, you can use multiprocessing to parallelize the read/deserialize/serialize steps and control the file writes with a lock. With a CSV you don't have to hold the whole thing in memory, just append each DF as its generated. If you are using hard drives instead of a ssd, you may also get a boost if the CSV is on a different drive (not just partition).

import multiprocessing as mp
import json
import pandas as pd
from pathlib import Path
import os

def update_csv(args):
    lock, infile, outfile = args
    with open(infile) as f:
        data = json.load(f)
    df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
    with lock:
        with open(outfile, mode="a", newline="") as f:
            df.to_csv(f)

if __name__ == "__main__":
    rootdir ='/path/foldername'
    outfile = 'myoutput.csv'
    if os.path.exists(outfile):
        os.remove(outfile)
    all_files = [str(p) for p in Path(rootdir).rglob('*.json')]
    mgr = mp.Manager()
    lock = mgr.Lock()
    # pool sizing is a bit of a guess....
    with mp.Pool(mp.cpu_count()-1) as pool:
        result = pool.map(update_csv, [(lock, fn, outfile) for fn in all_files],
            chunksize=1)

Personally, I prefer to use a file system lock file for this type of thing but that's platform dependent and you may have problems on some file system types (like a mounted remote file system). multiprocessing.Manager uses background synchronization - I'm not sure if its Lock is efficient or not. But good enough here.... it'll only be a minor % of costs.

edited Jul 16, 2020 at 2:39

Abhi

1231 silver badge11 bronze badges

answered Jul 15, 2020 at 15:24

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

7 Comments

Abhi Over a year ago

In the last line result = pool.map.. Passing this "args=[(lock, outfile, fn) for fn in all_files]" gives TypeError: map() got an unexpected keyword argument 'args'. Passing this "[(lock, outfile, fn) for fn in all_files]" gives RuntimeError: Lock objects should only be shared between processes through inheritance

tdelaney Over a year ago

Okay, a few problems there.... I've posted an update.

Abhi Over a year ago

Is this sequence args=[(lock, outfile, fn) correct bc I am getting the same error some how. Even corrected, if _name_ == "_main_": and exists typos

tdelaney Over a year ago

Okay, now it runs, at lest with 0 json files to process.

tdelaney Over a year ago

There are lots of other options but it depends on how you want to consume the data later. You could put the json into a SQL db, a non SQL db such as mongodb or couchdb, into a table in HDFS, use apache arrow. These all have ways to query and filter data when you read it so that you don't have to pull the entire dataset into memory. If you really want the entire dataset in memory, then formats like parquet and feather will be efficient for reading into pandas.

|

Collectives™ on Stack Overflow

Optimization of json.load() to reduce in-memory usage and time in Python

2 Answers 2

1 Comment

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related