Reading json file into pandas dataframe is very slow

Question

I have a json file of size less than 1Gb.I am trying to read the file on a server that have 400 Gb RAM using the following simple command:

df = pd.read_json('filepath.json')

However this code is taking forever (several hours) to execute,I tried several suggestions such as

df = pd.read_json('filepath.json', low_memory=False)

or

df = pd.read_json('filepath.json', lines=True)

But none have worked. How come reading 1GB file into a server of 400GB be so slow?

Did you try import json; d=json.load(open('filepath.json')); df=pd.DataFrame(d)? — Corralien
– Corralien, Commented Feb 24, 2022 at 14:14
Is your json essentially a list of dictionaries? Is it one dictionary per line? Do you need all the attributes or just some of them? — JonSG
– JonSG, Commented Feb 24, 2022 at 14:57
Even though pandas.read_json is not fast, I don't think it will take several hours (It's just a wild guess). I suspect that your table has too many columns, or pandas.read_json is reading it that way. pandas is terrible at handling tables with too many columns. For example, pd.DataFrame([range(100000)]) will take more than one second to create. Please check how many rows and columns your table has. — ken
– ken, Commented Feb 24, 2022 at 16:30
Thanks I think the problem was with reading directly using read_json. while @tomerar suggestion worked in few seconds! — Youcef B
– Youcef B, Commented Feb 24, 2022 at 19:23

Alex · Accepted Answer · 2022-02-24 14:09:06Z

1

You can use Chunking can shrink memory use. I recommend Dask Library can load data in parallel.

answered Feb 24, 2022 at 14:09

Alex

3112 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1