Parsing a JSON string which was loaded from a CSV using Pandas

Question

I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:

name,dob,stats
john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"
dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"
bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"

After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?

After about an hour, the only thing I could come up with was:

import json
stdf = df['stats'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
df.join(pandas.read_json(stjson))

This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.

Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:

df = df.join(pandas.read_json(stjson))
del(df['stats'])
In [14]: df

Out[14]:
          name       dob eye_color  height  weight
0   john smith  1/1/1980     brown     160      76
1   dave jones  2/2/1981      blue     170      85
2  bob roberts  3/3/1982     green     180      94

wjandrea · Accepted Answer · 2024-06-01 18:44:09Z

71

I think applying the json.loads is a good idea, but from there you can simply directly convert it to dataframe columns instead of writing/loading it again:

stdf = df['stats'].apply(json.loads)
pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)

or alternatively in one step:

df.join(df['stats'].apply(json.loads).apply(pd.Series))

edited Jun 1, 2024 at 18:44

wjandrea

33.9k10 gold badges69 silver badges105 bronze badges

answered Dec 19, 2013 at 13:24

joris

140k37 gold badges257 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

profesor_tortuga Over a year ago

ty, this was perfectly sufficient for my current task but i marked the other one as the answer since it's more broadly applicable

Neeraj Kumar Over a year ago

I was wondering how to parallelise this statement df.join(df['stats'].apply(json.loads).apply(pd.Series)). Any help please?

Paul · Accepted Answer · 2018-03-20 12:22:47Z

57

There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv

converters : dict. optional

Dict of functions for converting values in certain columns. Keys can either be integers or column labels

So first define your custom parser. In this case the below should work:

def CustomParser(data):
    import json
    j1 = json.loads(data)
    return j1

In your case you'll have something like:

df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)

We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict

From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)

df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)

On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.

edited Mar 20, 2018 at 12:22

answered Dec 19, 2013 at 13:45

Paul

7,3758 gold badges45 silver badges41 bronze badges

5 Comments

profesor_tortuga Over a year ago

thanks, this is great, i expect i'll need to deal with more mutant data in the future and this will help.

abeboparebop Over a year ago

The last line in this answer does not guarantee that the dict elements get matched to the correct column names. .apply(pandas.Series) converts each row into a Series and automatically sorts the index, which in this case is the list of dictionary keys. So for consistency, you have to ensure that the list of keys on the LHS is sorted.

gberger Over a year ago

I would import json and then use: pandas.read_csv(f1, converters={'stats': json.loads}). You don't need to define a new function, and you definitely don't need to import inside it.

A.Ali Over a year ago

Hello. I tried this in Python 3 and got the error: ValueError: Columns must be same length as key. My requirement and expected output is exactly the same except that I have nested values in my JSON.

Francis Manoj Fernnado Over a year ago

only issue is when the json keys are inconsistent, Columns must be same length as key error pops

Glen Thompson · Accepted Answer · 2019-11-25 15:41:10Z

16

Option 1

If you dumped the column with json.dumps before you wrote it to csv, you can read it back in with:

import json
import pandas as pd

df = pd.read_csv('data/file.csv', converters={'json_column_name': json.loads})

Option 2

If you didn't then you might need to use this:

import json
import pandas as pd

df = pd.read_csv('data/file.csv', converters={'json_column_name': eval})

Option 3

For more complicated situations you can write a custom converter like this:

import json
import pandas as pd

def parse_column(data):
    try:
        return json.loads(data)
    except Exception as e:
        print(e)
        return None


df = pd.read_csv('data/file.csv', converters={'json_column_name': parse_column})

edited Nov 25, 2019 at 15:41

answered Nov 25, 2019 at 15:18

Glen Thompson

10.1k5 gold badges61 silver badges54 bronze badges

2 Comments

Garf Over a year ago

Hello, I have got nan value in my JSON sting 'sv': [nan, nan, nan, nan, nan, 1.0] and I got the error "name 'nan' is not defined". Do you know how to handle that case?

Glen Thompson Over a year ago

Hmm you could try Option 3, the custom parser and do something like data = data.replace('nan,', 'None,') and then return eval(data), be careful though with the replacement, and other values that you don't want to replace being replaced. I'm not sure what your data looks like. You could maybe get a bit smarter and use regex something like this (?<=[\[,\s\]])(nan)(?=[\,\s\]]) which should match all the nan but not stuff like bnan or *nan - This is a good tool to play around on regexr.com

abeboparebop · Accepted Answer · 2019-05-28 17:59:45Z

8

Paul's original answer was very nice but not correct in general, because there is no assurance that the ordering of columns is the same on the left-hand side and the right-hand side of the last line. (In fact, it does not seem to work on the test data in the question, instead erroneously switching the height and weight columns.)

We can fix this by ensuring that the list of dict keys on the LHS is sorted. This works because the apply on the RHS automatically sorts by the index, which in this case is the list of column names.

def CustomParser(data):
  import json
  j1 = json.loads(data)
  return j1

df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)

edited May 28, 2019 at 17:59

answered May 26, 2017 at 13:08

abeboparebop

7,8247 gold badges43 silver badges48 bronze badges

1 Comment

Paul Over a year ago

Thx for spotting that. I have updated my answer with your additional sorted for completeness

Pavan Yadiki · Accepted Answer · 2019-03-12 04:24:38Z

3

json_normalize function in pandas.io.json package helps to do this without using custom function.

(assuming you are loading the data from a file)

from pandas.io.json import json_normalize
df = pd.read_csv(file_path, header=None)
stats_df = json_normalize(data['stats'].apply(ujson.loads).tolist())
stats_df.set_index(df.index, inplace=True)
df.join(stats_df)
del df.drop(df.columns[2], inplace=True)

answered Mar 12, 2019 at 4:24

Pavan Yadiki

415 bronze badges

2 Comments

nonbeing Over a year ago

Thanks for your answer. Shouldn't the ujson.loads actually be json.loads?

Richnou Over a year ago

Note that pandas.io.json is deprecated, use pandas.json_normalize (or pd.json_normalize)

Teana · Accepted Answer · 2022-01-27 01:54:48Z

2

If you have DateTime values in your .csv file, df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series) will mess up the date time values
This link has some tip how to read the csv file with json strings into the dataframe.

You could do the following to read csv file with json string column and convert your json string into columns.

Read your csv into the dataframe (read_df)

read_df = pd.read_csv('yourFile.csv', converters={'state':json.loads}, header=0, quotechar="'")
Convert the json string column to a new dataframe

state_df = read_df['state'].apply(pd.Series)
Merge the 2 dataframe with index number.

df = pd.merge(read_df, state_df, left_index=True, right_index=True)

answered Jan 27, 2022 at 1:54

Teana

211 bronze badge

Collectives™ on Stack Overflow

Parsing a JSON string which was loaded from a CSV using Pandas

6 Answers 6

2 Comments

5 Comments

2 Comments

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

5 Comments

2 Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related