2

I have a Pandas dataframe where one column contains a non-nested json object in each row.

                             js
0  {"k1":"1","k2":"A","k3":"X"}
1  {"k1":"2","k2":"B","k3":"X"}
2  {"k1":"3","k2":"A","k3":"Y"}
3  {"k1":"4","k2":"D","k4":"M"}

Created like this:

import pandas as pd
L0 = ['{"k1":"1","k2":"A","k3":"X"}',
      '{"k1":"2","k2":"B","k3":"X"}',
      '{"k1":"3","k2":"A","k3":"Y"}',
      '{"k1":"4","k2":"D","k4":"M"}']
df = pd.DataFrame({'js':L0})

I want to make the json-objects into their own dataframe:

  k1 k2   k3   k4
0  1  A    X  NaN
1  2  B    X  NaN
2  3  A    Y  NaN
3  4  D  NaN    M

Right now the only way I know is by using the json module and df.iterrows():

import json
all_json = []
for _,row in df.iterrows():
    all_json.append(json.loads(row["js"]))
df2 = pd.DataFrame.from_dict(all_json)

Is there a better way to do this, ideally without iterating?

EDIT 1:

Thanks for the answers.

I have timed the three suggested approaches using ast.literal_eval on my real world data where my own approach takes 158 ms ± 4.01 ms:

df = df.apply(lambda x: ast.literal_eval(x[0]), 1).apply(pd.Series) takes 640 ms ± 7.8 ms

df['js'].apply(ast.literal_eval).apply(pd.Series) takes 636 ms ± 19 ms

pd.DataFrame(df.js.apply(ast.literal_eval).tolist()) takes 180 ms ± 5.11

As suggested the third approach is the fastest, but sadly they are all slower than the iterrows-approach while my intention was to get rid of iterrows to make it faster.

EDIT 2: pd.DataFrame(df["js"].apply(json.loads).tolist()) takes 25.2 ms ± 512 µs so we have a winner I guess.

3 Answers 3

3

Use ast.literal_eval and apply pd.Series as:

import ast
df = df.apply(lambda x: ast.literal_eval(x[0]), 1).apply(pd.Series)

print(df)
  k1 k2   k3   k4
0  1  A    X  NaN
1  2  B    X  NaN
2  3  A    Y  NaN
3  4  D  NaN    M

OR:

df = pd.DataFrame([ast.literal_eval(i) for i in df['js']])

OR:

import json
df = pd.DataFrame([json.loads(i) for i in df['js']])
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, but this approach is much slower than mine, see Edit.
@Khris Check my new approach.
Second one is faster, third one is as fast as the to_list-approach by @anky_91.
2

I would call the dataframe constructor after converting the string to dict ( i think this would be faster):

import ast
pd.DataFrame(df.js.apply(ast.literal_eval).tolist())

Or:

import json
pd.DataFrame(df["js"].apply(json.loads).tolist())

  k1 k2   k3   k4
0  1  A    X  NaN
1  2  B    X  NaN
2  3  A    Y  NaN
3  4  D  NaN    M

3 Comments

Thanks, but this approach is much slower than mine, see Edit.
@Khris how about pd.DataFrame(df["js"].apply(json.loads).tolist()) ?
25.2 ms ± 512 µs, that's what I was looking for.
1

You can use apply(pd.Series):

import ast
print(df['js'].apply(ast.literal_eval).apply(pd.Series))

Output:

  k1 k2   k3   k4
0  1  A    X  NaN
1  2  B    X  NaN
2  3  A    Y  NaN
3  4  D  NaN    M

1 Comment

Thanks, this approach is almost as fast as mine, but sadly still slower.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.