0

I have a dataframe where a column is a json string with a dictionary, and I need to expand the json into separate columns. Example:

   c1                              c2
0  a1     {'x1': 1, 'x3': 3, 'x2': 2}
1  a2  {'x1': 21, 'x3': 23, 'x2': 22}

should become:

   c1    x1    x2    x3
0  a1   1.0   2.0   3.0
1  a2  21.0  22.0  23.0

My problem is very similar to this thread, except that I have strings, not dictionaries (although the strings evaluate to a dictionary), and the simple, optimized solution proposed there doesn't work in my case. I have a working solution, but it is clearly horribly inefficient. Here's a snippet with my code and the solution proposed in that thread:

import json
import pandas as pd

def expandFeatures(df, columnName):
    """Expands column 'columnName', which contains a dictionary in form of a json string, into N single columns, each containing a single feature"""
    # get names of new columns from the first row
    features = json.loads(df.iloc[0].loc[columnName])
    featureNames = list(features.keys())
    featureNames.sort()
    # add new columns (empty values)
    newCols = list(df.columns) + featureNames
    df = df.reindex(columns=newCols, fill_value=0.0)
    # fill in the values of the new columns
    for index, row in df.iterrows():
        features = json.loads(row[columnName])
        for key,val in features.items():
            df.at[index, key] = val
    # remove column 'columnName'
    return df.drop(columns=[columnName])

def expandFeatures1(df, columnName):
    return df.drop(columnName, axis=1).join(pd.DataFrame(df[columnName].values.tolist()))

df_json = pd.DataFrame([['a1', '{"x1": 1, "x2": 2, "x3": 3}'], ['a2', '{"x1": 21, "x2": 22, "x3": 23}']],
                    columns=['c1', 'c2'])
df_dict = pd.DataFrame([['a1', {'x1': 1, 'x2': 2, 'x3': 3}], ['a2', {'x1': 21, 'x2': 22, 'x3': 23}]],
                    columns=['c1', 'c2'])

# correct result, but inefficient
print("expandFeatures, df_json")
df = df_json.copy()
print(df)
df = expandFeatures(df, 'c2')
print(df)

# this gives an error because expandFeatures expects a string, not a dictionary 
# print("expandFeatures, df_dict")
# df = df_dict.copy()
# print(df)
# df = expandFeatures(df, 'c2')
# print(df)

# WRONG, doesn't expand anything
print("expandFeatures1, df_json")
df = df_json.copy()
print(df)
df = expandFeatures1(df, 'c2')
print(df)

# correct and efficient, but not my use case (I have strings not dicts)
print("expandFeatures1, df_dict")
df = df_dict.copy()
print(df)
df = expandFeatures1(df, 'c2')
print(df)

I'm sure there is some obvious way to improve the efficiency of my code, make it more similar to the single line proposed in the other thread, but I can't really see it myself... Thanks in advance for any help.

1 Answer 1

1

If you json strings are valid dictionaries, you can use ast.literal_eval to parse them:

import pandas as pd
from ast import literal_eval

df_json = pd.DataFrame([['a1', '{"x1": 1, "x2": 2, "x3": 3}'],
                        ['a2', '{"x1": 21, "x2": 22, "x3": 23}']],
                        columns=['c1', 'c2'])

print (pd.concat([df_json,pd.DataFrame(df_json["c2"].apply(literal_eval).to_list())],axis=1).drop("c2",axis=1))

#
   c1  x1  x2  x3
0  a1   1   2   3
1  a2  21  22  23
Sign up to request clarification or add additional context in comments.

3 Comments

thanks @henry-yik, that works. Is there a particular reason why you use pd.concat, is it more efficient than pd.join?
Also, because my string is certainly json, I can also use apply(json.loads). Is there any difference between the two, is one more efficient than the other, or are they internally essentially the same?
concat is base on axis, which fits our purpose here since we do not need to do lookup. I am not aware of the performance difference between ast and json.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.