1

I need to transform following data frame with json values in column into dataframe columnar structure so that it will be taking less space and easy to compute.

Sample DataFrame:

obs_id date obs
I2213 2021-12-31 23:20:02.761008 "[{'type': 'air', 'results': {'bat': {'F1': 0.1, 'F2': 0.2}}, {'type': 'water', 'results': {'neo': {'F1': 0.3}}]"
I2213 2022-01-01 23:20:02.761008 "[{'type': 'earth', 'results': {'cat': {'F1': 0.4}}]"
I2213 2022-01-02 23:20:02.761008 "[{'type': 'air', 'results': {'bat': {'F1': 0.2, 'F2': 0.1}}]"

Required Transformation format:

obs_id date obs.air.bat.F1 obs.air.bat.F2 obs.water.neo.F1 obs.earth.cat.F1

not sure if multi-level columns will suit better here.

I tried to create a separate dataframe from obs column like:

df1 = pd.DataFrame(df['obs'].values.tolist())

but since it contains list instead of dictionary, it doesn't work. Is it possible to achieve the require format?

1
  • obs is string ? Commented Jan 3, 2022 at 18:28

1 Answer 1

1

Let's start from a correction to your source data. As your sample contains unbalanced parentheses, probably its actual content should be:

   obs_id  date        obs
0  I2213   2021-12-31  [{'type': 'air', 'results': {'bat': {'F1': 0.1, 'F2': 0.2}}}, {'type': 'water', 'results': {'neo': {'F1': 0.3}}}]
1  I2213   2022-01-01  [{'type': 'earth', 'results': {'cat': {'F1': 0.4}}}]
2  I2213   2022-01-02  [{'type': 'air', 'results': {'bat': {'F1': 0.2, 'F2': 0.1}}}]

To keep printouts of reasonable breadth, I dropped the time part from your date column.

Start the coding part from necessary imports:

import pandas as pd
import json

Then define a row processing function as:

def procRow(row):
    wrk1 = pd.json_normalize(json.loads(row.obs.replace("'", '"')))
    wrk2 = wrk1.set_index('type').stack().reset_index()
    return pd.Series(wrk2[0].values, index='obs.' + wrk2.type\
        + wrk2.level_1.str.slice(7))

And concatenate first 2 columns of df with the result of application of this function to each row:

result = pd.concat([df.iloc[:, 0:2], df.apply(procRow, axis=1)], axis=1)

The result is:

  obs_id       date  obs.air.bat.F1  obs.air.bat.F2  obs.earth.cat.F1  obs.water.neo.F1
0  I2213 2021-12-31             0.1             0.2               NaN               0.3
1  I2213 2022-01-01             NaN             NaN               0.4               NaN
2  I2213 2022-01-02             0.2             0.1               NaN               NaN
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.