Parse nested json data in dataframe

Question

I have delimited file that have JSON also keyvalues matching in the column. I need to parse this data into dataframe.

Below is the record format

**trx_id|name|service_context|status**

abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success

i need to convert all information from this record to have this format

trx_id|name |type|payload.trx_id|payload.name|payload.counter.counter_type|payload.counter.counter_info|.....|payload.renewal.flag|status
abc123|order|cdr |abc123        |abs         |product                     |transfer                    |.....|0                   |success
abc456|order|cdr |abc456        |abs         |product                     |                            |.....|1                   |success

Currently i've done manual parsing the data for key_value with sep=';|[|] and remove behind '=' and update the column name. for Json, i do the below command, however the result is replacing the existing table and only contain parsing json result.

test_parse = pd.concat([pd.json_normalize(json.loads(js)) for js in test_parse['payload']])

Is there any way to do avoid any manual process to process this type of data?

you need to do it column wise. I have solved the half of it. Try to complete it. — Pygirl
– Pygirl, Commented Dec 19, 2020 at 13:28

Pygirl · Accepted Answer · 2020-12-19 13:25:48Z

1

The below hint will be sufficient to solve the problem.

Do it partwise for each column and then merge them together (you will need to remove the columns once you are able to split into multiple columns):

import ast
from pandas.io.json import json_normalize
x = json_normalize(df3['service_context'].apply(lambda x: (ast.literal_eval(x.split('=')[1])))).add_prefix('payload.')

y = pd.DataFrame(x['payload.counter'].apply(lambda x:[i['counter_type'] for i in x]).to_list())
y = y.rename(columns={0: 'counter_type', 1:'counter_info'})

for row in x['payload.product']:    
    z1 = json_normalize(row)
    z2 = json_normalize(z1['customer_spec.resource_pecification'][0])
    ### Write your own code.

x:

y:

answered Dec 19, 2020 at 13:25

Pygirl

13.4k6 gold badges36 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Rob Raymond · Accepted Answer · 2020-12-29 20:19:49Z

0

It's realy a 3-step approach

use primary pipe | delimiter
extract key / value pairs
normlize JSON

import pandas as pd
import io, json

# overall data structure is pipe delimited
df = pd.read_csv(io.StringIO("""abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success"""), 
            sep="|", header=None, names=["trx_id","name","data","status"])

df2 = pd.concat([
    df,
    # split out sub-columns ; delimted columns in 3rd column
    pd.DataFrame(
        [[c.split("=")[1] for c in r] for r in df.data.str.split(";")],
        columns=[c.split("=")[0] for c in df.data.str.split(";")[0]], 
    )
], axis=1)

# extract json payload into columns.  This will leave embedded lists as these are many-many
# that needs to be worked out by data owner
df3 = pd.concat([df2, 
           pd.concat([pd.json_normalize(json.loads(p)).add_prefix("payload.") for p in df2.payload]).reset_index()], axis=1)

output

    trx_id  name    data    status  type    payload index   payload.trx_id  payload.name    payload.counter payload.language    payload.type    payload.can_replace payload.product payload.renewal_flag    payload.price.transaction   payload.price.discount
0   abc123  order   type=cdr;payload={"trx_id":"abc123","name":"ab...   success cdr {"trx_id":"abc123","name":"abs","counter":[{"c...   0   abc123  abs [{'counter_type': 'product'}, {'counter_type':...   id  AD  yes [{'flag': '0', 'identifier_flag': '0', 'custom...   0   1800    0

use with caution - `explode()` embedded lists

df3p = df3["payload.product"].explode().apply(pd.Series)
df3.join(df3.explode("payload.counter")["payload.counter"].apply(pd.Series)).join(
pd.json_normalize(df3p.join(df3p["customer_spec"].apply(pd.Series)).explode("resource_pecification").to_dict(orient="records"))
)

edited Dec 29, 2020 at 20:19

answered Dec 21, 2020 at 20:19

Rob Raymond

31.5k3 gold badges19 silver badges34 bronze badges

2 Comments

Tiara Over a year ago

Hi, thanks for this answer. it's almost solved my problem. just need to find a way how to extract counter and product column.

Rob Raymond Over a year ago

for embedded lists the logic of how you want to handle needs to be defined. explode() expands an embedded list, there are two so you need to define rules. plus your embedded lists are dict so those need to be extracted using pd.Series or pd.json_normalize(). updated answer with an example - but it's probably incorrect as no data rules have been used

Collectives™ on Stack Overflow

Parse nested json data in dataframe

2 Answers 2

Comments

output

use with caution - `explode()` embedded lists

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

output

use with caution - explode() embedded lists

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related

use with caution - `explode()` embedded lists