0

I have delimited file that have JSON also keyvalues matching in the column. I need to parse this data into dataframe.

Below is the record format

**trx_id|name|service_context|status**

abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success

i need to convert all information from this record to have this format

trx_id|name |type|payload.trx_id|payload.name|payload.counter.counter_type|payload.counter.counter_info|.....|payload.renewal.flag|status
abc123|order|cdr |abc123        |abs         |product                     |transfer                    |.....|0                   |success
abc456|order|cdr |abc456        |abs         |product                     |                            |.....|1                   |success

Currently i've done manual parsing the data for key_value with sep=';|[|] and remove behind '=' and update the column name. for Json, i do the below command, however the result is replacing the existing table and only contain parsing json result.

test_parse = pd.concat([pd.json_normalize(json.loads(js)) for js in test_parse['payload']])

Is there any way to do avoid any manual process to process this type of data?

6
  • is the problem/query solved? Commented Dec 19, 2020 at 10:02
  • not yet, im still searching for the solution. Commented Dec 19, 2020 at 11:28
  • Do you need column names to be like this only? Commented Dec 19, 2020 at 11:48
  • yes, if possible.. Commented Dec 19, 2020 at 12:07
  • you need to do it column wise. I have solved the half of it. Try to complete it. Commented Dec 19, 2020 at 13:28

2 Answers 2

1

The below hint will be sufficient to solve the problem.

Do it partwise for each column and then merge them together (you will need to remove the columns once you are able to split into multiple columns):

import ast
from pandas.io.json import json_normalize
x = json_normalize(df3['service_context'].apply(lambda x: (ast.literal_eval(x.split('=')[1])))).add_prefix('payload.')

y = pd.DataFrame(x['payload.counter'].apply(lambda x:[i['counter_type'] for i in x]).to_list())
y = y.rename(columns={0: 'counter_type', 1:'counter_info'})

for row in x['payload.product']:    
    z1 = json_normalize(row)
    z2 = json_normalize(z1['customer_spec.resource_pecification'][0])
    ### Write your own code.

x:

enter image description here

y:

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

0

It's realy a 3-step approach

  1. use primary pipe | delimiter
  2. extract key / value pairs
  3. normlize JSON
import pandas as pd
import io, json

# overall data structure is pipe delimited
df = pd.read_csv(io.StringIO("""abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success"""), 
            sep="|", header=None, names=["trx_id","name","data","status"])

df2 = pd.concat([
    df,
    # split out sub-columns ; delimted columns in 3rd column
    pd.DataFrame(
        [[c.split("=")[1] for c in r] for r in df.data.str.split(";")],
        columns=[c.split("=")[0] for c in df.data.str.split(";")[0]], 
    )
], axis=1)

# extract json payload into columns.  This will leave embedded lists as these are many-many
# that needs to be worked out by data owner
df3 = pd.concat([df2, 
           pd.concat([pd.json_normalize(json.loads(p)).add_prefix("payload.") for p in df2.payload]).reset_index()], axis=1)

output

    trx_id  name    data    status  type    payload index   payload.trx_id  payload.name    payload.counter payload.language    payload.type    payload.can_replace payload.product payload.renewal_flag    payload.price.transaction   payload.price.discount
0   abc123  order   type=cdr;payload={"trx_id":"abc123","name":"ab...   success cdr {"trx_id":"abc123","name":"abs","counter":[{"c...   0   abc123  abs [{'counter_type': 'product'}, {'counter_type':...   id  AD  yes [{'flag': '0', 'identifier_flag': '0', 'custom...   0   1800    0

use with caution - explode() embedded lists

df3p = df3["payload.product"].explode().apply(pd.Series)
df3.join(df3.explode("payload.counter")["payload.counter"].apply(pd.Series)).join(
pd.json_normalize(df3p.join(df3p["customer_spec"].apply(pd.Series)).explode("resource_pecification").to_dict(orient="records"))
)


2 Comments

Hi, thanks for this answer. it's almost solved my problem. just need to find a way how to extract counter and product column.
for embedded lists the logic of how you want to handle needs to be defined. explode() expands an embedded list, there are two so you need to define rules. plus your embedded lists are dict so those need to be extracted using pd.Series or pd.json_normalize(). updated answer with an example - but it's probably incorrect as no data rules have been used

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.