1

Here are the 3 rows of my sample json.

{"customer": 10, "date": "2017.04.06 12:09:32", "itemList": [{"item": "20126907_EA", "price": 1.88, "quantity": 1.0}, {"item": "20185742_EA", "price": 0.99, "quantity": 1.0}, {"item": "20138681_EA", "price": 1.79, "quantity": 1.0}, {"item": "20049778001_EA", "price": 2.47, "quantity": 1.0}, {"item": "20419715007_EA", "price": 3.33, "quantity": 1.0}, {"item": "20321434_EA", "price": 2.47, "quantity": 1.0}, {"item": "20068076_KG", "price": 28.24, "quantity": 10.086}, {"item": "20022893002_EA", "price": 1.77, "quantity": 1.0}, {"item": "20299328003_EA", "price": 1.25, "quantity": 1.0}], "store": "825f9cd5f0390bc77c1fed3c94885c87"}
{"customer": 100, "date": "2017.01.10 12:59:09", "itemList": [{"item": "20132638_KG", "price": 3.33, "quantity": 0.28}, {"item": "20320042001_EA", "price": 2.99, "quantity": 1.0}, {"item": "20320832003_EA", "price": 2.58, "quantity": 2.0}, {"item": "20128148_KG", "price": 4.85, "quantity": 0.256}, {"item": "20027478_KG", "price": 4.58, "quantity": 0.135}, {"item": "20653232_EA", "price": 5.99, "quantity": 1.0}, {"item": "20317755_EA", "price": 3.69, "quantity": 1.0}, {"item": "20519704_KG", "price": 4.24, "quantity": 0.214}, {"item": "20591843_KG", "price": 5.56, "quantity": 0.286}], "store": "a666587afda6e89aec274a3657558a27"}
{"customer": 1000, "date": "2017.04.17 18:53:40", "itemList": [{"item": "20788909_EA", "price": 3.49, "quantity": 1.0}, {"item": "20975073_EA", "price": 5.0, "quantity": 1.0}, {"item": "20868904_EA", "price": 5.0, "quantity": 1.0}, {"item": "20189092_EA", "price": 0.05, "quantity": 1.0}], "store": "ebb71045453f38676c40deb9864f811d"}

I would like to convert every single tag into rows with the nested tag, below is the code. I'm trying while I am facing issues :

def data_load():
    p=Path(r'C:\Users\rohgorthy\Downloads\LBD_Assignemtn\sample_tag.json')
    with p.open('r', encoding='utf-8') as f:
        data = f.read()
    
    df = pd.json_normalize(data, record_path='itemList', meta=['customer', 'date', 'store'])
    return df

Error below:

result = result[spec]
TypeError: string indices must be integers

Can any one please help me to achieve the below format :

df Columns:

customer date item price quantity store 

Thank you in advance.

2 Answers 2

1

I think you need to take the actual raw strings of JSON data and convert them into a list of objects (dicts).

from pathlib import Path
from json import loads
from pandas import json_normalize

def data_load(p):
    p = Path(p) if not isinstance(p, Path) else p
    text = p.read_text(encoding='utf-8')
    data = [loads(ln) for ln in text.splitlines()]
    return json_normalize(data, record_path='itemList', meta=['customer', 'date', 'store'])

df = data_load('sample_tag.json')
Sign up to request clarification or add additional context in comments.

3 Comments

Excellent, This Works , with one correction data = [json.loads(ln) for ln in text.splitlines()] ....
oops was operating from memory, sorry. I fixed the answer
Thank you so much ! It was super kind of you much appreciate your answer!!! Wish you have a wonderful day ahead!
0

I used Pyspark to get through the solution: below is the code

def data_load():
df=spark.read.json(r"transactions.json")
df.createTempView("df")
df2=spark.sql("select customer,date,explode(itemlist) as item_list ,store from df")
df2.createTempView("df2")
df3=spark.sql("select customer,date,item_list.item as item,item_list.price as price,item_list.quantity as quantity,store as store from df2")
df3.createTempView("d13")
df4=spark.sql("select a.item as itema,b.item as itemb, count(*) as cnt from d13 a join d13 b on a.customer=b.customer and a.item<b.item and a.date=b.date group by a.item,b.item ")
return df4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.