Nested JSON to Dataframe using Python Pandas

Question

I am new to python and trying to extract the data from JSON object to table format using Dataframe

Sample structure of JSON

{
    "subject": "Computer",
    "records": [
      {
        "name": "Section A",
        "elements": [
          {
            "type": "easy",
            "name": "Section A Module 1",
            "elements": [
              {
                "type": "text",
                "name": "SectionQ1",
                "title": "Some question?",
                "readOnly": true
              },
              {
                "type": "text",
                "name": "SectionQ2",
                "title": "some other question?",
                "isRequired": true
              }
            ]
          }
        ],
        "title": "Section A details"
      },
      {
        "name": "Section B",
        "elements": [
          {
            "type": "medium",
            "name": "Section B Module 1",
            "elements": [
              {
                "type": "radiogroup",
                "name": "SectionQ1",
                "title": "some question?",
                "choices": [
                  {
                    "value": "1",
                    "text": "Choose 1"
                  },
                  {
                    "value": "2",
                    "text": "Choose 2"
                  },
                  {
                    "value": "3",
                    "text": "Choose 3"
                  }
                ],
                "colCount": 3
              },
              {
                "type": "radiogroup",
                "name": "SectionQ2",
                "title": "some other question?",
                "description": "question 2",
                "isRequired": true,
                "choices": [
                  {
                    "value": "value 1",
                    "text": "choice 1"
                  },
                  {
                    "value": "value 2",
                    "text": "choice 2"
                  },
                  {
                    "value": "value 3",
                    "text": "choice 3"
                  }
                ]
              }
            ],
            "title": "title"
          }         
        ],
        "title": "Title"
      }
    ]
  }

I'm trying to get inner elements array as table linking to outer elements. so table could be inner_elements, outer_elements , records.

I have tried to below code along with few other approaches.

df = pd.read_json (r'sample.json')
df.columns = df.columns.map(lambda x: x.split(".")[-1])
print(df)

The expected output will be like



dataframe records

Name        Title
Section A   Section A details
Section B   Section B details


dataframe elements1

Key(records)    Type    Name
Section A       easy    Section A Module 1
Section B       medium  Section B Module 1


dataframe elements2

Key(elements1)          type        name        title                   readOnly    Description     isRequired  colCount
Section A Module 1      text        SectionQ1   Some question?          true
Section A Module 1      text        SectionQ2   some other question?    true
Section B Module 1      radiogroup  SectionQ1   some question?                                                  3
Section B Module 1      radiogroup  SectionQ2   some other question?                question 2      true


dataframe choice

Key(elements2)          type        name        value       text    
Section B Module 1      radiogroup  SectionQ1   1           Choose 1
Section B Module 1      radiogroup  SectionQ1   2           Choose 2
Section B Module 1      radiogroup  SectionQ1   3           Choose 3
Section B Module 1      radiogroup  SectionQ2   value 1     choice 1
Section B Module 1      radiogroup  SectionQ2   value 2     choice 2
Section B Module 1      radiogroup  SectionQ2   value 3     choice 3

However not getting any clue to could I proceed further. Please guide me to achieve the same

Thank you for comment,each array should be the separate dataframe but should have relation with other dataframe with some key. — Python Learner
– Python Learner, Commented Jul 17, 2020 at 17:33
Hi @Sushanth I have added the expected output in the question it self. Please take a look and let me know if you can help me in this. — Python Learner
– Python Learner, Commented Jul 18, 2020 at 4:46

Edgar Ramírez Mondragón · Accepted Answer · 2020-07-18 08:29:33Z

To extract nested arrays from a JSON object, I'd like to do something similar to MongoDB's unwind. In fact, I'd try to mimic Mongo's aggregation pipelines. The package toolz can really help with that.

First there's the explode operation, where a record containing an array field gets expanded into a sequence of records identical to the original but with the value for the key in question replaced by the corresponding array element.

Something like this:

from toolz.dicttoolz import assoc_in, get_in

def explode(d, keys):
    values = get_in(keys, d)
    if isinstance(values, list):
        for v in values:
            yield assoc_in(d, keys, v)
    else:
        yield d


record = {"x": 1, "y": [1, 2, 3]}
assert list(explode(record, ["y"])) == [{"x": 1, "y": 1}, {"x": 1, "y": 2}, {"x": 1, "y": 3}]

To create the unwind stage, the explode operation needs to be applied a sequence of records:

from toolz.itertoolz import concat

def unwind(iterator, keys):
    return concat(map(lambda d: explode(d, keys), iterator))

records = [{"x": 1, "y": [1, 2, 3]}, {"x": 2, "y": [4, 5]}]
assert list(unwind(records, ["y"])) == [
    {"x": 1, "y": 1}, 
    {"x": 1, "y": 2}, 
    {"x": 1, "y": 3},
    {"x": 2, "y": 4},
    {"x": 2, "y": 5},
]

Now you can create the pipelines to extract the tables you want:

from toolz import curried
from toolz.functoolz import pipe

sections = pipe(data["records"],
                curried.map(lambda d: {
                    "section": get_in(["name"], d),
                    "title": get_in(["title"], d),
                }),
                list,
)

modules = pipe(data["records"],
               lambda i: unwind(i, ["elements"]),
               curried.map(lambda d: {
                   "section": get_in(["name"], d),
                   "type": get_in(["elements", "type"], d),
                   "module": get_in(["elements", "name"], d),
               }),
               list,
)

questions = pipe(data["records"],
                 lambda i: unwind(i, ["elements"]),
                 lambda i: unwind(i, ["elements", "elements"]),
                 curried.map(lambda d: {
                     "module": get_in(["elements", "name"], d),
                     "type": get_in(["elements", "elements", "type"], d),
                     "question": get_in(["elements", "elements", "name"], d),
                     "title": get_in(["elements", "elements", "title"], d),
                     "readOnly": get_in(["elements", "elements", "readOnly"], d),
                     "description": get_in(["elements", "elements", "description"], d),
                     "isRequired": get_in(["elements", "elements", "isRequired"], d),
                     "colCount": get_in(["elements", "elements", "colCount"], d),
                 }),
                 list,
)

choices = pipe(data["records"],
               lambda i: unwind(i, ["elements"]),
               lambda i: unwind(i, ["elements", "elements"]),
               lambda i: unwind(i, ["elements", "elements", "choices"]),
               curried.filter(lambda d: get_in(["elements", "elements", "choices"], d) is not None),
               curried.map(lambda d: {
                   "module": get_in(["elements", "name"], d),
                   "type": get_in(["elements", "elements", "type"], d),
                   "question": get_in(["elements", "elements", "name"], d),
                   "value": get_in(["elements", "elements", "choices", "value"], d),
                   "text": get_in(["elements", "elements", "choices", "text"], d),
               }),
               list,
)

Now you can create the dataframes:

import pandas as pd

df_sections = pd.DataFrame.from_records(sections)
df_modules = pd.DataFrame.from_records(modules)
df_questions = pd.DataFrame.from_records(questions)
df_choices = pd.DataFrame.from_records(choices)

Collectives™ on Stack Overflow

Nested JSON to Dataframe using Python Pandas

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related