2

I am new to python and trying to extract the data from JSON object to table format using Dataframe

Sample structure of JSON

{
    "subject": "Computer",
    "records": [
      {
        "name": "Section A",
        "elements": [
          {
            "type": "easy",
            "name": "Section A Module 1",
            "elements": [
              {
                "type": "text",
                "name": "SectionQ1",
                "title": "Some question?",
                "readOnly": true
              },
              {
                "type": "text",
                "name": "SectionQ2",
                "title": "some other question?",
                "isRequired": true
              }
            ]
          }
        ],
        "title": "Section A details"
      },
      {
        "name": "Section B",
        "elements": [
          {
            "type": "medium",
            "name": "Section B Module 1",
            "elements": [
              {
                "type": "radiogroup",
                "name": "SectionQ1",
                "title": "some question?",
                "choices": [
                  {
                    "value": "1",
                    "text": "Choose 1"
                  },
                  {
                    "value": "2",
                    "text": "Choose 2"
                  },
                  {
                    "value": "3",
                    "text": "Choose 3"
                  }
                ],
                "colCount": 3
              },
              {
                "type": "radiogroup",
                "name": "SectionQ2",
                "title": "some other question?",
                "description": "question 2",
                "isRequired": true,
                "choices": [
                  {
                    "value": "value 1",
                    "text": "choice 1"
                  },
                  {
                    "value": "value 2",
                    "text": "choice 2"
                  },
                  {
                    "value": "value 3",
                    "text": "choice 3"
                  }
                ]
              }
            ],
            "title": "title"
          }         
        ],
        "title": "Title"
      }
    ]
  }

I'm trying to get inner elements array as table linking to outer elements. so table could be inner_elements, outer_elements , records.

I have tried to below code along with few other approaches.

df = pd.read_json (r'sample.json')
df.columns = df.columns.map(lambda x: x.split(".")[-1])
print(df)

The expected output will be like



dataframe records

Name        Title
Section A   Section A details
Section B   Section B details


dataframe elements1

Key(records)    Type    Name
Section A       easy    Section A Module 1
Section B       medium  Section B Module 1


dataframe elements2

Key(elements1)          type        name        title                   readOnly    Description     isRequired  colCount
Section A Module 1      text        SectionQ1   Some question?          true
Section A Module 1      text        SectionQ2   some other question?    true
Section B Module 1      radiogroup  SectionQ1   some question?                                                  3
Section B Module 1      radiogroup  SectionQ2   some other question?                question 2      true


dataframe choice

Key(elements2)          type        name        value       text    
Section B Module 1      radiogroup  SectionQ1   1           Choose 1
Section B Module 1      radiogroup  SectionQ1   2           Choose 2
Section B Module 1      radiogroup  SectionQ1   3           Choose 3
Section B Module 1      radiogroup  SectionQ2   value 1     choice 1
Section B Module 1      radiogroup  SectionQ2   value 2     choice 2
Section B Module 1      radiogroup  SectionQ2   value 3     choice 3

However not getting any clue to could I proceed further. Please guide me to achieve the same

3
  • How will the output dataframe look like ? Commented Jul 17, 2020 at 17:26
  • Thank you for comment,each array should be the separate dataframe but should have relation with other dataframe with some key. Commented Jul 17, 2020 at 17:33
  • Hi @Sushanth I have added the expected output in the question it self. Please take a look and let me know if you can help me in this. Commented Jul 18, 2020 at 4:46

1 Answer 1

1

To extract nested arrays from a JSON object, I'd like to do something similar to MongoDB's unwind. In fact, I'd try to mimic Mongo's aggregation pipelines. The package toolz can really help with that.

First there's the explode operation, where a record containing an array field gets expanded into a sequence of records identical to the original but with the value for the key in question replaced by the corresponding array element.

Something like this:

from toolz.dicttoolz import assoc_in, get_in

def explode(d, keys):
    values = get_in(keys, d)
    if isinstance(values, list):
        for v in values:
            yield assoc_in(d, keys, v)
    else:
        yield d


record = {"x": 1, "y": [1, 2, 3]}
assert list(explode(record, ["y"])) == [{"x": 1, "y": 1}, {"x": 1, "y": 2}, {"x": 1, "y": 3}]

To create the unwind stage, the explode operation needs to be applied a sequence of records:

from toolz.itertoolz import concat

def unwind(iterator, keys):
    return concat(map(lambda d: explode(d, keys), iterator))

records = [{"x": 1, "y": [1, 2, 3]}, {"x": 2, "y": [4, 5]}]
assert list(unwind(records, ["y"])) == [
    {"x": 1, "y": 1}, 
    {"x": 1, "y": 2}, 
    {"x": 1, "y": 3},
    {"x": 2, "y": 4},
    {"x": 2, "y": 5},
]

Now you can create the pipelines to extract the tables you want:

from toolz import curried
from toolz.functoolz import pipe

sections = pipe(data["records"],
                curried.map(lambda d: {
                    "section": get_in(["name"], d),
                    "title": get_in(["title"], d),
                }),
                list,
)

modules = pipe(data["records"],
               lambda i: unwind(i, ["elements"]),
               curried.map(lambda d: {
                   "section": get_in(["name"], d),
                   "type": get_in(["elements", "type"], d),
                   "module": get_in(["elements", "name"], d),
               }),
               list,
)

questions = pipe(data["records"],
                 lambda i: unwind(i, ["elements"]),
                 lambda i: unwind(i, ["elements", "elements"]),
                 curried.map(lambda d: {
                     "module": get_in(["elements", "name"], d),
                     "type": get_in(["elements", "elements", "type"], d),
                     "question": get_in(["elements", "elements", "name"], d),
                     "title": get_in(["elements", "elements", "title"], d),
                     "readOnly": get_in(["elements", "elements", "readOnly"], d),
                     "description": get_in(["elements", "elements", "description"], d),
                     "isRequired": get_in(["elements", "elements", "isRequired"], d),
                     "colCount": get_in(["elements", "elements", "colCount"], d),
                 }),
                 list,
)

choices = pipe(data["records"],
               lambda i: unwind(i, ["elements"]),
               lambda i: unwind(i, ["elements", "elements"]),
               lambda i: unwind(i, ["elements", "elements", "choices"]),
               curried.filter(lambda d: get_in(["elements", "elements", "choices"], d) is not None),
               curried.map(lambda d: {
                   "module": get_in(["elements", "name"], d),
                   "type": get_in(["elements", "elements", "type"], d),
                   "question": get_in(["elements", "elements", "name"], d),
                   "value": get_in(["elements", "elements", "choices", "value"], d),
                   "text": get_in(["elements", "elements", "choices", "text"], d),
               }),
               list,
)

Now you can create the dataframes:

import pandas as pd

df_sections = pd.DataFrame.from_records(sections)
df_modules = pd.DataFrame.from_records(modules)
df_questions = pd.DataFrame.from_records(questions)
df_choices = pd.DataFrame.from_records(choices)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for help.Very well explained and presented.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.