0

I have a dataframe(df1) containing two columns.

id          information 
00100       {'DriversList': {'ProblematicDrivers': [], 'In...   
00200       {'DriversList': {'ProblematicDrivers': [], 'In...

The information column contains nested json object, which needs to be converted into dataFrame, and associate the same with ID.

df1['information'] column's json --

'DriversList': {
  'ProblematicDrivers': [
  ],
  'InstalledDrivers': [
    {
      'DriverName': 'FaxMachine',
      'DisplayName': 'Fax',
      'Version': '10',
      'Date': '06-21-2006'
    },
    {
      'DriverName': 'FaxMachine',
      'DisplayName': 'Fax',
      'Version': '10',
      'Date': '06-21-2006'
    }
  ]
}
}

My code so far:

df2 = pd.DataFRame()
data = json_normalize(data = df1['information'])
for x in data['DriversList.InstalledDrivers']:
    df2 = df2.append(x)

The number of records in information column will be associated with the ID, which is present in original dataframe(df1)

For example -- For first row, as information column contains 2 records for InstalledDrivers, the final output will have 00100 associated with 2 rows.

Expected OutPut --

id      Date        DriverName  DisplayName   Version
00100   06-21-2006  FaxMachine  Fax           10
00100   06-21-2006  FaxMachine  Fax           10
00200   06-21-2006  FaxMachine  Fax           10
00200   06-21-2006  FaxMachine  Fax           10

Any suitable approach which can be handle on dataFrame level only. I've also tried JSON_Normalize but unable to load this JSON into dataframe. Is it possible to do it using JSON Normalize or is there any other optimized solution available. And also not able to associate id with the converted dataframe.

2
  • 1
    do u mind sharing the original dataframe in a dict form, to include the ids, so that a solution can be proferred that includes both columns Commented Apr 12, 2020 at 22:22
  • Have shared the original dataframe(df1) only at the start. Just that the data of information column is the same in both the rows Commented Apr 12, 2020 at 22:43

1 Answer 1

2

IIUC, this is a possible approach:

import json
import pandas as pd

# setup
d = """{"DriversList": {
    "ProblematicDrivers": [],
    "InstalledDrivers": [
        {"DriverName": "FaxMachine", "DisplayName": "Fax", "Version": "10", "Date": "06-21-2006"},
        {"DriverName": "FaxMachine", "DisplayName": "Fax", "Version": "10", "Date": "06-21-2006"}
    ]}
}"""
df = pd.DataFrame(data=[d], columns=["information"])

# extract data
data = [drivers for info in df["information"].values for drivers in json.loads(info)["DriversList"]["InstalledDrivers"]]

# create DataFrame
result = pd.DataFrame.from_records(data)

print(result)

Output

   DriverName DisplayName Version        Date
0  FaxMachine         Fax      10  06-21-2006
1  FaxMachine         Fax      10  06-21-2006

Update

You can associate each id with the drivers, by doing the following:

df = pd.DataFrame(data=[['00100', d]], columns=["id", "information"])

# extract data
data = [{"id": i, **drivers} for i, info in df[["id", "information"]].values for drivers in json.loads(info)["DriversList"]["InstalledDrivers"]]

# create DataFrame
result = pd.DataFrame.from_records(data)

print(result)

The above code adds an id entry to the record.

Sign up to request clarification or add additional context in comments.

3 Comments

Hey Thanks, It worked for the column, but still curious, that how the original dataframe's id will be associated with each row.
@user2597209 So in your output, more than one driver can have the same id, right? For example in your data the two installed drivers, will have id 00100?
Yes.. correct. I missed the 2nd row json data in the shared snippet.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.