0

I have an excel file that has a string column in a nested JSON-like format. I would like to parse/expand it. The dataframe looks like this when I used df.head(2)

   json_str
0 {"id":"lni001","pub_date":"20220301","doc_id":"7098727","unique_id":"64WP-UI-POLI","content":[{"c_id":"002","p_id":"P02","type":"org","source":"internet"},{"c_id":"003","p_id":"P03","type":"org","source":"internet"},{"c_id":"005","p_id":"K01","type":"people","source":"news"}]}
1 {"id":"lni002","pub_date":"20220301","doc_id":"7097889","unique_id":"64WP-UI-CFGT","content":[{"c_id":"012","p_id":"K21","type":"location","source":"internet"},{"c_id":"034","p_id":"P17","type":"people","source":"news"},{"c_id":"098","p_id":"K54","type":"people","source":"news"}]}

The structure of each row looks like this:

{
   "id":"lni001",
   "pub_date":"20220301",
   "doc_id":"7098727",
   "unique_id":"64WP-UI-POLI",
   "content":[
      {
         "c_id":"002",
         "p_id":"P02",
         "type":"org",
         "source":"internet"  
      },
      {
         "c_id":"003",
         "p_id":"P03",
         "type":"org",
         "source":"internet" 
      },
      {
         "c_id":"005",
         "p_id":"K01",
         "type":"people",
         "source":"news" 
      }
   ]
}

The type/class of the column is str by using type(df['json_str'].iloc[0])

All the rows have the same structure/format but some of them may have more information in content. In the example above, it has 3 different nested strings but some may have 1, 2, 4, 5, or more. The expected result will look like this below

  id          pub_date      doc_id       unique_id     c_id    p_id   type     source
lni001        20220301      7098727     64WP-UI-POLI    002     P02    org    internet
lni001        20220301      7098727     64WP-UI-POLI    003     P03    org    internet
lni001        20220301      7098727     64WP-UI-POLI    005     K01   people  internet
lni002        20220301      7097889     64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      7097889     64WP-UI-CFGT    034     P17   people  news
lni002        20220301      7097889     64WP-UI-CFGT    098     K54   people  news

I have tried to convert the column into the dictionary and extract the information out but it doesn't work that well. I am wondering are there any better ways to do it.

3
  • 1
    Is the JSON minified in each row? Also, is it actual JSON or just a string that looks like JSON? Check with print(type(df['YOUR COLUMN'].iloc[0])) Commented Mar 14, 2022 at 21:16
  • It's in string format/type, it just looks like JSON format Commented Mar 14, 2022 at 21:20
  • I also edit the question as well by stating the type of the column Commented Mar 14, 2022 at 21:26

2 Answers 2

1

We could use apply json.loads on each row and use json_normalize:

import json
data = df['json_str'].apply(json.loads).tolist()
out = (pd.json_normalize(data, ['content'], list(data[0].keys()-{'content'}))
       [['id', 'pub_date', 'doc_id', 'unique_id', 'c_id', 'p_id', 'type', 'source']])

Output:

       id  pub_date   doc_id     unique_id c_id p_id      type    source
0  lni001  20220301  7098727  64WP-UI-POLI  002  P02       org  internet
1  lni001  20220301  7098727  64WP-UI-POLI  003  P03       org  internet
2  lni001  20220301  7098727  64WP-UI-POLI  005  K01    people      news
3  lni002  20220301  7097889  64WP-UI-CFGT  012  K21  location  internet
4  lni002  20220301  7097889  64WP-UI-CFGT  034  P17    people      news
5  lni002  20220301  7097889  64WP-UI-CFGT  098  K54    people      news

Here, data[0].keys() corresponds to all keys other than "content" in each dictionary.

Sign up to request clarification or add additional context in comments.

2 Comments

I tried both of the approaches but it doesn't work data = json.loads(df['json_str']), it gave me an error TypeError: the JSON object must be str, bytes or bytearray, not Series
for this line of code data = pd.Series([data, data]).apply(json.loads).tolist(), should I put my df instead of data inside the square bracket? Look something like this data = pd.Series([df['json_str'], df['json_str']]).apply(json.loads).tolist()
1

Building off of @enke's answer, you could first convert the strings to real JSON, and then use pd.json_normalize:

import ast
new_df = pd.json_normalize(df['YOUR COLUMN'].apply(ast.literal_eval), ['content'], list(data.keys()-{'content'}))

If you care about the order of the columns, you can rearrange them:

new_df = new_df[['id', 'pub_date', 'doc_id', 'unique_id', 'c_id', 'p_id', 'type', 'source']]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.