Converting pandas JSON rows into separate columns

Question

I have pandas dataframe where one of the columns is in JSON format. It contains lists of movie production companies for a given title. Below the sample structure:

ID | production_companies
---------------
 1 | "[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]"
 2 | "[{'name': 'Walt Disney Pictures', 'id': 2}]"
 3 | "[{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]"
 4 | nan
 5 | nan
 6 | nan
 7 | "[{'name': 'Ghost House Pictures', 'id': 768}, {'name': 'North Box Productions', 'id': 22637}]"

As you see one movie (row) can have multiple production companies. I want to create for each movie separate columns containing names of the producers. Columns should look like: name_1, name_2, name_3,... etc. If there is no second or third producer it should be NaN.

I don't have much experience working with JSON formats and I've tried a few methods (iterators with lambda functions) but they are not even close to what I need.

Therefore I hope for your help guys!

EDIT:

The following code ("movies" is the main database):

from pandas.io.json import json_normalize
companies = list(movies['production_companies'])
json_normalize(companies)

gives me the following error:

AttributeError: 'str' object has no attribute 'values'

Please take a step back and start by loading your JSON into a list, then call json_normalize. — cs95
– cs95, Commented Jun 20, 2019 at 19:46
This dataframe is simply one column taken from the entire pandas database. I will try to use json_normalize and give you a feedback. — Roberto
– Roberto, Commented Jun 20, 2019 at 19:48

Jan33 · Accepted Answer · 2019-06-21 01:54:15Z

Adding on to @Andy's answer above to answer OP's question.

This part was by @Andy:

import pandas as pd
import numpy as np
import ast
import itertools

# dummy data
df = pd.DataFrame({
    "ID": [1,2,3],
    "production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})

# remove the nans
df.dropna(inplace=True)

# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))

My additions to answer OP's requirements:

tmp_lst = []
for idx, item in df.groupby(by='ID'):

    # Crediting this part to @Andy above
    tmp_df = pd.DataFrame(list(itertools.chain(*item["production_companies"].values.tolist()))).drop(columns='id')

    # Transpose dataframe
    tmp_df = tmp_df.T

    # Add back movie id to tmp_df
    tmp_df['ID'] = item['ID'].values

    # Accumulate tmp_df from all unique movie ids
    tmp_lst.append(tmp_df)

pd.concat(tmp_lst, sort=False)

Result:

                         0               1                          2  ID
name    Paramount Pictures  United Artists  Metro-Goldwyn-Mayer (MGM)   1
name  Walt Disney Pictures             NaN                        NaN   3

Ian · Accepted Answer · 2019-06-20 20:57:21Z

0

This should do it

import pandas as pd
import numpy as np
import ast
import itertools

# dummy data
df = pd.DataFrame({
    "ID": [1,2,3],
    "production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})

# remove the nans
df.dropna(inplace=True)

# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))

# flatten the column of lists into a single list, and convert to DataFrame
pd.DataFrame(list(itertools.chain(*df["production_companies"].values.tolist())))

Which yields:

    id      name
0   4       Paramount Pictures
1   60      United Artists
2   8411    Metro-Goldwyn-Mayer (MGM)
3   2       Walt Disney Pictures

edited Jun 20, 2019 at 20:57

answered Jun 20, 2019 at 20:48

Ian

3,9684 gold badges34 silver badges77 bronze badges

1 Comment

Roberto Over a year ago

Thanks Andy! However your code maps id to names. What I want is to create columns containing names of companies. Probably maximally this should generate 5-6 columns. The ID column in my example is a movie ID.

Collectives™ on Stack Overflow

Converting pandas JSON rows into separate columns

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related