0

I have pandas dataframe where one of the columns is in JSON format. It contains lists of movie production companies for a given title. Below the sample structure:

ID | production_companies
---------------
 1 | "[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]"
 2 | "[{'name': 'Walt Disney Pictures', 'id': 2}]"
 3 | "[{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]"
 4 | nan
 5 | nan
 6 | nan
 7 | "[{'name': 'Ghost House Pictures', 'id': 768}, {'name': 'North Box Productions', 'id': 22637}]"

As you see one movie (row) can have multiple production companies. I want to create for each movie separate columns containing names of the producers. Columns should look like: name_1, name_2, name_3,... etc. If there is no second or third producer it should be NaN.

I don't have much experience working with JSON formats and I've tried a few methods (iterators with lambda functions) but they are not even close to what I need.

Therefore I hope for your help guys!

EDIT:

The following code ("movies" is the main database):

from pandas.io.json import json_normalize
companies = list(movies['production_companies'])
json_normalize(companies)

gives me the following error:

AttributeError: 'str' object has no attribute 'values'
3
  • 2
    How did you end up with this dataframe? Commented Jun 20, 2019 at 19:46
  • Please take a step back and start by loading your JSON into a list, then call json_normalize. Commented Jun 20, 2019 at 19:46
  • This dataframe is simply one column taken from the entire pandas database. I will try to use json_normalize and give you a feedback. Commented Jun 20, 2019 at 19:48

2 Answers 2

1

Adding on to @Andy's answer above to answer OP's question.

This part was by @Andy:

import pandas as pd
import numpy as np
import ast
import itertools

# dummy data
df = pd.DataFrame({
    "ID": [1,2,3],
    "production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})

# remove the nans
df.dropna(inplace=True)

# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))

My additions to answer OP's requirements:

tmp_lst = []
for idx, item in df.groupby(by='ID'):

    # Crediting this part to @Andy above
    tmp_df = pd.DataFrame(list(itertools.chain(*item["production_companies"].values.tolist()))).drop(columns='id')

    # Transpose dataframe
    tmp_df = tmp_df.T

    # Add back movie id to tmp_df
    tmp_df['ID'] = item['ID'].values

    # Accumulate tmp_df from all unique movie ids
    tmp_lst.append(tmp_df)

pd.concat(tmp_lst, sort=False)  

Result:

                         0               1                          2  ID
name    Paramount Pictures  United Artists  Metro-Goldwyn-Mayer (MGM)   1
name  Walt Disney Pictures             NaN                        NaN   3
Sign up to request clarification or add additional context in comments.

Comments

0

This should do it

import pandas as pd
import numpy as np
import ast
import itertools

# dummy data
df = pd.DataFrame({
    "ID": [1,2,3],
    "production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})

# remove the nans
df.dropna(inplace=True)

# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))

# flatten the column of lists into a single list, and convert to DataFrame
pd.DataFrame(list(itertools.chain(*df["production_companies"].values.tolist())))

Which yields:

    id      name
0   4       Paramount Pictures
1   60      United Artists
2   8411    Metro-Goldwyn-Mayer (MGM)
3   2       Walt Disney Pictures

1 Comment

Thanks Andy! However your code maps id to names. What I want is to create columns containing names of companies. Probably maximally this should generate 5-6 columns. The ID column in my example is a movie ID.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.