How to convert array of json object in pandas

Question

I am reading data from the source: https://www.kaggle.com/tmdb/tmdb-movie-metadata using command shown below:

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

Using above approach some of the columns have data in form of array of json objects like production_countries, keywords etc.

How can I convert these columns into multiple columns?

I am trying to do this as shown below:

pd.io.json.json_normalize(tmdbDataSet.production_companies.apply(json.loads))

But I am getting error

AttributeError: 'list' object has no attribute 'values'

Edit: Thanks Yang for help. But I observed that production_companies has max length of array as 26 and its not effective way as you stated to create these many column. I have used below code to find out length.

length =0
for index, row in tmdbDataSet.iterrows():
    company = json.loads(row['production_companies'])
    if(len(company) > length):
        length = len(company)
print(length)

Seems like I need to look for any other alternative. But I observed that spoken languages column have 9 distinct values. I have created 9 distinct columns using code shown below:

for i in range(9):
    tmdbDataSet['spoken_languages_' + str(i)] = ""

then when I run below code:

columns = ['spoken_languages_0','spoken_languages_1','spoken_languages_2','spoken_languages_3','spoken_languages_4','spoken_languages_5','spoken_languages_6','spoken_languages_7','spoken_languages_8']
tmdbDataSet[columns] = pd.DataFrame(tmdbDataSet.spoken_languages.values.tolist(), index= tmdbDataSet.index)
print(tmdbDataSet.head())

I get an error:

Columns must be same length as key Which is understandable as I don't have fixed length of array. Please let me know possible solution to it?

Yang T · Accepted Answer · 2018-08-08 15:39:47Z

1

The problem is that you are calling json.loads on a list object. When you type tmdbDataSet.production_companies, it returns a Series object from the dataframe, which you can call the apply() method on (documentation here).

However, each element in the Series is still a list object - as you astutely observed when you noticed that some of the columns have arrays of JSON objects. Therefore, applying the function json.loads to the series will not work as json.loads is expecting a JSON object but is instead receiving a list object.

This is unfortunate data packaging by the data source, but it could be because the length of the array can vary depending on the row/movie. Perhaps the best/easiest approach to accessing this data is to write a loop (ie: for company in row['production_companies']:) instead of trying to decompress that column into multiple dataframe columns. If you want to decompress that column without losing any data, you would first have to iterate through that column and find the length of the longest list so you know how many new columns to create. You will also run into the possibility that a large chunk of the entries in your dataframe will be blank place-holders, as the longest array length will probably only appear once or twice.

EDIT: If you must melt the dataframe however, here is a suggests process for doing so (sorry I do not have the personal time needed to provide anymore detail than this):

1) Iterate through the production_companies column and find the array of the longest length k.

2) Create k more (empty) columns for storing JSON objects in the dataframe.

3) Iterate through the production_companies column again and for array: for each JSON item in the array: pull out the JSON file and place into the next available JSON column

Note that you will have quite a few 'nan's in your dataframe now, as many movies will have less than the highest amount of production companies.

edited Aug 8, 2018 at 15:39

answered Aug 6, 2018 at 16:14

Yang T

1465 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Tavish Aggarwal Over a year ago

how can using the for loop for company in tmdbDataSet['production_companies']: I can associate production companies to a movie? I need list of companies associated to movie in separate columns.

Yang T Over a year ago

What are you planning on doing with the data after associating the list of companies to the movie?

Yang T Over a year ago

^ this will give me a better idea on how to answer your follow up question

Tavish Aggarwal Over a year ago

Basically data is not clean as of now and its difficult to process. So what I am planning is to have production companies in separate column and then I can melt my dataset to have clean data. Hope this will give you idea about my final objective.

Yang T Over a year ago

Ah I see. The task of creating a dataframe to be melted using this dataset is going to be very challenging and inefficient, since there are a variable number of production companies for each movie (ie: Spider-Man 3 has 4, while John Carter only has 1). I would recommend bypassing the 'melt' step and instead accessing each production company when needed using the for loop above. However, if you are adamant on melting the dataframe, see my EDIT to the answer.

Collectives™ on Stack Overflow

How to convert array of json object in pandas

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related