1

I am reading data from the source: https://www.kaggle.com/tmdb/tmdb-movie-metadata using command shown below:

tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')

Using above approach some of the columns have data in form of array of json objects like production_countries, keywords etc.

How can I convert these columns into multiple columns?

I am trying to do this as shown below:

pd.io.json.json_normalize(tmdbDataSet.production_companies.apply(json.loads))

But I am getting error

AttributeError: 'list' object has no attribute 'values'

Edit: Thanks Yang for help. But I observed that production_companies has max length of array as 26 and its not effective way as you stated to create these many column. I have used below code to find out length.

length =0
for index, row in tmdbDataSet.iterrows():
    company = json.loads(row['production_companies'])
    if(len(company) > length):
        length = len(company)
print(length)

Seems like I need to look for any other alternative. But I observed that spoken languages column have 9 distinct values. I have created 9 distinct columns using code shown below:

for i in range(9):
    tmdbDataSet['spoken_languages_' + str(i)] = ""

then when I run below code:

columns = ['spoken_languages_0','spoken_languages_1','spoken_languages_2','spoken_languages_3','spoken_languages_4','spoken_languages_5','spoken_languages_6','spoken_languages_7','spoken_languages_8']
tmdbDataSet[columns] = pd.DataFrame(tmdbDataSet.spoken_languages.values.tolist(), index= tmdbDataSet.index)
print(tmdbDataSet.head())

I get an error:

Columns must be same length as key Which is understandable as I don't have fixed length of array. Please let me know possible solution to it?

1 Answer 1

1

The problem is that you are calling json.loads on a list object. When you type tmdbDataSet.production_companies, it returns a Series object from the dataframe, which you can call the apply() method on (documentation here).

However, each element in the Series is still a list object - as you astutely observed when you noticed that some of the columns have arrays of JSON objects. Therefore, applying the function json.loads to the series will not work as json.loads is expecting a JSON object but is instead receiving a list object.

This is unfortunate data packaging by the data source, but it could be because the length of the array can vary depending on the row/movie. Perhaps the best/easiest approach to accessing this data is to write a loop (ie: for company in row['production_companies']:) instead of trying to decompress that column into multiple dataframe columns. If you want to decompress that column without losing any data, you would first have to iterate through that column and find the length of the longest list so you know how many new columns to create. You will also run into the possibility that a large chunk of the entries in your dataframe will be blank place-holders, as the longest array length will probably only appear once or twice.

EDIT: If you must melt the dataframe however, here is a suggests process for doing so (sorry I do not have the personal time needed to provide anymore detail than this):

1) Iterate through the production_companies column and find the array of the longest length k.

2) Create k more (empty) columns for storing JSON objects in the dataframe.

3) Iterate through the production_companies column again and for array: for each JSON item in the array: pull out the JSON file and place into the next available JSON column

Note that you will have quite a few 'nan's in your dataframe now, as many movies will have less than the highest amount of production companies.

Sign up to request clarification or add additional context in comments.

5 Comments

how can using the for loop for company in tmdbDataSet['production_companies']: I can associate production companies to a movie? I need list of companies associated to movie in separate columns.
What are you planning on doing with the data after associating the list of companies to the movie?
^ this will give me a better idea on how to answer your follow up question
Basically data is not clean as of now and its difficult to process. So what I am planning is to have production companies in separate column and then I can melt my dataset to have clean data. Hope this will give you idea about my final objective.
Ah I see. The task of creating a dataframe to be melted using this dataset is going to be very challenging and inefficient, since there are a variable number of production companies for each movie (ie: Spider-Man 3 has 4, while John Carter only has 1). I would recommend bypassing the 'melt' step and instead accessing each production company when needed using the for loop above. However, if you are adamant on melting the dataframe, see my EDIT to the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.