I am reading data from the source: https://www.kaggle.com/tmdb/tmdb-movie-metadata using command shown below:
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
Using above approach some of the columns have data in form of array of json objects like production_countries, keywords etc.
How can I convert these columns into multiple columns?
I am trying to do this as shown below:
pd.io.json.json_normalize(tmdbDataSet.production_companies.apply(json.loads))
But I am getting error
AttributeError: 'list' object has no attribute 'values'
Edit: Thanks Yang for help. But I observed that production_companies has max length of array as 26 and its not effective way as you stated to create these many column. I have used below code to find out length.
length =0
for index, row in tmdbDataSet.iterrows():
company = json.loads(row['production_companies'])
if(len(company) > length):
length = len(company)
print(length)
Seems like I need to look for any other alternative. But I observed that spoken languages column have 9 distinct values. I have created 9 distinct columns using code shown below:
for i in range(9):
tmdbDataSet['spoken_languages_' + str(i)] = ""
then when I run below code:
columns = ['spoken_languages_0','spoken_languages_1','spoken_languages_2','spoken_languages_3','spoken_languages_4','spoken_languages_5','spoken_languages_6','spoken_languages_7','spoken_languages_8']
tmdbDataSet[columns] = pd.DataFrame(tmdbDataSet.spoken_languages.values.tolist(), index= tmdbDataSet.index)
print(tmdbDataSet.head())
I get an error:
Columns must be same length as key Which is understandable as I don't have fixed length of array. Please let me know possible solution to it?