My csv file is on this link:
https://drive.google.com/file/d/1Pac9-YLAtc7iaN0qEuiBOpYYf9ZPDDaL/view?usp=sharing
I want to remove the duplicate from the csv by checking length of genres against each artist ID. If an artist have 2 records in csv (e.g., ed sheeran's id 6eUKZXaKkcviH0Ku9w2n3V have 2 records one record have 1 genres while row#5 have 5 genres so i want to keep the row which has largest genres length)
I'm using this script for now:
import pandas
import ast
df = pandas.read_csv('39K.csv', encoding='latin-1')
df['lst_len'] = df['genres'].map(lambda x: len(ast.literal_eval(str(x))))
print(df['lst_len'][0])
df = df.sort_values('lst_len', ascending=False)
# Drop duplicates, preserving first (longest) list by ID
df = df.drop_duplicates(subset='ID')
# Remove extra column that we introduced, write to file
df = df.drop('lst_len', axis=1)
df.to_csv('clean_39K.csv', index=False)
but this script works for 500 record (may be i have illusion that size of records matters),
but when I run this script for my largest file 39K.csv I'm getting this error:
Traceback (most recent call last):
******* error in line 5, in <module>....
df['lst_len'] = df['genres'].map(lambda x: len(list(x)))
df['lst_len'] = df['genres'].map(lambda x: len(list(x)))
TypeError: 'float' object is not iterable
Please point me where i am doing wrong? Thanks