I'm doing an exercise for a Python and Data Analysis basic course, but I'm having trouble with pandas' drop_duplicates function. In my working directory I have a csv file with this structure:
name,type,size(B)
bw,.png,94926
ciao,.txt,12
daffodil,.jpg,24657
eclipse,.png,64243
pippo,.odt,8299
song1,.mp3,1087849
song2,.mp3,764176
trump,.jpeg,10195
bw,.png,94926
daffodil,.jpg,24657
eclipse,.png,64243
trump,.jpeg,10195
bw,.png,94926
daffodil,.jpg,24657
eclipse,.png,64243
trump,.jpeg,10195
This is a part of the program where I move files to their folders based on their extension, create / update a recap file with the file data and, finally, try to remove any duplicates rows from the csv:
def move_files_and_update_recap(files, files_dir_path):
with open('recap.csv', 'a', newline='') as recap:
writer = csv.writer(recap)
if("recap.csv" not in work_dir_elements):
writer.writerow(['name', 'type', 'size(B)'])
for file in sorted(files):
# original file path
file_path = os.path.join(files_dir_path, file)
# file name
file_name = os.path.splitext(file)[0]
# file extension
file_extension = os.path.splitext(file)[1]
#file size
file_size = os.path.getsize(file_path)
#file type
file_type = ""
for key, value in file_types.items():
if(file.endswith(tuple(value))): # if the file has a recognizable extension findable in "file_types"
file_type = key
#if file already exists in the specific folder, print an error
if(file in os.listdir(os.path.join(files_dir_path, file_type))):
print("Operation failed: {} already exists in {} folder".format(file, file_type))
else:
# moving file to a specific directory based on its extension
shutil.move(os.path.join(files_dir_path, file), os.path.join(files_dir_path, file_type, file))
# print file info
print("{} type:{} size:{}".format(file_name, file_extension, file_size))
file_data = [file_name, file_extension, str(file_size)] # data info for csv file
writer.writerow(file_data)
df = pd.read_csv('recap.csv')
df.drop_duplicates(inplace=True)
I tried also different settings of the function:
df.drop_duplicates(subset=None, keep=False, inplace=True)
or:
df.drop_duplicates(subset=None, keep="first", inplace=True)
If I print df the result is an indexed dataframe:
name type size(B)
0 bw .png 94926
1 ciao .txt 12
2 daffodil .jpg 24657
3 eclipse .png 64243
4 pippo .odt 8299
5 song1 .mp3 1087849
6 song2 .mp3 764176
7 trump .jpeg 10195
8 bw .png 94926
9 daffodil .jpg 24657
10 eclipse .png 64243
11 trump .jpeg 10195
12 bw .png 94926
13 daffodil .jpg 24657
14 eclipse .png 64243
15 trump .jpeg 10195
If I print the drop_duplicates result the return value is None. Some suggestions on how to fix it?



inplace=Trueoption, meaning that the function will not return any value, it will update your originaldfvariable