Pandas drop_duplicates() function does not work on my csv file

Question

I'm doing an exercise for a Python and Data Analysis basic course, but I'm having trouble with pandas' drop_duplicates function. In my working directory I have a csv file with this structure:

name,type,size(B)
bw,.png,94926
ciao,.txt,12
daffodil,.jpg,24657
eclipse,.png,64243
pippo,.odt,8299
song1,.mp3,1087849
song2,.mp3,764176
trump,.jpeg,10195
bw,.png,94926
daffodil,.jpg,24657
eclipse,.png,64243
trump,.jpeg,10195
bw,.png,94926
daffodil,.jpg,24657
eclipse,.png,64243
trump,.jpeg,10195

This is a part of the program where I move files to their folders based on their extension, create / update a recap file with the file data and, finally, try to remove any duplicates rows from the csv:

def move_files_and_update_recap(files, files_dir_path):
    
    with open('recap.csv', 'a', newline='') as recap:
        writer = csv.writer(recap)
        if("recap.csv" not in work_dir_elements):
            
            writer.writerow(['name', 'type', 'size(B)'])
    
            
        for file in sorted(files):
            # original file path
            file_path = os.path.join(files_dir_path, file)
            # file name
            file_name = os.path.splitext(file)[0]
            # file extension
            file_extension = os.path.splitext(file)[1]
            #file size
            file_size = os.path.getsize(file_path)
            #file type
            file_type = ""
        
            for key, value in file_types.items():
                 if(file.endswith(tuple(value))): # if the file has a recognizable extension findable in "file_types" 
                        file_type = key     
                        #if file already exists in the specific folder, print an error
                        if(file in os.listdir(os.path.join(files_dir_path, file_type))):
                            print("Operation failed: {} already exists in {} folder".format(file, file_type))
                        else:
                            # moving file to a specific directory based on its extension 
                            shutil.move(os.path.join(files_dir_path, file), os.path.join(files_dir_path, file_type, file))
                            # print file info
                            print("{} type:{} size:{}".format(file_name, file_extension, file_size))

                            file_data = [file_name, file_extension, str(file_size)] # data info for csv file
                            writer.writerow(file_data)
                            
    df = pd.read_csv('recap.csv')
    df.drop_duplicates(inplace=True)

I tried also different settings of the function:

df.drop_duplicates(subset=None, keep=False, inplace=True)

or:

df.drop_duplicates(subset=None, keep="first", inplace=True)

If I print df the result is an indexed dataframe:

        name   type  size(B)
0         bw   .png    94926
1       ciao   .txt       12
2   daffodil   .jpg    24657
3    eclipse   .png    64243
4      pippo   .odt     8299
5      song1   .mp3  1087849
6      song2   .mp3   764176
7      trump  .jpeg    10195
8         bw   .png    94926
9   daffodil   .jpg    24657
10   eclipse   .png    64243
11     trump  .jpeg    10195
12        bw   .png    94926
13  daffodil   .jpg    24657
14   eclipse   .png    64243
15     trump  .jpeg    10195

If I print the drop_duplicates result the return value is None. Some suggestions on how to fix it?

"If I print the drop_duplicates result the return value is None. Some suggestions on how to fix it?" You're using the inplace=True option, meaning that the function will not return any value, it will update your original df variable — aaossa
– aaossa, Commented Feb 18, 2022 at 21:58

Nomiluks · Accepted Answer · 2022-02-18 22:08:43Z

1

I think you must be doing something wrong. I have tried to reproduce the whole scenario that you have described but it seems to be working in my case.

Let me share some details

Code to create a dataframe:

import re
import pandas as pd

lines = '''name   type  size(B)
0         bw   .png    94926
1       ciao   .txt       12
2   daffodil   .jpg    24657
3    eclipse   .png    64243
4      pippo   .odt     8299
5      song1   .mp3  1087849
6      song2   .mp3   764176
7      trump  .jpeg    10195
8         bw   .png    94926
9   daffodil   .jpg    24657
10   eclipse   .png    64243
11     trump  .jpeg    10195
12        bw   .png    94926
13  daffodil   .jpg    24657
14   eclipse   .png    64243
15     trump  .jpeg    10195'''.splitlines()

columns = lines[0].split()
lines = [re.sub(r'^\d+\s+', '', line).strip() for line in lines[1:]]
lines = [{columns[0]:line.split()[0], columns[1]:line.split()[1], columns[2]:line.split()[2]} for line in lines]
df = pd.DataFrame(lines)

Applied the functions to remove duplicates

Scenario 1:

Senario 2:

answered Feb 18, 2022 at 22:08

Nomiluks

2,0925 gold badges32 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Aurora Over a year ago

Could it have something to do with the csv file being created via the python csv module? Unfortunately I cannot use the pandas features to create the file. The project specifications strictly require the use of csv for that step.

Nomiluks Over a year ago

I don't think so. If you use inplace=True then the df object is automatic gets updated. You can check df.shape before and after the line. For example: print(df.shape) df.drop_duplicates(subset=None, keep=False, inplace=True) print(df.shape)

Aurora Over a year ago

In the end I could only fix the problem with this code snippet: df = pd.read_csv ('recap.csv', index_col = False) df = df.drop_duplicates () df.to_csv ("recap.csv", mode = "w", index = False)

Collectives™ on Stack Overflow

Pandas drop_duplicates() function does not work on my csv file

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related