0

I'm doing an exercise for a Python and Data Analysis basic course, but I'm having trouble with pandas' drop_duplicates function. In my working directory I have a csv file with this structure:

name,type,size(B)
bw,.png,94926
ciao,.txt,12
daffodil,.jpg,24657
eclipse,.png,64243
pippo,.odt,8299
song1,.mp3,1087849
song2,.mp3,764176
trump,.jpeg,10195
bw,.png,94926
daffodil,.jpg,24657
eclipse,.png,64243
trump,.jpeg,10195
bw,.png,94926
daffodil,.jpg,24657
eclipse,.png,64243
trump,.jpeg,10195

This is a part of the program where I move files to their folders based on their extension, create / update a recap file with the file data and, finally, try to remove any duplicates rows from the csv:

def move_files_and_update_recap(files, files_dir_path):
    
    with open('recap.csv', 'a', newline='') as recap:
        writer = csv.writer(recap)
        if("recap.csv" not in work_dir_elements):
            
            writer.writerow(['name', 'type', 'size(B)'])
    
            
        for file in sorted(files):
            # original file path
            file_path = os.path.join(files_dir_path, file)
            # file name
            file_name = os.path.splitext(file)[0]
            # file extension
            file_extension = os.path.splitext(file)[1]
            #file size
            file_size = os.path.getsize(file_path)
            #file type
            file_type = ""
        
            for key, value in file_types.items():
                 if(file.endswith(tuple(value))): # if the file has a recognizable extension findable in "file_types" 
                        file_type = key     
                        #if file already exists in the specific folder, print an error
                        if(file in os.listdir(os.path.join(files_dir_path, file_type))):
                            print("Operation failed: {} already exists in {} folder".format(file, file_type))
                        else:
                            # moving file to a specific directory based on its extension 
                            shutil.move(os.path.join(files_dir_path, file), os.path.join(files_dir_path, file_type, file))
                            # print file info
                            print("{} type:{} size:{}".format(file_name, file_extension, file_size))

                            file_data = [file_name, file_extension, str(file_size)] # data info for csv file
                            writer.writerow(file_data)
                            
    df = pd.read_csv('recap.csv')
    df.drop_duplicates(inplace=True)
    

I tried also different settings of the function:

df.drop_duplicates(subset=None, keep=False, inplace=True)

or:

df.drop_duplicates(subset=None, keep="first", inplace=True)

If I print df the result is an indexed dataframe:

        name   type  size(B)
0         bw   .png    94926
1       ciao   .txt       12
2   daffodil   .jpg    24657
3    eclipse   .png    64243
4      pippo   .odt     8299
5      song1   .mp3  1087849
6      song2   .mp3   764176
7      trump  .jpeg    10195
8         bw   .png    94926
9   daffodil   .jpg    24657
10   eclipse   .png    64243
11     trump  .jpeg    10195
12        bw   .png    94926
13  daffodil   .jpg    24657
14   eclipse   .png    64243
15     trump  .jpeg    10195

If I print the drop_duplicates result the return value is None. Some suggestions on how to fix it?

1
  • "If I print the drop_duplicates result the return value is None. Some suggestions on how to fix it?" You're using the inplace=True option, meaning that the function will not return any value, it will update your original df variable Commented Feb 18, 2022 at 21:58

1 Answer 1

1

I think you must be doing something wrong. I have tried to reproduce the whole scenario that you have described but it seems to be working in my case.

Let me share some details

Code to create a dataframe:

import re
import pandas as pd

lines = '''name   type  size(B)
0         bw   .png    94926
1       ciao   .txt       12
2   daffodil   .jpg    24657
3    eclipse   .png    64243
4      pippo   .odt     8299
5      song1   .mp3  1087849
6      song2   .mp3   764176
7      trump  .jpeg    10195
8         bw   .png    94926
9   daffodil   .jpg    24657
10   eclipse   .png    64243
11     trump  .jpeg    10195
12        bw   .png    94926
13  daffodil   .jpg    24657
14   eclipse   .png    64243
15     trump  .jpeg    10195'''.splitlines()

columns = lines[0].split()
lines = [re.sub(r'^\d+\s+', '', line).strip() for line in lines[1:]]
lines = [{columns[0]:line.split()[0], columns[1]:line.split()[1], columns[2]:line.split()[2]} for line in lines]
df = pd.DataFrame(lines)

enter image description here

Applied the functions to remove duplicates

Scenario 1:

enter image description here

Senario 2:

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

Could it have something to do with the csv file being created via the python csv module? Unfortunately I cannot use the pandas features to create the file. The project specifications strictly require the use of csv for that step.
I don't think so. If you use inplace=True then the df object is automatic gets updated. You can check df.shape before and after the line. For example: print(df.shape) df.drop_duplicates(subset=None, keep=False, inplace=True) print(df.shape)
In the end I could only fix the problem with this code snippet: df = pd.read_csv ('recap.csv', index_col = False) df = df.drop_duplicates () df.to_csv ("recap.csv", mode = "w", index = False)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.