Renaming files based on Dataframe content with Python and Pandas

Question

I am trying to read a xlsx file, compare all the reference numbers from a column to files inside a folder and if they correspond, rename them to an email associate with the reference number.

Excel File has fields such as:

 Reference     EmailAddress
   1123        [email protected]
   1233        [email protected]
   1334        [email protected]
   ...         .....

My folder applicants just contains doc files named as the Reference column:

How can I compare the contents of the applicantsCVs folder, to the Reference field inside my excel file and if it matches, rename all of the files as the corresponding email address ?

Here is What I've tried so far:

import os
import pandas as pd

dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")
references = dfOne['Reference']

emailAddress = dfOne['EmailAddress']

cleanedEmailList = [x for x in emailAddress if str(x) != 'nan']

print(cleanedEmailList)
excelArray = []
filesArray = []

for root, dirs, files in os.walk("applicantCVs"):
    for filename in files:
        print(filename) #Original file name with type 1233.doc
        reworkedFile = os.path.splitext(filename)[0]
        filesArray.append(reworkedFile)

for entry in references:
    excelArray.append(str(entry))

for i in excelArray:
    if i in filesArray:
        print(i, "corresponds to the file names")

I compare the reference names to the folder contents and print it out if it's the same:

 for i in excelArray:
        if i in filesArray:
            print(i, "corresponds to the file names")

I've tried to rename it with os.rename(filename, cleanedEmailList ) but it didn't work because cleanedEmailList is an array of emails.

How can I match and rename the files?

Update:

from os.path import dirname
import pandas as pd
from pathlib import Path
import os

dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")

emailAddress = dfOne['EmailAddress']
reference = dfOne['Reference'] = dfOne.references.astype(str)

references = dict(dfOne.dropna(subset=[reference, "EmailAddress"]).set_index(reference)["EmailAddress"])
print(references)
files = Path("applicantCVs").glob("*")

for file in files:
    new_name = references.get(file.stem, file.stem)
    file.rename(file.with_name(f"{new_name}{file.suffix}"))

You want to match the contents of Word Document or just want to match the Reference of excel file with name of the word document ? — Rahul Agarwal
– Rahul Agarwal, Commented Jun 24, 2019 at 7:51
@MaartenFabré I want to rename the files to the email address column inside the CSV — someonewithakeyboardz1
– someonewithakeyboardz1, Commented Jun 24, 2019 at 8:17

Maarten Fabré · Accepted Answer · 2019-06-24 08:12:18Z

3

+25

based on sample data:

Reference     EmailAddress
   1123        [email protected]
   1233        [email protected]
   nan         jane.smith#example.com
   1334        [email protected]

First you assemble a dict with the set of references as keys and the new names as values:

references = dict(df.dropna(subset=["Reference","EmailAddress"]).set_index("Reference")["EmailAddress"])

{'1123': '[email protected]',
 '1233': '[email protected]',
 '1334': '[email protected]'}

Note that the references are strs here. If they aren't in your original database, you can use astype(str)

Then you use pathlib.Path to look for all the files in the data directory:

files = Path("../data/renames").glob("*")

[WindowsPath('../data/renames/1123.docx'),
 WindowsPath('../data/renames/1156.pptx'),
 WindowsPath('../data/renames/1233.txt')]

The renaming can be made very simple:

for file in files:
    new_name = references.get(file.stem, file.stem )
    file.rename(file.with_name(f"{new_name}{file.suffix}"))

The references.get asks for the new filename, and if it doesn't find it, use the original stem.

[WindowsPath('../data/renames/1156.pptx'),
 WindowsPath('../data/renames/[email protected]'),
 WindowsPath('../data/renames/[email protected]')]

edited Jun 24, 2019 at 8:12

answered Jun 24, 2019 at 7:52

Maarten Fabré

7,0781 gold badge19 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

someonewithakeyboardz1 Over a year ago

it's hard for me to integrated your changes into my script. Sorry for the late response.

someonewithakeyboardz1 Over a year ago

The problem that I am having with this code is that whenever I run it, I Don't rename the files to the appropriate EmailAddress but the files still remain the same.

Maarten Fabré Over a year ago

It worked for the test files I used. You can start diagnosing your problem at the top, and check which part fails. Are the keys in references correct and strs, is files really a generator with all the files, etc..

someonewithakeyboardz1 Over a year ago

I found the problem, it's that new_name consists only of the reference numbers. The strs and files are correct. The problem is that ` file.rename(file.with_name(f"{new_name}{file.suffix}"))` passes new_name as the reft number not the email.

Maarten Fabré Over a year ago

the email adress will be a string already normall, it's the Reference you need to convert to string

|

Mig B · Accepted Answer · 2019-06-24 07:58:01Z

0

How about adding the "email associate" (your new name i guess?) into an dictionary, where the keys are your reference numbers? This could look something like:

cor_dict = {}

for i in excelArray:
        if i in filesArray:
            cor_dict[i] =dfOne['EmailAddress'].at[dfOne.Reference == i]


for entry in cor_dict.items():
    path = 'path to file...'
    filename = str(entry[0])+'.doc'
    new_filename =  str(entry[1]).replace('@','_') + '_.doc'

    filepath = os.path.join(path, filename)
    new_filepath = os.path.join(path,new_filename)

    os.rename(filename, new_filename)

answered Jun 24, 2019 at 7:58

Mig B

6471 gold badge11 silver badges21 bronze badges

Comments

Rakesh · Accepted Answer · 2019-06-24 11:19:10Z

This is one approach using a simple iteration.

Ex:

import os

#Sample Data#
#dfOne = pd.DataFrame({'Reference': [1123, 1233, 1334, 4444, 5555],'EmailAddress': ["[email protected]", "[email protected]", "[email protected]", np.nan, "[email protected]"]})
dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")
dfOne.dropna(inplace=True)  #Drop rows with NaN

for root, dirs, files in os.walk("applicantsCVs"):
    for file in files:
        file_name, ext = os.path.splitext(file)
        email = dfOne[dfOne['Reference'].astype(str).str.contains(file_name)]["EmailAddress"]
        if email.values:
            os.rename(os.path.join(root, file), os.path.join(root, email.values[0]+ext))

Or if you have only .docx file to rename

import os

dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")

dfOne["Reference"] = dfOne["Reference"].astype(str)
dfOne.dropna(inplace=True)  #Drop rows with NaN
ext = ".docx"
for root, dirs, files in os.walk("applicantsCVs"):
    files = r"\b" + "|".join(os.path.splitext(i)[0] for i in files) + r"\b"
    for email, ref in dfOne[dfOne['Reference'].astype(str).str.contains(files, regex=True)].values:
        os.rename(os.path.join(root, ref+ext), os.path.join(root, email+ext))

Sebastien D · Accepted Answer · 2019-06-24 14:58:14Z

0

You could do it directly in your dataframe using df.apply():

import glob
import os.path

#Filter out null addresses
df = df.dropna(subset=['EmailAddress']) 

#Add a column to check if file exists
df2['Existing_file'] = df2.apply(lambda row: glob.glob("applicantsCVs/{}.*".format(row['Reference'])), axis=1)

df2.apply(lambda row: os.rename(row.Existing_file[0], 'applicantsCVs/{}.{}'.format( row.EmailAddress, row.Existing_file[0].split('.')[-1])) if len(row.Existing_file) else None, axis = 1)
print(df2.Existing_file.map(len), "existing files renamed")

EDIT : works now with any extension (.doc, .docx) by using glob module

edited Jun 24, 2019 at 14:58

answered Jun 24, 2019 at 8:23

Sebastien D

4,5024 gold badges23 silver badges50 bronze badges

1 Comment

Maarten Fabré Over a year ago

You can use pd.dropna(subset='EmailAdress') to filter the empty addresses

Community · Accepted Answer · 2020-06-20 09:12:55Z

Let consider our sample data in excel sheet is following:

Reference   EmailAddress
1123    [email protected]
1233    [email protected]
1334    [email protected]
nan     [email protected]

There are following steps involved to solve this problem.

Step 1

import the data properly from excel sheet "my.xlsx". Here I am using the sample data

import pandas as pd
import os
#import data from excel sheet and drop rows with nan 
df = pd.read_excel('my.xlsx').dropna()
#check the head of data if the data is in desirable format
df.head()

You will see that the data type in the references are in float type here

Step 2

Change the data type in the reference column to integer and then into string

df['Reference']=df.Reference.astype(int, inplace=True)
df = df.astype(str,inplace=True)
df.head()

Now the data is in desirable format

Step 3

Renaming the files in the desired folder. Zip the lists of 'Reference' and 'EmailAddress' to use in for loop.

#absolute path to folder. I consider you have the folder "application cv" in the home directory
path_to_files='/home/applicant cv/'
for ref,email in zip(list(df['Reference']),list(df['EmailAddress'])):
    try: 
        os.rename(path_to_files+ref+'.doc',path_to_files+email+'.doc')
    except:
        print ("File name doesn't exist in the list, I am leaving it as it is")

Vineet Dhaimodker · Accepted Answer · 2019-06-29 18:59:30Z

Step 1: import the data from excel sheet "Book1.xlsx"

import pandas as pd
df = pd.read_excel (r'path of your file here\Book1.xlsx')        
print (df)

Step 2: Choose path that your ".docx" files are in and store their names. Get only relevent part of filename to compare.

mypath = r'path of docx files\doc files'
from os import listdir,rename
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
#print(onlyfiles)
currentfilename=onlyfiles[0].split(".")[0]

This is how I stored the files

Step 3: Run loop to check if name matches with the Reference. And just use rename(src,dest) function from os

for i in range(3):
    #print(currentfilename,df['ref'][i])
    if str(currentfilename)==str(df['Reference'][i]):
        corrosponding_email=df['EmailAddress'][i]
        #print(mypath+"\\"+corrosponding_email)
rename(mypath+"\\"+str(currentfilename)+".docx",mypath+"\\"+corrosponding_email+".docx")

checkout the code with example:https://github.com/Vineet-Dhaimodker

Collectives™ on Stack Overflow

Renaming files based on Dataframe content with Python and Pandas

6 Answers 6

10 Comments

Comments

Comments

1 Comment

Step 1

Step 2

Step 3

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

10 Comments

Comments

Comments

1 Comment

Step 1

Step 2

Step 3

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related