Extracting neccesary columns from multiple Excel files in Python

Question

I am trying to extract and combine selected columns from 19 Excel files into single excel file. Am able to extract required columns from single file with below code.

import pandas as pd
import openpyxl

file = pd.read_excel("Shift Handover To A - 05-25-2021.xlsx", "25th May")

dataframe=pd.DataFrame(file[["S No", "Issue Reported By", "Shift", "Severity", "ServiceDesk Ticket #", "Issue Description", "Issue Type", "System Component", "Server Type", "Date and Time of the occurrence", "DT Observed", "Action Taken", "Worked By", "DT Action Taken", "Date and Time Resolution", "Current Stus"]])

# selecting rows based on condition
rslt_df = dataframe.loc[dataframe['Current Stus'] == 'In-Progress' ]

rslt_df.to_excel('output.xlsx')

Am trying to apply it for all files with below code,

import os
import pandas as pd
cwd = os.path.abspath('')
import openpyxl
files = os.listdir(cwd)

for file in files:
    if file.startswith('Shift'):
        file = pd.read_excel(os.path.join(cwd, file))
dataframe=pd.DataFrame(file[["S No", "Issue Reported By", "Shift", "Severity", "ServiceDesk Ticket #", "Issue Description", "Issue Type", "System Component", "Server Type", "Date and Time of the occurrence", "DT Observed", "Action Taken", "Worked By", "DT Action Taken", "Date and Time Resolution", "Current Stus"]])

# selecting rows based on condition
rslt_df = dataframe.loc[dataframe['Current Stus'] == 'In-Progress' ]

#print(rslt_df)
rslt_df.to_excel('output.xlsx')

But am receiving TypeError for dataframe=pd.DataFrame(file..... "TypeError: string indices must be integers" What could be wrong?

read_excel itself will produce a dataframe, no need to convert it to a df again — Naveen
– Naveen, Commented Jun 19, 2021 at 6:43
You use 'file' both as iterator (for file in files) and as dataframe inside the loop. Use another name instead — IoaTzimas
– IoaTzimas, Commented Jun 19, 2021 at 6:46

SeaBean · Accepted Answer · 2021-06-19 06:50:13Z

You can try amend your codes as follows:

You need to define an empty dataframe and accumulate the results from each loop iteration by .append():

No need to call for pd.DataFrame after the loop, you can just select the columns you want and assign it back by dataframe = dataframe[["S No", ...]]

files = os.listdir(cwd)

dataframe = pd.DataFrame()
for file in files:
    if file.startswith('Shift'):
        file_read = pd.read_excel(os.path.join(cwd, file))
        dataframe = dataframe.append(file_read) 

dataframe = dataframe[["S No", "Issue Reported By", "Shift", "Severity", "ServiceDesk Ticket #", "Issue Description", "Issue Type", "System Component", "Server Type", "Date and Time of the occurrence", "DT Observed", "Action Taken", "Worked By", "DT Action Taken", "Date and Time Resolution", "Current Stus"]]

# selecting rows based on condition
rslt_df = dataframe.loc[dataframe['Current Stus'] == 'In-Progress' ]

#print(rslt_df)
rslt_df.to_excel('output.xlsx')

IoaTzimas · Accepted Answer · 2021-06-19 06:49:36Z

1

The problem with your code is in these lines:

for file in files:
    if file.startswith('Shift'):
        file = pd.read_excel(os.path.join(cwd, file))
dataframe=pd.DataFrame(file[["S No", ... "Current Stus"]])

You use 'file' as iterator (for file in files). When the loop ends, If file.startswith('Shift') is not True, then file is a string, therefore file[["S No", ... "Current Stus"]] will throw an error.

Just use another name for the dataframe

answered Jun 19, 2021 at 6:49

IoaTzimas

10.7k2 gold badges15 silver badges32 bronze badges

Collectives™ on Stack Overflow

Extracting neccesary columns from multiple Excel files in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related