0

If I have lots of excel files as follows (here are just two examples):

data1.xlsx

df1

data2.xlsx

enter image description here

Is it possible I just take the part with columns of id, a, b, c and ignore the rest and concatenate all those files together into a new excel file in Python. Thanks.

enter image description here

Here is what I have tried:

import os

for root, dirs, files in os.walk(src, topdown=False):
    for file in files:
        if file.endswith('.xlsx') or file.endswith('.xls'):
            #print(os.path.join(root, file))
            try:
                df0 = pd.read_excel(os.path.join(root, file))
                #print(df0)
            except:
                continue
            df1 = pd.DataFrame(columns = [columns_selected])
            df1 = df1.append(df0, ignore_index = True)
            print(df1)
            df1.to_excel('test.xlsx', index = False)
4
  • Using iloc to get only the data after specific columns from .xlsx, then concatenate both the sheets to make a new one. and pufff! done. Commented Jan 24, 2019 at 12:58
  • Thanks. In fact my real data is literally a mess, quite difficult to deal with. :( Commented Jan 24, 2019 at 14:04
  • ahbon: You can probably adapt the code in my answer to your other question to do this (as I already told you I thought was likely). Instead of the destination being a single directory somewhere, for this it's single dataframe—and instead of copying files to the destination directory, you'll want to extract and concatenate data from all the files which have one of the wanted file extensions. Commented Jan 25, 2019 at 11:09
  • Yeah. I agree. I means if it's possible to concatenate data1.xlsx and data2.xlsx ignore contents before and after empty rows to get final excel with columns of id, a, b, c. Commented Jan 25, 2019 at 11:36

2 Answers 2

1

use skpirows and nrows https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

import pandas

df1 = pd.read_excel('data1.xlsx', skpirows=3, nrows=5)
df2 = pd.read_excel('data2.xlsx', skpirows=4, nrows=5)

dfFinal = df1.append(df2)
Sign up to request clarification or add additional context in comments.

6 Comments

My mistake... I delete it
Thanks for your help. In fact I have lots of those type of excel files under a folder I don't want real_excel one by one, are there other solutions?
if all files have the same pattern, you can have one file with many sheets on it. Then you can set sheet_name=None to read and parse all data.
@ahbon are those files or sheets within file?
Thanks. The problem is they are located in different folders and subfolders, and plus thoese file have sheets as well. Mabey first I need to do is to copy all excel files into one folder via shutil.copytree(), then combine them to one file with multiple sheets, and at last step, take your solution.
|
1

Extending @Charles R's answer with your requirement of multiple excel files.

# get all the files
os.chdir('C:\ExcelWorkbooksFolder')
FileList = glob.glob('*.xlsx')
print(FileList)

and then:

for File in FileList:
    for x in File:
        # the rest of the code for reading

2 Comments

Thanks. Please check the poster I put here. The structure of folders and files is similar to this one: stackoverflow.com/questions/54346748/…
I think we need use os.walk to iterate all excel files end with xlsx or xls, non?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.