14

I am trying to combine 2 different Excel files. (thanks to the post Import multiple excel files into python pandas and concatenate them into one dataframe)

The one I work out so far is:

import os
import pandas as pd

df = pd.DataFrame()

for f in ['c:\\file1.xls', 'c:\\ file2.xls']:
    data = pd.read_excel(f, 'Sheet1')
    df = df.append(data)

df.to_excel("c:\\all.xls")

Here is how they look like.

enter image description here

However I want to:

  1. Exclude the last rows of each file (i.e. row4 and row5 in File1.xls; row7 and row8 in File2.xls).
  2. Add a column (or overwrite Column A) to indicate where the data from.

For example:

enter image description here

Is it possible? Thanks.

3 Answers 3

15

For num. 1, you can specify skip_footer as explained here; or, alternatively, do

data = data.iloc[:-2]

once your read the data.

For num. 2, you may do:

from os.path import basename
data.index = [basename(f)] * len(data)

Also, perhaps would be better to put all the data-frames in a list and then concat them at the end; something like:

df = []
for f in ['c:\\file1.xls', 'c:\\ file2.xls']:
    data = pd.read_excel(f, 'Sheet1').iloc[:-2]
    data.index = [os.path.basename(f)] * len(data)
    df.append(data)

df = pd.concat(df)
Sign up to request clarification or add additional context in comments.

1 Comment

Magnificent, I have to say. behzad.nouri, you are gorgeous!
4
import os
import os.path
import xlrd
import xlsxwriter

file_name = input("Decide the destination file name in DOUBLE QUOTES: ")
merged_file_name = file_name + ".xlsx"
dest_book = xlsxwriter.Workbook(merged_file_name)
dest_sheet_1 = dest_book.add_worksheet()
dest_row = 1
temp = 0
path = input("Enter the path in DOUBLE QUOTES: ")
for root,dirs,files in os.walk(path):
    files = [ _ for _ in files if _.endswith('.xlsx') ]
    for xlsfile in files:
        print ("File in mentioned folder is: " + xlsfile)
        temp_book = xlrd.open_workbook(os.path.join(root,xlsfile))
        temp_sheet = temp_book.sheet_by_index(0)
        if temp == 0:
            for col_index in range(temp_sheet.ncols):
                str = temp_sheet.cell_value(0, col_index)
                dest_sheet_1.write(0, col_index, str)
            temp = temp + 1
        for row_index in range(1, temp_sheet.nrows):
            for col_index in range(temp_sheet.ncols):
                str = temp_sheet.cell_value(row_index, col_index)
                dest_sheet_1.write(dest_row, col_index, str)
            dest_row = dest_row + 1
dest_book.close()
book = xlrd.open_workbook(merged_file_name)
sheet = book.sheet_by_index(0)
print "number of rows in destination file are: ", sheet.nrows
print "number of columns in destination file are: ", sheet.ncols

Comments

0

Change

df.to_excel("c:\\all.xls")

to

df.to_excel("c:\\all.xls", index=False)

You may need to play around with the double quotes, but I think that will work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.