1

I have a folder with several excel files in the format xls and xlsx and I am trying to read them and concatenate them in one single Dataframe. The problem that I am facing is that python does not read the files in the folder in the correct order.

My folder contains the following files: 190.xls , 195.xls , 198.xls , 202.xlsx , 220.xlsx and so on

This is my code:

import pandas as pd
from pathlib import Path

my_path = 'my_Dataset/'

xls_files = pd.concat([pd.read_excel(f2) for f2 in Path(my_path).rglob('*.xls')], sort = False)

xlsx_files = pd.concat([pd.read_excel(f1) for f1 in Path(my_path).rglob('*.xlsx')],sort = False)

all_files = pd.concat([xls_files,xlsx_files],sort = False).reset_index(drop=True))

I get what I want but the FILES ARE NOT CONCATENATED IN ORDER AS THEY WERE IN THE FOLDER!!!!! meaning that in the all_files Dataframe I first have data from 202.xlsx and then from 190.xls

How can I solve this problem? Thank you in advance!

3
  • what if u read the xls_files, append the xlsx files to xls and pd.concat only once? that way, the xls files are surely before xlsx files and ur concat also happens only once, which should count for some efficiency. just a suggestion. Commented Feb 24, 2020 at 9:44
  • 2
    Instead of Path(my_path).rglob('*.xls') use sorted(Path(my_path).rglob('*.xls')) Commented Feb 24, 2020 at 9:45
  • You can create a loop with order you need first and in that loop do your pd.concat. Commented Feb 24, 2020 at 9:46

2 Answers 2

1

Try using

import pandas as pd
from pathlib import Path

my_path = 'my_Dataset/'
all_files = pd.concat([pd.read_excel(f) for f in sorted(list(Path(my_path).rglob('*.xls')) + list(Path(my_path).rglob('*.xlsx')), key=lambda x: int(x.stem))],sort = False).reset_index(drop=True) 
Sign up to request clarification or add additional context in comments.

3 Comments

I have a doubt! What If I have the files in the following order in the folder: 190.xls , 195.xlsx , 198.xls , 202.xlsx So what if the file that I want to be the second to be concatenated is in xlsx format thus different from the previous one? Would this code work anyway? Would the 195.xlsx file put after 190xlsx or after 198.xls?
Try print(sorted(list(Path(my_path).rglob('*.xls')) + list(Path(my_path).rglob('*.xlsx')), key=lambda x: int(x.stem))) to check the result
Great1!! IT WORKKSSS! Thank you very much for your help!
0

Update this

all_files = pd.concat([xls_files,xlsx_files],sort = False).reset_index(drop=True))

to this

all_files = pd.concat([xlsx_files,xls_files],sort = False).reset_index(drop=True))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.