Iterate excel files and output in one folder in Python

Question

I have a folder and subfolders structure as follows:

D:/src
├─ xyz.xlsx
├─ dist
│  ├─ xyz.xlsx
│  ├─ xxx.zip
│  └─ xxy.xlsx
├─ lib
│  ├─ xy.rar
│  └─ xyx.xlsx
├─ test
│  ├─ xyy.xlsx
│  ├─ x.xls
│  └─ xyz.xlsx

I want to extract all excel files (xls or xlsx) from source directory and subdirectories, drop duplicates based on excel file names and put all the unique files in D:/dst directory. How can I the following result in Python? Thanks. Expected result:

D:/dst
├─ xyz.xlsx
├─ xxy.xlsx
├─ xyx.xlsx
├─ xyy.xlsx
├─ x.xls

Here is what I have tried:

import os

for root, dirs, files in os.walk(src, topdown=False):
    for file in files:
        if file.endswith('.xlsx') or file.endswith('.xls'):
            #print(os.path.join(root, file))
            try:
                df0 = pd.read_excel(os.path.join(root, file))
                #print(df0)
            except:
                continue
            df1 = pd.DataFrame(columns = [columns_selected])
            df1 = df1.append(df0, ignore_index = True)
            print(df1)
            df1.to_excel('test.xlsx', index = False)

I think you could do all that via shutil.copytree(). See question Copying specific files to a new folder, while maintaining the original subdirectory tree. — martineau
– martineau, Commented Jan 24, 2019 at 13:09
Thank you for asking. I'll try tomorrow and if I have problem I'll let you know. — ah bon
– ah bon, Commented Jan 24, 2019 at 15:35
@martineau The solution you've mentioned copy all xlsx and xls files to a new folder D:/dst, but it maintain the original subdirectories. What if I just want put them in a folder and ignore its original subdirectories? — ah bon
– ah bon, Commented Jan 25, 2019 at 1:23
While all that pandas dataframe stuff you added is kind of distracting and not exactly relevant to your question, it has however exposed—I think—what is called an XY Problem because it reveals why you want to do this file copying. If I am understand things correctly, then there really is no need to copy all the files to a single folder first—just use the process that finds them to copy to instead drive the concatenation of them you want to do. This will greatly reduce the I/O that needs to be done. — martineau
– martineau, Commented Jan 25, 2019 at 8:55

martineau · Accepted Answer · 2019-01-25 08:36:58Z

2

I think this will do what you want:

import os
import shutil


src = os.path.abspath(r'.\_src')
dst = os.path.abspath(r'.\_dst')
wanted = {'.xls', '.xlsx'}

copied = set()

for root, dirs, filenames in os.walk(src, topdown=False):
    for filename in filenames:
        ext = os.path.splitext(filename)[1]
        if ext in wanted and filename not in copied:
            src_filepath = os.path.join(root, filename)
            shutil.copy(src_filepath, dst)
            copied.add(filename)

answered Jan 25, 2019 at 8:36

martineau

124k29 gold badges181 silver badges319 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ah bon Over a year ago

Your solution perfectly solved my copying excel files to one folder issues. Thanks a lot.

martineau Over a year ago

ahbon: That's good to hear and you're welcome. Sorry about any distraction my shutil.copytree() suggestion may have caused. While it could be done with it, in this situation it would not have been a very good approach.

aneroid Over a year ago

Was about to share this answer you provided for another question.

martineau Over a year ago

@aneroid: I, too, originally thought that answer would be a good approach, but when I was finally—after much discussion—able to understand what the goal was, decided it wouldn't. Although technically it would be possible by "abusing" the ignore option to point where it ignored everything is a little too much in my opinion. All that's need for this question is the ability to walk the file folder system hierarchy, so os.walk() would be the more logical (and better) choice.

aneroid · Accepted Answer · 2019-01-25 07:40:20Z

1

Since you've already got glob.glob, you don't need to also do os.walk, and vice-versa. But since glob only matches one pattern at a time and no way to denote an optional extra 'x' in the extension, you'll either need the glob loop twice - once for each extension; or use glob.glob( 'D:\\src\\*.xls*') which could match '*.xlsm', etc.

For each file matched, use shutil.move:

for file in glob.glob('D:\\src\\*.xls*'):
    shutil.move(file, 'D:\\dst\\' + os.path.basename(file))

With os.walk, you can do each extension check with fnmatch.fnmatch in the same loop:

for root, dirs, files in os.walk('D:\\src'):
    for file in files:
        if fnmatch.fnmatch(file, '*.xls') or fnmatch.fnmatch(file, '*.xlsx'):
            shutil.move(f'{root}\\{file}', f'D:\\dst\\{file}')
            # shutil.move(root + '\\' + file, 'D:\\dst\\' + file)

edited Jan 25, 2019 at 7:40

answered Jan 25, 2019 at 1:55

aneroid

16.7k3 gold badges42 silver badges77 bronze badges

7 Comments

martineau Over a year ago

Why bother calling os.walk() only to then quit after first iteration? I think the question is unclear and have asked the OP for clarification.

ah bon Over a year ago

Thanks @aneroid. I think shutil.copy would be better for without changing original data, what do you think?

ah bon Over a year ago

Thanks for your help. Your solution only copy D:/src - xyz.xlsx to D:/dst, not excel files from its subfolders.

aneroid Over a year ago

@martineau I gathered from the OP's code that he/she wanted to stop after the first folder/iteration. Even with glob, we'd need two loops for each extension, and without recursive=True. So figured os.walk() could simplify that. Anyway, from OP's post, they did want to do subdirs.

aneroid Over a year ago

@ahbon Yes, you can use copy instead of move if you want the original data to remain in 'D:\src'. I've removed the break so that loop doesn't stop after the first iteration. From your code, it looked like you only want to use the top-level. Also, if you are recursively copying, then f'D:\\dst\\{file}' will need to be modified to have the parent folders of each file. Otherwise it will copy flat, without the subdirs. Look at shutil.copytree() for that.

|

Collectives™ on Stack Overflow

Iterate excel files and output in one folder in Python

2 Answers 2

4 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related