0

I have a folder and subfolders structure as follows:

D:/src
├─ xyz.xlsx
├─ dist
│  ├─ xyz.xlsx
│  ├─ xxx.zip
│  └─ xxy.xlsx
├─ lib
│  ├─ xy.rar
│  └─ xyx.xlsx
├─ test
│  ├─ xyy.xlsx
│  ├─ x.xls
│  └─ xyz.xlsx

I want to extract all excel files (xls or xlsx) from source directory and subdirectories, drop duplicates based on excel file names and put all the unique files in D:/dst directory. How can I the following result in Python? Thanks. Expected result:

D:/dst
├─ xyz.xlsx
├─ xxy.xlsx
├─ xyx.xlsx
├─ xyy.xlsx
├─ x.xls

Here is what I have tried:

import os

for root, dirs, files in os.walk(src, topdown=False):
    for file in files:
        if file.endswith('.xlsx') or file.endswith('.xls'):
            #print(os.path.join(root, file))
            try:
                df0 = pd.read_excel(os.path.join(root, file))
                #print(df0)
            except:
                continue
            df1 = pd.DataFrame(columns = [columns_selected])
            df1 = df1.append(df0, ignore_index = True)
            print(df1)
            df1.to_excel('test.xlsx', index = False)
14
  • 1
    I think you could do all that via shutil.copytree(). See question Copying specific files to a new folder, while maintaining the original subdirectory tree. Commented Jan 24, 2019 at 13:09
  • @ahbon, Any unluck with solving this one yet? Commented Jan 24, 2019 at 14:35
  • Thank you for asking. I'll try tomorrow and if I have problem I'll let you know. Commented Jan 24, 2019 at 15:35
  • @martineau The solution you've mentioned copy all xlsx and xls files to a new folder D:/dst, but it maintain the original subdirectories. What if I just want put them in a folder and ignore its original subdirectories? Commented Jan 25, 2019 at 1:23
  • 2
    While all that pandas dataframe stuff you added is kind of distracting and not exactly relevant to your question, it has however exposed—I think—what is called an XY Problem because it reveals why you want to do this file copying. If I am understand things correctly, then there really is no need to copy all the files to a single folder first—just use the process that finds them to copy to instead drive the concatenation of them you want to do. This will greatly reduce the I/O that needs to be done. Commented Jan 25, 2019 at 8:55

2 Answers 2

2

I think this will do what you want:

import os
import shutil


src = os.path.abspath(r'.\_src')
dst = os.path.abspath(r'.\_dst')
wanted = {'.xls', '.xlsx'}

copied = set()

for root, dirs, filenames in os.walk(src, topdown=False):
    for filename in filenames:
        ext = os.path.splitext(filename)[1]
        if ext in wanted and filename not in copied:
            src_filepath = os.path.join(root, filename)
            shutil.copy(src_filepath, dst)
            copied.add(filename)
Sign up to request clarification or add additional context in comments.

4 Comments

Your solution perfectly solved my copying excel files to one folder issues. Thanks a lot.
ahbon: That's good to hear and you're welcome. Sorry about any distraction my shutil.copytree() suggestion may have caused. While it could be done with it, in this situation it would not have been a very good approach.
Was about to share this answer you provided for another question.
@aneroid: I, too, originally thought that answer would be a good approach, but when I was finally—after much discussion—able to understand what the goal was, decided it wouldn't. Although technically it would be possible by "abusing" the ignore option to point where it ignored everything is a little too much in my opinion. All that's need for this question is the ability to walk the file folder system hierarchy, so os.walk() would be the more logical (and better) choice.
1

Since you've already got glob.glob, you don't need to also do os.walk, and vice-versa. But since glob only matches one pattern at a time and no way to denote an optional extra 'x' in the extension, you'll either need the glob loop twice - once for each extension; or use glob.glob( 'D:\\src\\*.xls*') which could match '*.xlsm', etc.

For each file matched, use shutil.move:

for file in glob.glob('D:\\src\\*.xls*'):
    shutil.move(file, 'D:\\dst\\' + os.path.basename(file))

With os.walk, you can do each extension check with fnmatch.fnmatch in the same loop:

for root, dirs, files in os.walk('D:\\src'):
    for file in files:
        if fnmatch.fnmatch(file, '*.xls') or fnmatch.fnmatch(file, '*.xlsx'):
            shutil.move(f'{root}\\{file}', f'D:\\dst\\{file}')
            # shutil.move(root + '\\' + file, 'D:\\dst\\' + file)

7 Comments

Why bother calling os.walk() only to then quit after first iteration? I think the question is unclear and have asked the OP for clarification.
Thanks @aneroid. I think shutil.copy would be better for without changing original data, what do you think?
Thanks for your help. Your solution only copy D:/src - xyz.xlsx to D:/dst, not excel files from its subfolders.
@martineau I gathered from the OP's code that he/she wanted to stop after the first folder/iteration. Even with glob, we'd need two loops for each extension, and without recursive=True. So figured os.walk() could simplify that. Anyway, from OP's post, they did want to do subdirs.
@ahbon Yes, you can use copy instead of move if you want the original data to remain in 'D:\src'. I've removed the break so that loop doesn't stop after the first iteration. From your code, it looked like you only want to use the top-level. Also, if you are recursively copying, then f'D:\\dst\\{file}' will need to be modified to have the parent folders of each file. Otherwise it will copy flat, without the subdirs. Look at shutil.copytree() for that.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.