I have a folder and subfolders structure as follows:
D:/src
├─ xyz.xlsx
├─ dist
│ ├─ xyz.xlsx
│ ├─ xxx.zip
│ └─ xxy.xlsx
├─ lib
│ ├─ xy.rar
│ └─ xyx.xlsx
├─ test
│ ├─ xyy.xlsx
│ ├─ x.xls
│ └─ xyz.xlsx
I want to extract all excel files (xls or xlsx) from source directory and subdirectories, drop duplicates based on excel file names and put all the unique files in D:/dst directory. How can I the following result in Python? Thanks. Expected result:
D:/dst
├─ xyz.xlsx
├─ xxy.xlsx
├─ xyx.xlsx
├─ xyy.xlsx
├─ x.xls
Here is what I have tried:
import os
for root, dirs, files in os.walk(src, topdown=False):
for file in files:
if file.endswith('.xlsx') or file.endswith('.xls'):
#print(os.path.join(root, file))
try:
df0 = pd.read_excel(os.path.join(root, file))
#print(df0)
except:
continue
df1 = pd.DataFrame(columns = [columns_selected])
df1 = df1.append(df0, ignore_index = True)
print(df1)
df1.to_excel('test.xlsx', index = False)
shutil.copytree(). See question Copying specific files to a new folder, while maintaining the original subdirectory tree.