1

I have to make an application in which I have to import all the excel files in the given folder and add it to a dataframe. The dataframe should look as shown:

Expected Data Frame

As seen in the image one of the columns for the dataframe is the name of the file.

I have successfully added that column in the final dataframe and the code is as follows:

import pandas as pd
import os
import shutil
import re  
path = 'C:/Users/Administrator/Desktop/Zerodha/Day2'
lst = os.listdir(path)
files = [os.path.join(path,x) for x in lst]
print(lst)
dataframes_lst = []

for file in files:
 filename = file.split('/')[-1]
 dataframe = pd.read_csv(file, usecols=[0,4], names ["date",filename],index_col=["date"])
dataframes_lst.append(dataframe)

df = pd.concat(dataframes_lst, axis=1)

print(df)


df.to_csv('data.csv')

The dataframe which is obtained using this code is as displayed:

Obtained Data Frame

For reference I will attach the snippet of one of the excel files:

Excel snippet

Also as seen there are many nan values obtained. I tried to remove them by using pd.dropna(inplace = True) function and also by doing as suggested in this post:

  1. How to drop rows of Pandas DataFrame whose value in a certain column is NaN

But the resultant dataframe still contains the nan values.

1
  • I have solved the nan issue by using the fillna function Commented Nov 19, 2020 at 9:22

2 Answers 2

2

Regarding

My doubt is that how do I loop through all the files in the directory and extract data of each file in the required format

You can loop through all the files in the directory and extract the data and filename as the header of the dataframe as such:

import pandas as pd
import os

path = './data'
lst = os.listdir('./data/')
files = [os.path.join(path,el) for el in lst]

output_of_files

and the structure of example.xlsx is:

excel_structure

dataframes_lst = []

for file in files:
    filename = file.split('/')[-1]
    dataframe = pd.read_excel(file, usecols=[3], names=[filename])
    dataframes_lst.append(dataframe)

df = pd.concat(dataframes_lst, axis=1)
print(df)

Here, the dataframes are concatenated along the axis=1 and the output of print(df)is

output_final_df

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your time. This is the solution I was looking for,
0

Try this:

import pandas as pd    
from pathlib import Path
read_path = Path('C:/Users/Administrator/Desktop/Zerodha/Day2')
    
df = pd.concat([pd.read_csv(path) for path in read_path.glob('*.csv')])

If you want to read from excel just use read_excel and change the pattern to '*.xlsx'

1 Comment

I am getting the data but not as expected in the Expected Data Frame. Your code gives me the data in the format mentioned in the Excel snippet. I need it in the format as mentioned in the Expected Data Frame

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.