2

I'm trying to combine excel data files with different dates to one file so I can do analysis using pandas package. I am having difficulties since the files are named by dates and have multiple sheets inside.

This is how the inside of the file looks like This is how the inside of the month folder looks like The inside of the year folder with multiples directoties

This is for an assignment to analyze the date and plot various parameters i.e, temp, atm, GHI e.t.c to the number of days/hours/minutes

import pandas as pd
import glob

all_data = pd.DataFrame() #Creating an empty dataframe
for f in glob.glob("/Data-Concentrated Solar Power-NamPower/Arandis 2016/2016 01 January/*.xlsx"): #path to datafiles and using glob to select all files with .xlsx extension
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)


2 Answers 2

2

Append each file DataFrame to a list, then use pandas.concat to combine them all to one DataFrame:

import pandas as pd
import glob

frames = []

for f in glob.glob("/home/humblefool/Dropbox/MSc/MSc Project/Data-Concentrated Solar Power-NamPower/Arandis 2016/2016 01 January/*.xlsx"): #path to datafiles and using glob to select all files with .xlsx extension
    df = pd.read_excel(f).assign(file_name=f)
    # Add date column for sorting later
    df['date'] = pd.to_datetime(df.file_name.str.extract(r'(\d{4}-\d{2}-\d{2})', expand=False), errors='coerce')
    frames.append(df)

all_data = pd.concat(frames, ignore_index=True).sort_values('date')
Sign up to request clarification or add additional context in comments.

7 Comments

Is it possible to know how that the files were added according to their dates and how maybe using pandas to only start at line 17 using header commands for all files?
I've updated my answer, this will include a column with the file that the dataframe came from
Also I realised the files merged are in no chronological order of their dates, and its messing up the data when I converted the generated file to a csv file. Any idea on how to?
I’m not at my desk just now, but will be able to look into this for you in about 30 minutes or so
@Tonikami04 apologies for the delay in getting back to you. IIUC, you want to extract the date part from the filename so that you can sort by that date..? I have updated my answer to add a date column, using .str.extract and pd.to_datetime. hope this is what you're looking for.
|
2

Can you try the following:

import os
all_data = pd.DataFrame() #Creating an empty dataframe
for f in glob.glob("/home/humblefool/Dropbox/MSc/MSc Project/Data-Concentrated Solar Power-NamPower/Arandis 2016/2016 01 January/*.xlsx"): #path to datafiles and using glob to select all files with .xlsx extension
    df = pd.ExcelFile(f).parse('Sheet1', skiprows=16)
    file_date = os.path.splitext(os.path.basename(f))[0].split('_')[1]
    df['file_date'] = pd.to_datetime(file_date)
    all_data = pd.concat([all_data, df])
all_data  = all_data.set_index('file_date').sort_index()

3 Comments

This is actually working. But how sure I am that the files are combined together as per their dates?
I have revised the solution to skip the first 16 rows. you can check now.
Also I realised the files merged are in no chronological order of their dates, and its messing up the data when I converted the generated file to a csv file. Any idea on how to?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.