0

I have a huge number of excel files i need to extract data from, hopefully into a pandas df. The file contains a number of columns, one of which is time as a string like "16:30"

The filenames are "Monday 21st September 2020.xlsx" for example

I'm trying to loop through the files, and add a datetime column that includes the date from the filename, and the time from the column in the excel file. I've tried the following with the loop limited to one file

import pandas as pd
import datetime
import dateutil
import glob
import pathlib

folder = r"C:\temp\Friday 1st April 2022 (SB).xlsx"

for file in glob.glob(folder, recursive=False):
    #read in the excel file
    df = pd.read_excel(file, sheet_name="SB", usecols="B,I,J")
    
    #workout the date from the file name
    filedate = dateutil.parser.parse(pathlib.Path(file).stem.replace(" (SB)",""))
    
    #print filedate because it doesnt end up in the df correctly!
    print(type(filedate))
    print(filedate)
    df.insert(0,'Date', racedate.strftime('%d-$m-%Y'))
  
    print(df)
    

And that gives, this..... So the date add to the df is going wrong somewhere

C:\temp\Friday 1st April 2022 (SB).xlsx
<class 'datetime.datetime'>
2022-04-01 00:00:00
           Date   Time  R1  R2
0    01-$m-2022  16:30   9   5
1    01-$m-2022  16:30   5   5
2    01-$m-2022  16:30   6   5
3    01-$m-2022  16:30   3   6
4    01-$m-2022  16:30   3   3
..          ...    ...  ..  ..
446  01-$m-2022  16:15   3   4
447  01-$m-2022  16:15   3   3
448  01-$m-2022  16:15   3   3
449  01-$m-2022  16:15   5   3
450  01-$m-2022  16:15   5   4

[451 rows x 4 columns]

Also once i get this sorted i want to merge the two Date and Time columns into one datetime object.

2
  • It looks like you've just accidentally put a $M in your string format instead of %M, hence the $M output. I think correcting that should give you your desired output for the day column Commented Jan 23, 2023 at 21:20
  • I just realised that too, I've changed to %m and it's made no difference so i have edited the original question. Commented Jan 23, 2023 at 21:23

1 Answer 1

1

You can extract the date from the folder variable like this.

from dateutil import parser
import re 

folder = r"C:\temp\Friday 1st April 2022 (SB).xlsx"

date_string = re.findall(r'\d{1,2}[a-z]{2} \w+ \d{4}', folder)[0]
date_obj = parser.parse(date_string)
print(date_obj.strftime("%d-%m-%Y"))

This will print 01-04-2022

The date_string variable is calculated by regex and you get it as 1st April 2022 which the dateutil library can easily parse then.

Regex explanation:

\d{1,2}: one or two digits to match the day of month number

[a-z]{2}: two lowercase letters to match the "st", "nd" and "rd"

\w+: one or more word characters to match the month name

\d{4}: four digits to match the year number

With the above, you can get the 1st April 2022 from the whole path.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks. The problem is that the folder will in reality end in *.xlsx not a file name. The loop below will then loop through the files adding the correct date column for each file each time.
Then instead of the folder use the file instead the loop.
Of course! Thanks, I'll give this a try, not that I understand the regex lol
@ChrisLewis I added a regex explanation. Hope it is more clear now.
That's very clear. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.