Using Pandas .apply() method with a regex-based function

Question

I am trying to create a new column in a data Frame by applying a function on a column that has numbers as strings.

I have written the function to extract the numbers I want and tested it on a single string input and can confirm that it works.

SEARCH_PATTERN = r'([0-9]{1,2}) ([0-9]{2}):([0-9]{2}):([0-9]{2})'
def get_total_time_minutes(time_col, pattern=SEARCH_PATTERN):
    """Uses regex to parse time_col which is a string in the format 'd hh:mm:ss' to
    obtain a total time in minutes
    """
    days, hours, minutes, _ = re.match(pattern, time_col).groups()
    total_time_minutes = (int(days)*24 + int(hours))*60 + int(minutes)
    return total_time_minutes

#test that the function works for a single input
text = "2 23:24:46"
print(get_total_time_minutes(text))

Ouput: 4284

#apply the function to the required columns
df['Minutes Available'] = df['Resource available (d hh:mm:ss)'].apply(get_total_time_minutes)

The picture below is a screenshot of my dataframe columns. Screenshot of my dataframe The 'Resources available (d hh:mm:ss)' column of my dataframe is of Pandas type 'o' (string, if my understanding is correct), and has data in the following format: '5 08:00:00'. When I call the apply(get_total_time_minutes) on it though, I get the following error:

TypeError: expected string or bytes-like object

To clarify further, the "Resources Available" column is a string representing the total time in days, hours, minutes and seconds that the resource was available. I want to convert that time string to a total time in minutes, hence the regex and arithmetic within the get_total_time_minutes function. – Sam Ezebunandu just now

try .applymap() instead of .apply() because get_total_time_minutes() is designed to operate on each cell of your column; not the column itself as a vector. — jeschwar
– jeschwar, Commented Aug 9, 2019 at 15:09
It seems to be working for me: ``` >>> d = pd.DataFrame({"Resource available (d hh:mm:ss)": ["2 23:24:46","3 23:12:45"]}) >>> d['Minutes Available'] = d['Resource available (d hh:mm:ss)'].apply(get_total_time_minutes) >>> d Resource available (d hh:mm:ss) Minutes Available 0 2 23:24:46 4284 1 3 23:12:45 5712 ``` — ranka47
– ranka47, Commented Aug 9, 2019 at 15:11
Thanks @AlexandreB. I have added a screenshot of my dataframe. — Sam Ezebunandu
– Sam Ezebunandu, Commented Aug 9, 2019 at 15:51

ifly6 · Accepted Answer · 2019-08-09 15:55:12Z

1

This might be a bit hacky, because it uses the datetime library to parse the date and then turn it into a Timedelta by subtracting the default epoch:

>>> pd.to_datetime('2 23:48:30', format='%d %H:%M:%S') - pd.to_datetime('0', format='%S')
Out[47]: Timedelta('1 days 23:48:30')

>>> Out[47] / pd.Timedelta('1 minute')
Out[50]: 2868.5

But it does tell you how many minutes elapsed in those two days and however many hours. It's also vectorised, so you can apply it to the columns and get your minute values a lot faster than using apply.

edited Aug 9, 2019 at 15:55

answered Aug 9, 2019 at 15:02

ifly6

5,4003 gold badges28 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Stael Over a year ago

TIL the Out object.

ifly6 Over a year ago

If you're dealing with big data, it can also be a real memory drain

Sam Ezebunandu Over a year ago

The column is actually a time-delta in days hours:minutes:seconds and not a time stamp.

ifly6 Over a year ago

If your column is already pd.Timedelta, then just take the column and divide it by pd.Timedelta('1 minute').

Sam Ezebunandu Over a year ago

Thanks, @ifly6 This works! I can get rid of the convoluted regex and keep things simple.

|

Collectives™ on Stack Overflow

Using Pandas .apply() method with a regex-based function

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related