0

I have a dataset which contains several time features. These time features contain object data like so:

    12h 22min
    7 hours
    18 minutes
    27h 37min
    1h 35min
    2 hours
    NaN

As you can see, the time is represented in different formats and also contains NaN values. As part of my data preprocessing, I want to convert this object data to numeric form (strings to minutes).

I tried to implement a solution similar to the one here as such:

def parse_time(time):
    if not pd.isna(time):
        mins = 0
        fields = time.split()
        print(fields) #inserted this line to debug why output was 0
        for idx in range(0, len(fields)-1):
            if fields[idx+1] in ('min', 'mins', 'minutes'):
                mins += int(fields[idx])
            elif fields[idx+1] in ('h', 'hour', 'hours'):
                mins += int(fields[idx]) * 60

        return mins

But when testing this function out, I realised that this will only work for data separated by spaces, which is not the case for my data:

   In[20]: parse_time('10h 50min')
           ['1h']
   Out[21]: 0
   In[22]: parse_time('10 h 50 min')
           ['10h', '50min']
   Out[23]:0
   In[24]: parse_time('10 h 50 min')
           ['10', 'h', '50', 'min']
   Out[24]: 650

Can anyone advise me what to change in my code so that this works, or offer an alternative, simpler solution?

Thanks :)

2 Answers 2

3

You can just do a pd.to_datetime:

pd.to_timedelta(df[0].fillna('0 min')
                    .str.replace('NaN', '0 m')
               )

Output:

0   0 days 12:22:00
1   0 days 07:00:00
2   0 days 00:18:00
3   1 days 03:37:00
4   0 days 01:35:00
5   0 days 02:00:00
6   0 days 00:00:00
Name: 0, dtype: timedelta64[ns]

Update: To get the periods in minutes:

pd.to_timedelta(df[0].fillna('0 min')
                    .str.replace('NaN', '0 m')
               ) / pd.to_timedelta('1 m')

Output:

0     742.0
1     420.0
2      18.0
3    1657.0
4      95.0
5     120.0
6       0.0
Name: 0, dtype: float64

Update 2: If you want to keep the NaN values, you can pass errors='coerce':

pd.to_timedelta(df[0], errors='coerce') / pd.to_timedelta('1 m')

Output:

0     742.0
1     420.0
2      18.0
3    1657.0
4      95.0
5     120.0
6       NaN
Name: 0, dtype: float64
Sign up to request clarification or add additional context in comments.

2 Comments

I don't think this solves my problem. I want to convert a time string to minutes, how has this helped me do that?
Is it better to use errors='ignore' rather than errors='coerce' here?
1

You could try to use re.findall with the time stripped, if you want to keep that function:

import re

def parse_time(time):
    if not pd.isna(time.strip()):
        mins = 0
        fields=re.findall(r'[A-Za-z]+|\d+', time.strip())
        print(fields) #inserted this line to debug why output was 0
        for idx in range(0, len(fields)-1):
            if fields[idx+1] in ('min', 'mins', 'minutes'):
                mins += int(fields[idx])
            elif fields[idx+1] in ('h', 'hour', 'hours'):
                mins += int(fields[idx]) * 60
    
        return mins

print(parse_time('20 hours 10min'))

print(parse_time('10 h 50 min'))

print(parse_time('10 h 50 min'))

Output:

['20', 'hours', '10', 'min']
1210
['10', 'h', '50', 'min']
650
['10', 'h', '50', 'min']
650

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.