0

I have an extremely large dataset with date/time columns with various formats. I have a validation function to detect the possible date/time string formats that can handle handle 24 hour time as well as 12 hour. The seperator is always :. A sample of the is below. However, after profiling my code, it seems this can become a bottleneck and expensive in terms of the execution time. My question is if there is a better way to do this without affecting the performance.

import datetime
def validate_time(time_str: str):
    for time_format in ["%H:%M", "%H:%M:%S", "%H:%M:%S.%f", "%I:%M %p"]:
        try:
            return datetime.datetime.strptime(time_str, time_format)
        except ValueError:
            continue
    return None

print(validate_time(time_str="9:21 PM"))
4
  • How many possible time formats do you have? Commented May 4, 2022 at 16:11
  • @PranavHosangadi 4 so far. Commented May 4, 2022 at 16:17
  • Could you share what they are? How general-purpose do you want this to be? Should it handle 24 hour time as well as 12 hour? What about a different separator? Please include all constraints and requirements in your question Commented May 4, 2022 at 16:21
  • Try the numpy(datetime64) format, it's much faster. Commented May 4, 2022 at 16:48

1 Answer 1

1

Instead of trying to parse using every format string, you could split by colons to obtain the segments of your string that denote hours, minutes, and everything that remains. Then you can parse the result depending on the number of values the split returns:

def validate_time_new(time_str: str):
    time_vals = time_str.split(':')
    
    try:
        if len(time_vals) == 1: 
            # No split, so invalid time
            return None
        elif len(time_vals) == 2:
            if time_vals[-1][::-2].lower() in ["am", "pm"]:
                # if last element contains am or pm, try to parse as 12hr time
                return datetime.datetime.strptime(time_str, "%I:%M %p")
            else:
                # try to parse as 24h time
                return datetime.datetime.strptime(time_str, "%H:%M")
        elif len(time_vals) == 3:
            if "." in time_vals[-1]:
                # If the last element has a decimal point, try to parse microseconds
                return datetime.datetime.strptime(time_str, "%H:%M:%S.%f")
            else:
                # try to parse without microseconds
                return datetime.datetime.strptime(time_str, "%H:%M:%S")
        else: return None
    except ValueError:
        # If any of the attempts to parse throws an error, return None
        return None

To test this, let's time both methods for a bunch of test strings:

import timeit
print("old\t\t\tnew\t\t\t\told/new\t\ttest_string")
for s in ["12:24", "12:23:42", "13:53", "1:53 PM", "12:24:43.220", "not a date", "54:23:21"]:
    t1 = timeit.timeit('validate_time(s)', 'from __main__ import datetime, validate_time, s', number=100)
    t2 = timeit.timeit('validate_time_new(s)', 'from __main__ import datetime, validate_time_new, s', number=100)
    print(f"{t1:.6f}\t{t2:.6f}\t\t{t1/t2:.6f}\t\t{s}")
old         new             old/new     test_string
0.001628    0.001143        1.424322        12:24
0.001567    0.001012        1.548661        12:23:42
0.000935    0.000979        0.955177        13:53
0.003004    0.000722        4.161657        1:53 PM
0.004523    0.001396        3.241204        12:24:43.220
0.002148    0.000025        84.897370       not a date
0.002262    0.000622        3.638629        54:23:21
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.