2

I have a column of strings like below that contain date information, and I need to add leading zeros to single-digit months and days. I've run into some issues trying to do this purely with pandas.DataFrame.replace and regular expressions.

import pandas as pd
df = pd.DataFrame({'Key':['0123456789_1/2/2019','0123456789_11/23/2019','0145892367_10/2/2019','0145892367_4/13/2019']})

df
Out[323]: 
                     Key
0    0123456789_1/2/2019
1  0123456789_11/23/2019
2   0145892367_10/2/2019
3   0145892367_4/13/2019

For the above column, the output I'd want after reformatting would be:

                     Key
0  0123456789_01/02/2019
1  0123456789_11/23/2019
2  0145892367_10/02/2019
3  0145892367_04/13/2019

By now I've figured out I can do this by splitting the strings:

r = df['Key'].str.split('_|/', expand=True)
df2 = r[0] + '_' + r[1].str.zfill(2) + '/' + r[2].str.zfill(2) + '/' + r[3]

df2
Out[333]: 
0    0123456789_01/02/2019
1    0123456789_11/23/2019
2    0145892367_10/02/2019
3    0145892367_04/13/2019
dtype: object

...But when I was initially trying to do it with pandas.DataFrame.replace, the closest I was able to get was:

df2 = df.replace(r'(_|/)([1-9]/)',r'\1 0\2',regex=True)

df2
Out[335]: 
                      Key
0   0123456789_ 01/2/2019
1   0123456789_11/23/2019
2  0145892367_10/ 02/2019
3  0145892367_ 04/13/2019

There are two problems with this that I'd like to know more about:

  1. In cases like row 0 where both the month and day are single-digit, it only finds the month. How can I get it to match both?
  2. I don't want the spaces, but when I try to replace using r'\10\2', of course I get an error because it thinks I'm trying to substitute in group 10, and there is no such group in the first regex. If I try r'(\1)0\2', it works, except it prints the literal parenthesis. Why does it do this, and how can I properly write this so that it prints group 1 immediately followed by a literal zero?

Edit for clarification: I'm aware I could also fix it by parsing the dates, but I'm specifically interested in the regex solution, as a learning exercise. Also because a single replace is much faster for large dataframes.

0

2 Answers 2

3

IIUC, you can use:

df.Key=df.Key.str.split("_").str[0]+"_"+pd.to_datetime(df.Key.str.split("_")
            .str[1]).dt.strftime('%m/%d/%Y')
print(df)

                     Key
0  0123456789_01/02/2019
1  0123456789_11/23/2019
2  0145892367_10/02/2019
3  0145892367_04/13/2019
Sign up to request clarification or add additional context in comments.

1 Comment

That does work, but I'm trying to understand how to get around the specific issues I encountered using regex. I'd like to be able to use the regex solution for other cases in the future that may not involve dates.
1

using datetime module

df['Key'] = df.Key.str.split('_').apply(lambda x: x[0]+'_'+datetime.strptime(x[1], "%m/%d/%Y").strftime("%m/%d/%Y"))

Output

                     Key
0  0123456789_01/02/2019
1  0123456789_11/23/2019
2  0145892367_10/02/2019
3  0145892367_04/13/2019

2 Comments

Thank you, but I'm trying to understand how to get around the specific issues I encountered using regex. I'd like to be able to use the regex solution for other cases in the future.
using datetime or pd.to_datetime like @anky_91 is better acc to my understanding, it covers all the cases since it understands dates but regex dosen't, it might fail in some

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.