0

I have a column with strings such as: Posted: 1 day ago, Posted: 2 days ago. I want to convert this column to a column of dates, i.e.: datetime.date(2021, 12, 22), datetime.date(2021, 12, 21).

I tried using regex groups combined with df.replace() to achieve it in a compact operation:

df2 = df.replace({r"Posted: (\d+) days? ago": str(date.today() - timedelta(int(r"\1")))}, regex=True)

but this results in ValueError: invalid literal for int() with base 10: '\\1' error since int() evaluates its input not as a reference to the earlier regex group but as a literal string. Merely just obtaining the matched pattern works fine though, either of the following two would work if I only wanted to preserve the numerical value in the column, instead of translating it to datetime objects:

df2 = df.replace({r"Posted: (\d+) days? ago": "\g<1>"}, regex=True)

df2 = df.replace({r"Posted: (\d+) days? ago": r"\1"}, regex=True)

How can I obtain the referenced regex value to pass it on to timedelta()?

Full code:

import pandas as pd
from datetime import date, timedelta

df = pd.DataFrame(
    [['Posted: 1 day ago', 'xa01332cs', 101],
     ['Posted: 2 days ago', 'd11as99101', 630],
     ['Posted: 11 days ago', '12011rww1a', 301]
    ],
    columns = ['Date', 'Code', 'Value']
)

def preprocess(df):
    
    #df2 = df.replace({r"Posted: (\d+) days? ago": "\g<1>"}, regex=True)     # this works
    #df2 = df.replace({r"Posted: (\d+) days? ago": r"\1"}, regex=True)       # this works identically to previous row
    df2 = df.replace({r"Posted: (\d+) days? ago": str(date.today() - timedelta(int(r"\1")))}, regex=True)
    return df2

preprocess(df)

2 Answers 2

2

You can't use date - timedelta, but you can use datetime - timedelta:

from datetime import datetime, timedelta

df['Date'] = datetime.datetime.today() - df.Date.str.extract('Posted: (\d+) days? ago')[0].astype(int).apply(timedelta)

Output:

>>> df
                        Date        Code  Value
0 2021-12-22 08:33:03.396630   xa01332cs    101
1 2021-12-21 08:33:03.396630  d11as99101    630
2 2021-12-12 08:33:03.396630  12011rww1a    301
Sign up to request clarification or add additional context in comments.

Comments

1

You can extract the numbers, convert it to timedelta, then subtract:

df['New Date'] = datetime.datetime.today() - df['Date'].str.extract(r"Posted: (\d+) days? ago").astype(int) * pd.Timedelta('1D')

Output:

                  Date        Code  Value                   New Date
0    Posted: 1 day ago   xa01332cs    101 2021-12-22 10:36:13.361973
1   Posted: 2 days ago  d11as99101    630 2021-12-21 10:36:13.361973
2  Posted: 11 days ago  12011rww1a    301 2021-12-12 10:36:13.361973

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.