I have a column with strings such as: Posted: 1 day ago, Posted: 2 days ago. I want to convert this column to a column of dates, i.e.: datetime.date(2021, 12, 22), datetime.date(2021, 12, 21).
I tried using regex groups combined with df.replace() to achieve it in a compact operation:
df2 = df.replace({r"Posted: (\d+) days? ago": str(date.today() - timedelta(int(r"\1")))}, regex=True)
but this results in ValueError: invalid literal for int() with base 10: '\\1' error since int() evaluates its input not as a reference to the earlier regex group but as a literal string. Merely just obtaining the matched pattern works fine though, either of the following two would work if I only wanted to preserve the numerical value in the column, instead of translating it to datetime objects:
df2 = df.replace({r"Posted: (\d+) days? ago": "\g<1>"}, regex=True)
df2 = df.replace({r"Posted: (\d+) days? ago": r"\1"}, regex=True)
How can I obtain the referenced regex value to pass it on to timedelta()?
Full code:
import pandas as pd
from datetime import date, timedelta
df = pd.DataFrame(
[['Posted: 1 day ago', 'xa01332cs', 101],
['Posted: 2 days ago', 'd11as99101', 630],
['Posted: 11 days ago', '12011rww1a', 301]
],
columns = ['Date', 'Code', 'Value']
)
def preprocess(df):
#df2 = df.replace({r"Posted: (\d+) days? ago": "\g<1>"}, regex=True) # this works
#df2 = df.replace({r"Posted: (\d+) days? ago": r"\1"}, regex=True) # this works identically to previous row
df2 = df.replace({r"Posted: (\d+) days? ago": str(date.today() - timedelta(int(r"\1")))}, regex=True)
return df2
preprocess(df)