0

I'm trying to capture and replace by regex in a DataFrame column which contains a date, i want capture the following date format in the text "YYYY-MM-DD" it seems my syntax for capturing and replace is correct but some how it doesn't work.

lst_date_version = ["2021-10-10 rev. 002", "2021-11-28 rev. 003", "2021-09-27 rev. 008","2021-11-29 rev. 008", "2021-10-16 rev. 003", "2021-10-25 rev. 008","2021-11-03 rev. 003", "2021-04-12 rev. 008", "2021-03-19 rev. 004"]
df_test_date = pd.DataFrame({"Version":lst_date_version})
df_test_date["Version"] = df_test_date["Version"].str.replace(r"(\d{4}-\d{2})-(\d{2})", r"\1", regex=True)
print(df_test_date["Version"])

the result seems the day in the date format is remove (2021-10-10 rev. 002 ==> 2021-10 rev. 002):

0    2021-10 rev. 002
1    2021-11 rev. 003
2    2021-09 rev. 008
3    2021-11 rev. 008
4    2021-10 rev. 003
5    2021-10 rev. 008
6    2021-11 rev. 003
7    2021-04 rev. 008
8    2021-03 rev. 004

but when i do the following :

 df_test_date["Version"] = df_test_date["Version"].str.replace(r"(\d{4}-\d{2})-(\d{2})", r"\0", regex=True)
    print(df_test_date["Version"])

the result is :

0     rev. 002
1     rev. 003
2     rev. 008
3     rev. 008
4     rev. 003
5     rev. 008
6     rev. 003
7     rev. 008
8     rev. 004

in the meantime i find different way (invert the capture) to what i wanted to achieve by this:

 df_test_date["Version"] = df_test_date["Version"].str.replace(r"(\srev.+)", r"\0", regex=True))

a big thanks for your help in advance :)

PS: adapted the questions based on remarks :)

3
  • 2
    What does r have to do with your question? Please consider editing your tags in order for us to be more helpful to your problem. Commented Mar 8, 2022 at 8:52
  • no changes to initial list ... What happened to the day component of the date? You start off with YYYY-MM-DD and somehow end up with just YYYY-MM. Commented Mar 8, 2022 at 9:03
  • Also, it is not really clear what you want to achieve. What is the expected output? Commented Mar 8, 2022 at 9:05

2 Answers 2

2

Your code fails as (\d{4}-\d{2})-(\d{2}) is never matching.

You could use str.split with n=1:

df_test_date['Version'] = df_test_date['Version'].str.split(n=1).str[1]

else, change your regex to \d{4}-\d{2}-\d{2}\s*:

df_test_date['Version'] = df_test_date["Version"].str.replace(r"\d{4}-\d{2}-\d{2}\s*",
                                    '', regex=True)
Sign up to request clarification or add additional context in comments.

3 Comments

thank you for your reply :) can you give some details why str.split with n=1: , please ? i mean the date format in the text "2021-10-10 rev. 002" can patch the regex "(\d{4}-\d{2})-(\d{2})" correct ?
@Dinesh I assumed here you wanted to extract the "rev. 00x" string. n=1 limits to 1 split (after the date). But I might be wrong. What do you expect here?
in fact i wanted the date with following format "YYYY-MM-DD" in the text its because in my initial question i forgot to mention (now I've edited) :) sorry for this
1

I would use str.extract here:

df_test_date["Version"] = df_test_date["Version"].str.extract(r'^(\d{4}-\d{2}-\d{2})')

If you want to use str.replace, then use a pattern which matches the entire input:

df_test_date["Version"] = df_test_date["Version"].str.replace(r'^(\d{4}-\d{2}-\d{2}) rev\. \d+$', r'\1')

4 Comments

That won't work (no match), the regex should be something like r'^\d{4}-\d{2}\s*(.*)'
I strongly suggest you read the actual question before leaving comments like this.
then I guess I have not understood what is being asked…
Wait...I see your point...but the output is not consistent with the data being used to populate the data frame. In the latter case, my answer is correct, in the former, your answer might be right.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.