0

i have a dataframe where within the raw text column certain text with Dates in different format is given. i am looking to extract this dates in separate column

sample Raw Text :

"Sales Assistant @ DFS Duration - June 2021 - 2023 Currently working in XYZ Within the role I am expected to achieve sales targets which I currently have no problems reaching. Job Role/Establishment - Plasterer @ XX Plasterer’s Duration - September 2016 - Nov 2016 Job Role/Establishment - Customer Advisor @ AA Duration - (2015 – 2016) Job Role/Establishment - Warehouse Operative @ xyz Duration - 03/2014 to 08/2015 In the xyz warehouse Job Role/Establishment - Airport Terminal Assistant @ port Duration - 01/2012 - 06/2013 Working at the airport . Job Role/Establishment - Apprentice Floorer @ YY Floors Duration - DEC 2010 – APRIL 2012 "

Expected Dataframe :

id      Raw_text                   Dates
01     "sample_raw_text"         June 2021 - 2023 , September 2016 - Nov 2016,(2015 – 2016),03/2014 to 08/2015 , 01/2012 - 06/2013, DEC 2010 – APRIL 2012

I have Tried below pattern :

def extract_dates(df, column):
    # Define the regex pattern to match dates in different month formats
    pattern = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{1,2}[-,\s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{2,4}\s*[-–]\s*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{1,2}[-,\s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{2,4}'

    # Extract the dates from the specified column
    df['Dates'] = df[column].str.extract(pattern)

with above i am unable to fetch required output. please guide what am i missing

4
  • use regexr.com its very helpful, but if I answer with the right regex please upvote me :D also, when using complex patterns like this, it is very helpful to split each pattern up into different subpatterns, then pass them into one massive regex string rather than write one massive regex string. It will help you identify which patterns are working and which are not. Just use the OR operator to seperate each subpattern Commented Jan 10, 2023 at 7:07
  • sure will upvote :-) please guide Commented Jan 10, 2023 at 7:13
  • This is a very complex regex so I probably won't write it for you, sorry man :) But please create a flowchart and organise the regex so it is easier for you. For example, you have three patterns you want to start capturing: Starts with day, month, or year. In these patterns, you then have the potential for new variation: is there a day following ? is there a year following? is there more? Once you have a flowchart that describes the pattern, you can easily create a regex that will match. It just requires some patience and to layout all possibilities. Commented Jan 10, 2023 at 7:14
  • Basically i would begin by constructing a "false" regex. Ie; r'(day)?(sep)?((month)?(year)?)+(sep)... etc and once this makes sense logically you can begin to substitute the "day" with the actual patterns for day ie \d?\d should match single digit days and two digit days. I dont know the exact syntax at the moment but this is the approach I would use to ensure that I cover all bases with my pattern i would make it easier for myself by storing day , month etc as variables containing the pattern definition, so when i go to construct the re string I can actually f string it substituting them Commented Jan 10, 2023 at 7:27

1 Answer 1

1

Try this:

\(?(?:\b[A-Za-z]{3,9}\s*)?(?:\d\d?\/){0,2}[12]\d{3}\)?\s*(?:–|-|[Tt][Oo])\s*\(?(?:[A-Za-z]{3,9}\s*)?(?:\d\d?\/){0,2}[12]\d{3}\)?|\(\s*[A-Za-z]{3,9}\s*[–-]\s*[A-Za-z]{3,9}\s*[12]\d{3}\s*\)
  • \(? an optional (.

  • (?:[A-Za-z]{3,9}\s*)? non-capturing gruop.

    • [A-Za-z]{3,9} between 3-9 letters.
    • \s* zero or more whitespace character.
    • ? makes the whole group optinal.
  • (?:\d\d\/)? non-caputring group.

    • \d a digit between 0-9.
    • \d another digit between 0-9.
    • \/ a literal forward slash /.
  • [12]\d{3}\s*

    • [12] match one digit from the listed digits 1 or 2.
    • \d{3} three digits between 0-9
    • \s* zero or more whitespace character.
  • (?:–|-|[Tt][Oo])\s*

    • (?:–|-|[Tt][Oo]) match , -, TO, to, To or tO.
    • \s* zero or more whitespace character.
  • (?:[A-Za-z]{3,9}\s*)? explained above.

  • (?:\d\d\/)? explained above.

  • [12]\d{3} explained above.

  • \)? an optional ).

See regex demo

Sign up to request clarification or add additional context in comments.

8 Comments

thanks.!! what changes to be made if dates like (12/03/2020)-(2/11/2021) are to fetched
@Roshankumar Thank you! updated See regex demo
@Roshankumar Updated, See regex demo
@Roshankumar Glad to help! See regex demo
@Roshankumar See regex demo
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.