3

I have a string as

 fmt_string2 = I want to apply for leaves from 12/12/2017 to 12/18/2017

Here I want to extract the following dates. But my code needs to be robust as this can be in any format it can be 12 January 2017 or 12 Jan 17. and its position can also change. For the above code I have tried doing:

''.join(fmt_string2.split()[-1].split('.')[::-10])

But here I am giving position of my date. Which I dont want. Can anyone help in making a robust code for extracting dates.

3
  • See the third party library, dateparser Commented Jul 11, 2017 at 6:18
  • 1
    Possible duplicate of Python - finding date in a string Commented Jul 11, 2017 at 6:21
  • i tried dateparser but its not helping in this case Commented Jul 11, 2017 at 6:31

2 Answers 2

12

If 12/12/2017, 12 January 2017, and 12 Jan 17 are the only possible patterns then the following code that uses regex should be enough.

import re

string = 'I want to apply for leaves from 12/12/2017 to 12/18/2017 I want to apply for leaves from 12 January 2017 to ' \
       '12/18/2017 I want to apply for leaves from 12/12/2017 to 12 Jan 17 '

matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', string)

for match in matches:
    print(match[0])

Output:

12/12/2017
12/18/2017
12 January 2017
12/18/2017
12/12/2017
12 Jan 17

To understand the regex play with it hare in regex101.

Sign up to request clarification or add additional context in comments.

Comments

5

Using Regular Expressions

Rather than going through regex completely, I suggest the following approach:

import re
from dateutil.parser import parse

Sample Text

text = """
I want to apply for leaves from 12/12/2017 to 12/18/2017
then later from 12 January 2018 to 18 January 2018
then lastly from 12 Feb 2018 to 18 Feb 2018
"""

Regular expression to find anything that is of form "from A to B". Advantage here will be that I don't have to take care of each and every case and keep building my regex. Rather this is dynamic.

pattern = re.compile(r'from (.*) to (.*)')    
matches = re.findall(pattern, text)

Pattern from above regex for the text is

[('12/12/2017', '12/18/2017'), ('12 January 2018', '18 January 2018'), ('12 Feb 2018', '18 Feb 2018')]

For each match I parse the date. Exception is thrown for value that isn't date, hence in except block we pass.

for val in matches:
    try:
        dt_from = parse(val[0])
        dt_to = parse(val[1])

        print("Leave applied from", dt_from.strftime('%d/%b/%Y'), "to", dt_to.strftime('%d/%b/%Y'))
    except ValueError:
        print("skipping", val)

Output:

Leave applied from 12/Dec/2017 to 18/Dec/2017
Leave applied from 12/Jan/2018 to 18/Jan/2018
Leave applied from 12/Feb/2018 to 18/Feb/2018

Using pyparsing

Using regular expressions has the limitation that it might end up being very complex in order to make it more dynamic for handling not so straightforward input for e.g.

text = """
I want to apply for leaves from start 12/12/2017 to end date 12/18/2017 some random text
then later from 12 January 2018 to 18 January 2018 some random text
then lastly from 12 Feb 2018 to 18 Feb 2018 some random text
"""

So, Pyton's pyparsing module is the best fit here.

import pyparsing as pp

Here approach is to create a dictionary that can parse the entire text.

Create keywords for month names that can be used as pyparsing keyword

months_list= []
for month_idx in range(1, 13):
    months_list.append(calendar.month_name[month_idx])
    months_list.append(calendar.month_abbr[month_idx])

# join the list to use it as pyparsing keyword
month_keywords = " ".join(months_list)

Dictionary for parsing:

# date separator - can be one of '/', '.', or ' '
separator = pp.Word("/. ")

# Dictionary for numeric date e.g. 12/12/2018
numeric_date = pp.Combine(pp.Word(pp.nums, max=2) + separator + pp.Word(pp.nums, max=2) + separator + pp.Word(pp.nums, max=4))

# Dictionary for text date e.g. 12/Jan/2018
text_date = pp.Combine(pp.Word(pp.nums, max=2) + separator + pp.oneOf(month_keywords) + separator + pp.Word(pp.nums, max=4))

# Either numeric or text date
date_pattern = numeric_date | text_date

# Final dictionary - from x to y
pattern = pp.Suppress(pp.SkipTo("from") + pp.Word("from") + pp.Optional("start") + pp.Optional("date")) + date_pattern
pattern += pp.Suppress(pp.Word("to") + pp.Optional("end") + pp.Optional("date")) + date_pattern

# Group the pattern, also it can be multiple
pattern = pp.OneOrMore(pp.Group(pattern))

Parse the input text:

result = pattern.parseString(text)

# Print result
for match in result:
    print("from", match[0], "to", match[1])

Output:

from 12/12/2017 to 12/18/2017
from 12 January 2018 to 18 January 2018
from 12 Feb 2018 to 18 Feb 2018

6 Comments

Here in the above code if I don't get a value ex: 18/Dec/2017. So I want none in that position but it is giving me ' '.
@GeetanjaliBisht Please elaborate more. What I understand you are getting blank string instead of None in that case you can always convert it in python.
If I am not giving dt_to.strftime I am getting this: ('skipping', ('12/12/2017', '')). Instead of this I want this: ('12/12/2017', 'none')). I tried doing this but I am not being able to
That won't be coming in from the regex, as it's matching whatever is there in the text. Now if you want none you can place check in the except ValueError: print("skipping", val). Here in the except block, you can write a if condition that will replace '' with 'none' e.g. if val[1] == '': val[1] = 'none'
In the following code if I give 'from date 12/12/2017 to end date 12/18/2017, then it will take date and end date in my output but I only want date. Can you tell me how can I solve this?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.