Using Regular Expressions
Rather than going through regex completely, I suggest the following approach:
import re
from dateutil.parser import parse
Sample Text
text = """
I want to apply for leaves from 12/12/2017 to 12/18/2017
then later from 12 January 2018 to 18 January 2018
then lastly from 12 Feb 2018 to 18 Feb 2018
"""
Regular expression to find anything that is of form "from A to B". Advantage here will be that I don't have to take care of each and every case and keep building my regex. Rather this is dynamic.
pattern = re.compile(r'from (.*) to (.*)')
matches = re.findall(pattern, text)
Pattern from above regex for the text is
[('12/12/2017', '12/18/2017'), ('12 January 2018', '18 January 2018'), ('12 Feb 2018', '18 Feb 2018')]
For each match I parse the date. Exception is thrown for value that isn't date, hence in except block we pass.
for val in matches:
try:
dt_from = parse(val[0])
dt_to = parse(val[1])
print("Leave applied from", dt_from.strftime('%d/%b/%Y'), "to", dt_to.strftime('%d/%b/%Y'))
except ValueError:
print("skipping", val)
Output:
Leave applied from 12/Dec/2017 to 18/Dec/2017
Leave applied from 12/Jan/2018 to 18/Jan/2018
Leave applied from 12/Feb/2018 to 18/Feb/2018
Using pyparsing
Using regular expressions has the limitation that it might end up being very complex in order to make it more dynamic for handling not so straightforward input for e.g.
text = """
I want to apply for leaves from start 12/12/2017 to end date 12/18/2017 some random text
then later from 12 January 2018 to 18 January 2018 some random text
then lastly from 12 Feb 2018 to 18 Feb 2018 some random text
"""
So, Pyton's pyparsing module is the best fit here.
import pyparsing as pp
Here approach is to create a dictionary that can parse the entire text.
Create keywords for month names that can be used as pyparsing keyword
months_list= []
for month_idx in range(1, 13):
months_list.append(calendar.month_name[month_idx])
months_list.append(calendar.month_abbr[month_idx])
# join the list to use it as pyparsing keyword
month_keywords = " ".join(months_list)
Dictionary for parsing:
# date separator - can be one of '/', '.', or ' '
separator = pp.Word("/. ")
# Dictionary for numeric date e.g. 12/12/2018
numeric_date = pp.Combine(pp.Word(pp.nums, max=2) + separator + pp.Word(pp.nums, max=2) + separator + pp.Word(pp.nums, max=4))
# Dictionary for text date e.g. 12/Jan/2018
text_date = pp.Combine(pp.Word(pp.nums, max=2) + separator + pp.oneOf(month_keywords) + separator + pp.Word(pp.nums, max=4))
# Either numeric or text date
date_pattern = numeric_date | text_date
# Final dictionary - from x to y
pattern = pp.Suppress(pp.SkipTo("from") + pp.Word("from") + pp.Optional("start") + pp.Optional("date")) + date_pattern
pattern += pp.Suppress(pp.Word("to") + pp.Optional("end") + pp.Optional("date")) + date_pattern
# Group the pattern, also it can be multiple
pattern = pp.OneOrMore(pp.Group(pattern))
Parse the input text:
result = pattern.parseString(text)
# Print result
for match in result:
print("from", match[0], "to", match[1])
Output:
from 12/12/2017 to 12/18/2017
from 12 January 2018 to 18 January 2018
from 12 Feb 2018 to 18 Feb 2018
dateparser